Add model card

Browse files

Files changed (1) hide show

README.md +82 -0

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+---
+language:
+- en
+- zh
+- id
+license: apache-2.0
+library_name: mlx
+base_model: Kwai-Keye/Keye-VL-1_5-8B
+tags:
+- mlx
+- vision-language
+- multimodal
+- keye-vl
+- apple-silicon
+pipeline_tag: image-text-to-text
+---
+# Keye-VL 1.5 8B — MLX 4-bit
+[Kwai-Keye/Keye-VL-1_5-8B](https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B) converted to [MLX](https://github.com/ml-explore/mlx) format with 4-bit quantization for fast inference on Apple Silicon.
+## Performance (M4 Pro, 24GB)
+| Mode | Prompt (tok/s) | Generation (tok/s) | Peak Memory |
+|------|:-:|:-:|:-:|
+| Text only | ~210 | ~52 | 5.6 GB |
+| Video (8 frames) | ~194 | ~36 | 7.2 GB |
+| Image | ~150 | ~34 | 14.2 GB |
+## Quick Start
+```bash
+pip install mlx-vlm qwen-vl-utils
+```
+### Python
+```python
+from mlx_vlm import load, generate
+model, processor = load("andrevp/Keye-VL-1.5-8B-MLX-4bit", trust_remote_code=True)
+# Image
+prompt = processor.apply_chat_template(
+    [{"role": "user", "content": [
+        {"type": "image", "image": "photo.jpg"},
+        {"type": "text", "text": "Describe this image."},
+    ]}],
+    tokenize=False, add_generation_prompt=True,
+)
+output = generate(
+    model, processor, prompt,
+    image=["photo.jpg"], max_tokens=200,
+)
+print(output.text)
+```
+### CLI
+```bash
+# One-shot
+python chat.py photo.jpg -p "What's in this image?"
+python chat.py video.mp4 -p "Describe this video" --nframes 16
+# Interactive
+python chat.py photo.jpg
+```
+## Model Details
+- **Base model**: Kwai-Keye/Keye-VL-1_5-8B
+- **Quantization**: 4-bit (~5.1 bits effective), 5.2 GB on disk
+- **Vision encoder**: 27-layer ViT with learnable position embeddings and 2D RoPE
+- **Language model**: 36-layer Qwen3 with MRoPE and GQA (32 heads, 8 KV heads)
+- **Projector**: 2x2 spatial merge + LayerNorm + MLP
+- **Supports**: Images, video, text-only, multilingual (EN/ZH/ID)
+## Notes
+- Video inference uses sampled frames to fit in memory. Default is 8 frames at 224px max resolution.
+- High-resolution images (~1000px+) can use up to 14GB due to the vision attention mask.
+- Custom mlx-vlm model module (`keyevl1_5`) is required — included in this repo's conversion.