andrevp commited on
Commit
a350052
·
verified ·
1 Parent(s): 37907d3

Add model card

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ - id
6
+ license: apache-2.0
7
+ library_name: mlx
8
+ base_model: Kwai-Keye/Keye-VL-1_5-8B
9
+ tags:
10
+ - mlx
11
+ - vision-language
12
+ - multimodal
13
+ - keye-vl
14
+ - apple-silicon
15
+ pipeline_tag: image-text-to-text
16
+ ---
17
+
18
+ # Keye-VL 1.5 8B — MLX 4-bit
19
+
20
+ [Kwai-Keye/Keye-VL-1_5-8B](https://huggingface.co/Kwai-Keye/Keye-VL-1_5-8B) converted to [MLX](https://github.com/ml-explore/mlx) format with 4-bit quantization for fast inference on Apple Silicon.
21
+
22
+ ## Performance (M4 Pro, 24GB)
23
+
24
+ | Mode | Prompt (tok/s) | Generation (tok/s) | Peak Memory |
25
+ |------|:-:|:-:|:-:|
26
+ | Text only | ~210 | ~52 | 5.6 GB |
27
+ | Video (8 frames) | ~194 | ~36 | 7.2 GB |
28
+ | Image | ~150 | ~34 | 14.2 GB |
29
+
30
+ ## Quick Start
31
+
32
+ ```bash
33
+ pip install mlx-vlm qwen-vl-utils
34
+ ```
35
+
36
+ ### Python
37
+
38
+ ```python
39
+ from mlx_vlm import load, generate
40
+
41
+ model, processor = load("andrevp/Keye-VL-1.5-8B-MLX-4bit", trust_remote_code=True)
42
+
43
+ # Image
44
+ prompt = processor.apply_chat_template(
45
+ [{"role": "user", "content": [
46
+ {"type": "image", "image": "photo.jpg"},
47
+ {"type": "text", "text": "Describe this image."},
48
+ ]}],
49
+ tokenize=False, add_generation_prompt=True,
50
+ )
51
+ output = generate(
52
+ model, processor, prompt,
53
+ image=["photo.jpg"], max_tokens=200,
54
+ )
55
+ print(output.text)
56
+ ```
57
+
58
+ ### CLI
59
+
60
+ ```bash
61
+ # One-shot
62
+ python chat.py photo.jpg -p "What's in this image?"
63
+ python chat.py video.mp4 -p "Describe this video" --nframes 16
64
+
65
+ # Interactive
66
+ python chat.py photo.jpg
67
+ ```
68
+
69
+ ## Model Details
70
+
71
+ - **Base model**: Kwai-Keye/Keye-VL-1_5-8B
72
+ - **Quantization**: 4-bit (~5.1 bits effective), 5.2 GB on disk
73
+ - **Vision encoder**: 27-layer ViT with learnable position embeddings and 2D RoPE
74
+ - **Language model**: 36-layer Qwen3 with MRoPE and GQA (32 heads, 8 KV heads)
75
+ - **Projector**: 2x2 spatial merge + LayerNorm + MLP
76
+ - **Supports**: Images, video, text-only, multilingual (EN/ZH/ID)
77
+
78
+ ## Notes
79
+
80
+ - Video inference uses sampled frames to fit in memory. Default is 8 frames at 224px max resolution.
81
+ - High-resolution images (~1000px+) can use up to 14GB due to the vision attention mask.
82
+ - Custom mlx-vlm model module (`keyevl1_5`) is required — included in this repo's conversion.