zeekay commited on
Commit
dc8ae84
·
verified ·
1 Parent(s): 17a1ff0

Update model card: add zen/zenlm tags, fix branding

Browse files
Files changed (1) hide show
  1. README.md +44 -190
README.md CHANGED
@@ -1,218 +1,72 @@
1
- # Zen VL 4B Instruct
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- **Zen VL** is a family of vision-language models with integrated function calling capabilities from [Hanzo AI](https://hanzo.ai) (Techstars '17).
4
 
5
- This model (`zen-vl-4b-instruct`) is the **identity fine-tuned** variant, establishing the "Zen VL" persona across both text and vision modalities while preserving strong general-purpose vision-language understanding.
6
 
7
- ## Model Details
8
-
9
- - **Model Size**: 4B parameters (3.5B non-embedding)
10
- - **Architecture**: Zen
11
- - **Architecture**: Zen
12
- - **Context Length**: 32K tokens (expandable to 256K)
13
- - **Developed by**: [Hanzo AI](https://hanzo.ai)
14
- - **Model Type**: Vision-Language Model (VLM)
15
- - **License**: Apache 2.0 (inherited from Zen VL)
16
- - **Language(s)**: Multilingual (32 languages for OCR)
17
-
18
- ## Training Data
19
-
20
- This model was trained using:
21
-
22
- ### Primary Dataset
23
- **Custom Identity Dataset** (150 examples):
24
- - 100 text-only identity prompts
25
- - 40 visual capability demonstrations
26
- - 10 multimodal reasoning examples
27
- - Focus: Establishing "Zen VL" identity from Hanzo AI
28
-
29
- ### Advanced Training Datasets (In Progress)
30
- We have downloaded and are actively training with:
31
 
32
- 1. **[Agent Data Protocol (ADP)](https://huggingface.co/datasets/neulab/agent-data-collection)** - **8.4GB locally downloaded**
33
- - Paper: [Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-Tuning of LLM Agents](https://arxiv.org/abs/2510.24702)
34
- - Contributors: Carnegie Mellon, Ohio State, University of Hong Kong, Duke, All Hands AI
35
- - Covers: Web browsing, coding, software engineering, tool use
36
- - Downloaded: 15 configs including synatra (99k), code_feedback (66k), go-browse-wa (27k), nebius_SWE-agent (13k)
37
- - Total: **~220,000 trajectories**
38
- - Expected gain: **+20% on agent benchmarks**
39
 
40
- 2. **[xLAM Function Calling 60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)** - **101MB locally downloaded**
41
- - From: Salesforce Research
42
- - Paper: [xLAM: A Family of Large Action Models](https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2daea4)
43
- - Focus: High-quality function calling and API use
44
- - Downloaded: **60,000 function calling trajectories**
45
- - Expected additional gain: **+5% on function calling tasks**
46
 
47
- **Training Status**: Agent training at 24% complete. Combined ADP+xLAM retraining queued for **+25% total performance boost**.
48
-
49
- ## Capabilities
50
-
51
- - ✅ **Visual Understanding**: Image analysis, OCR (32 languages), scene understanding
52
- - ✅ **Multimodal Reasoning**: Chart analysis, diagram understanding, visual QA
53
- - ✅ **Identity Consistency**: Maintains "Zen VL from Hanzo AI" persona
54
- - 🔄 **Function Calling**: Coming in `zen-vl-4b-agent` variant
55
- - 🔄 **GUI Interaction**: Coming in ADP-trained versions
56
-
57
- ## Usage
58
 
59
  ```python
60
  from transformers import AutoModelForVision2Seq, AutoProcessor
61
  from PIL import Image
62
  import torch
63
 
64
- # Load model
65
- model = AutoModelForVision2Seq.from_pretrained(
66
- "zenlm/zen-vl-4b-instruct",
67
- torch_dtype=torch.bfloat16,
68
- device_map="auto",
69
- trust_remote_code=True
70
- )
71
-
72
- processor = AutoProcessor.from_pretrained(
73
- "zenlm/zen-vl-4b-instruct",
74
- trust_remote_code=True
75
- )
76
 
77
- # Prepare input
78
  messages = [
79
- {"role": "system", "content": "You are a helpful AI assistant."},
80
- {"role": "user", "content": "Who are you?"}
81
  ]
82
 
 
83
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
84
  inputs = processor(text=[text], return_tensors="pt").to(model.device)
85
-
86
- # Generate
87
- with torch.no_grad():
88
- outputs = model.generate(**inputs, max_new_tokens=150)
89
-
90
- response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
91
- print(response)
92
- # Output: "I'm Zen VL, a vision-language model from the Zen family, created by Hanzo AI..."
93
  ```
94
 
95
- ### With Images
96
 
97
  ```python
98
- # Load image
99
- image = Image.open("path/to/image.jpg")
100
-
101
- messages = [
102
- {"role": "system", "content": "You are a helpful AI assistant."},
103
- {"role": "user", "content": "What do you see in this image?"}
104
- ]
105
-
106
- # Process with image
107
- text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
108
- inputs = processor(
109
- text=[text],
110
- images=[image],
111
- return_tensors="pt"
112
- ).to(model.device)
113
-
114
- # Generate
115
- outputs = model.generate(**inputs, max_new_tokens=200)
116
- response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
117
- ```
118
-
119
- ## Model Variants
120
-
121
- The Zen VL family includes:
122
 
123
- | Model | Size | Type | Description | Link |
124
- |-------|------|------|-------------|------|
125
- | **zen-vl-4b-instruct** | 4B | Base VL | Identity fine-tuning only | [🤗 HF](https://huggingface.co/zenlm/zen-vl-4b-instruct) |
126
- | **zen-vl-4b-agent** | 4B | VL + Functions | With function calling | [🤗 HF](https://huggingface.co/zenlm/zen-vl-4b-agent) |
127
- | **zen-vl-8b-instruct** | 9B | Base VL | Identity fine-tuning only | [🤗 HF](https://huggingface.co/zenlm/zen-vl-8b-instruct) |
128
- | **zen-vl-8b-agent** | 9B | VL + Functions | With function calling | [🤗 HF](https://huggingface.co/zenlm/zen-vl-8b-agent) |
129
- | **zen-vl-30b-instruct** | 31B | Base VL (MoE) | Identity fine-tuning only | [🤗 HF](https://huggingface.co/zenlm/zen-vl-30b-instruct) |
130
- | **zen-vl-30b-agent** | 31B | VL + Functions (MoE) | With function calling | [🤗 HF](https://huggingface.co/zenlm/zen-vl-30b-agent) |
131
-
132
- ## Training Details
133
-
134
- ### Training Hyperparameters
135
-
136
- - **Epochs**: 3
137
- - **Batch Size**: 1 (per device)
138
- - **Gradient Accumulation**: 4 (effective batch size: 4)
139
- - **Learning Rate**: 2e-5
140
- - **LR Schedule**: Cosine with 3% warmup
141
- - **Optimizer**: AdamW
142
- - **Weight Decay**: 0.0
143
- - **Max Gradient Norm**: 1.0
144
- - **Precision**: bfloat16
145
- - **Device**: MPS (Apple Silicon)
146
-
147
- ### Training Infrastructure
148
-
149
- - **Hardware**: Apple M3 Max, 128GB RAM
150
- - **Framework**: PyTorch 2.3.0, Transformers 4.57.1
151
- - **Training Time**: ~3.5 hours
152
- - **Dataset Size**: 150 examples
153
-
154
- ## Evaluation
155
-
156
- **Identity Tests** (Perfect Score: 4/4):
157
- - ✅ "Who are you?" → Correctly mentions "Zen VL" and "Hanzo AI"
158
- - ✅ "What is your name?" → Identifies as "Zen VL"
159
- - ✅ "Tell me about yourself" → Describes vision-language capabilities
160
- - ✅ "Who created you?" → Attributes to "Hanzo AI"
161
-
162
- **General Knowledge**: Preserved from base Zen VL model
163
-
164
- **Visual Capabilities**: Maintained from base model
165
-
166
- ## Limitations
167
-
168
- - **Function Calling**: Not available in this variant (use `zen-vl-4b-agent`)
169
- - **Dataset Size**: Small identity dataset (150 examples)
170
- - **Evaluation**: Limited benchmarking (comprehensive eval coming)
171
- - **Video**: Basic video support (full temporal reasoning in development)
172
-
173
- ## Bias, Risks, and Ethical Considerations
174
-
175
- - Inherits biases from Zen VL base model
176
- - Identity training may reinforce certain response patterns
177
- - Should not be used for malicious purposes (surveillance, deepfakes, etc.)
178
- - OCR capabilities could extract sensitive information - use responsibly
179
- - See the base model documentation for additional considerations
180
-
181
- ## Citation
182
-
183
- If you use Zen VL in your research, please cite:
184
-
185
- ```bibtex
186
- @software{zen_vl_2025,
187
- title = {Zen VL: Vision-Language Models with Integrated Function Calling},
188
- author = {Hanzo AI Research Team},
189
- year = {2025},
190
- url = {https://github.com/zenlm/zen-vl},
191
- note = {Built on Zen VL architecture}
192
- }
193
-
194
- @article{adp_2025,
195
- title={Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-Tuning of LLM Agents},
196
- author={Song, Yueqi and others},
197
- journal={arXiv preprint arXiv:2510.24702},
198
- year={2025}
199
- }
200
  ```
201
 
202
- ## Acknowledgments
203
-
204
- - **neulab** (CMU, OSU, HKU, Duke, All Hands AI) for the Agent Data Protocol
205
- - **Salesforce Research** for xLAM function calling dataset
206
-
207
- ## Resources
208
 
209
- - **GitHub**: https://github.com/zenlm/zen-vl
210
- - **HuggingFace**: https://huggingface.co/zenlm
211
- - **Website**: https://zenlm.org
212
- - **Paper**: Coming soon
 
 
213
 
214
- ## Model Card Contact
215
 
216
- For questions or feedback:
217
- - GitHub Issues: https://github.com/zenlm/zen-vl/issues
218
- - Organization: [Hanzo AI](https://hanzo.ai)
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - image-text-to-text
6
+ - zen
7
+ - zenlm
8
+ - hanzo
9
+ - vision-language
10
+ - multimodal
11
+ - instruct
12
+ pipeline_tag: image-text-to-text
13
+ library_name: transformers
14
+ ---
15
 
16
+ # Zen Vl 4b Instruct
17
 
18
+ Compact 4B vision-language model for image understanding and multimodal instruction following.
19
 
20
+ ## Overview
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
+ Built on **Zen MoDE (Mixture of Distilled Experts)** architecture with 4B parameters and 32K context window.
 
 
 
 
 
 
23
 
24
+ Developed by [Hanzo AI](https://hanzo.ai) and the [Zoo Labs Foundation](https://zoo.ngo).
 
 
 
 
 
25
 
26
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
27
 
28
  ```python
29
  from transformers import AutoModelForVision2Seq, AutoProcessor
30
  from PIL import Image
31
  import torch
32
 
33
+ model_id = "zenlm/zen-vl-4b-instruct"
34
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
35
+ model = AutoModelForVision2Seq.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
 
 
 
 
 
 
 
 
 
36
 
 
37
  messages = [
38
+ {"role": "user", "content": "Describe this image in detail."}
 
39
  ]
40
 
41
+ # Text-only
42
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
43
  inputs = processor(text=[text], return_tensors="pt").to(model.device)
44
+ outputs = model.generate(**inputs, max_new_tokens=512)
45
+ print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
 
 
 
 
 
 
46
  ```
47
 
48
+ ## API Access
49
 
50
  ```python
51
+ from openai import OpenAI
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
+ client = OpenAI(base_url="https://api.hanzo.ai/v1", api_key="your-api-key")
54
+ response = client.chat.completions.create(
55
+ model="zen-vl-4b-instruct",
56
+ messages=[{"role": "user", "content": "Hello!"}],
57
+ )
58
+ print(response.choices[0].message.content)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ```
60
 
61
+ ## Model Details
 
 
 
 
 
62
 
63
+ | Attribute | Value |
64
+ |-----------|-------|
65
+ | Parameters | 4B |
66
+ | Architecture | Zen MoDE |
67
+ | Context | 32K tokens |
68
+ | License | Apache 2.0 |
69
 
70
+ ## License
71
 
72
+ Apache 2.0