zenlm
/

zen-vl-4b-instruct

@@ -1,218 +1,72 @@
-# Zen VL 4B Instruct
-**Zen VL** is a family of vision-language models with integrated function calling capabilities from [Hanzo AI](https://hanzo.ai) (Techstars '17).
-This model (`zen-vl-4b-instruct`) is the **identity fine-tuned** variant, establishing the "Zen VL" persona across both text and vision modalities while preserving strong general-purpose vision-language understanding.
-## Model Details
-- **Model Size**: 4B parameters (3.5B non-embedding)
-- **Architecture**: Zen
-- **Architecture**: Zen
-- **Context Length**: 32K tokens (expandable to 256K)
-- **Developed by**: [Hanzo AI](https://hanzo.ai)
-- **Model Type**: Vision-Language Model (VLM)
-- **License**: Apache 2.0 (inherited from Zen VL)
-- **Language(s)**: Multilingual (32 languages for OCR)
-## Training Data
-This model was trained using:
-### Primary Dataset
-**Custom Identity Dataset** (150 examples):
-- 100 text-only identity prompts
-- 40 visual capability demonstrations
-- 10 multimodal reasoning examples
-- Focus: Establishing "Zen VL" identity from Hanzo AI
-### Advanced Training Datasets (In Progress)
-We have downloaded and are actively training with:
-1. **[Agent Data Protocol (ADP)](https://huggingface.co/datasets/neulab/agent-data-collection)** - **8.4GB locally downloaded** ✅
-   - Paper: [Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-Tuning of LLM Agents](https://arxiv.org/abs/2510.24702)
-   - Contributors: Carnegie Mellon, Ohio State, University of Hong Kong, Duke, All Hands AI
-   - Covers: Web browsing, coding, software engineering, tool use
-   - Downloaded: 15 configs including synatra (99k), code_feedback (66k), go-browse-wa (27k), nebius_SWE-agent (13k)
-   - Total: **~220,000 trajectories**
-   - Expected gain: **+20% on agent benchmarks**
-2. **[xLAM Function Calling 60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)** - **101MB locally downloaded** ✅
-   - From: Salesforce Research
-   - Paper: [xLAM: A Family of Large Action Models](https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2daea4)
-   - Focus: High-quality function calling and API use
-   - Downloaded: **60,000 function calling trajectories**
-   - Expected additional gain: **+5% on function calling tasks**
-**Training Status**: Agent training at 24% complete. Combined ADP+xLAM retraining queued for **+25% total performance boost**.
-## Capabilities
-- ✅ **Visual Understanding**: Image analysis, OCR (32 languages), scene understanding
-- ✅ **Multimodal Reasoning**: Chart analysis, diagram understanding, visual QA
-- ✅ **Identity Consistency**: Maintains "Zen VL from Hanzo AI" persona
-- 🔄 **Function Calling**: Coming in `zen-vl-4b-agent` variant
-- 🔄 **GUI Interaction**: Coming in ADP-trained versions
-## Usage
 ```python
 from transformers import AutoModelForVision2Seq, AutoProcessor
 from PIL import Image
 import torch
-# Load model
-model = AutoModelForVision2Seq.from_pretrained(
-    "zenlm/zen-vl-4b-instruct",
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True
-)
-processor = AutoProcessor.from_pretrained(
-    "zenlm/zen-vl-4b-instruct",
-    trust_remote_code=True
-)
-# Prepare input
 messages = [
-    {"role": "system", "content": "You are a helpful AI assistant."},
-    {"role": "user", "content": "Who are you?"}
 ]
 text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 inputs = processor(text=[text], return_tensors="pt").to(model.device)
-# Generate
-with torch.no_grad():
-    outputs = model.generate(**inputs, max_new_tokens=150)
-response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
-print(response)
-# Output: "I'm Zen VL, a vision-language model from the Zen family, created by Hanzo AI..."
 ```
-### With Images
 ```python
-# Load image
-image = Image.open("path/to/image.jpg")
-messages = [
-    {"role": "system", "content": "You are a helpful AI assistant."},
-    {"role": "user", "content": "What do you see in this image?"}
-]
-# Process with image
-text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-inputs = processor(
-    text=[text],
-    images=[image],
-    return_tensors="pt"
-).to(model.device)
-# Generate
-outputs = model.generate(**inputs, max_new_tokens=200)
-response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
-```
-## Model Variants
-The Zen VL family includes:
-| Model | Size | Type | Description | Link |
-|-------|------|------|-------------|------|
-| **zen-vl-4b-instruct** | 4B | Base VL | Identity fine-tuning only | [🤗 HF](https://huggingface.co/zenlm/zen-vl-4b-instruct) |
-| **zen-vl-4b-agent** | 4B | VL + Functions | With function calling | [🤗 HF](https://huggingface.co/zenlm/zen-vl-4b-agent) |
-| **zen-vl-8b-instruct** | 9B | Base VL | Identity fine-tuning only | [🤗 HF](https://huggingface.co/zenlm/zen-vl-8b-instruct) |
-| **zen-vl-8b-agent** | 9B | VL + Functions | With function calling | [🤗 HF](https://huggingface.co/zenlm/zen-vl-8b-agent) |
-| **zen-vl-30b-instruct** | 31B | Base VL (MoE) | Identity fine-tuning only | [🤗 HF](https://huggingface.co/zenlm/zen-vl-30b-instruct) |
-| **zen-vl-30b-agent** | 31B | VL + Functions (MoE) | With function calling | [🤗 HF](https://huggingface.co/zenlm/zen-vl-30b-agent) |
-## Training Details
-### Training Hyperparameters
-- **Epochs**: 3
-- **Batch Size**: 1 (per device)
-- **Gradient Accumulation**: 4 (effective batch size: 4)
-- **Learning Rate**: 2e-5
-- **LR Schedule**: Cosine with 3% warmup
-- **Optimizer**: AdamW
-- **Weight Decay**: 0.0
-- **Max Gradient Norm**: 1.0
-- **Precision**: bfloat16
-- **Device**: MPS (Apple Silicon)
-### Training Infrastructure
-- **Hardware**: Apple M3 Max, 128GB RAM
-- **Framework**: PyTorch 2.3.0, Transformers 4.57.1
-- **Training Time**: ~3.5 hours
-- **Dataset Size**: 150 examples
-## Evaluation
-**Identity Tests** (Perfect Score: 4/4):
-- ✅ "Who are you?" → Correctly mentions "Zen VL" and "Hanzo AI"
-- ✅ "What is your name?" → Identifies as "Zen VL"
-- ✅ "Tell me about yourself" → Describes vision-language capabilities
-- ✅ "Who created you?" → Attributes to "Hanzo AI"
-**General Knowledge**: Preserved from base Zen VL model
-**Visual Capabilities**: Maintained from base model
-## Limitations
-- **Function Calling**: Not available in this variant (use `zen-vl-4b-agent`)
-- **Dataset Size**: Small identity dataset (150 examples)
-- **Evaluation**: Limited benchmarking (comprehensive eval coming)
-- **Video**: Basic video support (full temporal reasoning in development)
-## Bias, Risks, and Ethical Considerations
-- Inherits biases from Zen VL base model
-- Identity training may reinforce certain response patterns
-- Should not be used for malicious purposes (surveillance, deepfakes, etc.)
-- OCR capabilities could extract sensitive information - use responsibly
-- See the base model documentation for additional considerations
-## Citation
-If you use Zen VL in your research, please cite:
-```bibtex
-@software{zen_vl_2025,
-  title = {Zen VL: Vision-Language Models with Integrated Function Calling},
-  author = {Hanzo AI Research Team},
-  year = {2025},
-  url = {https://github.com/zenlm/zen-vl},
-  note = {Built on Zen VL architecture}
-}
-@article{adp_2025,
-  title={Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-Tuning of LLM Agents},
-  author={Song, Yueqi and others},
-  journal={arXiv preprint arXiv:2510.24702},
-  year={2025}
-}
 ```
-## Acknowledgments
-- **neulab** (CMU, OSU, HKU, Duke, All Hands AI) for the Agent Data Protocol
-- **Salesforce Research** for xLAM function calling dataset
-## Resources
-- **GitHub**: https://github.com/zenlm/zen-vl
-- **HuggingFace**: https://huggingface.co/zenlm
-- **Website**: https://zenlm.org
-- **Paper**: Coming soon
-## Model Card Contact
-For questions or feedback:
-- GitHub Issues: https://github.com/zenlm/zen-vl/issues
-- Organization: [Hanzo AI](https://hanzo.ai)

+---
+language: en
+license: apache-2.0
+tags:
+  - image-text-to-text
+  - zen
+  - zenlm
+  - hanzo
+  - vision-language
+  - multimodal
+  - instruct
+pipeline_tag: image-text-to-text
+library_name: transformers
+---
+# Zen Vl 4b Instruct
+Compact 4B vision-language model for image understanding and multimodal instruction following.
+## Overview
+Built on **Zen MoDE (Mixture of Distilled Experts)** architecture with 4B parameters and 32K context window.
+Developed by [Hanzo AI](https://hanzo.ai) and the [Zoo Labs Foundation](https://zoo.ngo).
+## Quick Start
 ```python
 from transformers import AutoModelForVision2Seq, AutoProcessor
 from PIL import Image
 import torch
+model_id = "zenlm/zen-vl-4b-instruct"
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForVision2Seq.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
 messages = [
+    {"role": "user", "content": "Describe this image in detail."}
 ]
+# Text-only
 text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 inputs = processor(text=[text], return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=512)
+print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
 ```
+## API Access
 ```python
+from openai import OpenAI
+client = OpenAI(base_url="https://api.hanzo.ai/v1", api_key="your-api-key")
+response = client.chat.completions.create(
+    model="zen-vl-4b-instruct",
+    messages=[{"role": "user", "content": "Hello!"}],
+)
+print(response.choices[0].message.content)
 ```
+## Model Details
+| Attribute | Value |
+|-----------|-------|
+| Parameters | 4B |
+| Architecture | Zen MoDE |
+| Context | 32K tokens |
+| License | Apache 2.0 |
+## License
+Apache 2.0