Qwen3-VL-4B-English-Thinking
An English-only, vocabulary-pruned version of Qwen3-VL-4B-Thinking with 30.7% smaller vocabulary and more efficient tokenization for English text.
Model Specifications
| Specification | Original (Qwen3-VL-4B) | This Model (English-Only) | Improvement |
|---|---|---|---|
| Vocabulary Size | 151,669 tokens | 105,169 tokens | -30.7% (46,500 tokens removed) |
| Context Window | 262,144 tokens | 262,144 tokens | Same max, but ~20-30% more effective for English |
| Parameters | ~4B | ~4B | Same |
| Model Size | ~9.2 GB | ~9.2 GB | Same (weights unchanged) |
| Hidden Size | 2,560 | 2,560 | Same |
| Layers | 36 | 36 | Same |
| Attention Heads | 32 | 32 | Same |
| dtype | bfloat16 | bfloat16 | Same |
Key Benefits
1. Effective Larger Context for English
Since non-English tokens are removed, English text tokenizes more efficiently:
- Same text uses fewer tokens
- 262K context window goes further for English content
- Estimated 20-30% more English text fits in context
2. Faster Tokenization
- Smaller vocabulary = faster token lookup
- Reduced embedding table size
- Marginally faster inference
3. Preserved Capabilities
- All English language capabilities
- Programming/code tokens (ASCII preserved)
- JSON, XML, Markdown support
- Vision-language multimodal
- Chain-of-thought reasoning (think tokens)
- Tool calling support
Technical Details
Architecture
| Component | Value |
|---|---|
| Model Type | Qwen3VLForConditionalGeneration |
| Hidden Size | 2,560 |
| Intermediate Size | 9,728 |
| Num Layers | 36 |
| Num Attention Heads | 32 |
| Num KV Heads | 8 (GQA) |
| Head Dim | 128 |
| RoPE Theta | 5,000,000 |
| Max Position Embeddings | 262,144 |
Vision Encoder
| Component | Value |
|---|---|
| Type | ViT (Vision Transformer) |
| Hidden Size | 1,024 |
| Depth | 24 layers |
| Num Heads | 16 |
| Patch Size | 16 |
| Spatial Merge Size | 2 |
Vocabulary Pruning Details
| Category | Tokens Kept | Tokens Removed |
|---|---|---|
| ASCII (English + Code) | 94,351 | - |
| Special Tokens | 33 | - |
| Whitespace | 12 | - |
| Other (Punctuation, etc.) | 10,773 | - |
| Chinese/Japanese/Korean | - | 25,665 |
| Cyrillic (Russian, etc.) | - | 4,129 |
| Arabic | - | 3,643 |
| Korean | - | 3,544 |
| Hebrew | - | 3,164 |
| Thai | - | 2,571 |
| Japanese (additional) | - | 1,541 |
| Vietnamese | - | 1,174 |
| Other Unicode | - | 780+ |
Special Tokens Preserved
All special tokens are preserved for full functionality:
| Token | Purpose |
|---|---|
<|im_start|> / <|im_end|> |
Chat format markers |
<|vision_start|> / <|vision_end|> |
Vision input markers |
<|image_pad|> / <|video_pad|> |
Image/video padding |
<think> / </think> |
Chain-of-thought reasoning |
<tool_call> / </tool_call> |
Tool/function calling |
<|fim_prefix|> / <|fim_middle|> / <|fim_suffix|> |
Fill-in-middle coding |
Usage
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
model = Qwen2VLForConditionalGeneration.from_pretrained(
"DavidrPatton/Qwen3-VL-4B-English-Thinking",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"DavidrPatton/Qwen3-VL-4B-English-Thinking"
)
# Text-only example
messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))
With Images
from PIL import Image
image = Image.open("example.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe what you see in this image."}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))
Hardware Requirements
| Configuration | VRAM Required |
|---|---|
| Full Precision (fp32) | ~18 GB |
| Half Precision (fp16/bf16) | ~10 GB |
| 8-bit Quantized | ~6 GB |
| 4-bit Quantized | ~4 GB |
Recommended: NVIDIA RTX 4070 Ti or better (16GB VRAM)
Model Files
| File | Size | Description |
|---|---|---|
model-00001-of-00002.safetensors |
4.95 GB | Model weights (part 1) |
model-00002-of-00002.safetensors |
4.22 GB | Model weights (part 2) |
tokenizer.json |
11.4 MB | Pruned English vocabulary |
token_remapper.json |
4.4 MB | Original to pruned token ID mapping |
token_remapper.pt |
2.1 MB | PyTorch remapper tensor |
vocab.json |
2.8 MB | Vocabulary dictionary |
merges.txt |
1.7 MB | BPE merge rules |
config.json |
1.6 KB | Model configuration |
Limitations
- English Only: Non-English languages are not supported
- Not Fine-Tuned: This is a vocabulary-pruned version, not a fine-tuned model
- Token ID Remapping: Some applications may need the token_remapper for compatibility
- Inherits base model capabilities and limitations
How It Works
The vocabulary pruning process:
- Analyzed all 151,669 tokens in the original Qwen3-VL vocabulary
- Categorized tokens by script (ASCII, CJK, Cyrillic, Arabic, etc.)
- Preserved ASCII tokens (English + programming), special tokens, and essential symbols
- Removed 46,500 non-English tokens
- Remapped token IDs to maintain contiguous vocabulary
- Adjusted embedding layers to match new vocabulary size
License
Apache 2.0 - Same as the base Qwen model.
Acknowledgments
- Qwen Team for the excellent Qwen3-VL base model
- Based on Qwen/Qwen3-VL-4B-Thinking
- Downloads last month
- 51
Model tree for DavidrPatton/Qwen3-VL-4B-English-Thinking
Base model
Qwen/Qwen3-VL-4B-Thinking