Qwen3-VL-4B-English-Thinking

An English-only, vocabulary-pruned version of Qwen3-VL-4B-Thinking with 30.7% smaller vocabulary and more efficient tokenization for English text.

Model Specifications

Specification	Original (Qwen3-VL-4B)	This Model (English-Only)	Improvement
Vocabulary Size	151,669 tokens	105,169 tokens	-30.7% (46,500 tokens removed)
Context Window	262,144 tokens	262,144 tokens	Same max, but ~20-30% more effective for English
Parameters	~4B	~4B	Same
Model Size	~9.2 GB	~9.2 GB	Same (weights unchanged)
Hidden Size	2,560	2,560	Same
Layers	36	36	Same
Attention Heads	32	32	Same
dtype	bfloat16	bfloat16	Same

Key Benefits

1. Effective Larger Context for English

Since non-English tokens are removed, English text tokenizes more efficiently:

Same text uses fewer tokens
262K context window goes further for English content
Estimated 20-30% more English text fits in context

2. Faster Tokenization

Smaller vocabulary = faster token lookup
Reduced embedding table size
Marginally faster inference

3. Preserved Capabilities

All English language capabilities
Programming/code tokens (ASCII preserved)
JSON, XML, Markdown support
Vision-language multimodal
Chain-of-thought reasoning (think tokens)
Tool calling support

Technical Details

Architecture

Component	Value
Model Type	Qwen3VLForConditionalGeneration
Hidden Size	2,560
Intermediate Size	9,728
Num Layers	36
Num Attention Heads	32
Num KV Heads	8 (GQA)
Head Dim	128
RoPE Theta	5,000,000
Max Position Embeddings	262,144

Vision Encoder

Component	Value
Type	ViT (Vision Transformer)
Hidden Size	1,024
Depth	24 layers
Num Heads	16
Patch Size	16
Spatial Merge Size	2

Vocabulary Pruning Details

Category	Tokens Kept	Tokens Removed
ASCII (English + Code)	94,351	-
Special Tokens	33	-
Whitespace	12	-
Other (Punctuation, etc.)	10,773	-
Chinese/Japanese/Korean	-	25,665
Cyrillic (Russian, etc.)	-	4,129
Arabic	-	3,643
Korean	-	3,544
Hebrew	-	3,164
Thai	-	2,571
Japanese (additional)	-	1,541
Vietnamese	-	1,174
Other Unicode	-	780+

Special Tokens Preserved

All special tokens are preserved for full functionality:

Token	Purpose
`<\|im_start\|>` / `<\|im_end\|>`	Chat format markers
`<\|vision_start\|>` / `<\|vision_end\|>`	Vision input markers
`<\|image_pad\|>` / `<\|video_pad\|>`	Image/video padding
`<think>` / `</think>`	Chain-of-thought reasoning
`<tool_call>` / `</tool_call>`	Tool/function calling
`<\|fim_prefix\|>` / `<\|fim_middle\|>` / `<\|fim_suffix\|>`	Fill-in-middle coding

Usage

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "DavidrPatton/Qwen3-VL-4B-English-Thinking",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "DavidrPatton/Qwen3-VL-4B-English-Thinking"
)

# Text-only example
messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

With Images

from PIL import Image

image = Image.open("example.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe what you see in this image."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

Hardware Requirements

Configuration	VRAM Required
Full Precision (fp32)	~18 GB
Half Precision (fp16/bf16)	~10 GB
8-bit Quantized	~6 GB
4-bit Quantized	~4 GB

Recommended: NVIDIA RTX 4070 Ti or better (16GB VRAM)

Model Files

File	Size	Description
`model-00001-of-00002.safetensors`	4.95 GB	Model weights (part 1)
`model-00002-of-00002.safetensors`	4.22 GB	Model weights (part 2)
`tokenizer.json`	11.4 MB	Pruned English vocabulary
`token_remapper.json`	4.4 MB	Original to pruned token ID mapping
`token_remapper.pt`	2.1 MB	PyTorch remapper tensor
`vocab.json`	2.8 MB	Vocabulary dictionary
`merges.txt`	1.7 MB	BPE merge rules
`config.json`	1.6 KB	Model configuration

Limitations

English Only: Non-English languages are not supported
Not Fine-Tuned: This is a vocabulary-pruned version, not a fine-tuned model
Token ID Remapping: Some applications may need the token_remapper for compatibility
Inherits base model capabilities and limitations

How It Works

The vocabulary pruning process:

Analyzed all 151,669 tokens in the original Qwen3-VL vocabulary
Categorized tokens by script (ASCII, CJK, Cyrillic, Arabic, etc.)
Preserved ASCII tokens (English + programming), special tokens, and essential symbols
Removed 46,500 non-English tokens
Remapped token IDs to maintain contiguous vocabulary
Adjusted embedding layers to match new vocabulary size

License

Apache 2.0 - Same as the base Qwen model.

Acknowledgments

Qwen Team for the excellent Qwen3-VL base model
Based on Qwen/Qwen3-VL-4B-Thinking

Downloads last month: 4

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for DavidrPatton/Qwen3-VL-4B-English-Thinking

Base model

Qwen/Qwen3-VL-4B-Thinking

Finetuned

(16)

this model