Qwen3-VL-4B-English-Thinking

An English-only, vocabulary-pruned version of Qwen3-VL-4B-Thinking with 30.7% smaller vocabulary and more efficient tokenization for English text.

Model Specifications

Specification Original (Qwen3-VL-4B) This Model (English-Only) Improvement
Vocabulary Size 151,669 tokens 105,169 tokens -30.7% (46,500 tokens removed)
Context Window 262,144 tokens 262,144 tokens Same max, but ~20-30% more effective for English
Parameters ~4B ~4B Same
Model Size ~9.2 GB ~9.2 GB Same (weights unchanged)
Hidden Size 2,560 2,560 Same
Layers 36 36 Same
Attention Heads 32 32 Same
dtype bfloat16 bfloat16 Same

Key Benefits

1. Effective Larger Context for English

Since non-English tokens are removed, English text tokenizes more efficiently:

  • Same text uses fewer tokens
  • 262K context window goes further for English content
  • Estimated 20-30% more English text fits in context

2. Faster Tokenization

  • Smaller vocabulary = faster token lookup
  • Reduced embedding table size
  • Marginally faster inference

3. Preserved Capabilities

  • All English language capabilities
  • Programming/code tokens (ASCII preserved)
  • JSON, XML, Markdown support
  • Vision-language multimodal
  • Chain-of-thought reasoning (think tokens)
  • Tool calling support

Technical Details

Architecture

Component Value
Model Type Qwen3VLForConditionalGeneration
Hidden Size 2,560
Intermediate Size 9,728
Num Layers 36
Num Attention Heads 32
Num KV Heads 8 (GQA)
Head Dim 128
RoPE Theta 5,000,000
Max Position Embeddings 262,144

Vision Encoder

Component Value
Type ViT (Vision Transformer)
Hidden Size 1,024
Depth 24 layers
Num Heads 16
Patch Size 16
Spatial Merge Size 2

Vocabulary Pruning Details

Category Tokens Kept Tokens Removed
ASCII (English + Code) 94,351 -
Special Tokens 33 -
Whitespace 12 -
Other (Punctuation, etc.) 10,773 -
Chinese/Japanese/Korean - 25,665
Cyrillic (Russian, etc.) - 4,129
Arabic - 3,643
Korean - 3,544
Hebrew - 3,164
Thai - 2,571
Japanese (additional) - 1,541
Vietnamese - 1,174
Other Unicode - 780+

Special Tokens Preserved

All special tokens are preserved for full functionality:

Token Purpose
<|im_start|> / <|im_end|> Chat format markers
<|vision_start|> / <|vision_end|> Vision input markers
<|image_pad|> / <|video_pad|> Image/video padding
<think> / </think> Chain-of-thought reasoning
<tool_call> / </tool_call> Tool/function calling
<|fim_prefix|> / <|fim_middle|> / <|fim_suffix|> Fill-in-middle coding

Usage

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "DavidrPatton/Qwen3-VL-4B-English-Thinking",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "DavidrPatton/Qwen3-VL-4B-English-Thinking"
)

# Text-only example
messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

With Images

from PIL import Image

image = Image.open("example.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe what you see in this image."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

Hardware Requirements

Configuration VRAM Required
Full Precision (fp32) ~18 GB
Half Precision (fp16/bf16) ~10 GB
8-bit Quantized ~6 GB
4-bit Quantized ~4 GB

Recommended: NVIDIA RTX 4070 Ti or better (16GB VRAM)

Model Files

File Size Description
model-00001-of-00002.safetensors 4.95 GB Model weights (part 1)
model-00002-of-00002.safetensors 4.22 GB Model weights (part 2)
tokenizer.json 11.4 MB Pruned English vocabulary
token_remapper.json 4.4 MB Original to pruned token ID mapping
token_remapper.pt 2.1 MB PyTorch remapper tensor
vocab.json 2.8 MB Vocabulary dictionary
merges.txt 1.7 MB BPE merge rules
config.json 1.6 KB Model configuration

Limitations

  • English Only: Non-English languages are not supported
  • Not Fine-Tuned: This is a vocabulary-pruned version, not a fine-tuned model
  • Token ID Remapping: Some applications may need the token_remapper for compatibility
  • Inherits base model capabilities and limitations

How It Works

The vocabulary pruning process:

  1. Analyzed all 151,669 tokens in the original Qwen3-VL vocabulary
  2. Categorized tokens by script (ASCII, CJK, Cyrillic, Arabic, etc.)
  3. Preserved ASCII tokens (English + programming), special tokens, and essential symbols
  4. Removed 46,500 non-English tokens
  5. Remapped token IDs to maintain contiguous vocabulary
  6. Adjusted embedding layers to match new vocabulary size

License

Apache 2.0 - Same as the base Qwen model.

Acknowledgments

Downloads last month
51
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for DavidrPatton/Qwen3-VL-4B-English-Thinking

Finetuned
(11)
this model