Qwen3.5 Text-Only

If all you need is text, these are the Qwen3.5 models for you.

Trimmed checkpoints of the Qwen3.5 model family with vision encoder weights removed — smaller files, lower VRAM, drop-in text-only replacement.

⚠️ Disclaimer: These models were tested exclusively with HuggingFace Transformers (≥5.2.0). vLLM, SGLang, llama.cpp, Ollama, and other inference engines are not supported yet — partly because Transformers 5 support is still cooking in those projects, and partly because we just threw these checkpoints on the Hub while messing around in the lab. If you get any of these running on other engines, we'd love to hear about it — open a discussion or drop a community post. We didn't set out to build a production-ready model zoo; we just left the oven door open. Use accordingly.

For official details on the Qwen3.5 model family — architecture, benchmarks, training data, and intended use — see the original Qwen3.5 model card.

How It Works

The Qwen3.5 architecture consists of a vision encoder and a language model sharing a single checkpoint. During text-only inference the vision encoder is never called, but its weights are still loaded into memory. By loading the checkpoint with Qwen3_5ForCausalLM instead of Qwen3_5ForConditionalGeneration, HuggingFace Transformers instantiates only the language model component. Re-saving that model produces a checkpoint with no vision weights, which can subsequently be loaded with the standard AutoModelForCausalLM interface.

Why bother?

  • Lower VRAM — vision encoder weights are freed, reducing peak memory usage by 5–17% depending on model size
  • Smaller checkpoints — faster downloads and storage savings
  • Simpler loading — standard AutoModelForCausalLM, no multimodal dependencies
  • Drop-in replacement — identical tokenizer, same chat template, same text generation behavior as the original Qwen3.5 models

Available Models

Size Reduction

We compared each text-only checkpoint against its original Qwen3.5 counterpart across three metrics: file size on disk, peak VRAM usage when loaded in float16 with device_map="auto", and total parameter count. Savings scale with the relative size of the vision encoder — smaller models see the biggest percentage drop.

Qwen3.5-0.8B vs. Qwen3.5-0.8B-text-only

Metric Qwen3.5 Text-Only Reduction
File size (GB) 1.75 1.50 ~14%
VRAM (GB) 1.59 1.40 ~12%
Parameters (B) 0.85 0.75 ~12%

Qwen3.5-2B vs. Qwen3.5-2B-text-only

Metric Qwen3.5 Text-Only Reduction
File size (GB) 4.55 3.76 ~17%
VRAM (GB) 4.12 3.51 ~15%
Parameters (B) 2.21 1.88 ~15%

Qwen3.5-4B vs. Qwen3.5-4B-text-only

Metric Qwen3.5 Text-Only Reduction
File size (GB) 9.32 8.41 ~10%
VRAM (GB) 8.45 7.83 ~7%
Parameters (B) 4.54 4.21 ~7%

Qwen3.5-9B vs. Qwen3.5-9B-text-only

Metric Qwen3.5 Text-Only Reduction
File size (GB) 19.32 17.90 ~7%
VRAM (GB) 17.52 16.68 ~5%
Parameters (B) 9.41 8.95 ~5%

Quickstart

The latest transformers is required:

uv pip install transformers>=5.2.0

Load and run inference exactly like any causal LM:

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="principled-intelligence/Qwen3.5-4B-text-only",
    device_map="auto",
)

messages = [{"role": "user", "content": "What is the capital of Italy?"}]
print(pipe(messages, max_new_tokens=512))
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "principled-intelligence/Qwen3.5-4B-text-only"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [
    {"role": "user", "content": "What is the capital of Italy?"},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(output_ids[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

You can also use pipeline for a simpler interface:

Qwen3.5 thinks by default, generating <think>...</think> content before the final response. To disable thinking, pass chat_template_kwargs={"enable_thinking": False} in your generation call or API request.

Contributing

Contributions are welcome! Whether it's getting these checkpoints running on vLLM, SGLang, llama.cpp, Ollama, or something else entirely — we'd love your help. Bug reports, compatibility notes, and PRs are all appreciated. Open a discussion or community post and let us know what you find.

License

These checkpoints are released under the Apache 2.0 License, consistent with the original Qwen3.5 models.


Made with love from Principled Intelligence ❤️

Learn more about what we build in Principled Intelligence on our website.

Downloads last month
6
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including principled-intelligence/Qwen3.5-4B-text-only