Instructions to use principled-intelligence/Qwen3.5-4B-text-only with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use principled-intelligence/Qwen3.5-4B-text-only with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="principled-intelligence/Qwen3.5-4B-text-only")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("principled-intelligence/Qwen3.5-4B-text-only")
model = AutoModelForCausalLM.from_pretrained("principled-intelligence/Qwen3.5-4B-text-only")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use principled-intelligence/Qwen3.5-4B-text-only with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "principled-intelligence/Qwen3.5-4B-text-only"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "principled-intelligence/Qwen3.5-4B-text-only",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/principled-intelligence/Qwen3.5-4B-text-only

SGLang

How to use principled-intelligence/Qwen3.5-4B-text-only with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "principled-intelligence/Qwen3.5-4B-text-only" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "principled-intelligence/Qwen3.5-4B-text-only",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "principled-intelligence/Qwen3.5-4B-text-only" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "principled-intelligence/Qwen3.5-4B-text-only",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use principled-intelligence/Qwen3.5-4B-text-only with Docker Model Runner:
```
docker model run hf.co/principled-intelligence/Qwen3.5-4B-text-only
```

Qwen3.5 Text-Only

If all you need is text, these are the Qwen3.5 models for you.

Trimmed checkpoints of the Qwen3.5 model family with vision encoder weights removed — smaller files, lower VRAM, drop-in text-only replacement.

⚠️ Disclaimer: These models were tested exclusively with HuggingFace Transformers (≥5.2.0). vLLM, SGLang, llama.cpp, Ollama, and other inference engines are not supported yet — partly because Transformers 5 support is still cooking in those projects, and partly because we just threw these checkpoints on the Hub while messing around in the lab. If you get any of these running on other engines, we'd love to hear about it — open a discussion or drop a community post. We didn't set out to build a production-ready model zoo; we just left the oven door open. Use accordingly.

For official details on the Qwen3.5 model family — architecture, benchmarks, training data, and intended use — see the original Qwen3.5 model card.

How It Works

The Qwen3.5 architecture consists of a vision encoder and a language model sharing a single checkpoint. During text-only inference the vision encoder is never called, but its weights are still loaded into memory. By loading the checkpoint with Qwen3_5ForCausalLM instead of Qwen3_5ForConditionalGeneration, HuggingFace Transformers instantiates only the language model component. Re-saving that model produces a checkpoint with no vision weights, which can subsequently be loaded with the standard AutoModelForCausalLM interface.

Why bother?

Lower VRAM — vision encoder weights are freed, reducing peak memory usage by 5–17% depending on model size
Smaller checkpoints — faster downloads and storage savings
Simpler loading — standard AutoModelForCausalLM, no multimodal dependencies
Drop-in replacement — identical tokenizer, same chat template, same text generation behavior as the original Qwen3.5 models

Available Models

Model	HuggingFace Hub
Qwen3.5-0.8B-text-only	`principled-intelligence/Qwen3.5-0.8B-text-only`
Qwen3.5-2B-text-only	`principled-intelligence/Qwen3.5-2B-text-only`
Qwen3.5-4B-text-only	`principled-intelligence/Qwen3.5-4B-text-only`
Qwen3.5-9B-text-only	`principled-intelligence/Qwen3.5-9B-text-only`

Size Reduction

We compared each text-only checkpoint against its original Qwen3.5 counterpart across three metrics: file size on disk, peak VRAM usage when loaded in float16 with device_map="auto", and total parameter count. Savings scale with the relative size of the vision encoder — smaller models see the biggest percentage drop.

Qwen3.5-0.8B vs. Qwen3.5-0.8B-text-only

Metric	Qwen3.5	Text-Only	Reduction
File size (GB)	1.75	1.50	~14%
VRAM (GB)	1.59	1.40	~12%
Parameters (B)	0.85	0.75	~12%

Qwen3.5-2B vs. Qwen3.5-2B-text-only

Metric	Qwen3.5	Text-Only	Reduction
File size (GB)	4.55	3.76	~17%
VRAM (GB)	4.12	3.51	~15%
Parameters (B)	2.21	1.88	~15%

Qwen3.5-4B vs. Qwen3.5-4B-text-only

Metric	Qwen3.5	Text-Only	Reduction
File size (GB)	9.32	8.41	~10%
VRAM (GB)	8.45	7.83	~7%
Parameters (B)	4.54	4.21	~7%

Qwen3.5-9B vs. Qwen3.5-9B-text-only

Metric	Qwen3.5	Text-Only	Reduction
File size (GB)	19.32	17.90	~7%
VRAM (GB)	17.52	16.68	~5%
Parameters (B)	9.41	8.95	~5%

Quickstart

The latest transformers is required:

uv pip install transformers>=5.2.0

Load and run inference exactly like any causal LM:

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="principled-intelligence/Qwen3.5-4B-text-only",
    device_map="auto",
)

messages = [{"role": "user", "content": "What is the capital of Italy?"}]
print(pipe(messages, max_new_tokens=512))

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "principled-intelligence/Qwen3.5-4B-text-only"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [
    {"role": "user", "content": "What is the capital of Italy?"},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(output_ids[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

You can also use pipeline for a simpler interface:

Qwen3.5 thinks by default, generating <think>...</think> content before the final response. To disable thinking, pass chat_template_kwargs={"enable_thinking": False} in your generation call or API request.

Contributing

Contributions are welcome! Whether it's getting these checkpoints running on vLLM, SGLang, llama.cpp, Ollama, or something else entirely — we'd love your help. Bug reports, compatibility notes, and PRs are all appreciated. Open a discussion or community post and let us know what you find.

License

These checkpoints are released under the Apache 2.0 License, consistent with the original Qwen3.5 models.

Made with love from Principled Intelligence ❤️

Learn more about what we build in Principled Intelligence on our website.

Downloads last month: 1,983

Safetensors

Model size

4B params

Tensor type

BF16

Collection including principled-intelligence/Qwen3.5-4B-text-only

Qwen3.5-text-only

Collection

Text-only versions of Qwen-3.5 without the vision encoders for a smaller memory and storage footprint. • 4 items • Updated Jun 5 • 15