Instructions to use ig1/Qwen3.5-27B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ig1/Qwen3.5-27B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ig1/Qwen3.5-27B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("ig1/Qwen3.5-27B-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("ig1/Qwen3.5-27B-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ig1/Qwen3.5-27B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ig1/Qwen3.5-27B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ig1/Qwen3.5-27B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ig1/Qwen3.5-27B-NVFP4

SGLang

How to use ig1/Qwen3.5-27B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ig1/Qwen3.5-27B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ig1/Qwen3.5-27B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ig1/Qwen3.5-27B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ig1/Qwen3.5-27B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ig1/Qwen3.5-27B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/ig1/Qwen3.5-27B-NVFP4
```

Qwen3.5-27B-NVFP4 by IG1

Quantization

This model has been quantized using llm-compressor v0.10.1.dev31+geb49917e (just after Qwen3.5 support was merged) and transformers v5.3.0. It is based on the official example with a few modifications (see next section).

Quantization particularities

The sequence length has been increased from 4096 to 8192 and the number of samples from 256 to 1024. The 1024 samples come from 4 differents datasets:

256 general conversation samples (UltraChat)
256 math reasoning samples (GSM8K)
256 code samples (CodeAlpaca)
256 multilingual samples (Aya)

You can find the quantization script here.

While the quantization needed transformers v5, the original (transformers v4) tokenizer files has been put back for simple execution on current vLLM versions. The transformers v5 tokenizer files produced by llm-compressor can be found in the transformers_v5 folder.

Qwen3.5 Profiles

Alongside support for dynamic thinking and non-thinking modes, the Qwen team offers 4 sampling parameter profiles:

Thinking General
Thinking Coding
Instruct General
Instruct Reasoning (we prefer to call it Instruct Creative internally)

Manually configuring these parameters for every AI client can be difficult. To solve this, we built a lightweight reverse proxy that exposes the 4 profiles as virtual model names. It handles request transformation on the fly using a single inference server as backend. View the project on our GitHub.

Inference

We run this model with vLLM, here is a sample execution command:

docker run --rm --name 'Qwen3.5-27B-NVFP4' \
  --runtime=nvidia --gpus 'all' --ipc=host \
  -e 'HF_TOKEN' \
  -e 'VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1' \
  -v '/srv/cache:/root/.cache' \
  -p '127.0.0.1:8000:8000' \
  'vllm/vllm-openai:v0.18.0-cu130' \
  'ig1/Qwen3.5-27B-NVFP4' \
  --served-model-name 'Qwen3.5-27B' \
  --reasoning-parser 'qwen3' \
  --enable-auto-tool-choice \
  --tool-call-parser 'qwen3_coder' \
  --max-model-len 'auto' \
  --gpu-memory-utilization '0.9'

A few notes about some of the parameters:

Adapt the /srv/cache:/root/.cache mount point to your liking. It contains files you want to keep between multiples run (dynamo bytecode and AOT with torch compile but most importantly the huggingface folder for the model)
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 allows for more precise CUDA graph VRAM estimation. It should become the default once vLLM reaches v0.19.0 which at which point you can simply remove it
If you deploy the model into several GPUs using Tensor Parallelism, be sure to check the official recipe as others flags are needed.