Instructions to use rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included")
model = AutoModelForImageTextToText.from_pretrained("rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included

SGLang

How to use rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included with Docker Model Runner:
```
docker model run hf.co/rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included
```

Qwen3.5-27B-NVFP4-Full (W4A4)

NVFP4 quantization of Qwen/Qwen3.5-27B with all linear layers quantized, including the DeltaNet linear attention projections that are typically excluded.

Key differences from standard NVFP4 checkpoints

	Standard NVFP4 (e.g., Sehyo)	This checkpoint
MoE experts	FP4	FP4
Shared experts	FP4	FP4
Self-attention (q/k/v/o)	FP4	FP4
DeltaNet (in_proj_qkv, in_proj_z, out_proj)	BF16	FP4
DeltaNet (in_proj_a, in_proj_b)	BF16	BF16 (N=48, below CUTLASS tile minimum)
Model size	27 GB	20 GB

Performance (DGX Spark / GB10 / SM121)

Measured with vLLM 0.19.1 + FlashInfer 0.6.7, CUTLASS W4A4 backend, no MTP:

Metric	Standard NVFP4	This checkpoint	Improvement
Decode (tg32)	7.93 tok/s	11.98 tok/s	+51%
Decode @ d4096	7.66 tok/s	11.90 tok/s	+55%
Decode @ d8192	7.92 tok/s	11.80 tok/s	+49%
Prefill (pp2048)	1855 tok/s	2383 tok/s	+28%

The speedup comes from eliminating ~5 GB of BF16 weight loads per token for the DeltaNet layers, replacing them with ~1.4 GB of FP4 loads.

Quality benchmarks (0-shot, 200-sample subsets)

Benchmark	Metric	This checkpoint	BF16 typical	Recovery
ARC-Challenge	acc_norm	63.5%	~66%	~96%
HellaSwag	acc_norm	74.0%	~78%	~95%
TruthfulQA MC2	acc	54.2%	~55%	~99%
Winogrande	acc	51.5%	~52%	~99%

95-99% quality recovery across knowledge and reasoning benchmarks. Quantizing the DeltaNet linear attention layers to FP4 is near-lossless.

Note: GSM8k results are excluded as the model's thinking/reasoning output format interferes with lm-eval-harness answer extraction, producing unreliable scores. Subjective quality in interactive use (Open WebUI, chat API) is excellent with reasoning intact.

Quantization details

Method: llm-compressor oneshot with calibrated NVFP4 (W4A4)
Calibration: 256 samples from HuggingFaceH4/ultrachat_200k, max_seq_length=4096
Format: compressed-tensors nvfp4-pack-quantized with calibrated input_global_scale
Excluded layers: in_proj_a, in_proj_b (N=48, CUTLASS FP4 requires N%64==0), conv1d (3D), norms, A_log, dt_bias, lm_head, embed_tokens

Usage

vLLM (recommended)

Requires vLLM >= 0.19.1 with PR #38423 (W4A4 SM120/SM121 support) and FlashInfer >= 0.6.7.

vllm serve rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Quality notes

FP4 activation quantization on DeltaNet layers was widely assumed to be destructive for model quality. Our analysis shows the quantization error (SNR ~24 dB, relative error ~26%) is comparable to other layer types (SNR ~24 dB, relative error ~26%). The model produces coherent output with reasoning capabilities intact.

Required llm-compressor fix

Quantizing the DeltaNet layers requires vllm-project/llm-compressor#2566, which fixes model_free_ptq for models with non-contiguous fused attention layers (Qwen3.5's interleaved self_attn + linear_attn architecture).

Acknowledgments

Sehyo for the original Qwen3.5 NVFP4 quantization work and llm-compressor PR #2383
eugr for spark-vllm-docker infrastructure
Built on DGX Spark (GB10, SM121)

Downloads last month: 78

Safetensors

Model size

17B params

Tensor type

F32

BF16

F8_E4M3

Model tree for rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included

Base model

Qwen/Qwen3.5-27B

Quantized

(198)

this model