Qwopus3.5-27B-v3-NVFP4

Mixed-precision quantized version of ShinePixelOrg/Qwopus3.5-27B-v3.

This checkpoint keeps the same hybrid Qwen3.5 DeltaNet + softmax architecture and Qwen3.5 MTP head as the BF16 source, but applies the NVFP4/FP8/BF16 mixed recipe that worked well.

The published folder includes:

  • model.safetensors
  • config.json
  • recipe.yaml
  • tokenizer.json
  • tokenizer_config.json
  • processor_config.json
  • preprocessor_config.json
  • video_preprocessor_config.json
  • generation_config.json
  • chat_template.jinja

Verified Inference

Local inference was verified on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB) with:

  • vllm==0.17.1
  • transformers==5.3.0

Patch note:

  • The old v1 one-line vllm patch for the Blackwell/TMA issue may still be required if you encounter the same problem.
  • If your local vllm build does not already include that fix, apply the one-line patch.

Concrete patch command:

UTILS_FILE=$(python -c "import vllm, os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))") && sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE"

The exact validated command for MTP-enabled serving was:

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

The same model also serves without MTP:

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

What was verified in that run:

  • the server started cleanly
  • GET /health returned 200
  • GET /v1/models returned the model
  • POST /v1/chat/completions returned 200
  • MTP/speculative decoding was active and reported acceptance metrics in the server logs

Quantization Strategy

Non-uniform mixed-precision quantization using llm-compressor:

Precision Layers
FP8 W8A8 DeltaNet in_proj_qkv, in_proj_z, out_proj; softmax q_proj/k_proj/v_proj; MLP down_proj
NVFP4 W4A4 softmax o_proj; MLP gate_proj/up_proj
BF16 lm_head, embed_tokens, DeltaNet in_proj_a/in_proj_b, norms, visual encoder, MTP sidecar

Architecture match with the BF16 source:

  • model_type=qwen3_5
  • 64 text layers
  • full_attention_interval=4
  • mtp_num_hidden_layers=1
  • max_position_embeddings=262144

Usage

vLLM

pip install -U vllm transformers

With MTP:

vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Without MTP:

vllm serve ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 1 \
  --skip-mm-profiling \
  --reasoning-parser qwen3

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4",
    trust_remote_code=True,
)

Compatibility

Framework Supported Notes
vLLM >= 0.17.0 Yes Verified locally with vllm==0.17.1
transformers >= 5.3.0 Yes Direct loading works with device_map="auto"
Downloads last month
225
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ShinePixelOrg/Qwopus3.5-27B-v3-NVFP4

Base model

Qwen/Qwen3.5-27B
Quantized
(11)
this model