Nanbeige 4.1 3B - NVFP4
This is an NVFP4 quantized version of Nanbeige/Nanbeige4.1-3B, optimized for NVIDIA Blackwell GPUs (RTX 50xx series).
✅ FULLY FUNCTIONAL - This model loads and generates correctly in vLLM 0.15.1!
Model Details
- Base Model: Nanbeige 4.1 3B (Bilingual Chinese/English reasoning model)
- Parameters: 3 billion
- Quantization: NVFP4 (4-bit floating point)
- Size: 3.3GB (down from ~6GB, 45% reduction)
- Quantizer: SILVERTHRONE
- Method: llmcompressor with 512 calibration samples
Quantization Strategy
- ✅ Quantized to NVFP4: All linear layers
- ❌ Preserved at full precision: Token embeddings, lm_head
Calibration:
- Dataset: open-platypus
- Samples: 512
- Sequence length: 1024
Hardware Requirements
- GPU: NVIDIA RTX 50xx series (Blackwell) or newer
- VRAM: ~6-8GB recommended
- Framework: vLLM 0.15.1+
Note: This model will NOT work with llama.cpp, transformers library, or GGUF-based tools.
Usage
With vLLM
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
# Load tokenizer for chat template
tokenizer = AutoTokenizer.from_pretrained("SILVERTHRONE/nanbeige4.1-3b-nvfp4")
# Load model (IMPORTANT: set max_model_len=8192)
model = LLM(
model="SILVERTHRONE/nanbeige4.1-3b-nvfp4",
gpu_memory_utilization=0.85,
max_model_len=8192 # Required! Default 262K won't fit in 16GB VRAM
)
# Format message with chat template
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Generate
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=200,
stop=["<|im_end|>"]
)
outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text
print(response)
Chat Template
Nanbeige uses the ChatML format with <think> reasoning blocks:
<|im_start|>system
你是南北阁,一款由BOSS直聘自主研发并训练的专业大语言模型。<|im_end|>
<|im_start|>user
{your question}<|im_end|>
<|im_start|>assistant
<think>
{reasoning steps}
</think>
{final answer}<|im_end|>
The model outputs reasoning in <think> tags, then provides the final answer. You can parse the response to extract either part.
Performance
- Inference Speed: ~110-120 tokens/sec on RTX 5060 Ti (16GB)
- Quality: Good - coherent reasoning, correct answers on factual questions
- Language Support: Bilingual (Chinese/English)
Example outputs:
- "What is 2+2?" → Correctly identifies answer as 4
- "What is the capital of France?" → Correctly identifies Paris
Known Limitations
- Does NOT work with transformers library - Crashes with
KeyError: 'weight_scale'due to incomplete NVFP4 support in compressed-tensors - vLLM only - This is by design; NVFP4 is optimized for vLLM inference
- Must set max_model_len=8192 - Default 262K context requires 16GB just for KV cache
- Bilingual calibration - Quantized with English-only dataset (open-platypus), may slightly affect Chinese performance
Testing Environment
- GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM, Blackwell)
- CPU: AMD Ryzen 9 5950X
- RAM: 32GB DDR4
- OS: Ubuntu 24.04.3 LTS (Noble)
- vLLM: 0.15.1
- transformers: 4.57.3 (for tokenizer only)
- llmcompressor: Latest (Feb 2026)
Comparison: Text-Only NVFP4 Works!
Unlike vision-language models (e.g., Apriel 1.6), this text-only NVFP4 quantization works perfectly in vLLM. The weight_scale error only affects:
- Vision-language models (LLaVA architecture)
- transformers library loading
Pure text models like Nanbeige work great with vLLM!
Credits
- Original Model: Nanbeige Team / BOSS直聘
- Quantization: SILVERTHRONE (@SILVERTHRONE)
- Method: Neural Magic llmcompressor
License
Unspecified (inherited from base model)
- Downloads last month
- 29