Nanbeige 4.1 3B - NVFP4

This is an NVFP4 quantized version of Nanbeige/Nanbeige4.1-3B, optimized for NVIDIA Blackwell GPUs (RTX 50xx series).

✅ FULLY FUNCTIONAL - This model loads and generates correctly in vLLM 0.15.1!

Model Details

  • Base Model: Nanbeige 4.1 3B (Bilingual Chinese/English reasoning model)
  • Parameters: 3 billion
  • Quantization: NVFP4 (4-bit floating point)
  • Size: 3.3GB (down from ~6GB, 45% reduction)
  • Quantizer: SILVERTHRONE
  • Method: llmcompressor with 512 calibration samples

Quantization Strategy

  • Quantized to NVFP4: All linear layers
  • Preserved at full precision: Token embeddings, lm_head

Calibration:

  • Dataset: open-platypus
  • Samples: 512
  • Sequence length: 1024

Hardware Requirements

  • GPU: NVIDIA RTX 50xx series (Blackwell) or newer
  • VRAM: ~6-8GB recommended
  • Framework: vLLM 0.15.1+

Note: This model will NOT work with llama.cpp, transformers library, or GGUF-based tools.

Usage

With vLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# Load tokenizer for chat template
tokenizer = AutoTokenizer.from_pretrained("SILVERTHRONE/nanbeige4.1-3b-nvfp4")

# Load model (IMPORTANT: set max_model_len=8192)
model = LLM(
    model="SILVERTHRONE/nanbeige4.1-3b-nvfp4",
    gpu_memory_utilization=0.85,
    max_model_len=8192  # Required! Default 262K won't fit in 16GB VRAM
)

# Format message with chat template
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=200,
    stop=["<|im_end|>"]
)

outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text

print(response)

Chat Template

Nanbeige uses the ChatML format with <think> reasoning blocks:

<|im_start|>system
你是南北阁,一款由BOSS直聘自主研发并训练的专业大语言模型。<|im_end|>
<|im_start|>user
{your question}<|im_end|>
<|im_start|>assistant
<think>
{reasoning steps}
</think>
{final answer}<|im_end|>

The model outputs reasoning in <think> tags, then provides the final answer. You can parse the response to extract either part.

Performance

  • Inference Speed: ~110-120 tokens/sec on RTX 5060 Ti (16GB)
  • Quality: Good - coherent reasoning, correct answers on factual questions
  • Language Support: Bilingual (Chinese/English)

Example outputs:

  • "What is 2+2?" → Correctly identifies answer as 4
  • "What is the capital of France?" → Correctly identifies Paris

Known Limitations

  1. Does NOT work with transformers library - Crashes with KeyError: 'weight_scale' due to incomplete NVFP4 support in compressed-tensors
  2. vLLM only - This is by design; NVFP4 is optimized for vLLM inference
  3. Must set max_model_len=8192 - Default 262K context requires 16GB just for KV cache
  4. Bilingual calibration - Quantized with English-only dataset (open-platypus), may slightly affect Chinese performance

Testing Environment

  • GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM, Blackwell)
  • CPU: AMD Ryzen 9 5950X
  • RAM: 32GB DDR4
  • OS: Ubuntu 24.04.3 LTS (Noble)
  • vLLM: 0.15.1
  • transformers: 4.57.3 (for tokenizer only)
  • llmcompressor: Latest (Feb 2026)

Comparison: Text-Only NVFP4 Works!

Unlike vision-language models (e.g., Apriel 1.6), this text-only NVFP4 quantization works perfectly in vLLM. The weight_scale error only affects:

  • Vision-language models (LLaVA architecture)
  • transformers library loading

Pure text models like Nanbeige work great with vLLM!

Credits

License

Unspecified (inherited from base model)

Downloads last month
29
Safetensors
Model size
3B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SILVERTHRONE/nanbeige4.1-3b-nvfp4

Quantized
(38)
this model