SILVERTHRONE's picture
Update README.md
9a07689 verified
metadata
base_model: Nanbeige/Nanbeige4.1-3B
tags:
  - nvfp4
  - quantized
  - blackwell
  - bilingual
  - chinese
  - english
pipeline_tag: text-generation
language:
  - en
  - zh

Nanbeige 4.1 3B - NVFP4

This is an NVFP4 quantized version of Nanbeige/Nanbeige4.1-3B, optimized for NVIDIA Blackwell GPUs (RTX 50xx series).

✅ FULLY FUNCTIONAL - This model loads and generates correctly in vLLM 0.15.1!

Model Details

  • Base Model: Nanbeige 4.1 3B (Bilingual Chinese/English reasoning model)
  • Parameters: 3 billion
  • Quantization: NVFP4 (4-bit floating point)
  • Size: 3.3GB (down from ~6GB, 45% reduction)
  • Quantizer: SILVERTHRONE
  • Method: llmcompressor with 512 calibration samples

Quantization Strategy

  • Quantized to NVFP4: All linear layers
  • Preserved at full precision: Token embeddings, lm_head

Calibration:

  • Dataset: open-platypus
  • Samples: 512
  • Sequence length: 1024

Hardware Requirements

  • GPU: NVIDIA RTX 50xx series (Blackwell) or newer
  • VRAM: ~6-8GB recommended
  • Framework: vLLM 0.15.1+

Note: This model will NOT work with llama.cpp, transformers library, or GGUF-based tools.

Usage

With vLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# Load tokenizer for chat template
tokenizer = AutoTokenizer.from_pretrained("SILVERTHRONE/nanbeige4.1-3b-nvfp4")

# Load model (IMPORTANT: set max_model_len=8192)
model = LLM(
    model="SILVERTHRONE/nanbeige4.1-3b-nvfp4",
    gpu_memory_utilization=0.85,
    max_model_len=8192  # Required! Default 262K won't fit in 16GB VRAM
)

# Format message with chat template
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=200,
    stop=["<|im_end|>"]
)

outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text

print(response)

Chat Template

Nanbeige uses the ChatML format with <think> reasoning blocks:

<|im_start|>system
你是南北阁,一款由BOSS直聘自主研发并训练的专业大语言模型。<|im_end|>
<|im_start|>user
{your question}<|im_end|>
<|im_start|>assistant
<think>
{reasoning steps}
</think>
{final answer}<|im_end|>

The model outputs reasoning in <think> tags, then provides the final answer. You can parse the response to extract either part.

Performance

  • Inference Speed: ~110-120 tokens/sec on RTX 5060 Ti (16GB)
  • Quality: Good - coherent reasoning, correct answers on factual questions
  • Language Support: Bilingual (Chinese/English)

Example outputs:

  • "What is 2+2?" → Correctly identifies answer as 4
  • "What is the capital of France?" → Correctly identifies Paris

Known Limitations

  1. Does NOT work with transformers library - Crashes with KeyError: 'weight_scale' due to incomplete NVFP4 support in compressed-tensors
  2. vLLM only - This is by design; NVFP4 is optimized for vLLM inference
  3. Must set max_model_len=8192 - Default 262K context requires 16GB just for KV cache
  4. Bilingual calibration - Quantized with English-only dataset (open-platypus), may slightly affect Chinese performance

Testing Environment

  • GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM, Blackwell)
  • CPU: AMD Ryzen 9 5950X
  • RAM: 32GB DDR4
  • OS: Ubuntu 24.04.3 LTS (Noble)
  • vLLM: 0.15.1
  • transformers: 4.57.3 (for tokenizer only)
  • llmcompressor: Latest (Feb 2026)

Comparison: Text-Only NVFP4 Works!

Unlike vision-language models (e.g., Apriel 1.6), this text-only NVFP4 quantization works perfectly in vLLM. The weight_scale error only affects:

  • Vision-language models (LLaVA architecture)
  • transformers library loading

Pure text models like Nanbeige work great with vLLM!

Credits

License

Unspecified (inherited from base model)