nanbeige4.1-3b-nvfp4 / README.md

SILVERTHRONE

Update README.md

9a07689 verified 2 months ago

preview code

raw

history blame contribute delete

4.29 kB

metadata

base_model: Nanbeige/Nanbeige4.1-3B
tags:
  - nvfp4
  - quantized
  - blackwell
  - bilingual
  - chinese
  - english
pipeline_tag: text-generation
language:
  - en
  - zh

Nanbeige 4.1 3B - NVFP4

This is an NVFP4 quantized version of Nanbeige/Nanbeige4.1-3B, optimized for NVIDIA Blackwell GPUs (RTX 50xx series).

✅ FULLY FUNCTIONAL - This model loads and generates correctly in vLLM 0.15.1!

Model Details

Base Model: Nanbeige 4.1 3B (Bilingual Chinese/English reasoning model)
Parameters: 3 billion
Quantization: NVFP4 (4-bit floating point)
Size: 3.3GB (down from ~6GB, 45% reduction)
Quantizer: SILVERTHRONE
Method: llmcompressor with 512 calibration samples

Quantization Strategy

✅ Quantized to NVFP4: All linear layers
❌ Preserved at full precision: Token embeddings, lm_head

Calibration:

Dataset: open-platypus
Samples: 512
Sequence length: 1024

Hardware Requirements

GPU: NVIDIA RTX 50xx series (Blackwell) or newer
VRAM: ~6-8GB recommended
Framework: vLLM 0.15.1+

Note: This model will NOT work with llama.cpp, transformers library, or GGUF-based tools.

Usage

With vLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# Load tokenizer for chat template
tokenizer = AutoTokenizer.from_pretrained("SILVERTHRONE/nanbeige4.1-3b-nvfp4")

# Load model (IMPORTANT: set max_model_len=8192)
model = LLM(
    model="SILVERTHRONE/nanbeige4.1-3b-nvfp4",
    gpu_memory_utilization=0.85,
    max_model_len=8192  # Required! Default 262K won't fit in 16GB VRAM
)

# Format message with chat template
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=200,
    stop=["<|im_end|>"]
)

outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text

print(response)

Chat Template

Nanbeige uses the ChatML format with <think> reasoning blocks:

<|im_start|>system
你是南北阁，一款由BOSS直聘自主研发并训练的专业大语言模型。<|im_end|>
<|im_start|>user
{your question}<|im_end|>
<|im_start|>assistant
<think>
{reasoning steps}
</think>
{final answer}<|im_end|>

The model outputs reasoning in <think> tags, then provides the final answer. You can parse the response to extract either part.

Performance

Inference Speed: ~110-120 tokens/sec on RTX 5060 Ti (16GB)
Quality: Good - coherent reasoning, correct answers on factual questions
Language Support: Bilingual (Chinese/English)

Example outputs:

"What is 2+2?" → Correctly identifies answer as 4
"What is the capital of France?" → Correctly identifies Paris

Known Limitations

Does NOT work with transformers library - Crashes with KeyError: 'weight_scale' due to incomplete NVFP4 support in compressed-tensors
vLLM only - This is by design; NVFP4 is optimized for vLLM inference
Must set max_model_len=8192 - Default 262K context requires 16GB just for KV cache
Bilingual calibration - Quantized with English-only dataset (open-platypus), may slightly affect Chinese performance

Testing Environment

GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM, Blackwell)
CPU: AMD Ryzen 9 5950X
RAM: 32GB DDR4
OS: Ubuntu 24.04.3 LTS (Noble)
vLLM: 0.15.1
transformers: 4.57.3 (for tokenizer only)
llmcompressor: Latest (Feb 2026)

Comparison: Text-Only NVFP4 Works!

Unlike vision-language models (e.g., Apriel 1.6), this text-only NVFP4 quantization works perfectly in vLLM. The weight_scale error only affects:

Vision-language models (LLaVA architecture)
transformers library loading

Pure text models like Nanbeige work great with vLLM!

Credits

Original Model: Nanbeige Team / BOSS直聘
Quantization: SILVERTHRONE (@SILVERTHRONE)
Method: Neural Magic llmcompressor

License

Unspecified (inherited from base model)