Trinity-Large-TrueBase-NVFP4
NVFP4-quantized version of arcee-ai/Trinity-Large-TrueBase for deployment on NVIDIA Blackwell GPUs via TensorRT-LLM.
Model Details
| Base model | arcee-ai/Trinity-Large-TrueBase |
| Architecture | AfmoeForCausalLM (Mixture-of-Experts) |
| Parameters | 398B total |
| Layers | 60 (6 dense + 54 MoE) |
| Experts | 256 per MoE layer, 4 active per token, 1 shared expert |
| Hidden size | 3072 |
| MoE intermediate size | 3072 per expert |
| Dense intermediate size | 12,288 |
| Attention | 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers |
| Context length | 8,192 tokens |
| Vocabulary | 200,192 tokens |
Quantization
| Method | NVFP4 (4-bit floating point) |
| Tool | NVIDIA ModelOpt 0.41.0 |
| Group size | 16 |
| Calibration | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 |
| Quantized layers | MLP/expert weights only (gate_proj, up_proj, down_proj in dense and MoE layers) |
| BF16 layers | Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head |
| Source precision | BF16 |
Compression
| Format | Size |
|---|---|
| BF16 (original) | 796 GB |
| NVFP4 (this model) | 216 GB |
3.7x compression.
Intended Use
This checkpoint is intended for deployment on NVIDIA Blackwell (SM100) GPUs using TensorRT-LLM's NVFP4 inference path. The NVFP4 format requires Blackwell's 5th-generation Tensor Cores for native FP4 execution.
Loading with TensorRT-LLM
# Convert to TensorRT-LLM engine
trtllm-build \
--checkpoint_dir ./Trinity-Large-TrueBase-NVFP4 \
--output_dir ./engine \
--gemm_plugin auto
Quantization Recipe
Following NVIDIA's MLP-only quantization strategy (similar to the DeepSeek-R1 NVFP4 recipe):
- Only MLP/expert weights (
gate_proj,up_proj,down_proj) are quantized to FP4 - All attention projections remain in BF16 to preserve quality
- Router gates (
mlp.router) remain in BF16 - Embeddings and lm_head remain in BF16
- The default
*mlp.gate.*exclusion was removed because Trinity usesmlp.gate_projas a standard MLP projection (not a routing gate)
Calibration Data
| Domain | Samples | Dataset |
|---|---|---|
| Korean | 128 | heegyu/open-korean-instructions |
| Code | 128 | m-a-p/CodeFeedback-Filtered-Instruction |
| Creative Writing | 128 | Gryphe/ChatGPT-4o-Writing-Prompts |
| General English | 128 | teknium/OpenHermes-2.5 |
Files
| File | Description |
|---|---|
model-00001-of-00005.safetensors ... model-00005-of-00005.safetensors |
Quantized model weights (5 shards, ~43 GB each) |
model.safetensors.index.json |
Weight shard index |
config.json |
Model configuration with quantization_config |
hf_quant_config.json |
ModelOpt quantization metadata (consumed by TensorRT-LLM) |
generation_config.json |
Generation configuration |
tokenizer.json |
Tokenizer |
tokenizer_config.json |
Tokenizer configuration |
chat_template.jinja |
Chat template |
Hardware
Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.
Limitations
- Requires NVIDIA Blackwell GPUs (SM100) for native NVFP4 inference via TensorRT-LLM
- Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
- This quantization targets the MLP/expert layers only; KV cache is not quantized
License
Same license as the base model arcee-ai/Trinity-Large-TrueBase.
- Downloads last month
- 6
Model tree for mconcat/Trinity-Large-TrueBase-NVFP4
Base model
arcee-ai/Trinity-Large-TrueBase