Trinity-Large-TrueBase-NVFP4

NVFP4-quantized version of arcee-ai/Trinity-Large-TrueBase for deployment on NVIDIA Blackwell GPUs via TensorRT-LLM.

Model Details

Base model arcee-ai/Trinity-Large-TrueBase
Architecture AfmoeForCausalLM (Mixture-of-Experts)
Parameters 398B total
Layers 60 (6 dense + 54 MoE)
Experts 256 per MoE layer, 4 active per token, 1 shared expert
Hidden size 3072
MoE intermediate size 3072 per expert
Dense intermediate size 12,288
Attention 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers
Context length 8,192 tokens
Vocabulary 200,192 tokens

Quantization

Method NVFP4 (4-bit floating point)
Tool NVIDIA ModelOpt 0.41.0
Group size 16
Calibration 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512
Quantized layers MLP/expert weights only (gate_proj, up_proj, down_proj in dense and MoE layers)
BF16 layers Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head
Source precision BF16

Compression

Format Size
BF16 (original) 796 GB
NVFP4 (this model) 216 GB

3.7x compression.

Intended Use

This checkpoint is intended for deployment on NVIDIA Blackwell (SM100) GPUs using TensorRT-LLM's NVFP4 inference path. The NVFP4 format requires Blackwell's 5th-generation Tensor Cores for native FP4 execution.

Loading with TensorRT-LLM

# Convert to TensorRT-LLM engine
trtllm-build \
    --checkpoint_dir ./Trinity-Large-TrueBase-NVFP4 \
    --output_dir ./engine \
    --gemm_plugin auto

Quantization Recipe

Following NVIDIA's MLP-only quantization strategy (similar to the DeepSeek-R1 NVFP4 recipe):

  • Only MLP/expert weights (gate_proj, up_proj, down_proj) are quantized to FP4
  • All attention projections remain in BF16 to preserve quality
  • Router gates (mlp.router) remain in BF16
  • Embeddings and lm_head remain in BF16
  • The default *mlp.gate.* exclusion was removed because Trinity uses mlp.gate_proj as a standard MLP projection (not a routing gate)

Calibration Data

Domain Samples Dataset
Korean 128 heegyu/open-korean-instructions
Code 128 m-a-p/CodeFeedback-Filtered-Instruction
Creative Writing 128 Gryphe/ChatGPT-4o-Writing-Prompts
General English 128 teknium/OpenHermes-2.5

Files

File Description
model-00001-of-00005.safetensors ... model-00005-of-00005.safetensors Quantized model weights (5 shards, ~43 GB each)
model.safetensors.index.json Weight shard index
config.json Model configuration with quantization_config
hf_quant_config.json ModelOpt quantization metadata (consumed by TensorRT-LLM)
generation_config.json Generation configuration
tokenizer.json Tokenizer
tokenizer_config.json Tokenizer configuration
chat_template.jinja Chat template

Hardware

Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.

Limitations

  • Requires NVIDIA Blackwell GPUs (SM100) for native NVFP4 inference via TensorRT-LLM
  • Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
  • Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
  • This quantization targets the MLP/expert layers only; KV cache is not quantized

License

Same license as the base model arcee-ai/Trinity-Large-TrueBase.

Downloads last month
6
Safetensors
Model size
202B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for mconcat/Trinity-Large-TrueBase-NVFP4

Quantized
(4)
this model