mconcat's picture
Upload folder using huggingface_hub
13be8a1 verified
metadata
library_name: tensorrt_llm
base_model: arcee-ai/Trinity-Large-TrueBase
tags:
  - nvidia
  - nvfp4
  - fp4
  - quantized
  - tensorrt-llm
  - modelopt
  - mixture-of-experts
  - moe
  - blackwell
license: other
license_name: same-as-base-model
license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase

Trinity-Large-TrueBase-NVFP4

NVFP4-quantized version of arcee-ai/Trinity-Large-TrueBase for deployment on NVIDIA Blackwell GPUs via TensorRT-LLM.

Model Details

Base model arcee-ai/Trinity-Large-TrueBase
Architecture AfmoeForCausalLM (Mixture-of-Experts)
Parameters 398B total
Layers 60 (6 dense + 54 MoE)
Experts 256 per MoE layer, 4 active per token, 1 shared expert
Hidden size 3072
MoE intermediate size 3072 per expert
Dense intermediate size 12,288
Attention 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers
Context length 8,192 tokens
Vocabulary 200,192 tokens

Quantization

Method NVFP4 (4-bit floating point)
Tool NVIDIA ModelOpt 0.41.0
Group size 16
Calibration 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512
Quantized layers MLP/expert weights only (gate_proj, up_proj, down_proj in dense and MoE layers)
BF16 layers Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head
Source precision BF16

Compression

Format Size
BF16 (original) 796 GB
NVFP4 (this model) 216 GB

3.7x compression.

Intended Use

This checkpoint is intended for deployment on NVIDIA Blackwell (SM100) GPUs using TensorRT-LLM's NVFP4 inference path. The NVFP4 format requires Blackwell's 5th-generation Tensor Cores for native FP4 execution.

Loading with TensorRT-LLM

# Convert to TensorRT-LLM engine
trtllm-build \
    --checkpoint_dir ./Trinity-Large-TrueBase-NVFP4 \
    --output_dir ./engine \
    --gemm_plugin auto

Quantization Recipe

Following NVIDIA's MLP-only quantization strategy (similar to the DeepSeek-R1 NVFP4 recipe):

  • Only MLP/expert weights (gate_proj, up_proj, down_proj) are quantized to FP4
  • All attention projections remain in BF16 to preserve quality
  • Router gates (mlp.router) remain in BF16
  • Embeddings and lm_head remain in BF16
  • The default *mlp.gate.* exclusion was removed because Trinity uses mlp.gate_proj as a standard MLP projection (not a routing gate)

Calibration Data

Domain Samples Dataset
Korean 128 heegyu/open-korean-instructions
Code 128 m-a-p/CodeFeedback-Filtered-Instruction
Creative Writing 128 Gryphe/ChatGPT-4o-Writing-Prompts
General English 128 teknium/OpenHermes-2.5

Files

File Description
model-00001-of-00005.safetensors ... model-00005-of-00005.safetensors Quantized model weights (5 shards, ~43 GB each)
model.safetensors.index.json Weight shard index
config.json Model configuration with quantization_config
hf_quant_config.json ModelOpt quantization metadata (consumed by TensorRT-LLM)
generation_config.json Generation configuration
tokenizer.json Tokenizer
tokenizer_config.json Tokenizer configuration
chat_template.jinja Chat template

Hardware

Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.

Limitations

  • Requires NVIDIA Blackwell GPUs (SM100) for native NVFP4 inference via TensorRT-LLM
  • Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
  • Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
  • This quantization targets the MLP/expert layers only; KV cache is not quantized

License

Same license as the base model arcee-ai/Trinity-Large-TrueBase.