Trinity-Large-TrueBase-NVFP4 / README.md

mconcat

Upload folder using huggingface_hub

13be8a1 verified 3 days ago

preview code

raw

history blame contribute delete

4.87 kB

metadata

library_name: tensorrt_llm
base_model: arcee-ai/Trinity-Large-TrueBase
tags:
  - nvidia
  - nvfp4
  - fp4
  - quantized
  - tensorrt-llm
  - modelopt
  - mixture-of-experts
  - moe
  - blackwell
license: other
license_name: same-as-base-model
license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase

Trinity-Large-TrueBase-NVFP4

NVFP4-quantized version of arcee-ai/Trinity-Large-TrueBase for deployment on NVIDIA Blackwell GPUs via TensorRT-LLM.

Model Details


Base model	arcee-ai/Trinity-Large-TrueBase
Architecture	AfmoeForCausalLM (Mixture-of-Experts)
Parameters	398B total
Layers	60 (6 dense + 54 MoE)
Experts	256 per MoE layer, 4 active per token, 1 shared expert
Hidden size	3072
MoE intermediate size	3072 per expert
Dense intermediate size	12,288
Attention	48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers
Context length	8,192 tokens
Vocabulary	200,192 tokens

Quantization


Method	NVFP4 (4-bit floating point)
Tool	NVIDIA ModelOpt 0.41.0
Group size	16
Calibration	512 samples (Korean, Code, Creative Writing, English), max_seq_length=512
Quantized layers	MLP/expert weights only (`gate_proj`, `up_proj`, `down_proj` in dense and MoE layers)
BF16 layers	Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head
Source precision	BF16

Compression

Format	Size
BF16 (original)	796 GB
NVFP4 (this model)	216 GB

3.7x compression.

Intended Use

This checkpoint is intended for deployment on NVIDIA Blackwell (SM100) GPUs using TensorRT-LLM's NVFP4 inference path. The NVFP4 format requires Blackwell's 5th-generation Tensor Cores for native FP4 execution.

Loading with TensorRT-LLM

# Convert to TensorRT-LLM engine
trtllm-build \
    --checkpoint_dir ./Trinity-Large-TrueBase-NVFP4 \
    --output_dir ./engine \
    --gemm_plugin auto

Quantization Recipe

Following NVIDIA's MLP-only quantization strategy (similar to the DeepSeek-R1 NVFP4 recipe):

Only MLP/expert weights (gate_proj, up_proj, down_proj) are quantized to FP4
All attention projections remain in BF16 to preserve quality
Router gates (mlp.router) remain in BF16
Embeddings and lm_head remain in BF16
The default *mlp.gate.* exclusion was removed because Trinity uses mlp.gate_proj as a standard MLP projection (not a routing gate)

Calibration Data

Domain	Samples	Dataset
Korean	128	heegyu/open-korean-instructions
Code	128	m-a-p/CodeFeedback-Filtered-Instruction
Creative Writing	128	Gryphe/ChatGPT-4o-Writing-Prompts
General English	128	teknium/OpenHermes-2.5

Files

File	Description
`model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors`	Quantized model weights (5 shards, ~43 GB each)
`model.safetensors.index.json`	Weight shard index
`config.json`	Model configuration with `quantization_config`
`hf_quant_config.json`	ModelOpt quantization metadata (consumed by TensorRT-LLM)
`generation_config.json`	Generation configuration
`tokenizer.json`	Tokenizer
`tokenizer_config.json`	Tokenizer configuration
`chat_template.jinja`	Chat template

Hardware

Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.

Limitations

Requires NVIDIA Blackwell GPUs (SM100) for native NVFP4 inference via TensorRT-LLM
Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
This quantization targets the MLP/expert layers only; KV cache is not quantized

License

Same license as the base model arcee-ai/Trinity-Large-TrueBase.