--- library_name: tensorrt_llm base_model: arcee-ai/Trinity-Large-TrueBase tags: - nvidia - nvfp4 - fp4 - quantized - tensorrt-llm - modelopt - mixture-of-experts - moe - blackwell license: other license_name: same-as-base-model license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase --- # Trinity-Large-TrueBase-NVFP4 NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) for deployment on NVIDIA Blackwell GPUs via TensorRT-LLM. ## Model Details | | | |---|---| | **Base model** | [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) | | **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) | | **Parameters** | 398B total | | **Layers** | 60 (6 dense + 54 MoE) | | **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert | | **Hidden size** | 3072 | | **MoE intermediate size** | 3072 per expert | | **Dense intermediate size** | 12,288 | | **Attention** | 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers | | **Context length** | 8,192 tokens | | **Vocabulary** | 200,192 tokens | ## Quantization | | | |---|---| | **Method** | NVFP4 (4-bit floating point) | | **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 | | **Group size** | 16 | | **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 | | **Quantized layers** | MLP/expert weights only (`gate_proj`, `up_proj`, `down_proj` in dense and MoE layers) | | **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head | | **Source precision** | BF16 | ### Compression | Format | Size | |--------|------| | BF16 (original) | 796 GB | | **NVFP4 (this model)** | **216 GB** | 3.7x compression. ## Intended Use This checkpoint is intended for deployment on NVIDIA Blackwell (SM100) GPUs using TensorRT-LLM's NVFP4 inference path. The NVFP4 format requires Blackwell's 5th-generation Tensor Cores for native FP4 execution. ### Loading with TensorRT-LLM ```bash # Convert to TensorRT-LLM engine trtllm-build \ --checkpoint_dir ./Trinity-Large-TrueBase-NVFP4 \ --output_dir ./engine \ --gemm_plugin auto ``` ## Quantization Recipe Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)): - Only MLP/expert weights (`gate_proj`, `up_proj`, `down_proj`) are quantized to FP4 - All attention projections remain in BF16 to preserve quality - Router gates (`mlp.router`) remain in BF16 - Embeddings and lm_head remain in BF16 - The default `*mlp.gate.*` exclusion was removed because Trinity uses `mlp.gate_proj` as a standard MLP projection (not a routing gate) ### Calibration Data | Domain | Samples | Dataset | |--------|---------|---------| | Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) | | Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) | | Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) | | General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | ## Files | File | Description | |------|-------------| | `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43 GB each) | | `model.safetensors.index.json` | Weight shard index | | `config.json` | Model configuration with `quantization_config` | | `hf_quant_config.json` | ModelOpt quantization metadata (consumed by TensorRT-LLM) | | `generation_config.json` | Generation configuration | | `tokenizer.json` | Tokenizer | | `tokenizer_config.json` | Tokenizer configuration | | `chat_template.jinja` | Chat template | ## Hardware Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does. ## Limitations - Requires NVIDIA Blackwell GPUs (SM100) for native NVFP4 inference via TensorRT-LLM - Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision - Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation - This quantization targets the MLP/expert layers only; KV cache is not quantized ## License Same license as the base model [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase).