README.md · mconcat/Trinity-Large-TrueBase-NVFP4 at main

Trinity-Large-TrueBase-NVFP4 / README.md

mconcat

Upload folder using huggingface_hub

13be8a1 verified 3 days ago

preview code

raw

history blame contribute delete

4.87 kB

	---
	library_name: tensorrt_llm
	base_model: arcee-ai/Trinity-Large-TrueBase
	tags:
	- nvidia
	- nvfp4
	- fp4
	- quantized
	- tensorrt-llm
	- modelopt
	- mixture-of-experts
	- moe
	- blackwell
	license: other
	license_name: same-as-base-model
	license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase
	---

	# Trinity-Large-TrueBase-NVFP4

	NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) for deployment on NVIDIA Blackwell GPUs via TensorRT-LLM.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Base model \| [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) \|
	\| Architecture \| AfmoeForCausalLM (Mixture-of-Experts) \|
	\| Parameters \| 398B total \|
	\| Layers \| 60 (6 dense + 54 MoE) \|
	\| Experts \| 256 per MoE layer, 4 active per token, 1 shared expert \|
	\| Hidden size \| 3072 \|
	\| MoE intermediate size \| 3072 per expert \|
	\| Dense intermediate size \| 12,288 \|
	\| Attention \| 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers \|
	\| Context length \| 8,192 tokens \|
	\| Vocabulary \| 200,192 tokens \|

	## Quantization

	\| \| \|
	\|---\|---\|
	\| Method \| NVFP4 (4-bit floating point) \|
	\| Tool \| [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 \|
	\| Group size \| 16 \|
	\| Calibration \| 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 \|
	\| Quantized layers \| MLP/expert weights only (`gate_proj`, `up_proj`, `down_proj` in dense and MoE layers) \|
	\| BF16 layers \| Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head \|
	\| Source precision \| BF16 \|

	### Compression

	\| Format \| Size \|
	\|--------\|------\|
	\| BF16 (original) \| 796 GB \|
	\| NVFP4 (this model) \| 216 GB \|

	3.7x compression.

	## Intended Use

	This checkpoint is intended for deployment on NVIDIA Blackwell (SM100) GPUs using TensorRT-LLM's NVFP4 inference path. The NVFP4 format requires Blackwell's 5th-generation Tensor Cores for native FP4 execution.

	### Loading with TensorRT-LLM

	```bash
	# Convert to TensorRT-LLM engine
	trtllm-build \
	--checkpoint_dir ./Trinity-Large-TrueBase-NVFP4 \
	--output_dir ./engine \
	--gemm_plugin auto
	```

	## Quantization Recipe

	Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):

	- Only MLP/expert weights (`gate_proj`, `up_proj`, `down_proj`) are quantized to FP4
	- All attention projections remain in BF16 to preserve quality
	- Router gates (`mlp.router`) remain in BF16
	- Embeddings and lm_head remain in BF16
	- The default `mlp.gate.` exclusion was removed because Trinity uses `mlp.gate_proj` as a standard MLP projection (not a routing gate)

	### Calibration Data

	\| Domain \| Samples \| Dataset \|
	\|--------\|---------\|---------\|
	\| Korean \| 128 \| [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) \|
	\| Code \| 128 \| [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) \|
	\| Creative Writing \| 128 \| [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) \|
	\| General English \| 128 \| [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) \|

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` \| Quantized model weights (5 shards, ~43 GB each) \|
	\| `model.safetensors.index.json` \| Weight shard index \|
	\| `config.json` \| Model configuration with `quantization_config` \|
	\| `hf_quant_config.json` \| ModelOpt quantization metadata (consumed by TensorRT-LLM) \|
	\| `generation_config.json` \| Generation configuration \|
	\| `tokenizer.json` \| Tokenizer \|
	\| `tokenizer_config.json` \| Tokenizer configuration \|
	\| `chat_template.jinja` \| Chat template \|

	## Hardware

	Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.

	## Limitations

	- Requires NVIDIA Blackwell GPUs (SM100) for native NVFP4 inference via TensorRT-LLM
	- Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
	- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
	- This quantization targets the MLP/expert layers only; KV cache is not quantized

	## License

	Same license as the base model [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase).