File size: 4,866 Bytes
13be8a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
library_name: tensorrt_llm
base_model: arcee-ai/Trinity-Large-TrueBase
tags:
  - nvidia
  - nvfp4
  - fp4
  - quantized
  - tensorrt-llm
  - modelopt
  - mixture-of-experts
  - moe
  - blackwell
license: other
license_name: same-as-base-model
license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase
---

# Trinity-Large-TrueBase-NVFP4

NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) for deployment on NVIDIA Blackwell GPUs via TensorRT-LLM.

## Model Details

| | |
|---|---|
| **Base model** | [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) |
| **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) |
| **Parameters** | 398B total |
| **Layers** | 60 (6 dense + 54 MoE) |
| **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert |
| **Hidden size** | 3072 |
| **MoE intermediate size** | 3072 per expert |
| **Dense intermediate size** | 12,288 |
| **Attention** | 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers |
| **Context length** | 8,192 tokens |
| **Vocabulary** | 200,192 tokens |

## Quantization

| | |
|---|---|
| **Method** | NVFP4 (4-bit floating point) |
| **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 |
| **Group size** | 16 |
| **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 |
| **Quantized layers** | MLP/expert weights only (`gate_proj`, `up_proj`, `down_proj` in dense and MoE layers) |
| **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head |
| **Source precision** | BF16 |

### Compression

| Format | Size |
|--------|------|
| BF16 (original) | 796 GB |
| **NVFP4 (this model)** | **216 GB** |

3.7x compression.

## Intended Use

This checkpoint is intended for deployment on NVIDIA Blackwell (SM100) GPUs using TensorRT-LLM's NVFP4 inference path. The NVFP4 format requires Blackwell's 5th-generation Tensor Cores for native FP4 execution.

### Loading with TensorRT-LLM

```bash
# Convert to TensorRT-LLM engine
trtllm-build \
    --checkpoint_dir ./Trinity-Large-TrueBase-NVFP4 \
    --output_dir ./engine \
    --gemm_plugin auto
```

## Quantization Recipe

Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):

- Only MLP/expert weights (`gate_proj`, `up_proj`, `down_proj`) are quantized to FP4
- All attention projections remain in BF16 to preserve quality
- Router gates (`mlp.router`) remain in BF16
- Embeddings and lm_head remain in BF16
- The default `*mlp.gate.*` exclusion was removed because Trinity uses `mlp.gate_proj` as a standard MLP projection (not a routing gate)

### Calibration Data

| Domain | Samples | Dataset |
|--------|---------|---------|
| Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) |
| Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) |
| Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) |
| General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |

## Files

| File | Description |
|------|-------------|
| `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43 GB each) |
| `model.safetensors.index.json` | Weight shard index |
| `config.json` | Model configuration with `quantization_config` |
| `hf_quant_config.json` | ModelOpt quantization metadata (consumed by TensorRT-LLM) |
| `generation_config.json` | Generation configuration |
| `tokenizer.json` | Tokenizer |
| `tokenizer_config.json` | Tokenizer configuration |
| `chat_template.jinja` | Chat template |

## Hardware

Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.

## Limitations

- Requires NVIDIA Blackwell GPUs (SM100) for native NVFP4 inference via TensorRT-LLM
- Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
- This quantization targets the MLP/expert layers only; KV cache is not quantized

## License

Same license as the base model [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase).