File size: 9,161 Bytes
d32ea7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
license: apache-2.0
base_model: arcee-ai/Trinity-Large-Base
tags:
  - moe
  - nvfp4
  - modelopt
  - blackwell
  - vllm
---

# Trinity-Large-Base-NVFP4

NVFP4-quantized version of [arcee-ai/Trinity-Large-Base](https://huggingface.co/arcee-ai/Trinity-Large-Base) for deployment on NVIDIA Blackwell GPUs.

## Model Details

| | |
|---|---|
| **Base model** | [arcee-ai/Trinity-Large-Base](https://huggingface.co/arcee-ai/Trinity-Large-Base) |
| **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) |
| **Parameters** | 398B total, ~13B active per token |
| **Layers** | 60 (6 dense + 54 MoE) |
| **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert |
| **Hidden size** | 3072 |
| **MoE intermediate size** | 3072 per expert |
| **Dense intermediate size** | 12,288 |
| **Attention** | 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers |
| **Context length** | 8,192 tokens |
| **Vocabulary** | 200,192 tokens |

## Quantization

| | |
|---|---|
| **Method** | NVFP4 (4-bit floating point) |
| **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 |
| **Group size** | 16 |
| **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 |
| **Quantized layers** | MLP/expert weights only (`gate_proj`, `up_proj`, `down_proj` in dense and MoE layers) |
| **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head |
| **Source precision** | BF16 |

### Compression

| Format | Size |
|--------|------|
| BF16 (original) | 796 GB |
| **NVFP4 (this model)** | **216 GB** |

3.7x compression.

## Running with vLLM

[vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference.

### Requirements

- **VRAM**: ~216 GB total model weight. A single GPU with ≥224 GB VRAM can load it directly; smaller setups require multi-GPU and/or CPU offloading.
- **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory (the offload value × number of GPUs, plus ~40 GB for model loading overhead).

### Installation

```bash
pip install "vllm>=0.15.1"
```

### Environment Variables

Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. This avoids large temporary GPU allocations during MoE weight initialization that can cause OOM on memory-constrained setups:

```bash
export VLLM_USE_FLASHINFER_MOE_FP4=0
```

### Single-GPU (≥224 GB VRAM)

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="mconcat/Trinity-Large-Base-NVFP4",
    quantization="modelopt",
    max_model_len=4096,
    enforce_eager=True,
    gpu_memory_utilization=0.90,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
```

### Multi-GPU with Pipeline Parallelism

For setups where total VRAM is less than ~216 GB, use pipeline parallelism with CPU weight offloading:

```python
import os
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"

from vllm import LLM, SamplingParams

llm = LLM(
    model="mconcat/Trinity-Large-Base-NVFP4",
    quantization="modelopt",
    pipeline_parallel_size=2,        # number of GPUs
    cpu_offload_gb=30,               # GB of weights to offload per GPU
    max_model_len=512,
    max_num_seqs=256,
    enforce_eager=True,
    gpu_memory_utilization=0.95,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
```

**Tuning tips:**
- `cpu_offload_gb` is **per GPU** — total pinned memory = `cpu_offload_gb × pipeline_parallel_size`. Ensure this fits in system RAM alongside the OS and model loading workspace (~40 GB).
- For **heterogeneous GPU setups** (different VRAM sizes), set `VLLM_PP_LAYER_PARTITION` to control how many of the 60 layers each GPU gets. For example, `export VLLM_PP_LAYER_PARTITION="32,14,14"` for a 3-GPU setup where the first GPU has ~3x the VRAM.
- Each MoE layer is ~3.9 GB (NVFP4) while each dense layer is ~0.14 GB. The first 6 layers are dense; layers 6–59 are MoE. Distribute layers so that `(layer_weights - cpu_offload_gb)` fits comfortably on each GPU with room for KV cache and overhead.
- `max_num_seqs` may need to be lowered for GPUs with ≤32 GB VRAM. The sampler warmup allocates `max_num_seqs × vocab_size × 8 bytes` of temporary memory (~1.5 GB at the default of 1024). Use 256 for smaller GPUs.
- Start with a low `max_model_len` (e.g., 512) and increase once loading succeeds.

### OpenAI-Compatible API Server

```bash
VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
    --model mconcat/Trinity-Large-Base-NVFP4 \
    --quantization modelopt \
    --max-model-len 4096 \
    --enforce-eager \
    --gpu-memory-utilization 0.90 \
    --port 8000
```

For multi-GPU serving, add `--pipeline-parallel-size N --cpu-offload-gb X --max-num-seqs 256` as needed.

```bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mconcat/Trinity-Large-Base-NVFP4", "prompt": "Hello", "max_tokens": 64}'
```

## Important Notes

- **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
- **vLLM quantization flag**: Use `--quantization modelopt` (not `modelopt_fp4`). vLLM auto-detects the NVFP4 algorithm from the config.
- **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend performs a `reorder_w1w3_to_w3w1` operation that temporarily allocates ~2.25 GB per MoE layer on GPU, which can cause OOM.
- **vLLM cpu_offload_gb + V1 engine**: As of vLLM 0.15.x, using `cpu_offload_gb` with the V1 engine may trigger an assertion error in `may_reinitialize_input_batch` (`gpu_model_runner.py`). If you encounter `AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled`, this can be safely patched by converting the assertion to a warning. See [vLLM PR #18298](https://github.com/vllm-project/vllm/issues/18298) for status.
- **HuggingFace Transformers**: While `transformers >= 5.0` recognizes the `AfmoeForCausalLM` architecture, it does **not** support ModelOpt NVFP4 weight format for inference. Use vLLM instead.
- **TensorRT-LLM**: As of February 2026, TensorRT-LLM does not support the `AfmoeForCausalLM` architecture.

## Quantization Recipe

Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):

- Only MLP/expert weights (`gate_proj`, `up_proj`, `down_proj`) are quantized to FP4
- All attention projections remain in BF16 to preserve quality
- Router gates (`mlp.router`) remain in BF16
- Embeddings and lm_head remain in BF16
- The default `*mlp.gate.*` exclusion was removed because Trinity uses `mlp.gate_proj` as a standard MLP projection (not a routing gate)

### Calibration Data

| Domain | Samples | Dataset |
|--------|---------|---------|
| Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) |
| Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) |
| Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) |
| General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |

## Files

| File | Description |
|------|-------------|
| `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43-50 GB each) |
| `model.safetensors.index.json` | Weight shard index |
| `config.json` | Model configuration with `quantization_config` |
| `hf_quant_config.json` | ModelOpt quantization metadata |
| `generation_config.json` | Generation configuration |
| `tokenizer.json` | Tokenizer |
| `tokenizer_config.json` | Tokenizer configuration |
| `chat_template.jinja` | Chat template |

## Hardware

Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.

## Limitations

- Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
- Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
- This quantization targets the MLP/expert layers only; KV cache is not quantized

## License

Same license as the base model: [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Large-Base).