File size: 9,161 Bytes
d32ea7b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | ---
license: apache-2.0
base_model: arcee-ai/Trinity-Large-Base
tags:
- moe
- nvfp4
- modelopt
- blackwell
- vllm
---
# Trinity-Large-Base-NVFP4
NVFP4-quantized version of [arcee-ai/Trinity-Large-Base](https://huggingface.co/arcee-ai/Trinity-Large-Base) for deployment on NVIDIA Blackwell GPUs.
## Model Details
| | |
|---|---|
| **Base model** | [arcee-ai/Trinity-Large-Base](https://huggingface.co/arcee-ai/Trinity-Large-Base) |
| **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) |
| **Parameters** | 398B total, ~13B active per token |
| **Layers** | 60 (6 dense + 54 MoE) |
| **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert |
| **Hidden size** | 3072 |
| **MoE intermediate size** | 3072 per expert |
| **Dense intermediate size** | 12,288 |
| **Attention** | 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers |
| **Context length** | 8,192 tokens |
| **Vocabulary** | 200,192 tokens |
## Quantization
| | |
|---|---|
| **Method** | NVFP4 (4-bit floating point) |
| **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 |
| **Group size** | 16 |
| **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 |
| **Quantized layers** | MLP/expert weights only (`gate_proj`, `up_proj`, `down_proj` in dense and MoE layers) |
| **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head |
| **Source precision** | BF16 |
### Compression
| Format | Size |
|--------|------|
| BF16 (original) | 796 GB |
| **NVFP4 (this model)** | **216 GB** |
3.7x compression.
## Running with vLLM
[vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference.
### Requirements
- **VRAM**: ~216 GB total model weight. A single GPU with ≥224 GB VRAM can load it directly; smaller setups require multi-GPU and/or CPU offloading.
- **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory (the offload value × number of GPUs, plus ~40 GB for model loading overhead).
### Installation
```bash
pip install "vllm>=0.15.1"
```
### Environment Variables
Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. This avoids large temporary GPU allocations during MoE weight initialization that can cause OOM on memory-constrained setups:
```bash
export VLLM_USE_FLASHINFER_MOE_FP4=0
```
### Single-GPU (≥224 GB VRAM)
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="mconcat/Trinity-Large-Base-NVFP4",
quantization="modelopt",
max_model_len=4096,
enforce_eager=True,
gpu_memory_utilization=0.90,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
```
### Multi-GPU with Pipeline Parallelism
For setups where total VRAM is less than ~216 GB, use pipeline parallelism with CPU weight offloading:
```python
import os
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
from vllm import LLM, SamplingParams
llm = LLM(
model="mconcat/Trinity-Large-Base-NVFP4",
quantization="modelopt",
pipeline_parallel_size=2, # number of GPUs
cpu_offload_gb=30, # GB of weights to offload per GPU
max_model_len=512,
max_num_seqs=256,
enforce_eager=True,
gpu_memory_utilization=0.95,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
```
**Tuning tips:**
- `cpu_offload_gb` is **per GPU** — total pinned memory = `cpu_offload_gb × pipeline_parallel_size`. Ensure this fits in system RAM alongside the OS and model loading workspace (~40 GB).
- For **heterogeneous GPU setups** (different VRAM sizes), set `VLLM_PP_LAYER_PARTITION` to control how many of the 60 layers each GPU gets. For example, `export VLLM_PP_LAYER_PARTITION="32,14,14"` for a 3-GPU setup where the first GPU has ~3x the VRAM.
- Each MoE layer is ~3.9 GB (NVFP4) while each dense layer is ~0.14 GB. The first 6 layers are dense; layers 6–59 are MoE. Distribute layers so that `(layer_weights - cpu_offload_gb)` fits comfortably on each GPU with room for KV cache and overhead.
- `max_num_seqs` may need to be lowered for GPUs with ≤32 GB VRAM. The sampler warmup allocates `max_num_seqs × vocab_size × 8 bytes` of temporary memory (~1.5 GB at the default of 1024). Use 256 for smaller GPUs.
- Start with a low `max_model_len` (e.g., 512) and increase once loading succeeds.
### OpenAI-Compatible API Server
```bash
VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
--model mconcat/Trinity-Large-Base-NVFP4 \
--quantization modelopt \
--max-model-len 4096 \
--enforce-eager \
--gpu-memory-utilization 0.90 \
--port 8000
```
For multi-GPU serving, add `--pipeline-parallel-size N --cpu-offload-gb X --max-num-seqs 256` as needed.
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "mconcat/Trinity-Large-Base-NVFP4", "prompt": "Hello", "max_tokens": 64}'
```
## Important Notes
- **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
- **vLLM quantization flag**: Use `--quantization modelopt` (not `modelopt_fp4`). vLLM auto-detects the NVFP4 algorithm from the config.
- **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend performs a `reorder_w1w3_to_w3w1` operation that temporarily allocates ~2.25 GB per MoE layer on GPU, which can cause OOM.
- **vLLM cpu_offload_gb + V1 engine**: As of vLLM 0.15.x, using `cpu_offload_gb` with the V1 engine may trigger an assertion error in `may_reinitialize_input_batch` (`gpu_model_runner.py`). If you encounter `AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled`, this can be safely patched by converting the assertion to a warning. See [vLLM PR #18298](https://github.com/vllm-project/vllm/issues/18298) for status.
- **HuggingFace Transformers**: While `transformers >= 5.0` recognizes the `AfmoeForCausalLM` architecture, it does **not** support ModelOpt NVFP4 weight format for inference. Use vLLM instead.
- **TensorRT-LLM**: As of February 2026, TensorRT-LLM does not support the `AfmoeForCausalLM` architecture.
## Quantization Recipe
Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):
- Only MLP/expert weights (`gate_proj`, `up_proj`, `down_proj`) are quantized to FP4
- All attention projections remain in BF16 to preserve quality
- Router gates (`mlp.router`) remain in BF16
- Embeddings and lm_head remain in BF16
- The default `*mlp.gate.*` exclusion was removed because Trinity uses `mlp.gate_proj` as a standard MLP projection (not a routing gate)
### Calibration Data
| Domain | Samples | Dataset |
|--------|---------|---------|
| Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) |
| Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) |
| Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) |
| General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |
## Files
| File | Description |
|------|-------------|
| `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43-50 GB each) |
| `model.safetensors.index.json` | Weight shard index |
| `config.json` | Model configuration with `quantization_config` |
| `hf_quant_config.json` | ModelOpt quantization metadata |
| `generation_config.json` | Generation configuration |
| `tokenizer.json` | Tokenizer |
| `tokenizer_config.json` | Tokenizer configuration |
| `chat_template.jinja` | Chat template |
## Hardware
Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.
## Limitations
- Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
- Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
- This quantization targets the MLP/expert layers only; KV cache is not quantized
## License
Same license as the base model: [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Large-Base).
|