Trinity-Large-Base-NVFP4
NVFP4-quantized version of arcee-ai/Trinity-Large-Base for deployment on NVIDIA Blackwell GPUs.
Model Details
| Base model | arcee-ai/Trinity-Large-Base |
| Architecture | AfmoeForCausalLM (Mixture-of-Experts) |
| Parameters | 398B total, ~13B active per token |
| Layers | 60 (6 dense + 54 MoE) |
| Experts | 256 per MoE layer, 4 active per token, 1 shared expert |
| Hidden size | 3072 |
| MoE intermediate size | 3072 per expert |
| Dense intermediate size | 12,288 |
| Attention | 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers |
| Context length | 8,192 tokens |
| Vocabulary | 200,192 tokens |
Quantization
| Method | NVFP4 (4-bit floating point) |
| Tool | NVIDIA ModelOpt 0.41.0 |
| Group size | 16 |
| Calibration | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 |
| Quantized layers | MLP/expert weights only (gate_proj, up_proj, down_proj in dense and MoE layers) |
| BF16 layers | Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head |
| Source precision | BF16 |
Compression
| Format | Size |
|---|---|
| BF16 (original) | 796 GB |
| NVFP4 (this model) | 216 GB |
3.7x compression.
Running with vLLM
vLLM >= 0.15.1 supports this model natively with the modelopt quantization backend. Blackwell GPUs (SM100/SM120) are required for NVFP4 inference.
Requirements
- VRAM: ~216 GB total model weight. A single GPU with ≥224 GB VRAM can load it directly; smaller setups require multi-GPU and/or CPU offloading.
- System RAM: If using
cpu_offload_gb, you need sufficient system RAM for pinned memory (the offload value × number of GPUs, plus ~40 GB for model loading overhead).
Installation
pip install "vllm>=0.15.1"
Environment Variables
Set VLLM_USE_FLASHINFER_MOE_FP4=0 to use the VLLM_CUTLASS MoE backend. This avoids large temporary GPU allocations during MoE weight initialization that can cause OOM on memory-constrained setups:
export VLLM_USE_FLASHINFER_MOE_FP4=0
Single-GPU (≥224 GB VRAM)
from vllm import LLM, SamplingParams
llm = LLM(
model="mconcat/Trinity-Large-Base-NVFP4",
quantization="modelopt",
max_model_len=4096,
enforce_eager=True,
gpu_memory_utilization=0.90,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
Multi-GPU with Pipeline Parallelism
For setups where total VRAM is less than ~216 GB, use pipeline parallelism with CPU weight offloading:
import os
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
from vllm import LLM, SamplingParams
llm = LLM(
model="mconcat/Trinity-Large-Base-NVFP4",
quantization="modelopt",
pipeline_parallel_size=2, # number of GPUs
cpu_offload_gb=30, # GB of weights to offload per GPU
max_model_len=512,
max_num_seqs=256,
enforce_eager=True,
gpu_memory_utilization=0.95,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)
Tuning tips:
cpu_offload_gbis per GPU — total pinned memory =cpu_offload_gb × pipeline_parallel_size. Ensure this fits in system RAM alongside the OS and model loading workspace (~40 GB).- For heterogeneous GPU setups (different VRAM sizes), set
VLLM_PP_LAYER_PARTITIONto control how many of the 60 layers each GPU gets. For example,export VLLM_PP_LAYER_PARTITION="32,14,14"for a 3-GPU setup where the first GPU has ~3x the VRAM. - Each MoE layer is ~3.9 GB (NVFP4) while each dense layer is ~0.14 GB. The first 6 layers are dense; layers 6–59 are MoE. Distribute layers so that
(layer_weights - cpu_offload_gb)fits comfortably on each GPU with room for KV cache and overhead. max_num_seqsmay need to be lowered for GPUs with ≤32 GB VRAM. The sampler warmup allocatesmax_num_seqs × vocab_size × 8 bytesof temporary memory (~1.5 GB at the default of 1024). Use 256 for smaller GPUs.- Start with a low
max_model_len(e.g., 512) and increase once loading succeeds.
OpenAI-Compatible API Server
VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
--model mconcat/Trinity-Large-Base-NVFP4 \
--quantization modelopt \
--max-model-len 4096 \
--enforce-eager \
--gpu-memory-utilization 0.90 \
--port 8000
For multi-GPU serving, add --pipeline-parallel-size N --cpu-offload-gb X --max-num-seqs 256 as needed.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "mconcat/Trinity-Large-Base-NVFP4", "prompt": "Hello", "max_tokens": 64}'
Important Notes
- Blackwell required: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
- vLLM quantization flag: Use
--quantization modelopt(notmodelopt_fp4). vLLM auto-detects the NVFP4 algorithm from the config. - MoE backend: Set
VLLM_USE_FLASHINFER_MOE_FP4=0to use the VLLM_CUTLASS MoE backend. The default flashinfer backend performs areorder_w1w3_to_w3w1operation that temporarily allocates ~2.25 GB per MoE layer on GPU, which can cause OOM. - vLLM cpu_offload_gb + V1 engine: As of vLLM 0.15.x, using
cpu_offload_gbwith the V1 engine may trigger an assertion error inmay_reinitialize_input_batch(gpu_model_runner.py). If you encounterAssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled, this can be safely patched by converting the assertion to a warning. See vLLM PR #18298 for status. - HuggingFace Transformers: While
transformers >= 5.0recognizes theAfmoeForCausalLMarchitecture, it does not support ModelOpt NVFP4 weight format for inference. Use vLLM instead. - TensorRT-LLM: As of February 2026, TensorRT-LLM does not support the
AfmoeForCausalLMarchitecture.
Quantization Recipe
Following NVIDIA's MLP-only quantization strategy (similar to the DeepSeek-R1 NVFP4 recipe):
- Only MLP/expert weights (
gate_proj,up_proj,down_proj) are quantized to FP4 - All attention projections remain in BF16 to preserve quality
- Router gates (
mlp.router) remain in BF16 - Embeddings and lm_head remain in BF16
- The default
*mlp.gate.*exclusion was removed because Trinity usesmlp.gate_projas a standard MLP projection (not a routing gate)
Calibration Data
| Domain | Samples | Dataset |
|---|---|---|
| Korean | 128 | heegyu/open-korean-instructions |
| Code | 128 | m-a-p/CodeFeedback-Filtered-Instruction |
| Creative Writing | 128 | Gryphe/ChatGPT-4o-Writing-Prompts |
| General English | 128 | teknium/OpenHermes-2.5 |
Files
| File | Description |
|---|---|
model-00001-of-00005.safetensors ... model-00005-of-00005.safetensors |
Quantized model weights (5 shards, ~43-50 GB each) |
model.safetensors.index.json |
Weight shard index |
config.json |
Model configuration with quantization_config |
hf_quant_config.json |
ModelOpt quantization metadata |
generation_config.json |
Generation configuration |
tokenizer.json |
Tokenizer |
tokenizer_config.json |
Tokenizer configuration |
chat_template.jinja |
Chat template |
Hardware
Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.
Limitations
- Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
- Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
- This quantization targets the MLP/expert layers only; KV cache is not quantized
License
Same license as the base model: Apache 2.0.
- Downloads last month
- 11
Model tree for mconcat/Trinity-Large-Base-NVFP4
Base model
arcee-ai/Trinity-Large-TrueBase