Trinity-Large-Base-NVFP4

NVFP4-quantized version of arcee-ai/Trinity-Large-Base for deployment on NVIDIA Blackwell GPUs.

Model Details


Base model	arcee-ai/Trinity-Large-Base
Architecture	AfmoeForCausalLM (Mixture-of-Experts)
Parameters	398B total, ~13B active per token
Layers	60 (6 dense + 54 MoE)
Experts	256 per MoE layer, 4 active per token, 1 shared expert
Hidden size	3072
MoE intermediate size	3072 per expert
Dense intermediate size	12,288
Attention	48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers
Context length	8,192 tokens
Vocabulary	200,192 tokens

Quantization


Method	NVFP4 (4-bit floating point)
Tool	NVIDIA ModelOpt 0.41.0
Group size	16
Calibration	512 samples (Korean, Code, Creative Writing, English), max_seq_length=512
Quantized layers	MLP/expert weights only (`gate_proj`, `up_proj`, `down_proj` in dense and MoE layers)
BF16 layers	Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head
Source precision	BF16

Compression

Format	Size
BF16 (original)	796 GB
NVFP4 (this model)	216 GB

3.7x compression.

Running with vLLM

vLLM >= 0.15.1 supports this model natively with the modelopt quantization backend. Blackwell GPUs (SM100/SM120) are required for NVFP4 inference.

Requirements

VRAM: ~216 GB total model weight. A single GPU with ≥224 GB VRAM can load it directly; smaller setups require multi-GPU and/or CPU offloading.
System RAM: If using cpu_offload_gb, you need sufficient system RAM for pinned memory (the offload value × number of GPUs, plus ~40 GB for model loading overhead).

Installation

pip install "vllm>=0.15.1"

Environment Variables

Set VLLM_USE_FLASHINFER_MOE_FP4=0 to use the VLLM_CUTLASS MoE backend. This avoids large temporary GPU allocations during MoE weight initialization that can cause OOM on memory-constrained setups:

export VLLM_USE_FLASHINFER_MOE_FP4=0

Single-GPU (≥224 GB VRAM)

from vllm import LLM, SamplingParams

llm = LLM(
    model="mconcat/Trinity-Large-Base-NVFP4",
    quantization="modelopt",
    max_model_len=4096,
    enforce_eager=True,
    gpu_memory_utilization=0.90,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)

Multi-GPU with Pipeline Parallelism

For setups where total VRAM is less than ~216 GB, use pipeline parallelism with CPU weight offloading:

import os
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"

from vllm import LLM, SamplingParams

llm = LLM(
    model="mconcat/Trinity-Large-Base-NVFP4",
    quantization="modelopt",
    pipeline_parallel_size=2,        # number of GPUs
    cpu_offload_gb=30,               # GB of weights to offload per GPU
    max_model_len=512,
    max_num_seqs=256,
    enforce_eager=True,
    gpu_memory_utilization=0.95,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)

Tuning tips:

cpu_offload_gb is per GPU — total pinned memory = cpu_offload_gb × pipeline_parallel_size. Ensure this fits in system RAM alongside the OS and model loading workspace (~40 GB).
For heterogeneous GPU setups (different VRAM sizes), set VLLM_PP_LAYER_PARTITION to control how many of the 60 layers each GPU gets. For example, export VLLM_PP_LAYER_PARTITION="32,14,14" for a 3-GPU setup where the first GPU has ~3x the VRAM.
Each MoE layer is ~3.9 GB (NVFP4) while each dense layer is ~0.14 GB. The first 6 layers are dense; layers 6–59 are MoE. Distribute layers so that (layer_weights - cpu_offload_gb) fits comfortably on each GPU with room for KV cache and overhead.
max_num_seqs may need to be lowered for GPUs with ≤32 GB VRAM. The sampler warmup allocates max_num_seqs × vocab_size × 8 bytes of temporary memory (~1.5 GB at the default of 1024). Use 256 for smaller GPUs.
Start with a low max_model_len (e.g., 512) and increase once loading succeeds.

OpenAI-Compatible API Server

VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
    --model mconcat/Trinity-Large-Base-NVFP4 \
    --quantization modelopt \
    --max-model-len 4096 \
    --enforce-eager \
    --gpu-memory-utilization 0.90 \
    --port 8000

For multi-GPU serving, add --pipeline-parallel-size N --cpu-offload-gb X --max-num-seqs 256 as needed.

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mconcat/Trinity-Large-Base-NVFP4", "prompt": "Hello", "max_tokens": 64}'

Important Notes

Blackwell required: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
vLLM quantization flag: Use --quantization modelopt (not modelopt_fp4). vLLM auto-detects the NVFP4 algorithm from the config.
MoE backend: Set VLLM_USE_FLASHINFER_MOE_FP4=0 to use the VLLM_CUTLASS MoE backend. The default flashinfer backend performs a reorder_w1w3_to_w3w1 operation that temporarily allocates ~2.25 GB per MoE layer on GPU, which can cause OOM.
vLLM cpu_offload_gb + V1 engine: As of vLLM 0.15.x, using cpu_offload_gb with the V1 engine may trigger an assertion error in may_reinitialize_input_batch (gpu_model_runner.py). If you encounter AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled, this can be safely patched by converting the assertion to a warning. See vLLM PR #18298 for status.
HuggingFace Transformers: While transformers >= 5.0 recognizes the AfmoeForCausalLM architecture, it does not support ModelOpt NVFP4 weight format for inference. Use vLLM instead.
TensorRT-LLM: As of February 2026, TensorRT-LLM does not support the AfmoeForCausalLM architecture.

Quantization Recipe

Following NVIDIA's MLP-only quantization strategy (similar to the DeepSeek-R1 NVFP4 recipe):

Only MLP/expert weights (gate_proj, up_proj, down_proj) are quantized to FP4
All attention projections remain in BF16 to preserve quality
Router gates (mlp.router) remain in BF16
Embeddings and lm_head remain in BF16
The default *mlp.gate.* exclusion was removed because Trinity uses mlp.gate_proj as a standard MLP projection (not a routing gate)

Calibration Data

Domain	Samples	Dataset
Korean	128	heegyu/open-korean-instructions
Code	128	m-a-p/CodeFeedback-Filtered-Instruction
Creative Writing	128	Gryphe/ChatGPT-4o-Writing-Prompts
General English	128	teknium/OpenHermes-2.5

Files

File	Description
`model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors`	Quantized model weights (5 shards, ~43-50 GB each)
`model.safetensors.index.json`	Weight shard index
`config.json`	Model configuration with `quantization_config`
`hf_quant_config.json`	ModelOpt quantization metadata
`generation_config.json`	Generation configuration
`tokenizer.json`	Tokenizer
`tokenizer_config.json`	Tokenizer configuration
`chat_template.jinja`	Chat template

Hardware

Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.

Limitations

Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
This quantization targets the MLP/expert layers only; KV cache is not quantized

License

Same license as the base model: Apache 2.0.

Downloads last month: 6

Safetensors

Model size

202B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mconcat/Trinity-Large-Base-NVFP4

Base model

arcee-ai/Trinity-Large-TrueBase

Finetuned

arcee-ai/Trinity-Large-Base

Quantized

(3)

this model