Trinity-Large-Base-NVFP4

NVFP4-quantized version of arcee-ai/Trinity-Large-Base for deployment on NVIDIA Blackwell GPUs.

Model Details

Base model arcee-ai/Trinity-Large-Base
Architecture AfmoeForCausalLM (Mixture-of-Experts)
Parameters 398B total, ~13B active per token
Layers 60 (6 dense + 54 MoE)
Experts 256 per MoE layer, 4 active per token, 1 shared expert
Hidden size 3072
MoE intermediate size 3072 per expert
Dense intermediate size 12,288
Attention 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers
Context length 8,192 tokens
Vocabulary 200,192 tokens

Quantization

Method NVFP4 (4-bit floating point)
Tool NVIDIA ModelOpt 0.41.0
Group size 16
Calibration 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512
Quantized layers MLP/expert weights only (gate_proj, up_proj, down_proj in dense and MoE layers)
BF16 layers Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head
Source precision BF16

Compression

Format Size
BF16 (original) 796 GB
NVFP4 (this model) 216 GB

3.7x compression.

Running with vLLM

vLLM >= 0.15.1 supports this model natively with the modelopt quantization backend. Blackwell GPUs (SM100/SM120) are required for NVFP4 inference.

Requirements

  • VRAM: ~216 GB total model weight. A single GPU with ≥224 GB VRAM can load it directly; smaller setups require multi-GPU and/or CPU offloading.
  • System RAM: If using cpu_offload_gb, you need sufficient system RAM for pinned memory (the offload value × number of GPUs, plus ~40 GB for model loading overhead).

Installation

pip install "vllm>=0.15.1"

Environment Variables

Set VLLM_USE_FLASHINFER_MOE_FP4=0 to use the VLLM_CUTLASS MoE backend. This avoids large temporary GPU allocations during MoE weight initialization that can cause OOM on memory-constrained setups:

export VLLM_USE_FLASHINFER_MOE_FP4=0

Single-GPU (≥224 GB VRAM)

from vllm import LLM, SamplingParams

llm = LLM(
    model="mconcat/Trinity-Large-Base-NVFP4",
    quantization="modelopt",
    max_model_len=4096,
    enforce_eager=True,
    gpu_memory_utilization=0.90,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)

Multi-GPU with Pipeline Parallelism

For setups where total VRAM is less than ~216 GB, use pipeline parallelism with CPU weight offloading:

import os
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"

from vllm import LLM, SamplingParams

llm = LLM(
    model="mconcat/Trinity-Large-Base-NVFP4",
    quantization="modelopt",
    pipeline_parallel_size=2,        # number of GPUs
    cpu_offload_gb=30,               # GB of weights to offload per GPU
    max_model_len=512,
    max_num_seqs=256,
    enforce_eager=True,
    gpu_memory_utilization=0.95,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["The meaning of life is"], sampling_params)
print(outputs[0].outputs[0].text)

Tuning tips:

  • cpu_offload_gb is per GPU — total pinned memory = cpu_offload_gb × pipeline_parallel_size. Ensure this fits in system RAM alongside the OS and model loading workspace (~40 GB).
  • For heterogeneous GPU setups (different VRAM sizes), set VLLM_PP_LAYER_PARTITION to control how many of the 60 layers each GPU gets. For example, export VLLM_PP_LAYER_PARTITION="32,14,14" for a 3-GPU setup where the first GPU has ~3x the VRAM.
  • Each MoE layer is ~3.9 GB (NVFP4) while each dense layer is ~0.14 GB. The first 6 layers are dense; layers 6–59 are MoE. Distribute layers so that (layer_weights - cpu_offload_gb) fits comfortably on each GPU with room for KV cache and overhead.
  • max_num_seqs may need to be lowered for GPUs with ≤32 GB VRAM. The sampler warmup allocates max_num_seqs × vocab_size × 8 bytes of temporary memory (~1.5 GB at the default of 1024). Use 256 for smaller GPUs.
  • Start with a low max_model_len (e.g., 512) and increase once loading succeeds.

OpenAI-Compatible API Server

VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
    --model mconcat/Trinity-Large-Base-NVFP4 \
    --quantization modelopt \
    --max-model-len 4096 \
    --enforce-eager \
    --gpu-memory-utilization 0.90 \
    --port 8000

For multi-GPU serving, add --pipeline-parallel-size N --cpu-offload-gb X --max-num-seqs 256 as needed.

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mconcat/Trinity-Large-Base-NVFP4", "prompt": "Hello", "max_tokens": 64}'

Important Notes

  • Blackwell required: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
  • vLLM quantization flag: Use --quantization modelopt (not modelopt_fp4). vLLM auto-detects the NVFP4 algorithm from the config.
  • MoE backend: Set VLLM_USE_FLASHINFER_MOE_FP4=0 to use the VLLM_CUTLASS MoE backend. The default flashinfer backend performs a reorder_w1w3_to_w3w1 operation that temporarily allocates ~2.25 GB per MoE layer on GPU, which can cause OOM.
  • vLLM cpu_offload_gb + V1 engine: As of vLLM 0.15.x, using cpu_offload_gb with the V1 engine may trigger an assertion error in may_reinitialize_input_batch (gpu_model_runner.py). If you encounter AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled, this can be safely patched by converting the assertion to a warning. See vLLM PR #18298 for status.
  • HuggingFace Transformers: While transformers >= 5.0 recognizes the AfmoeForCausalLM architecture, it does not support ModelOpt NVFP4 weight format for inference. Use vLLM instead.
  • TensorRT-LLM: As of February 2026, TensorRT-LLM does not support the AfmoeForCausalLM architecture.

Quantization Recipe

Following NVIDIA's MLP-only quantization strategy (similar to the DeepSeek-R1 NVFP4 recipe):

  • Only MLP/expert weights (gate_proj, up_proj, down_proj) are quantized to FP4
  • All attention projections remain in BF16 to preserve quality
  • Router gates (mlp.router) remain in BF16
  • Embeddings and lm_head remain in BF16
  • The default *mlp.gate.* exclusion was removed because Trinity uses mlp.gate_proj as a standard MLP projection (not a routing gate)

Calibration Data

Domain Samples Dataset
Korean 128 heegyu/open-korean-instructions
Code 128 m-a-p/CodeFeedback-Filtered-Instruction
Creative Writing 128 Gryphe/ChatGPT-4o-Writing-Prompts
General English 128 teknium/OpenHermes-2.5

Files

File Description
model-00001-of-00005.safetensors ... model-00005-of-00005.safetensors Quantized model weights (5 shards, ~43-50 GB each)
model.safetensors.index.json Weight shard index
config.json Model configuration with quantization_config
hf_quant_config.json ModelOpt quantization metadata
generation_config.json Generation configuration
tokenizer.json Tokenizer
tokenizer_config.json Tokenizer configuration
chat_template.jinja Chat template

Hardware

Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does.

Limitations

  • Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
  • Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
  • Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
  • This quantization targets the MLP/expert layers only; KV cache is not quantized

License

Same license as the base model: Apache 2.0.

Downloads last month
11
Safetensors
Model size
202B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mconcat/Trinity-Large-Base-NVFP4

Quantized
(2)
this model