| | --- |
| | license: apache-2.0 |
| | base_model: arcee-ai/Trinity-Large-Base |
| | tags: |
| | - moe |
| | - nvfp4 |
| | - modelopt |
| | - blackwell |
| | - vllm |
| | --- |
| | |
| | # Trinity-Large-Base-NVFP4 |
| |
|
| | NVFP4-quantized version of [arcee-ai/Trinity-Large-Base](https://huggingface.co/arcee-ai/Trinity-Large-Base) for deployment on NVIDIA Blackwell GPUs. |
| |
|
| | ## Model Details |
| |
|
| | | | | |
| | |---|---| |
| | | **Base model** | [arcee-ai/Trinity-Large-Base](https://huggingface.co/arcee-ai/Trinity-Large-Base) | |
| | | **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) | |
| | | **Parameters** | 398B total, ~13B active per token | |
| | | **Layers** | 60 (6 dense + 54 MoE) | |
| | | **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert | |
| | | **Hidden size** | 3072 | |
| | | **MoE intermediate size** | 3072 per expert | |
| | | **Dense intermediate size** | 12,288 | |
| | | **Attention** | 48 heads, 8 KV heads (GQA), sliding window (4096) + full attention every 4 layers | |
| | | **Context length** | 8,192 tokens | |
| | | **Vocabulary** | 200,192 tokens | |
| |
|
| | ## Quantization |
| |
|
| | | | | |
| | |---|---| |
| | | **Method** | NVFP4 (4-bit floating point) | |
| | | **Tool** | [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) 0.41.0 | |
| | | **Group size** | 16 | |
| | | **Calibration** | 512 samples (Korean, Code, Creative Writing, English), max_seq_length=512 | |
| | | **Quantized layers** | MLP/expert weights only (`gate_proj`, `up_proj`, `down_proj` in dense and MoE layers) | |
| | | **BF16 layers** | Attention (Q/K/V/O projections), embeddings, router gates, shared experts, layer norms, lm_head | |
| | | **Source precision** | BF16 | |
| | |
| | ### Compression |
| | |
| | | Format | Size | |
| | |--------|------| |
| | | BF16 (original) | 796 GB | |
| | | **NVFP4 (this model)** | **216 GB** | |
| | |
| | 3.7x compression. |
| | |
| | ## Running with vLLM |
| | |
| | [vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference. |
| | |
| | ### Requirements |
| | |
| | - **VRAM**: ~216 GB total model weight. A single GPU with ≥224 GB VRAM can load it directly; smaller setups require multi-GPU and/or CPU offloading. |
| | - **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory (the offload value × number of GPUs, plus ~40 GB for model loading overhead). |
| | |
| | ### Installation |
| | |
| | ```bash |
| | pip install "vllm>=0.15.1" |
| | ``` |
| | |
| | ### Environment Variables |
| | |
| | Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. This avoids large temporary GPU allocations during MoE weight initialization that can cause OOM on memory-constrained setups: |
| |
|
| | ```bash |
| | export VLLM_USE_FLASHINFER_MOE_FP4=0 |
| | ``` |
| |
|
| | ### Single-GPU (≥224 GB VRAM) |
| |
|
| | ```python |
| | from vllm import LLM, SamplingParams |
| | |
| | llm = LLM( |
| | model="mconcat/Trinity-Large-Base-NVFP4", |
| | quantization="modelopt", |
| | max_model_len=4096, |
| | enforce_eager=True, |
| | gpu_memory_utilization=0.90, |
| | ) |
| | |
| | sampling_params = SamplingParams(temperature=0.7, max_tokens=256) |
| | outputs = llm.generate(["The meaning of life is"], sampling_params) |
| | print(outputs[0].outputs[0].text) |
| | ``` |
| |
|
| | ### Multi-GPU with Pipeline Parallelism |
| |
|
| | For setups where total VRAM is less than ~216 GB, use pipeline parallelism with CPU weight offloading: |
| |
|
| | ```python |
| | import os |
| | os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0" |
| | |
| | from vllm import LLM, SamplingParams |
| | |
| | llm = LLM( |
| | model="mconcat/Trinity-Large-Base-NVFP4", |
| | quantization="modelopt", |
| | pipeline_parallel_size=2, # number of GPUs |
| | cpu_offload_gb=30, # GB of weights to offload per GPU |
| | max_model_len=512, |
| | max_num_seqs=256, |
| | enforce_eager=True, |
| | gpu_memory_utilization=0.95, |
| | ) |
| | |
| | sampling_params = SamplingParams(temperature=0.7, max_tokens=256) |
| | outputs = llm.generate(["The meaning of life is"], sampling_params) |
| | print(outputs[0].outputs[0].text) |
| | ``` |
| |
|
| | **Tuning tips:** |
| | - `cpu_offload_gb` is **per GPU** — total pinned memory = `cpu_offload_gb × pipeline_parallel_size`. Ensure this fits in system RAM alongside the OS and model loading workspace (~40 GB). |
| | - For **heterogeneous GPU setups** (different VRAM sizes), set `VLLM_PP_LAYER_PARTITION` to control how many of the 60 layers each GPU gets. For example, `export VLLM_PP_LAYER_PARTITION="32,14,14"` for a 3-GPU setup where the first GPU has ~3x the VRAM. |
| | - Each MoE layer is ~3.9 GB (NVFP4) while each dense layer is ~0.14 GB. The first 6 layers are dense; layers 6–59 are MoE. Distribute layers so that `(layer_weights - cpu_offload_gb)` fits comfortably on each GPU with room for KV cache and overhead. |
| | - `max_num_seqs` may need to be lowered for GPUs with ≤32 GB VRAM. The sampler warmup allocates `max_num_seqs × vocab_size × 8 bytes` of temporary memory (~1.5 GB at the default of 1024). Use 256 for smaller GPUs. |
| | - Start with a low `max_model_len` (e.g., 512) and increase once loading succeeds. |
| |
|
| | ### OpenAI-Compatible API Server |
| |
|
| | ```bash |
| | VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \ |
| | --model mconcat/Trinity-Large-Base-NVFP4 \ |
| | --quantization modelopt \ |
| | --max-model-len 4096 \ |
| | --enforce-eager \ |
| | --gpu-memory-utilization 0.90 \ |
| | --port 8000 |
| | ``` |
| |
|
| | For multi-GPU serving, add `--pipeline-parallel-size N --cpu-offload-gb X --max-num-seqs 256` as needed. |
| |
|
| | ```bash |
| | curl http://localhost:8000/v1/completions \ |
| | -H "Content-Type: application/json" \ |
| | -d '{"model": "mconcat/Trinity-Large-Base-NVFP4", "prompt": "Hello", "max_tokens": 64}' |
| | ``` |
| |
|
| | ## Important Notes |
| |
|
| | - **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs. |
| | - **vLLM quantization flag**: Use `--quantization modelopt` (not `modelopt_fp4`). vLLM auto-detects the NVFP4 algorithm from the config. |
| | - **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend performs a `reorder_w1w3_to_w3w1` operation that temporarily allocates ~2.25 GB per MoE layer on GPU, which can cause OOM. |
| | - **vLLM cpu_offload_gb + V1 engine**: As of vLLM 0.15.x, using `cpu_offload_gb` with the V1 engine may trigger an assertion error in `may_reinitialize_input_batch` (`gpu_model_runner.py`). If you encounter `AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled`, this can be safely patched by converting the assertion to a warning. See [vLLM PR #18298](https://github.com/vllm-project/vllm/issues/18298) for status. |
| | - **HuggingFace Transformers**: While `transformers >= 5.0` recognizes the `AfmoeForCausalLM` architecture, it does **not** support ModelOpt NVFP4 weight format for inference. Use vLLM instead. |
| | - **TensorRT-LLM**: As of February 2026, TensorRT-LLM does not support the `AfmoeForCausalLM` architecture. |
| |
|
| | ## Quantization Recipe |
| |
|
| | Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)): |
| |
|
| | - Only MLP/expert weights (`gate_proj`, `up_proj`, `down_proj`) are quantized to FP4 |
| | - All attention projections remain in BF16 to preserve quality |
| | - Router gates (`mlp.router`) remain in BF16 |
| | - Embeddings and lm_head remain in BF16 |
| | - The default `*mlp.gate.*` exclusion was removed because Trinity uses `mlp.gate_proj` as a standard MLP projection (not a routing gate) |
| |
|
| | ### Calibration Data |
| |
|
| | | Domain | Samples | Dataset | |
| | |--------|---------|---------| |
| | | Korean | 128 | [heegyu/open-korean-instructions](https://huggingface.co/datasets/heegyu/open-korean-instructions) | |
| | | Code | 128 | [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) | |
| | | Creative Writing | 128 | [Gryphe/ChatGPT-4o-Writing-Prompts](https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts) | |
| | | General English | 128 | [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | |
| |
|
| | ## Files |
| |
|
| | | File | Description | |
| | |------|-------------| |
| | | `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43-50 GB each) | |
| | | `model.safetensors.index.json` | Weight shard index | |
| | | `config.json` | Model configuration with `quantization_config` | |
| | | `hf_quant_config.json` | ModelOpt quantization metadata | |
| | | `generation_config.json` | Generation configuration | |
| | | `tokenizer.json` | Tokenizer | |
| | | `tokenizer_config.json` | Tokenizer configuration | |
| | | `chat_template.jinja` | Chat template | |
| |
|
| | ## Hardware |
| |
|
| | Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM. Total quantization time was approximately 9 hours (dominated by calibration forward passes). Quantization on A100 does not require Blackwell hardware; only inference with native FP4 execution does. |
| |
|
| | ## Limitations |
| |
|
| | - Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference |
| | - Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision |
| | - Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation |
| | - This quantization targets the MLP/expert layers only; KV cache is not quantized |
| |
|
| | ## License |
| |
|
| | Same license as the base model: [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Large-Base). |
| |
|