mconcat
/

Trinity-Large-TrueBase-NVFP4

@@ -1,24 +1,19 @@
 ---
-library_name: tensorrt_llm
 base_model: arcee-ai/Trinity-Large-TrueBase
 tags:
-  - nvidia
   - nvfp4
-  - fp4
-  - quantized
-  - tensorrt-llm
   - modelopt
-  - mixture-of-experts
-  - moe
   - blackwell
-license: other
-license_name: same-as-base-model
-license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase
 ---
 # Trinity-Large-TrueBase-NVFP4
-NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) for deployment on NVIDIA Blackwell GPUs via TensorRT-LLM.
 ## Model Details
@@ -26,7 +21,7 @@ NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface
 |---|---|
 | **Base model** | [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) |
 | **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) |
-| **Parameters** | 398B total |
 | **Layers** | 60 (6 dense + 54 MoE) |
 | **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert |
 | **Hidden size** | 3072 |
@@ -57,20 +52,109 @@ NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface
 3.7x compression.
-## Intended Use
-This checkpoint is intended for deployment on NVIDIA Blackwell (SM100) GPUs using TensorRT-LLM's NVFP4 inference path. The NVFP4 format requires Blackwell's 5th-generation Tensor Cores for native FP4 execution.
-### Loading with TensorRT-LLM
 ```bash
-# Convert to TensorRT-LLM engine
-trtllm-build \
-    --checkpoint_dir ./Trinity-Large-TrueBase-NVFP4 \
-    --output_dir ./engine \
-    --gemm_plugin auto
 ```
 ## Quantization Recipe
 Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):
@@ -94,10 +178,10 @@ Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 N
 | File | Description |
 |------|-------------|
-| `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43 GB each) |
 | `model.safetensors.index.json` | Weight shard index |
 | `config.json` | Model configuration with `quantization_config` |
-| `hf_quant_config.json` | ModelOpt quantization metadata (consumed by TensorRT-LLM) |
 | `generation_config.json` | Generation configuration |
 | `tokenizer.json` | Tokenizer |
 | `tokenizer_config.json` | Tokenizer configuration |
@@ -109,7 +193,7 @@ Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM.
 ## Limitations
-- Requires NVIDIA Blackwell GPUs (SM100) for native NVFP4 inference via TensorRT-LLM
 - Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
 - Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
 - This quantization targets the MLP/expert layers only; KV cache is not quantized

 ---
+license: other
+license_name: trinity-large
+license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase/blob/main/LICENSE
 base_model: arcee-ai/Trinity-Large-TrueBase
 tags:
+  - moe
   - nvfp4
   - modelopt
   - blackwell
+  - vllm
 ---
 # Trinity-Large-TrueBase-NVFP4
+NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) for deployment on NVIDIA Blackwell GPUs.
 ## Model Details
 |---|---|
 | **Base model** | [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) |
 | **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) |
+| **Parameters** | 398B total, ~13B active per token |
 | **Layers** | 60 (6 dense + 54 MoE) |
 | **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert |
 | **Hidden size** | 3072 |
 3.7x compression.
+## Running with vLLM
+[vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference.
+### Requirements
+- **VRAM**: ~216 GB total model weight. A single GPU with ≥224 GB VRAM can load it directly; smaller setups require multi-GPU and/or CPU offloading.
+- **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory (the offload value × number of GPUs, plus ~40 GB for model loading overhead).
+### Installation
+```bash
+pip install "vllm>=0.15.1"
+```
+### Environment Variables
+Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. This avoids large temporary GPU allocations during MoE weight initialization that can cause OOM on memory-constrained setups:
+```bash
+export VLLM_USE_FLASHINFER_MOE_FP4=0
+```
+### Single-GPU (≥224 GB VRAM)
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="mconcat/Trinity-Large-TrueBase-NVFP4",
+    quantization="modelopt",
+    max_model_len=4096,
+    enforce_eager=True,
+    gpu_memory_utilization=0.90,
+)
+sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
+outputs = llm.generate(["The meaning of life is"], sampling_params)
+print(outputs[0].outputs[0].text)
+```
+### Multi-GPU with Pipeline Parallelism
+For setups where total VRAM is less than ~216 GB, use pipeline parallelism with CPU weight offloading:
+```python
+import os
+os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="mconcat/Trinity-Large-TrueBase-NVFP4",
+    quantization="modelopt",
+    pipeline_parallel_size=2,        # number of GPUs
+    cpu_offload_gb=30,               # GB of weights to offload per GPU
+    max_model_len=512,
+    max_num_seqs=256,
+    enforce_eager=True,
+    gpu_memory_utilization=0.95,
+)
+sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
+outputs = llm.generate(["The meaning of life is"], sampling_params)
+print(outputs[0].outputs[0].text)
+```
+**Tuning tips:**
+- `cpu_offload_gb` is **per GPU** — total pinned memory = `cpu_offload_gb × pipeline_parallel_size`. Ensure this fits in system RAM alongside the OS and model loading workspace (~40 GB).
+- For **heterogeneous GPU setups** (different VRAM sizes), set `VLLM_PP_LAYER_PARTITION` to control how many of the 60 layers each GPU gets. For example, `export VLLM_PP_LAYER_PARTITION="32,14,14"` for a 3-GPU setup where the first GPU has ~3x the VRAM.
+- Each MoE layer is ~3.9 GB (NVFP4) while each dense layer is ~0.14 GB. The first 6 layers are dense; layers 6–59 are MoE. Distribute layers so that `(layer_weights - cpu_offload_gb)` fits comfortably on each GPU with room for KV cache and overhead.
+- `max_num_seqs` may need to be lowered for GPUs with ≤32 GB VRAM. The sampler warmup allocates `max_num_seqs × vocab_size × 8 bytes` of temporary memory (~1.5 GB at the default of 1024). Use 256 for smaller GPUs.
+- Start with a low `max_model_len` (e.g., 512) and increase once loading succeeds.
+### OpenAI-Compatible API Server
+```bash
+VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
+    --model mconcat/Trinity-Large-TrueBase-NVFP4 \
+    --quantization modelopt \
+    --max-model-len 4096 \
+    --enforce-eager \
+    --gpu-memory-utilization 0.90 \
+    --port 8000
+```
+For multi-GPU serving, add `--pipeline-parallel-size N --cpu-offload-gb X --max-num-seqs 256` as needed.
 ```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "mconcat/Trinity-Large-TrueBase-NVFP4", "prompt": "Hello", "max_tokens": 64}'
 ```
+## Important Notes
+- **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
+- **vLLM quantization flag**: Use `--quantization modelopt` (not `modelopt_fp4`). vLLM auto-detects the NVFP4 algorithm from the config.
+- **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend performs a `reorder_w1w3_to_w3w1` operation that temporarily allocates ~2.25 GB per MoE layer on GPU, which can cause OOM.
+- **vLLM cpu_offload_gb + V1 engine**: As of vLLM 0.15.x, using `cpu_offload_gb` with the V1 engine may trigger an assertion error in `may_reinitialize_input_batch` (`gpu_model_runner.py`). If you encounter `AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled`, this can be safely patched by converting the assertion to a warning. See [vLLM PR #18298](https://github.com/vllm-project/vllm/issues/18298) for status.
+- **HuggingFace Transformers**: While `transformers >= 5.0` recognizes the `AfmoeForCausalLM` architecture, it does **not** support ModelOpt NVFP4 weight format for inference. Use vLLM instead.
+- **TensorRT-LLM**: As of February 2026, TensorRT-LLM does not support the `AfmoeForCausalLM` architecture.
 ## Quantization Recipe
 Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):
 | File | Description |
 |------|-------------|
+| `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43-50 GB each) |
 | `model.safetensors.index.json` | Weight shard index |
 | `config.json` | Model configuration with `quantization_config` |
+| `hf_quant_config.json` | ModelOpt quantization metadata |
 | `generation_config.json` | Generation configuration |
 | `tokenizer.json` | Tokenizer |
 | `tokenizer_config.json` | Tokenizer configuration |
 ## Limitations
+- Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
 - Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
 - Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
 - This quantization targets the MLP/expert layers only; KV cache is not quantized