Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,24 +1,19 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
|
|
|
| 3 |
base_model: arcee-ai/Trinity-Large-TrueBase
|
| 4 |
tags:
|
| 5 |
-
-
|
| 6 |
- nvfp4
|
| 7 |
-
- fp4
|
| 8 |
-
- quantized
|
| 9 |
-
- tensorrt-llm
|
| 10 |
- modelopt
|
| 11 |
-
- mixture-of-experts
|
| 12 |
-
- moe
|
| 13 |
- blackwell
|
| 14 |
-
|
| 15 |
-
license_name: same-as-base-model
|
| 16 |
-
license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase
|
| 17 |
---
|
| 18 |
|
| 19 |
# Trinity-Large-TrueBase-NVFP4
|
| 20 |
|
| 21 |
-
NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) for deployment on NVIDIA Blackwell GPUs
|
| 22 |
|
| 23 |
## Model Details
|
| 24 |
|
|
@@ -26,7 +21,7 @@ NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface
|
|
| 26 |
|---|---|
|
| 27 |
| **Base model** | [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) |
|
| 28 |
| **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) |
|
| 29 |
-
| **Parameters** | 398B total |
|
| 30 |
| **Layers** | 60 (6 dense + 54 MoE) |
|
| 31 |
| **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert |
|
| 32 |
| **Hidden size** | 3072 |
|
|
@@ -57,20 +52,109 @@ NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface
|
|
| 57 |
|
| 58 |
3.7x compression.
|
| 59 |
|
| 60 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
|
|
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
```bash
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
--output_dir ./engine \
|
| 71 |
-
--gemm_plugin auto
|
| 72 |
```
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
## Quantization Recipe
|
| 75 |
|
| 76 |
Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):
|
|
@@ -94,10 +178,10 @@ Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 N
|
|
| 94 |
|
| 95 |
| File | Description |
|
| 96 |
|------|-------------|
|
| 97 |
-
| `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43 GB each) |
|
| 98 |
| `model.safetensors.index.json` | Weight shard index |
|
| 99 |
| `config.json` | Model configuration with `quantization_config` |
|
| 100 |
-
| `hf_quant_config.json` | ModelOpt quantization metadata
|
| 101 |
| `generation_config.json` | Generation configuration |
|
| 102 |
| `tokenizer.json` | Tokenizer |
|
| 103 |
| `tokenizer_config.json` | Tokenizer configuration |
|
|
@@ -109,7 +193,7 @@ Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM.
|
|
| 109 |
|
| 110 |
## Limitations
|
| 111 |
|
| 112 |
-
- Requires NVIDIA Blackwell GPUs (SM100) for native NVFP4 inference
|
| 113 |
- Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
|
| 114 |
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
|
| 115 |
- This quantization targets the MLP/expert layers only; KV cache is not quantized
|
|
|
|
| 1 |
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: trinity-large
|
| 4 |
+
license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase/blob/main/LICENSE
|
| 5 |
base_model: arcee-ai/Trinity-Large-TrueBase
|
| 6 |
tags:
|
| 7 |
+
- moe
|
| 8 |
- nvfp4
|
|
|
|
|
|
|
|
|
|
| 9 |
- modelopt
|
|
|
|
|
|
|
| 10 |
- blackwell
|
| 11 |
+
- vllm
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
# Trinity-Large-TrueBase-NVFP4
|
| 15 |
|
| 16 |
+
NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) for deployment on NVIDIA Blackwell GPUs.
|
| 17 |
|
| 18 |
## Model Details
|
| 19 |
|
|
|
|
| 21 |
|---|---|
|
| 22 |
| **Base model** | [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) |
|
| 23 |
| **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) |
|
| 24 |
+
| **Parameters** | 398B total, ~13B active per token |
|
| 25 |
| **Layers** | 60 (6 dense + 54 MoE) |
|
| 26 |
| **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert |
|
| 27 |
| **Hidden size** | 3072 |
|
|
|
|
| 52 |
|
| 53 |
3.7x compression.
|
| 54 |
|
| 55 |
+
## Running with vLLM
|
| 56 |
+
|
| 57 |
+
[vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference.
|
| 58 |
+
|
| 59 |
+
### Requirements
|
| 60 |
+
|
| 61 |
+
- **VRAM**: ~216 GB total model weight. A single GPU with ≥224 GB VRAM can load it directly; smaller setups require multi-GPU and/or CPU offloading.
|
| 62 |
+
- **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory (the offload value × number of GPUs, plus ~40 GB for model loading overhead).
|
| 63 |
+
|
| 64 |
+
### Installation
|
| 65 |
+
|
| 66 |
+
```bash
|
| 67 |
+
pip install "vllm>=0.15.1"
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
### Environment Variables
|
| 71 |
+
|
| 72 |
+
Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. This avoids large temporary GPU allocations during MoE weight initialization that can cause OOM on memory-constrained setups:
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
export VLLM_USE_FLASHINFER_MOE_FP4=0
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
### Single-GPU (≥224 GB VRAM)
|
| 79 |
|
| 80 |
+
```python
|
| 81 |
+
from vllm import LLM, SamplingParams
|
| 82 |
|
| 83 |
+
llm = LLM(
|
| 84 |
+
model="mconcat/Trinity-Large-TrueBase-NVFP4",
|
| 85 |
+
quantization="modelopt",
|
| 86 |
+
max_model_len=4096,
|
| 87 |
+
enforce_eager=True,
|
| 88 |
+
gpu_memory_utilization=0.90,
|
| 89 |
+
)
|
| 90 |
+
|
| 91 |
+
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
|
| 92 |
+
outputs = llm.generate(["The meaning of life is"], sampling_params)
|
| 93 |
+
print(outputs[0].outputs[0].text)
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
### Multi-GPU with Pipeline Parallelism
|
| 97 |
+
|
| 98 |
+
For setups where total VRAM is less than ~216 GB, use pipeline parallelism with CPU weight offloading:
|
| 99 |
+
|
| 100 |
+
```python
|
| 101 |
+
import os
|
| 102 |
+
os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
|
| 103 |
+
|
| 104 |
+
from vllm import LLM, SamplingParams
|
| 105 |
+
|
| 106 |
+
llm = LLM(
|
| 107 |
+
model="mconcat/Trinity-Large-TrueBase-NVFP4",
|
| 108 |
+
quantization="modelopt",
|
| 109 |
+
pipeline_parallel_size=2, # number of GPUs
|
| 110 |
+
cpu_offload_gb=30, # GB of weights to offload per GPU
|
| 111 |
+
max_model_len=512,
|
| 112 |
+
max_num_seqs=256,
|
| 113 |
+
enforce_eager=True,
|
| 114 |
+
gpu_memory_utilization=0.95,
|
| 115 |
+
)
|
| 116 |
+
|
| 117 |
+
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
|
| 118 |
+
outputs = llm.generate(["The meaning of life is"], sampling_params)
|
| 119 |
+
print(outputs[0].outputs[0].text)
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
**Tuning tips:**
|
| 123 |
+
- `cpu_offload_gb` is **per GPU** — total pinned memory = `cpu_offload_gb × pipeline_parallel_size`. Ensure this fits in system RAM alongside the OS and model loading workspace (~40 GB).
|
| 124 |
+
- For **heterogeneous GPU setups** (different VRAM sizes), set `VLLM_PP_LAYER_PARTITION` to control how many of the 60 layers each GPU gets. For example, `export VLLM_PP_LAYER_PARTITION="32,14,14"` for a 3-GPU setup where the first GPU has ~3x the VRAM.
|
| 125 |
+
- Each MoE layer is ~3.9 GB (NVFP4) while each dense layer is ~0.14 GB. The first 6 layers are dense; layers 6–59 are MoE. Distribute layers so that `(layer_weights - cpu_offload_gb)` fits comfortably on each GPU with room for KV cache and overhead.
|
| 126 |
+
- `max_num_seqs` may need to be lowered for GPUs with ≤32 GB VRAM. The sampler warmup allocates `max_num_seqs × vocab_size × 8 bytes` of temporary memory (~1.5 GB at the default of 1024). Use 256 for smaller GPUs.
|
| 127 |
+
- Start with a low `max_model_len` (e.g., 512) and increase once loading succeeds.
|
| 128 |
+
|
| 129 |
+
### OpenAI-Compatible API Server
|
| 130 |
+
|
| 131 |
+
```bash
|
| 132 |
+
VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
|
| 133 |
+
--model mconcat/Trinity-Large-TrueBase-NVFP4 \
|
| 134 |
+
--quantization modelopt \
|
| 135 |
+
--max-model-len 4096 \
|
| 136 |
+
--enforce-eager \
|
| 137 |
+
--gpu-memory-utilization 0.90 \
|
| 138 |
+
--port 8000
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
For multi-GPU serving, add `--pipeline-parallel-size N --cpu-offload-gb X --max-num-seqs 256` as needed.
|
| 142 |
|
| 143 |
```bash
|
| 144 |
+
curl http://localhost:8000/v1/completions \
|
| 145 |
+
-H "Content-Type: application/json" \
|
| 146 |
+
-d '{"model": "mconcat/Trinity-Large-TrueBase-NVFP4", "prompt": "Hello", "max_tokens": 64}'
|
|
|
|
|
|
|
| 147 |
```
|
| 148 |
|
| 149 |
+
## Important Notes
|
| 150 |
+
|
| 151 |
+
- **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
|
| 152 |
+
- **vLLM quantization flag**: Use `--quantization modelopt` (not `modelopt_fp4`). vLLM auto-detects the NVFP4 algorithm from the config.
|
| 153 |
+
- **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend performs a `reorder_w1w3_to_w3w1` operation that temporarily allocates ~2.25 GB per MoE layer on GPU, which can cause OOM.
|
| 154 |
+
- **vLLM cpu_offload_gb + V1 engine**: As of vLLM 0.15.x, using `cpu_offload_gb` with the V1 engine may trigger an assertion error in `may_reinitialize_input_batch` (`gpu_model_runner.py`). If you encounter `AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled`, this can be safely patched by converting the assertion to a warning. See [vLLM PR #18298](https://github.com/vllm-project/vllm/issues/18298) for status.
|
| 155 |
+
- **HuggingFace Transformers**: While `transformers >= 5.0` recognizes the `AfmoeForCausalLM` architecture, it does **not** support ModelOpt NVFP4 weight format for inference. Use vLLM instead.
|
| 156 |
+
- **TensorRT-LLM**: As of February 2026, TensorRT-LLM does not support the `AfmoeForCausalLM` architecture.
|
| 157 |
+
|
| 158 |
## Quantization Recipe
|
| 159 |
|
| 160 |
Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):
|
|
|
|
| 178 |
|
| 179 |
| File | Description |
|
| 180 |
|------|-------------|
|
| 181 |
+
| `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43-50 GB each) |
|
| 182 |
| `model.safetensors.index.json` | Weight shard index |
|
| 183 |
| `config.json` | Model configuration with `quantization_config` |
|
| 184 |
+
| `hf_quant_config.json` | ModelOpt quantization metadata |
|
| 185 |
| `generation_config.json` | Generation configuration |
|
| 186 |
| `tokenizer.json` | Tokenizer |
|
| 187 |
| `tokenizer_config.json` | Tokenizer configuration |
|
|
|
|
| 193 |
|
| 194 |
## Limitations
|
| 195 |
|
| 196 |
+
- Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
|
| 197 |
- Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
|
| 198 |
- Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
|
| 199 |
- This quantization targets the MLP/expert layers only; KV cache is not quantized
|