mconcat commited on
Commit
8e48500
·
verified ·
1 Parent(s): 13be8a1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +107 -23
README.md CHANGED
@@ -1,24 +1,19 @@
1
  ---
2
- library_name: tensorrt_llm
 
 
3
  base_model: arcee-ai/Trinity-Large-TrueBase
4
  tags:
5
- - nvidia
6
  - nvfp4
7
- - fp4
8
- - quantized
9
- - tensorrt-llm
10
  - modelopt
11
- - mixture-of-experts
12
- - moe
13
  - blackwell
14
- license: other
15
- license_name: same-as-base-model
16
- license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase
17
  ---
18
 
19
  # Trinity-Large-TrueBase-NVFP4
20
 
21
- NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) for deployment on NVIDIA Blackwell GPUs via TensorRT-LLM.
22
 
23
  ## Model Details
24
 
@@ -26,7 +21,7 @@ NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface
26
  |---|---|
27
  | **Base model** | [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) |
28
  | **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) |
29
- | **Parameters** | 398B total |
30
  | **Layers** | 60 (6 dense + 54 MoE) |
31
  | **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert |
32
  | **Hidden size** | 3072 |
@@ -57,20 +52,109 @@ NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface
57
 
58
  3.7x compression.
59
 
60
- ## Intended Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
- This checkpoint is intended for deployment on NVIDIA Blackwell (SM100) GPUs using TensorRT-LLM's NVFP4 inference path. The NVFP4 format requires Blackwell's 5th-generation Tensor Cores for native FP4 execution.
 
63
 
64
- ### Loading with TensorRT-LLM
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
  ```bash
67
- # Convert to TensorRT-LLM engine
68
- trtllm-build \
69
- --checkpoint_dir ./Trinity-Large-TrueBase-NVFP4 \
70
- --output_dir ./engine \
71
- --gemm_plugin auto
72
  ```
73
 
 
 
 
 
 
 
 
 
 
74
  ## Quantization Recipe
75
 
76
  Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):
@@ -94,10 +178,10 @@ Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 N
94
 
95
  | File | Description |
96
  |------|-------------|
97
- | `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43 GB each) |
98
  | `model.safetensors.index.json` | Weight shard index |
99
  | `config.json` | Model configuration with `quantization_config` |
100
- | `hf_quant_config.json` | ModelOpt quantization metadata (consumed by TensorRT-LLM) |
101
  | `generation_config.json` | Generation configuration |
102
  | `tokenizer.json` | Tokenizer |
103
  | `tokenizer_config.json` | Tokenizer configuration |
@@ -109,7 +193,7 @@ Quantization was performed on 8x NVIDIA A100-SXM4-80GB with ~1.8 TiB system RAM.
109
 
110
  ## Limitations
111
 
112
- - Requires NVIDIA Blackwell GPUs (SM100) for native NVFP4 inference via TensorRT-LLM
113
  - Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
114
  - Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
115
  - This quantization targets the MLP/expert layers only; KV cache is not quantized
 
1
  ---
2
+ license: other
3
+ license_name: trinity-large
4
+ license_link: https://huggingface.co/arcee-ai/Trinity-Large-TrueBase/blob/main/LICENSE
5
  base_model: arcee-ai/Trinity-Large-TrueBase
6
  tags:
7
+ - moe
8
  - nvfp4
 
 
 
9
  - modelopt
 
 
10
  - blackwell
11
+ - vllm
 
 
12
  ---
13
 
14
  # Trinity-Large-TrueBase-NVFP4
15
 
16
+ NVFP4-quantized version of [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) for deployment on NVIDIA Blackwell GPUs.
17
 
18
  ## Model Details
19
 
 
21
  |---|---|
22
  | **Base model** | [arcee-ai/Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase) |
23
  | **Architecture** | AfmoeForCausalLM (Mixture-of-Experts) |
24
+ | **Parameters** | 398B total, ~13B active per token |
25
  | **Layers** | 60 (6 dense + 54 MoE) |
26
  | **Experts** | 256 per MoE layer, 4 active per token, 1 shared expert |
27
  | **Hidden size** | 3072 |
 
52
 
53
  3.7x compression.
54
 
55
+ ## Running with vLLM
56
+
57
+ [vLLM](https://github.com/vllm-project/vllm) >= 0.15.1 supports this model natively with the `modelopt` quantization backend. Blackwell GPUs (SM100/SM120) are **required** for NVFP4 inference.
58
+
59
+ ### Requirements
60
+
61
+ - **VRAM**: ~216 GB total model weight. A single GPU with ≥224 GB VRAM can load it directly; smaller setups require multi-GPU and/or CPU offloading.
62
+ - **System RAM**: If using `cpu_offload_gb`, you need sufficient system RAM for pinned memory (the offload value × number of GPUs, plus ~40 GB for model loading overhead).
63
+
64
+ ### Installation
65
+
66
+ ```bash
67
+ pip install "vllm>=0.15.1"
68
+ ```
69
+
70
+ ### Environment Variables
71
+
72
+ Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. This avoids large temporary GPU allocations during MoE weight initialization that can cause OOM on memory-constrained setups:
73
+
74
+ ```bash
75
+ export VLLM_USE_FLASHINFER_MOE_FP4=0
76
+ ```
77
+
78
+ ### Single-GPU (≥224 GB VRAM)
79
 
80
+ ```python
81
+ from vllm import LLM, SamplingParams
82
 
83
+ llm = LLM(
84
+ model="mconcat/Trinity-Large-TrueBase-NVFP4",
85
+ quantization="modelopt",
86
+ max_model_len=4096,
87
+ enforce_eager=True,
88
+ gpu_memory_utilization=0.90,
89
+ )
90
+
91
+ sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
92
+ outputs = llm.generate(["The meaning of life is"], sampling_params)
93
+ print(outputs[0].outputs[0].text)
94
+ ```
95
+
96
+ ### Multi-GPU with Pipeline Parallelism
97
+
98
+ For setups where total VRAM is less than ~216 GB, use pipeline parallelism with CPU weight offloading:
99
+
100
+ ```python
101
+ import os
102
+ os.environ["VLLM_USE_FLASHINFER_MOE_FP4"] = "0"
103
+
104
+ from vllm import LLM, SamplingParams
105
+
106
+ llm = LLM(
107
+ model="mconcat/Trinity-Large-TrueBase-NVFP4",
108
+ quantization="modelopt",
109
+ pipeline_parallel_size=2, # number of GPUs
110
+ cpu_offload_gb=30, # GB of weights to offload per GPU
111
+ max_model_len=512,
112
+ max_num_seqs=256,
113
+ enforce_eager=True,
114
+ gpu_memory_utilization=0.95,
115
+ )
116
+
117
+ sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
118
+ outputs = llm.generate(["The meaning of life is"], sampling_params)
119
+ print(outputs[0].outputs[0].text)
120
+ ```
121
+
122
+ **Tuning tips:**
123
+ - `cpu_offload_gb` is **per GPU** — total pinned memory = `cpu_offload_gb × pipeline_parallel_size`. Ensure this fits in system RAM alongside the OS and model loading workspace (~40 GB).
124
+ - For **heterogeneous GPU setups** (different VRAM sizes), set `VLLM_PP_LAYER_PARTITION` to control how many of the 60 layers each GPU gets. For example, `export VLLM_PP_LAYER_PARTITION="32,14,14"` for a 3-GPU setup where the first GPU has ~3x the VRAM.
125
+ - Each MoE layer is ~3.9 GB (NVFP4) while each dense layer is ~0.14 GB. The first 6 layers are dense; layers 6–59 are MoE. Distribute layers so that `(layer_weights - cpu_offload_gb)` fits comfortably on each GPU with room for KV cache and overhead.
126
+ - `max_num_seqs` may need to be lowered for GPUs with ≤32 GB VRAM. The sampler warmup allocates `max_num_seqs × vocab_size × 8 bytes` of temporary memory (~1.5 GB at the default of 1024). Use 256 for smaller GPUs.
127
+ - Start with a low `max_model_len` (e.g., 512) and increase once loading succeeds.
128
+
129
+ ### OpenAI-Compatible API Server
130
+
131
+ ```bash
132
+ VLLM_USE_FLASHINFER_MOE_FP4=0 python -m vllm.entrypoints.openai.api_server \
133
+ --model mconcat/Trinity-Large-TrueBase-NVFP4 \
134
+ --quantization modelopt \
135
+ --max-model-len 4096 \
136
+ --enforce-eager \
137
+ --gpu-memory-utilization 0.90 \
138
+ --port 8000
139
+ ```
140
+
141
+ For multi-GPU serving, add `--pipeline-parallel-size N --cpu-offload-gb X --max-num-seqs 256` as needed.
142
 
143
  ```bash
144
+ curl http://localhost:8000/v1/completions \
145
+ -H "Content-Type: application/json" \
146
+ -d '{"model": "mconcat/Trinity-Large-TrueBase-NVFP4", "prompt": "Hello", "max_tokens": 64}'
 
 
147
  ```
148
 
149
+ ## Important Notes
150
+
151
+ - **Blackwell required**: NVFP4 uses Blackwell's 5th-generation Tensor Cores. This model will NOT run on Hopper (H100/H200), Ada (RTX 4090), or older GPUs.
152
+ - **vLLM quantization flag**: Use `--quantization modelopt` (not `modelopt_fp4`). vLLM auto-detects the NVFP4 algorithm from the config.
153
+ - **MoE backend**: Set `VLLM_USE_FLASHINFER_MOE_FP4=0` to use the VLLM_CUTLASS MoE backend. The default flashinfer backend performs a `reorder_w1w3_to_w3w1` operation that temporarily allocates ~2.25 GB per MoE layer on GPU, which can cause OOM.
154
+ - **vLLM cpu_offload_gb + V1 engine**: As of vLLM 0.15.x, using `cpu_offload_gb` with the V1 engine may trigger an assertion error in `may_reinitialize_input_batch` (`gpu_model_runner.py`). If you encounter `AssertionError: Cannot re-initialize the input batch when CPU weight offloading is enabled`, this can be safely patched by converting the assertion to a warning. See [vLLM PR #18298](https://github.com/vllm-project/vllm/issues/18298) for status.
155
+ - **HuggingFace Transformers**: While `transformers >= 5.0` recognizes the `AfmoeForCausalLM` architecture, it does **not** support ModelOpt NVFP4 weight format for inference. Use vLLM instead.
156
+ - **TensorRT-LLM**: As of February 2026, TensorRT-LLM does not support the `AfmoeForCausalLM` architecture.
157
+
158
  ## Quantization Recipe
159
 
160
  Following NVIDIA's MLP-only quantization strategy (similar to the [DeepSeek-R1 NVFP4 recipe](https://developer.nvidia.com/blog/nvidia-publishes-the-first-deepseek-r1-nvfp4-quantized-model/)):
 
178
 
179
  | File | Description |
180
  |------|-------------|
181
+ | `model-00001-of-00005.safetensors` ... `model-00005-of-00005.safetensors` | Quantized model weights (5 shards, ~43-50 GB each) |
182
  | `model.safetensors.index.json` | Weight shard index |
183
  | `config.json` | Model configuration with `quantization_config` |
184
+ | `hf_quant_config.json` | ModelOpt quantization metadata |
185
  | `generation_config.json` | Generation configuration |
186
  | `tokenizer.json` | Tokenizer |
187
  | `tokenizer_config.json` | Tokenizer configuration |
 
193
 
194
  ## Limitations
195
 
196
+ - Requires NVIDIA Blackwell GPUs (SM100/SM120) for native NVFP4 inference
197
  - Quality may differ from the original BF16 model, particularly on tasks sensitive to numerical precision
198
  - Calibration was bilingual (Korean + English) with code; other languages may see slightly higher degradation
199
  - This quantization targets the MLP/expert layers only; KV cache is not quantized