anneketh-vij commited on
Commit
e0c33b2
·
verified ·
1 Parent(s): ff338e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -15
README.md CHANGED
@@ -32,34 +32,51 @@ tags:
32
  >
33
  </picture>
34
  </div>
 
35
  # Trinity Nano Preview NVFP4
 
36
  Trinity Nano Preview is a preview of Arcee AI's 6B MoE model with 1B active parameters. It is the small-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.
 
37
  This is a chat tuned model, with a delightful personality and charm we think users will love. We note that this model is pushing the limits of sparsity in small language models with only 800M non-embedding parameters active per token, and as such **may be unstable** in certain use cases, especially in this preview.
 
38
  This is an *experimental* release, it's fun to talk to but will not be hosted anywhere, so download it and try it out yourself!
 
39
  ***
 
40
  Trinity Nano Preview is trained on 10T tokens gathered and curated through a key partnership with [Datology](https://www.datologyai.com/), building upon the excellent dataset we used on [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) with additional math and code.
 
41
  Training was performed on a cluster of 512 H200 GPUs powered by [Prime Intellect](https://www.primeintellect.ai/) using HSDP parallelism.
 
42
  More details, including key architecture decisions, can be found on our blog [here](https://www.arcee.ai/blog/the-trinity-manifesto)
 
43
  ***
 
44
  **This repository contains the NVFP4 quantized weights of Trinity-Nano-Preview for deployment on NVIDIA Blackwell GPUs.**
 
45
  ## Model Details
 
46
  * **Model Architecture:** AfmoeForCausalLM
47
  * **Parameters:** 6B, 1B active
48
  * **Experts:** 128 total, 8 active, 1 shared
49
  * **Context length:** 128k
50
  * **Training Tokens:** 10T
51
  * **License:** [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Nano-Preview#license)
 
52
  ***
 
53
  <div align="center">
54
  <picture>
55
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/sSVjGNHfrJKmQ6w8I18ek.png" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology">
56
  </picture>
57
  </div>
 
58
  ## Quantization Details
 
59
  - **Scheme:** NVFP4 (`nvfp4_mlp_only` — MLP/expert weights only, attention remains BF16)
60
  - **Tool:** [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer)
61
  - **Calibration:** 512 samples, seq_length=2048, all-expert calibration enabled
62
  - **KV cache:** Not quantized
 
63
  ## Running with vLLM
64
  Requires [vLLM](https://github.com/vllm-project/vllm) >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.
65
 
@@ -84,21 +101,21 @@ vllm serve arcee-ai/Trinity-Nano-Preview-NVFP4 \
84
  --port 8000
85
  ```
86
 
87
- > **Note (Blackwell pip installs):** If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:
88
- >
89
- > ```bash
90
- > export VLLM_NVFP4_GEMM_BACKEND=marlin
91
- >
92
- > vllm serve arcee-ai/Trinity-Nano-Preview-NVFP4 \
93
- > --trust-remote-code \
94
- > --moe-backend marlin \
95
- > --gpu-memory-utilization 0.90 \
96
- > --max-model-len 8192 \
97
- > --host 0.0.0.0 \
98
- > --port 8000
99
- > ```
100
- >
101
- > Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit (~3.7× vs BF16) but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.
102
 
103
  ## License
104
  Trinity-Nano-Preview-NVFP4 is released under the Apache-2.0 license.
 
32
  >
33
  </picture>
34
  </div>
35
+
36
  # Trinity Nano Preview NVFP4
37
+
38
  Trinity Nano Preview is a preview of Arcee AI's 6B MoE model with 1B active parameters. It is the small-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.
39
+
40
  This is a chat tuned model, with a delightful personality and charm we think users will love. We note that this model is pushing the limits of sparsity in small language models with only 800M non-embedding parameters active per token, and as such **may be unstable** in certain use cases, especially in this preview.
41
+
42
  This is an *experimental* release, it's fun to talk to but will not be hosted anywhere, so download it and try it out yourself!
43
+
44
  ***
45
+
46
  Trinity Nano Preview is trained on 10T tokens gathered and curated through a key partnership with [Datology](https://www.datologyai.com/), building upon the excellent dataset we used on [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) with additional math and code.
47
+
48
  Training was performed on a cluster of 512 H200 GPUs powered by [Prime Intellect](https://www.primeintellect.ai/) using HSDP parallelism.
49
+
50
  More details, including key architecture decisions, can be found on our blog [here](https://www.arcee.ai/blog/the-trinity-manifesto)
51
+
52
  ***
53
+
54
  **This repository contains the NVFP4 quantized weights of Trinity-Nano-Preview for deployment on NVIDIA Blackwell GPUs.**
55
+
56
  ## Model Details
57
+
58
  * **Model Architecture:** AfmoeForCausalLM
59
  * **Parameters:** 6B, 1B active
60
  * **Experts:** 128 total, 8 active, 1 shared
61
  * **Context length:** 128k
62
  * **Training Tokens:** 10T
63
  * **License:** [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Nano-Preview#license)
64
+
65
  ***
66
+
67
  <div align="center">
68
  <picture>
69
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/sSVjGNHfrJKmQ6w8I18ek.png" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology">
70
  </picture>
71
  </div>
72
+
73
  ## Quantization Details
74
+
75
  - **Scheme:** NVFP4 (`nvfp4_mlp_only` — MLP/expert weights only, attention remains BF16)
76
  - **Tool:** [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer)
77
  - **Calibration:** 512 samples, seq_length=2048, all-expert calibration enabled
78
  - **KV cache:** Not quantized
79
+
80
  ## Running with vLLM
81
  Requires [vLLM](https://github.com/vllm-project/vllm) >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.
82
 
 
101
  --port 8000
102
  ```
103
 
104
+ **Note (Blackwell pip installs):** If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:
105
+
106
+ ```bash
107
+ export VLLM_NVFP4_GEMM_BACKEND=marlin
108
+
109
+ vllm serve arcee-ai/Trinity-Nano-Preview-NVFP4 \
110
+ --trust-remote-code \
111
+ --moe-backend marlin \
112
+ --gpu-memory-utilization 0.90 \
113
+ --max-model-len 8192 \
114
+ --host 0.0.0.0 \
115
+ --port 8000
116
+ ```
117
+
118
+ Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit (~3.7× vs BF16) but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.
119
 
120
  ## License
121
  Trinity-Nano-Preview-NVFP4 is released under the Apache-2.0 license.