arcee-ai
/

Trinity-Nano-Preview-NVFP4

@@ -32,34 +32,51 @@ tags:
     >
   </picture>
 </div>
 # Trinity Nano Preview NVFP4
 Trinity Nano Preview is a preview of Arcee AI's 6B MoE model with 1B active parameters. It is the small-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.
 This is a chat tuned model, with a delightful personality and charm we think users will love. We note that this model is pushing the limits of sparsity in small language models with only 800M non-embedding parameters active per token, and as such **may be unstable** in certain use cases, especially in this preview.
 This is an *experimental* release, it's fun to talk to but will not be hosted anywhere, so download it and try it out yourself!
 ***
 Trinity Nano Preview is trained on 10T tokens gathered and curated through a key partnership with [Datology](https://www.datologyai.com/), building upon the excellent dataset we used on [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) with additional math and code.
 Training was performed on a cluster of 512 H200 GPUs powered by [Prime Intellect](https://www.primeintellect.ai/) using HSDP parallelism.
 More details, including key architecture decisions, can be found on our blog [here](https://www.arcee.ai/blog/the-trinity-manifesto)
 ***
 **This repository contains the NVFP4 quantized weights of Trinity-Nano-Preview for deployment on NVIDIA Blackwell GPUs.**
 ## Model Details
 * **Model Architecture:** AfmoeForCausalLM
 * **Parameters:** 6B, 1B active
 * **Experts:** 128 total, 8 active, 1 shared
 * **Context length:** 128k
 * **Training Tokens:** 10T
 * **License:** [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Nano-Preview#license)
 ***
 <div align="center">
   <picture>
       <img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/sSVjGNHfrJKmQ6w8I18ek.png" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology">
   </picture>
 </div>
 ## Quantization Details
 - **Scheme:** NVFP4 (`nvfp4_mlp_only` — MLP/expert weights only, attention remains BF16)
 - **Tool:** [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer)
 - **Calibration:** 512 samples, seq_length=2048, all-expert calibration enabled
 - **KV cache:** Not quantized
 ## Running with vLLM
 Requires [vLLM](https://github.com/vllm-project/vllm) >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.
@@ -84,21 +101,21 @@ vllm serve arcee-ai/Trinity-Nano-Preview-NVFP4 \
   --port 8000
 ```
-> **Note (Blackwell pip installs):** If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:
->
-> ```bash
-> export VLLM_NVFP4_GEMM_BACKEND=marlin
->
-> vllm serve arcee-ai/Trinity-Nano-Preview-NVFP4 \
->   --trust-remote-code \
->   --moe-backend marlin \
->   --gpu-memory-utilization 0.90 \
->   --max-model-len 8192 \
->   --host 0.0.0.0 \
->   --port 8000
-> ```
->
-> Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit (~3.7× vs BF16) but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.
 ## License
 Trinity-Nano-Preview-NVFP4 is released under the Apache-2.0 license.

     >
   </picture>
 </div>
 # Trinity Nano Preview NVFP4
 Trinity Nano Preview is a preview of Arcee AI's 6B MoE model with 1B active parameters. It is the small-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.
 This is a chat tuned model, with a delightful personality and charm we think users will love. We note that this model is pushing the limits of sparsity in small language models with only 800M non-embedding parameters active per token, and as such **may be unstable** in certain use cases, especially in this preview.
 This is an *experimental* release, it's fun to talk to but will not be hosted anywhere, so download it and try it out yourself!
 ***
 Trinity Nano Preview is trained on 10T tokens gathered and curated through a key partnership with [Datology](https://www.datologyai.com/), building upon the excellent dataset we used on [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) with additional math and code.
 Training was performed on a cluster of 512 H200 GPUs powered by [Prime Intellect](https://www.primeintellect.ai/) using HSDP parallelism.
 More details, including key architecture decisions, can be found on our blog [here](https://www.arcee.ai/blog/the-trinity-manifesto)
 ***
 **This repository contains the NVFP4 quantized weights of Trinity-Nano-Preview for deployment on NVIDIA Blackwell GPUs.**
 ## Model Details
 * **Model Architecture:** AfmoeForCausalLM
 * **Parameters:** 6B, 1B active
 * **Experts:** 128 total, 8 active, 1 shared
 * **Context length:** 128k
 * **Training Tokens:** 10T
 * **License:** [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Nano-Preview#license)
 ***
 <div align="center">
   <picture>
       <img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/sSVjGNHfrJKmQ6w8I18ek.png" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology">
   </picture>
 </div>
 ## Quantization Details
 - **Scheme:** NVFP4 (`nvfp4_mlp_only` — MLP/expert weights only, attention remains BF16)
 - **Tool:** [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer)
 - **Calibration:** 512 samples, seq_length=2048, all-expert calibration enabled
 - **KV cache:** Not quantized
 ## Running with vLLM
 Requires [vLLM](https://github.com/vllm-project/vllm) >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.
   --port 8000
 ```
+ **Note (Blackwell pip installs):** If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:
+ ```bash
+ export VLLM_NVFP4_GEMM_BACKEND=marlin
+ vllm serve arcee-ai/Trinity-Nano-Preview-NVFP4 \
+   --trust-remote-code \
+   --moe-backend marlin \
+   --gpu-memory-utilization 0.90 \
+   --max-model-len 8192 \
+   --host 0.0.0.0 \
+   --port 8000
+ ```
+Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit (~3.7× vs BF16) but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.
 ## License
 Trinity-Nano-Preview-NVFP4 is released under the Apache-2.0 license.