Text Generation
Transformers
Safetensors
afmoe
Mixture of Experts
nvfp4
modelopt
blackwell
vllm
conversational
custom_code
8-bit precision
Instructions to use arcee-ai/Trinity-Mini-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use arcee-ai/Trinity-Mini-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="arcee-ai/Trinity-Mini-NVFP4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Trinity-Mini-NVFP4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("arcee-ai/Trinity-Mini-NVFP4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use arcee-ai/Trinity-Mini-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "arcee-ai/Trinity-Mini-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Trinity-Mini-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/arcee-ai/Trinity-Mini-NVFP4
- SGLang
How to use arcee-ai/Trinity-Mini-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "arcee-ai/Trinity-Mini-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Trinity-Mini-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "arcee-ai/Trinity-Mini-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Trinity-Mini-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use arcee-ai/Trinity-Mini-NVFP4 with Docker Model Runner:
docker model run hf.co/arcee-ai/Trinity-Mini-NVFP4
File size: 4,119 Bytes
1068478 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | ---
license: apache-2.0
language:
- en
- es
- fr
- de
- it
- pt
- ru
- ar
- hi
- ko
- zh
library_name: transformers
base_model:
- arcee-ai/Trinity-Mini
base_model_relation: quantized
tags:
- moe
- nvfp4
- modelopt
- blackwell
- vllm
---
<div align="center">
<picture>
<img
src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/i-v1KyAMOW_mgVGeic9WJ.png"
alt="Arcee Trinity Mini"
style="max-width: 100%; height: auto;"
>
</picture>
</div>
# Trinity Mini NVFP4
**This repository contains the NVFP4 quantized weights of Trinity-Mini for deployment on NVIDIA Blackwell GPUs.**
Trinity Mini is an Arcee AI 26B MoE model with 3B active parameters. It is the medium-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.
This model is tuned for reasoning, but in testing, it uses a similar total token count to competitive instruction-tuned models.
***
Trinity Mini is trained on 10T tokens gathered and curated through a key partnership with [Datology](https://www.datologyai.com/), building upon the excellent dataset we used on [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) with additional math and code.
Training was performed on a cluster of 512 H200 GPUs powered by [Prime Intellect](https://www.primeintellect.ai/) using HSDP parallelism.
More details, including key architecture decisions, can be found on our blog [here](https://www.arcee.ai/blog/the-trinity-manifesto)
***
## Model Details
* **Model Architecture:** AfmoeForCausalLM
* **Parameters:** 26B, 3B active
* **Experts:** 128 total, 8 active, 1 shared
* **Context length:** 128k
* **Training Tokens:** 10T
* **License:** [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Mini#license)
* **Recommended settings:**
* temperature: 0.15
* top_k: 50
* top_p: 0.75
* min_p: 0.06
***
## Benchmarks

<div align="center">
<picture>
<img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/sSVjGNHfrJKmQ6w8I18ek.png" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology">
</picture>
</div>
## Quantization Details
- **Scheme:** NVFP4 (`nvfp4_mlp_only` — MLP/expert weights only, attention remains BF16)
- **Tool:** [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer)
- **Calibration:** 512 samples, seq_length=2048, all-expert calibration enabled
- **KV cache:** Not quantized
## Running with vLLM
Requires [vLLM](https://github.com/vllm-project/vllm) >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.
### Blackwell GPUs (B200/B300/GB300) — Docker (recommended)
```bash
docker run --runtime nvidia --gpus all -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:v0.18.0-cu130 \
arcee-ai/Trinity-Mini-NVFP4 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--max-model-len 8192
```
### Hopper GPUs (H100/H200) and others
```bash
vllm serve arcee-ai/Trinity-Mini-NVFP4 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--host 0.0.0.0 \
--port 8000
```
**Note (Blackwell pip installs):** If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:
```bash
export VLLM_NVFP4_GEMM_BACKEND=marlin
vllm serve arcee-ai/Trinity-Mini-NVFP4 \
--trust-remote-code \
--moe-backend marlin \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--host 0.0.0.0 \
--port 8000
```
Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit (~3.7× vs BF16) but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.
## License
Trinity-Mini-NVFP4 is released under the Apache-2.0 license. |