Trinity-Large-Thinking-NVFP4
Introduction
Trinity-Large-Thinking is a reasoning-optimized variant of Arcee AI's Trinity-Large family — a 398B-parameter sparse Mixture-of-Experts (MoE) model with approximately 13B active parameters per token, post-trained with extended chain-of-thought reasoning and agentic RL.
This repository contains the NVFP4 quantized weights of Trinity-Large-Thinking for deployment on NVIDIA Blackwell GPUs.
For full model details, benchmarks, and usage guidance, see the main Trinity-Large-Thinking model card.
Quantization Details
- Scheme: NVFP4 (
nvfp4_experts_only— MoE expert weights only, attention and dense layers remain BF16) - Tool: NVIDIA ModelOpt
- Calibration: 2048 samples, seq_length=4096
- KV cache: Not quantized
Usage
Inference tested on
- Both Hopper (via Marlin) and Blackwell B300 node
- vLLM 0.18.0+
vLLM
Requires vLLM >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.
Example Blackwell GPUs (B200/B300/GB300) — Docker (recommended)
docker run --runtime nvidia --gpus all -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:v0.18.0-cu130 \
arcee-ai/Trinity-Large-Thinking-NVFP4 \
--trust-remote-code \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Hopper GPUs (H100/H200) and others
vllm serve arcee-ai/Trinity-Large-Thinking-NVFP4 \
--trust-remote-code \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--enable-reasoning \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Note (For Blackwell pip installs): If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:
export VLLM_NVFP4_GEMM_BACKEND=marlin vllm serve arcee-ai/Trinity-Large-Thinking-NVFP4 \ --trust-remote-code \ --tensor-parallel-size 8 \ --moe-backend marlin \ --gpu-memory-utilization 0.90 \ --max-model-len 8192 \ --enable-reasoning \ --reasoning-parser deepseek_r1 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coderMarlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.
Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "arcee-ai/Trinity-Large-Thinking-NVFP4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True
)
messages = [{"role": "user", "content": "Who are you?"}]
input_ids = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(input_ids, max_new_tokens=4096, do_sample=True, temperature=0.3, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
API
Works out of the box on OpenRouter as arcee-ai/trinity-large-thinking.
License
Trinity-Large-Thinking-NVFP4 is released under the Apache License, Version 2.0.
Citation
If you use this model, please cite:
@misc{singh2026arceetrinity,
title = {Arcee Trinity Large Technical Report},
author = {Varun Singh and Lucas Krauss and Sami Jaghouar and Matej Sirovatka and Charles Goddard and Fares Obied and Jack Min Ong and Jannik Straube and Fern and Aria Harley and Conner Stewart and Colin Kealty and Maziyar Panahi and Simon Kirsten and Anushka Deshpande and Anneketh Vij and Arthur Bresnu and Pranav Veldurthi and Raghav Ravishankar and Hardik Bishnoi and DatologyAI Team and Arcee AI Team and Prime Intellect Team and Mark McQuade and Johannes Hagemann and Lucas Atkins},
year = {2026},
eprint = {2602.17004},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
doi = {10.48550/arXiv.2602.17004},
url = {https://arxiv.org/abs/2602.17004}
}
- Downloads last month
- 51
Model tree for arcee-ai/Trinity-Large-Thinking-NVFP4
Base model
arcee-ai/Trinity-Large-TrueBase