README.md · arcee-ai/Trinity-Large-Thinking-NVFP4 at main

Trinity-Large-Thinking-NVFP4 / README.md

anneketh-vij

Create README.md

4c6dcac verified 2 days ago

preview code

raw

history blame contribute delete

5.3 kB

	---
	license: apache-2.0
	language:
	- en
	- es
	- fr
	- de
	- it
	- pt
	- ru
	- ar
	- hi
	- ko
	- zh
	library_name: transformers
	base_model:
	- arcee-ai/Trinity-Large-Thinking
	base_model_relation: quantized
	tags:
	- reasoning
	- agentic
	- tool-calling
	- thinking
	- moe
	- nvfp4
	- modelopt
	- blackwell
	- vllm
	---
	<!-- markdownlint-disable first-line-h1 -->
	<!-- markdownlint-disable html -->
	<!-- markdownlint-disable no-duplicate-header -->
	<div align="center">
	<picture>
	<img
	src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/i-v1KyAMOW_mgVGeic9WJ.png"
	alt="Arcee Trinity Large Thinking"
	style="max-width: 100%; height: auto;"
	>
	</picture>
	</div>
	<hr>

	# Trinity-Large-Thinking-NVFP4

	## Introduction

	Trinity-Large-Thinking is a reasoning-optimized variant of Arcee AI's Trinity-Large family — a 398B-parameter sparse Mixture-of-Experts (MoE) model with approximately 13B active parameters per token, post-trained with extended chain-of-thought reasoning and agentic RL.

	This repository contains the NVFP4 quantized weights of Trinity-Large-Thinking for deployment on NVIDIA Blackwell GPUs.

	For full model details, benchmarks, and usage guidance, see the main [Trinity-Large-Thinking](https://huggingface.co/arcee-ai/Trinity-Large-Thinking) model card.

	## Quantization Details

	- Scheme: NVFP4 (`nvfp4_experts_only` — MoE expert weights only, attention and dense layers remain BF16)
	- Tool: [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer)
	- Calibration: 2048 samples, seq_length=4096
	- KV cache: Not quantized

	## Usage

	### Inference tested on

	- Both Hopper (via Marlin) and Blackwell B300 node
	- vLLM 0.18.0+

	### vLLM

	Requires [vLLM](https://github.com/vllm-project/vllm) >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.

	#### Example Blackwell GPUs (B200/B300/GB300) — Docker (recommended)

	```bash
	docker run --runtime nvidia --gpus all -p 8000:8000 \
	-v ~/.cache/huggingface:/root/.cache/huggingface \
	vllm/vllm-openai:v0.18.0-cu130 \
	arcee-ai/Trinity-Large-Thinking-NVFP4 \
	--trust-remote-code \
	--tensor-parallel-size 8 \
	--gpu-memory-utilization 0.90 \
	--max-model-len 8192 \
	--enable-reasoning \
	--reasoning-parser deepseek_r1 \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder
	```

	#### Hopper GPUs (H100/H200) and others

	```bash
	vllm serve arcee-ai/Trinity-Large-Thinking-NVFP4 \
	--trust-remote-code \
	--tensor-parallel-size 8 \
	--gpu-memory-utilization 0.90 \
	--max-model-len 8192 \
	--enable-reasoning \
	--reasoning-parser deepseek_r1 \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder
	```

	> Note (For Blackwell pip installs): If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:
	>
	> ```bash
	> export VLLM_NVFP4_GEMM_BACKEND=marlin
	>
	> vllm serve arcee-ai/Trinity-Large-Thinking-NVFP4 \
	> --trust-remote-code \
	> --tensor-parallel-size 8 \
	> --moe-backend marlin \
	> --gpu-memory-utilization 0.90 \
	> --max-model-len 8192 \
	> --enable-reasoning \
	> --reasoning-parser deepseek_r1 \
	> --enable-auto-tool-choice \
	> --tool-call-parser qwen3_coder
	> ```
	>
	> Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.

	### Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = "arcee-ai/Trinity-Large-Thinking-NVFP4"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	trust_remote_code=True
	)

	messages = [{"role": "user", "content": "Who are you?"}]
	input_ids = tokenizer.apply_chat_template(
	messages, add_generation_prompt=True, return_tensors="pt"
	).to(model.device)

	outputs = model.generate(input_ids, max_new_tokens=4096, do_sample=True, temperature=0.3, top_p=0.95)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### API

	Works out of the box on [OpenRouter](https://openrouter.ai/) as `arcee-ai/trinity-large-thinking`.

	## License

	Trinity-Large-Thinking-NVFP4 is released under the Apache License, Version 2.0.

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{singh2026arceetrinity,
	title = {Arcee Trinity Large Technical Report},
	author = {Varun Singh and Lucas Krauss and Sami Jaghouar and Matej Sirovatka and Charles Goddard and Fares Obied and Jack Min Ong and Jannik Straube and Fern and Aria Harley and Conner Stewart and Colin Kealty and Maziyar Panahi and Simon Kirsten and Anushka Deshpande and Anneketh Vij and Arthur Bresnu and Pranav Veldurthi and Raghav Ravishankar and Hardik Bishnoi and DatologyAI Team and Arcee AI Team and Prime Intellect Team and Mark McQuade and Johannes Hagemann and Lucas Atkins},
	year = {2026},
	eprint = {2602.17004},
	archivePrefix= {arXiv},
	primaryClass = {cs.LG},
	doi = {10.48550/arXiv.2602.17004},
	url = {https://arxiv.org/abs/2602.17004}
	}
	```