Arcee Trinity Nano

Trinity Nano Preview NVFP4

Trinity Nano Preview is a preview of Arcee AI's 6B MoE model with 1B active parameters. It is the small-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.

This is a chat tuned model, with a delightful personality and charm we think users will love. We note that this model is pushing the limits of sparsity in small language models with only 800M non-embedding parameters active per token, and as such may be unstable in certain use cases, especially in this preview.

This is an experimental release, it's fun to talk to but will not be hosted anywhere, so download it and try it out yourself!


Trinity Nano Preview is trained on 10T tokens gathered and curated through a key partnership with Datology, building upon the excellent dataset we used on AFM-4.5B with additional math and code.

Training was performed on a cluster of 512 H200 GPUs powered by Prime Intellect using HSDP parallelism.

More details, including key architecture decisions, can be found on our blog here


This repository contains the NVFP4 quantized weights of Trinity-Nano-Preview for deployment on NVIDIA Blackwell GPUs.

Model Details

  • Model Architecture: AfmoeForCausalLM
  • Parameters: 6B, 1B active
  • Experts: 128 total, 8 active, 1 shared
  • Context length: 128k
  • Training Tokens: 10T
  • License: Apache 2.0

Powered by Datology

Quantization Details

  • Scheme: NVFP4 (nvfp4_mlp_only — MLP/expert weights only, attention remains BF16)
  • Tool: NVIDIA ModelOpt
  • Calibration: 512 samples, seq_length=2048, all-expert calibration enabled
  • KV cache: Not quantized

Running with vLLM

Requires vLLM >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.

Blackwell GPUs (B200/B300/GB300) — Docker (recommended)

docker run --runtime nvidia --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:v0.18.0-cu130 \
  arcee-ai/Trinity-Nano-Preview-NVFP4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192

Hopper GPUs (H100/H200) and others

vllm serve arcee-ai/Trinity-Nano-Preview-NVFP4 \
  --trust-remote-code \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --host 0.0.0.0 \
  --port 8000

Note (Blackwell pip installs): If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:

export VLLM_NVFP4_GEMM_BACKEND=marlin

vllm serve arcee-ai/Trinity-Nano-Preview-NVFP4 \
  --trust-remote-code \
  --moe-backend marlin \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --host 0.0.0.0 \
  --port 8000

Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit (~3.7× vs BF16) but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.

License

Trinity-Nano-Preview-NVFP4 is released under the Apache-2.0 license.

Downloads last month
345
Safetensors
Model size
4B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for arcee-ai/Trinity-Nano-Preview-NVFP4

Collection including arcee-ai/Trinity-Nano-Preview-NVFP4