File size: 6,648 Bytes
f14b7fd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | ---
license: apache-2.0
tags:
- neuron
- aws-inferentia
- inf2
- moe
- pre-compiled
- neuronx-distributed-inference
base_model: arcee-ai/Trinity-Nano-Preview
library_name: neuronx-distributed-inference
---
# Trinity-Nano Pre-Compiled for AWS Inferentia2 (TP=1)
Pre-compiled and pre-sharded [Trinity-Nano-Preview](https://huggingface.co/arcee-ai/Trinity-Nano-Preview) (~6B total, ~1B active MoE) for AWS Neuron SDK 2.28, ready to load on **inf2.xlarge** (16GB system RAM) or any larger Inferentia2/Trainium instance.
## Why Pre-Sharded?
The standard NxDI load path downloads the full HuggingFace checkpoint (~12GB bf16) into CPU RAM for weight conversion and sharding. On inf2.xlarge (16GB system RAM), this causes an OOM kill at 15+ GB RSS.
Pre-sharded weights bypass this entirely — NxDI reads directly from the per-rank sharded files, using only **1.4 GB RSS** (12.6% of system RAM).
## Contents
| File | Size | Description |
|------|------|-------------|
| `model.pt` | 49 MB | Compiled Neuron NEFF graphs |
| `neuron_config.json` | 9 KB | NxDI configuration (TP=1, BS=1, seq_len=2048, bf16) |
| `weights/tp0_sharded_checkpoint.safetensors` | 12 GB | Pre-sharded model weights for rank 0 |
## Performance
Measured on inf2.xlarge (1 NeuronCore, 16GB system RAM):
| Metric | Value |
|--------|-------|
| TTFT | 706 ms |
| TKG (per token) | 9.0 ms |
| Throughput | 112 tok/s |
| Load time | 18.4 s |
| Peak RSS | 1.39 GB |
## Quick Start
### Prerequisites
- AWS instance with Inferentia2: inf2.xlarge, inf2.8xlarge, or larger
- [Deep Learning AMI Neuron (Ubuntu 24.04) 20260227](https://aws.amazon.com/marketplace/) (SDK 2.28)
- Activate the pre-installed venv: `source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate`
### 1. Clone the model implementation
The Trinity Neuron implementation is not yet merged into the main NxDI repo. Use the contrib branch from the fork:
```bash
git clone --branch contrib/trinity-model --single-branch \
https://github.com/jimburtoft/neuronx-distributed-inference.git nxdi-trinity
```
### 2. Download this artifact and the base model config/tokenizer
```python
from huggingface_hub import snapshot_download
# Download the pre-compiled artifact (model.pt + sharded weights)
snapshot_download("jburtoft/Trinity-Nano-Neuron-TP1",
local_dir="/home/ubuntu/Trinity-Nano-Neuron-TP1")
# Download config + tokenizer only (no model weights needed)
snapshot_download("arcee-ai/Trinity-Nano-Preview",
local_dir="/home/ubuntu/Trinity-Nano-Preview",
ignore_patterns=["*.safetensors", "*.bin", "*.pt", "*.gguf"])
```
### 3. Load and run inference
```python
import sys
import torch
from transformers import AutoTokenizer
from neuronx_distributed_inference.models.config import MoENeuronConfig
# Point to the Trinity implementation from the cloned repo
sys.path.insert(0, "/home/ubuntu/nxdi-trinity/contrib/models/Trinity/src")
from modeling_trinity import NeuronTrinityForCausalLM, TrinityInferenceConfig
# Build model with save_sharded_checkpoint=True (must match compilation)
neuron_config = MoENeuronConfig(
tp_degree=1,
batch_size=1,
seq_len=2048,
torch_dtype=torch.bfloat16,
save_sharded_checkpoint=True,
)
config = TrinityInferenceConfig.from_pretrained(
"/home/ubuntu/Trinity-Nano-Preview",
neuron_config=neuron_config,
)
model = NeuronTrinityForCausalLM("/home/ubuntu/Trinity-Nano-Preview", config)
model.load("/home/ubuntu/Trinity-Nano-Neuron-TP1")
# Tokenize
tokenizer = AutoTokenizer.from_pretrained(
"/home/ubuntu/Trinity-Nano-Preview", trust_remote_code=True
)
prompt = "Hello, how are you today?"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids
# Generate
model.reset()
position_ids = torch.arange(input_ids.shape[1]).unsqueeze(0)
seq_ids = torch.arange(1)
with torch.no_grad():
outputs = model(input_ids, position_ids=position_ids, seq_ids=seq_ids)
logits = outputs.logits if hasattr(outputs, "logits") else outputs[0]
next_token = torch.argmax(logits[:, -1, :], dim=-1)
print(f"Prompt: {prompt}")
print(f"Next token: {tokenizer.decode(next_token)}")
# Autoregressive generation
generated = [next_token.unsqueeze(0)]
for i in range(31):
pos = torch.tensor([[input_ids.shape[1] + i]])
with torch.no_grad():
outputs = model(generated[-1], position_ids=pos, seq_ids=seq_ids)
logits = outputs.logits if hasattr(outputs, "logits") else outputs[0]
next_token = torch.argmax(logits[:, -1, :], dim=-1)
generated.append(next_token.unsqueeze(0))
text = tokenizer.decode(torch.cat(generated, dim=1)[0], skip_special_tokens=True)
print(f"Generated: {text}")
```
## Compilation Details
| Parameter | Value |
|-----------|-------|
| SDK | 2.28 (NxDI 0.8.16251, neuronx-cc 2.23.6484, torch-neuronx 2.9.0.2.12) |
| TP degree | 1 |
| Batch size | 1 |
| Sequence length | 2048 |
| Dtype | bfloat16 |
| `save_sharded_checkpoint` | True |
## Compiling Your Own
To compile for different configurations (e.g., TP=2, BS=4), you need a larger instance (inf2.8xlarge or trn2.3xlarge):
```python
import sys
import torch
from neuronx_distributed_inference.models.config import MoENeuronConfig
sys.path.insert(0, "/path/to/nxdi-trinity/contrib/models/Trinity/src")
from modeling_trinity import NeuronTrinityForCausalLM, TrinityInferenceConfig
neuron_config = MoENeuronConfig(
tp_degree=1, # Adjust as needed
batch_size=1, # Adjust as needed
seq_len=2048, # Adjust as needed
torch_dtype=torch.bfloat16,
save_sharded_checkpoint=True, # Required for pre-sharded deployment
)
config = TrinityInferenceConfig.from_pretrained(
"/path/to/Trinity-Nano-Preview", neuron_config=neuron_config
)
model = NeuronTrinityForCausalLM("/path/to/Trinity-Nano-Preview", config)
model.compile("/path/to/compiled-output")
# Output: model.pt, neuron_config.json, weights/tp{rank}_sharded_checkpoint.safetensors
```
## Base Model
- **Model**: [arcee-ai/Trinity-Nano-Preview](https://huggingface.co/arcee-ai/Trinity-Nano-Preview)
- **Architecture**: MoE (128 experts, top-8 active, 1 shared expert)
- **Parameters**: ~6B total, ~1B active per token
- **License**: Apache 2.0
## Model Implementation
The NeuronX Distributed Inference implementation for Trinity is available at:
[github.com/jimburtoft/neuronx-distributed-inference](https://github.com/jimburtoft/neuronx-distributed-inference/tree/contrib/trinity-model/contrib/models/Trinity) (branch: `contrib/trinity-model`)
This implementation supports all three Trinity model sizes (Nano, Mini, Large) with a single unified `modeling_trinity.py`.
|