File size: 6,648 Bytes
f14b7fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
license: apache-2.0
tags:
  - neuron
  - aws-inferentia
  - inf2
  - moe
  - pre-compiled
  - neuronx-distributed-inference
base_model: arcee-ai/Trinity-Nano-Preview
library_name: neuronx-distributed-inference
---

# Trinity-Nano Pre-Compiled for AWS Inferentia2 (TP=1)

Pre-compiled and pre-sharded [Trinity-Nano-Preview](https://huggingface.co/arcee-ai/Trinity-Nano-Preview) (~6B total, ~1B active MoE) for AWS Neuron SDK 2.28, ready to load on **inf2.xlarge** (16GB system RAM) or any larger Inferentia2/Trainium instance.

## Why Pre-Sharded?

The standard NxDI load path downloads the full HuggingFace checkpoint (~12GB bf16) into CPU RAM for weight conversion and sharding. On inf2.xlarge (16GB system RAM), this causes an OOM kill at 15+ GB RSS.

Pre-sharded weights bypass this entirely — NxDI reads directly from the per-rank sharded files, using only **1.4 GB RSS** (12.6% of system RAM).

## Contents

| File | Size | Description |
|------|------|-------------|
| `model.pt` | 49 MB | Compiled Neuron NEFF graphs |
| `neuron_config.json` | 9 KB | NxDI configuration (TP=1, BS=1, seq_len=2048, bf16) |
| `weights/tp0_sharded_checkpoint.safetensors` | 12 GB | Pre-sharded model weights for rank 0 |

## Performance

Measured on inf2.xlarge (1 NeuronCore, 16GB system RAM):

| Metric | Value |
|--------|-------|
| TTFT | 706 ms |
| TKG (per token) | 9.0 ms |
| Throughput | 112 tok/s |
| Load time | 18.4 s |
| Peak RSS | 1.39 GB |

## Quick Start

### Prerequisites

- AWS instance with Inferentia2: inf2.xlarge, inf2.8xlarge, or larger
- [Deep Learning AMI Neuron (Ubuntu 24.04) 20260227](https://aws.amazon.com/marketplace/) (SDK 2.28)
- Activate the pre-installed venv: `source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate`

### 1. Clone the model implementation

The Trinity Neuron implementation is not yet merged into the main NxDI repo. Use the contrib branch from the fork:

```bash
git clone --branch contrib/trinity-model --single-branch \
    https://github.com/jimburtoft/neuronx-distributed-inference.git nxdi-trinity
```

### 2. Download this artifact and the base model config/tokenizer

```python
from huggingface_hub import snapshot_download

# Download the pre-compiled artifact (model.pt + sharded weights)
snapshot_download("jburtoft/Trinity-Nano-Neuron-TP1",
                  local_dir="/home/ubuntu/Trinity-Nano-Neuron-TP1")

# Download config + tokenizer only (no model weights needed)
snapshot_download("arcee-ai/Trinity-Nano-Preview",
                  local_dir="/home/ubuntu/Trinity-Nano-Preview",
                  ignore_patterns=["*.safetensors", "*.bin", "*.pt", "*.gguf"])
```

### 3. Load and run inference

```python
import sys
import torch
from transformers import AutoTokenizer
from neuronx_distributed_inference.models.config import MoENeuronConfig

# Point to the Trinity implementation from the cloned repo
sys.path.insert(0, "/home/ubuntu/nxdi-trinity/contrib/models/Trinity/src")
from modeling_trinity import NeuronTrinityForCausalLM, TrinityInferenceConfig

# Build model with save_sharded_checkpoint=True (must match compilation)
neuron_config = MoENeuronConfig(
    tp_degree=1,
    batch_size=1,
    seq_len=2048,
    torch_dtype=torch.bfloat16,
    save_sharded_checkpoint=True,
)

config = TrinityInferenceConfig.from_pretrained(
    "/home/ubuntu/Trinity-Nano-Preview",
    neuron_config=neuron_config,
)

model = NeuronTrinityForCausalLM("/home/ubuntu/Trinity-Nano-Preview", config)
model.load("/home/ubuntu/Trinity-Nano-Neuron-TP1")

# Tokenize
tokenizer = AutoTokenizer.from_pretrained(
    "/home/ubuntu/Trinity-Nano-Preview", trust_remote_code=True
)

prompt = "Hello, how are you today?"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids

# Generate
model.reset()
position_ids = torch.arange(input_ids.shape[1]).unsqueeze(0)
seq_ids = torch.arange(1)

with torch.no_grad():
    outputs = model(input_ids, position_ids=position_ids, seq_ids=seq_ids)

logits = outputs.logits if hasattr(outputs, "logits") else outputs[0]
next_token = torch.argmax(logits[:, -1, :], dim=-1)
print(f"Prompt: {prompt}")
print(f"Next token: {tokenizer.decode(next_token)}")

# Autoregressive generation
generated = [next_token.unsqueeze(0)]
for i in range(31):
    pos = torch.tensor([[input_ids.shape[1] + i]])
    with torch.no_grad():
        outputs = model(generated[-1], position_ids=pos, seq_ids=seq_ids)
    logits = outputs.logits if hasattr(outputs, "logits") else outputs[0]
    next_token = torch.argmax(logits[:, -1, :], dim=-1)
    generated.append(next_token.unsqueeze(0))

text = tokenizer.decode(torch.cat(generated, dim=1)[0], skip_special_tokens=True)
print(f"Generated: {text}")
```

## Compilation Details

| Parameter | Value |
|-----------|-------|
| SDK | 2.28 (NxDI 0.8.16251, neuronx-cc 2.23.6484, torch-neuronx 2.9.0.2.12) |
| TP degree | 1 |
| Batch size | 1 |
| Sequence length | 2048 |
| Dtype | bfloat16 |
| `save_sharded_checkpoint` | True |

## Compiling Your Own

To compile for different configurations (e.g., TP=2, BS=4), you need a larger instance (inf2.8xlarge or trn2.3xlarge):

```python
import sys
import torch
from neuronx_distributed_inference.models.config import MoENeuronConfig

sys.path.insert(0, "/path/to/nxdi-trinity/contrib/models/Trinity/src")
from modeling_trinity import NeuronTrinityForCausalLM, TrinityInferenceConfig

neuron_config = MoENeuronConfig(
    tp_degree=1,       # Adjust as needed
    batch_size=1,      # Adjust as needed
    seq_len=2048,      # Adjust as needed
    torch_dtype=torch.bfloat16,
    save_sharded_checkpoint=True,  # Required for pre-sharded deployment
)

config = TrinityInferenceConfig.from_pretrained(
    "/path/to/Trinity-Nano-Preview", neuron_config=neuron_config
)
model = NeuronTrinityForCausalLM("/path/to/Trinity-Nano-Preview", config)
model.compile("/path/to/compiled-output")
# Output: model.pt, neuron_config.json, weights/tp{rank}_sharded_checkpoint.safetensors
```

## Base Model

- **Model**: [arcee-ai/Trinity-Nano-Preview](https://huggingface.co/arcee-ai/Trinity-Nano-Preview)
- **Architecture**: MoE (128 experts, top-8 active, 1 shared expert)
- **Parameters**: ~6B total, ~1B active per token
- **License**: Apache 2.0

## Model Implementation

The NeuronX Distributed Inference implementation for Trinity is available at:
[github.com/jimburtoft/neuronx-distributed-inference](https://github.com/jimburtoft/neuronx-distributed-inference/tree/contrib/trinity-model/contrib/models/Trinity) (branch: `contrib/trinity-model`)

This implementation supports all three Trinity model sizes (Nano, Mini, Large) with a single unified `modeling_trinity.py`.