File size: 5,681 Bytes
8a204f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4362dff
 
 
8a204f0
 
 
 
 
 
 
 
 
 
 
 
 
 
cfa3b5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a204f0
 
 
 
 
 
2dc4c9f
8a204f0
 
 
 
 
 
 
4c24642
8a204f0
 
4c24642
8a204f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2dc4c9f
8a204f0
4362dff
ec7a366
8a204f0
 
cfa3b5a
 
 
 
 
 
8a204f0
 
 
 
 
 
 
 
beae308
 
8a204f0
beae308
8a204f0
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: mit
library_name: jax
tags:
  - function-calling
  - tool-use
  - encoder-decoder
  - edge
  - on-device
  - jax
  - flax
---

# Needle

We distilled Gemini 3.1 into a 26m parameter "[Simple Attention Network](docs/simple_attention_networks.md)" that you can even finetune locally on your Mac/PC.
In production, Needle runs on [Cactus](https://github.com/cactus-compute/cactus) at 6000 toks/sec prefill and 1200 decode speed. 
Weights are fully open on [Cactus-Compute/needle](https://huggingface.co/Cactus-Compute/needle), as well as the dataset generation. 

| | |
|---|---|
| Parameters | 26M |
| Architecture | Encoder-decoder, pure attention (no FFN) |
| Encoder | 12 layers, GQA (8H/4KV), RoPE, gated residuals |
| Decoder | 8 layers, self-attn + cross-attn, gated residuals |
| d_model | 512 |
| Vocab | 8192 (SentencePiece BPE) |
| Norm | ZCRMSNorm (zero-centered, init=0) |
| Precision | bfloat16 (INT4 QAT during training) |
| Pretraining | 200B tokens on 16x TPU v6e (27hrs) |
| Post-training | 2B tokens of function call data (45mins) |

```
d=512, 8H/4KV, BPE=8192
                                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                  β”‚  Tool Call   β”‚
                                  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”Œβ”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                        β”‚  Softmax  β”‚
                                        β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                                        β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
                                        β”‚ Linear (T)β”‚  <- tied
                                        β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                                        β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
                                        β”‚ ZCRMSNorm β”‚
                                        β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
                                     β”‚ Decoder x 8     β”‚
                                     β”‚β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
                                     β”‚β”‚ ZCRMSNorm     β”‚β”‚
                                     β”‚β”‚ Masked Self   β”‚β”‚
                                     β”‚β”‚ Attn + RoPE   β”‚β”‚
                                     β”‚β”‚ Gated Residualβ”‚β”‚
                                     β”‚β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚β”‚ ZCRMSNorm     β”‚β”‚
  β”‚ Encoder x 12 │─────────────────────>Cross Attn    β”‚β”‚
  β”‚              β”‚                   β”‚β”‚ Gated Residualβ”‚β”‚
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚                   β”‚β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
  β”‚ β”‚ZCRMSNorm β”‚ β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β”‚ β”‚Self Attn β”‚ β”‚                      β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
  β”‚ β”‚ GQA+RoPE β”‚ β”‚                      β”‚ Embedding β”‚  <- shared
  β”‚ β”‚Gated Res β”‚ β”‚                      β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
  β”‚ β”‚          β”‚ β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ β”‚ (no FFN) β”‚ β”‚                    β”‚[EOS]<tool_call>β”‚
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚                    β”‚ + answer       β”‚
  β”‚              β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
    β”‚ Embedding β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
    β”‚   Text    β”‚
    β”‚  query    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Quickstart

```bash
git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground
```

Opens a web UI at http://127.0.0.1:7860 where you can test and finetune on your own tools. Weights are auto-downloaded.

## Usage (Python)

```python
from needle import load_checkpoint, generate, SimpleAttentionNetwork, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()

result = generate(
    model, params, tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]
```

## Finetuning

Finetune on your own tools via the web UI or CLI:

```bash
# Web UI (generates data via Gemini, trains, evaluates, bundles result)
needle playground

# CLI (auto-downloads weights if not local)
needle finetune data.jsonl
```

## Links

- [Needle](https://github.com/cactus-compute/needle) - training, finetuning, and inference code
- [Cactus](https://github.com/cactus-compute/cactus) - on-device runtime (6000 tok/s prefill, 1200 tok/s decode)
- [Simple Attention Networks](https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md) - architecture details

## License

MIT

## Citation

```
@misc{ndubuaku2026needle,
  title={Needle},
  author={Henry Ndubuaku and Jakub Mroz and Karen Mosoyan and Roman Shemet and Parkirat Sandhu and Satyajit Kumar and Noah Cylich and Justin H. Lee},
  year={2026},
  url={https://github.com/cactus-compute/needle}
}
```