File size: 5,681 Bytes
8a204f0 4362dff 8a204f0 cfa3b5a 8a204f0 2dc4c9f 8a204f0 4c24642 8a204f0 4c24642 8a204f0 2dc4c9f 8a204f0 4362dff ec7a366 8a204f0 cfa3b5a 8a204f0 beae308 8a204f0 beae308 8a204f0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | ---
license: mit
library_name: jax
tags:
- function-calling
- tool-use
- encoder-decoder
- edge
- on-device
- jax
- flax
---
# Needle
We distilled Gemini 3.1 into a 26m parameter "[Simple Attention Network](docs/simple_attention_networks.md)" that you can even finetune locally on your Mac/PC.
In production, Needle runs on [Cactus](https://github.com/cactus-compute/cactus) at 6000 toks/sec prefill and 1200 decode speed.
Weights are fully open on [Cactus-Compute/needle](https://huggingface.co/Cactus-Compute/needle), as well as the dataset generation.
| | |
|---|---|
| Parameters | 26M |
| Architecture | Encoder-decoder, pure attention (no FFN) |
| Encoder | 12 layers, GQA (8H/4KV), RoPE, gated residuals |
| Decoder | 8 layers, self-attn + cross-attn, gated residuals |
| d_model | 512 |
| Vocab | 8192 (SentencePiece BPE) |
| Norm | ZCRMSNorm (zero-centered, init=0) |
| Precision | bfloat16 (INT4 QAT during training) |
| Pretraining | 200B tokens on 16x TPU v6e (27hrs) |
| Post-training | 2B tokens of function call data (45mins) |
```
d=512, 8H/4KV, BPE=8192
ββββββββββββββββ
β Tool Call β
ββββββββ¬ββββββββ
ββ΄βββββββββββ
β Softmax β
βββββββ¬ββββββ
βββββββ΄ββββββ
β Linear (T)β <- tied
βββββββ¬ββββββ
βββββββ΄ββββββ
β ZCRMSNorm β
βββββββ¬ββββββ
ββββββββββ΄βββββββββ
β Decoder x 8 β
βββββββββββββββββββ
ββ ZCRMSNorm ββ
ββ Masked Self ββ
ββ Attn + RoPE ββ
ββ Gated Residualββ
ββββββββββββββββββ€β
ββββββββββββββββ ββ ZCRMSNorm ββ
β Encoder x 12 ββββββββββββββββββββββ>Cross Attn ββ
β β ββ Gated Residualββ
β ββββββββββββ β βββββββββββββββββββ
β βZCRMSNorm β β ββββββββββ¬βββββββββ
β βSelf Attn β β βββββββ΄ββββββ
β β GQA+RoPE β β β Embedding β <- shared
β βGated Res β β βββββββ¬ββββββ
β β β β βββββββββ΄βββββββββ
β β (no FFN) β β β[EOS]<tool_call>β
β ββββββββββββ β β + answer β
β β ββββββββββββββββββ
ββββββββ¬ββββββββ
β
ββββββ΄βββββββ
β Embedding β
ββββββ¬βββββββ
β
ββββββ΄βββββββ
β Text β
β query β
βββββββββββββ
```
## Quickstart
```bash
git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground
```
Opens a web UI at http://127.0.0.1:7860 where you can test and finetune on your own tools. Weights are auto-downloaded.
## Usage (Python)
```python
from needle import load_checkpoint, generate, SimpleAttentionNetwork, get_tokenizer
params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()
result = generate(
model, params, tokenizer,
query="What's the weather in San Francisco?",
tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]
```
## Finetuning
Finetune on your own tools via the web UI or CLI:
```bash
# Web UI (generates data via Gemini, trains, evaluates, bundles result)
needle playground
# CLI (auto-downloads weights if not local)
needle finetune data.jsonl
```
## Links
- [Needle](https://github.com/cactus-compute/needle) - training, finetuning, and inference code
- [Cactus](https://github.com/cactus-compute/cactus) - on-device runtime (6000 tok/s prefill, 1200 tok/s decode)
- [Simple Attention Networks](https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md) - architecture details
## License
MIT
## Citation
```
@misc{ndubuaku2026needle,
title={Needle},
author={Henry Ndubuaku and Jakub Mroz and Karen Mosoyan and Roman Shemet and Parkirat Sandhu and Satyajit Kumar and Noah Cylich and Justin H. Lee},
year={2026},
url={https://github.com/cactus-compute/needle}
}
```
|