metadata
license: mit
library_name: jax
tags:
- function-calling
- tool-use
- encoder-decoder
- edge
- on-device
- jax
- flax
Needle
We distilled Gemini 3.1 into a 26m parameter "Simple Attention Network" that you can even finetune locally on your Mac/PC. In production, Needle runs on Cactus at 6000 toks/sec prefill and 1200 decode speed. Weights are fully open on Cactus-Compute/needle, as well as the dataset generation.
| Parameters | 26M |
| Architecture | Encoder-decoder, pure attention (no FFN) |
| Encoder | 12 layers, GQA (8H/4KV), RoPE, gated residuals |
| Decoder | 8 layers, self-attn + cross-attn, gated residuals |
| d_model | 512 |
| Vocab | 8192 (SentencePiece BPE) |
| Norm | ZCRMSNorm (zero-centered, init=0) |
| Precision | bfloat16 (INT4 QAT during training) |
| Pretraining | 200B tokens on 16x TPU v6e (27hrs) |
| Post-training | 2B tokens of function call data (45mins) |
d=512, 8H/4KV, BPE=8192
ββββββββββββββββ
β Tool Call β
ββββββββ¬ββββββββ
ββ΄βββββββββββ
β Softmax β
βββββββ¬ββββββ
βββββββ΄ββββββ
β Linear (T)β <- tied
βββββββ¬ββββββ
βββββββ΄ββββββ
β ZCRMSNorm β
βββββββ¬ββββββ
ββββββββββ΄βββββββββ
β Decoder x 8 β
βββββββββββββββββββ
ββ ZCRMSNorm ββ
ββ Masked Self ββ
ββ Attn + RoPE ββ
ββ Gated Residualββ
ββββββββββββββββββ€β
ββββββββββββββββ ββ ZCRMSNorm ββ
β Encoder x 12 ββββββββββββββββββββββ>Cross Attn ββ
β β ββ Gated Residualββ
β ββββββββββββ β βββββββββββββββββββ
β βZCRMSNorm β β ββββββββββ¬βββββββββ
β βSelf Attn β β βββββββ΄ββββββ
β β GQA+RoPE β β β Embedding β <- shared
β βGated Res β β βββββββ¬ββββββ
β β β β βββββββββ΄βββββββββ
β β (no FFN) β β β[EOS]<tool_call>β
β ββββββββββββ β β + answer β
β β ββββββββββββββββββ
ββββββββ¬ββββββββ
β
ββββββ΄βββββββ
β Embedding β
ββββββ¬βββββββ
β
ββββββ΄βββββββ
β Text β
β query β
βββββββββββββ
Quickstart
git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground
Opens a web UI at http://127.0.0.1:7860 where you can test and finetune on your own tools. Weights are auto-downloaded.
Usage (Python)
from needle import load_checkpoint, generate, SimpleAttentionNetwork, get_tokenizer
params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()
result = generate(
model, params, tokenizer,
query="What's the weather in San Francisco?",
tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]
Finetuning
Finetune on your own tools via the web UI or CLI:
# Web UI (generates data via Gemini, trains, evaluates, bundles result)
needle playground
# CLI (auto-downloads weights if not local)
needle finetune data.jsonl
Links
- Needle - training, finetuning, and inference code
- Cactus - on-device runtime (6000 tok/s prefill, 1200 tok/s decode)
- Simple Attention Networks - architecture details
License
MIT
Citation
@misc{ndubuaku2026needle,
title={Needle},
author={Henry Ndubuaku and Jakub Mroz and Karen Mosoyan and Roman Shemet and Parkirat Sandhu and Satyajit Kumar and Noah Cylich and Justin H. Lee},
year={2026},
url={https://github.com/cactus-compute/needle}
}