Needle

We distilled Gemini 3.1 into a 26m parameter "Simple Attention Network" that you can even finetune locally on your Mac/PC. In production, Needle runs on Cactus at 6000 toks/sec prefill and 1200 decode speed. Weights are fully open on Cactus-Compute/needle, as well as the dataset generation.

Parameters 26M
Architecture Encoder-decoder, pure attention (no FFN)
Encoder 12 layers, GQA (8H/4KV), RoPE, gated residuals
Decoder 8 layers, self-attn + cross-attn, gated residuals
d_model 512
Vocab 8192 (SentencePiece BPE)
Norm ZCRMSNorm (zero-centered, init=0)
Precision bfloat16 (INT4 QAT during training)
Pretraining 200B tokens on 16x TPU v6e (27hrs)
Post-training 2B tokens of function call data (45mins)
d=512, 8H/4KV, BPE=8192
                                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                  β”‚  Tool Call   β”‚
                                  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”Œβ”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                        β”‚  Softmax  β”‚
                                        β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                                        β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
                                        β”‚ Linear (T)β”‚  <- tied
                                        β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                                        β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
                                        β”‚ ZCRMSNorm β”‚
                                        β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
                                     β”‚ Decoder x 8     β”‚
                                     β”‚β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
                                     β”‚β”‚ ZCRMSNorm     β”‚β”‚
                                     β”‚β”‚ Masked Self   β”‚β”‚
                                     β”‚β”‚ Attn + RoPE   β”‚β”‚
                                     β”‚β”‚ Gated Residualβ”‚β”‚
                                     β”‚β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚β”‚ ZCRMSNorm     β”‚β”‚
  β”‚ Encoder x 12 │─────────────────────>Cross Attn    β”‚β”‚
  β”‚              β”‚                   β”‚β”‚ Gated Residualβ”‚β”‚
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚                   β”‚β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
  β”‚ β”‚ZCRMSNorm β”‚ β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β”‚ β”‚Self Attn β”‚ β”‚                      β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
  β”‚ β”‚ GQA+RoPE β”‚ β”‚                      β”‚ Embedding β”‚  <- shared
  β”‚ β”‚Gated Res β”‚ β”‚                      β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
  β”‚ β”‚          β”‚ β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ β”‚ (no FFN) β”‚ β”‚                    β”‚[EOS]<tool_call>β”‚
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚                    β”‚ + answer       β”‚
  β”‚              β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
    β”‚ Embedding β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
    β”‚   Text    β”‚
    β”‚  query    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quickstart

git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

Opens a web UI at http://127.0.0.1:7860 where you can test and finetune on your own tools. Weights are auto-downloaded.

Usage (Python)

from needle import load_checkpoint, generate, SimpleAttentionNetwork, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()

result = generate(
    model, params, tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]

Finetuning

Finetune on your own tools via the web UI or CLI:

# Web UI (generates data via Gemini, trains, evaluates, bundles result)
needle playground

# CLI (auto-downloads weights if not local)
needle finetune data.jsonl

Links

License

MIT

Citation

@misc{ndubuaku2026needle,
  title={Needle},
  author={Henry Ndubuaku and Jakub Mroz and Karen Mosoyan and Roman Shemet and Parkirat Sandhu and Satyajit Kumar and Noah Cylich and Justin H. Lee},
  year={2026},
  url={https://github.com/cactus-compute/needle}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Cactus-Compute/needle

Quantizations
1 model

Space using Cactus-Compute/needle 1