Cactus-Compute
/

needle

@@ -9,8 +9,6 @@ tags:
   - on-device
   - jax
   - flax
-datasets:
-  - Cactus-Compute/tool-calls
 ---
 # Needle
@@ -19,8 +17,6 @@ A 26M parameter encoder-decoder transformer for on-device function calling, buil
 Distilled from Gemini 3.1 Flash Lite. Runs at 6000 tok/s prefill and 1200 tok/s decode on [Cactus](https://github.com/cactus-compute/cactus).
-## Model Details
 | | |
 |---|---|
 | Parameters | 26M |
@@ -34,7 +30,51 @@ Distilled from Gemini 3.1 Flash Lite. Runs at 6000 tok/s prefill and 1200 tok/s
 | Pretraining | 200B tokens on 16x TPU v6e (27hrs) |
 | Post-training | 2B tokens of function call data (45mins) |
-## Architecture
 No feedforward layers. Each encoder block is gated self-attention; each decoder block is gated self-attention + gated cross-attention. The only nonlinearities are softmax and sigmoid.
@@ -83,6 +123,12 @@ needle ui
 python -m src.training.finetune data.jsonl --checkpoint checkpoints/needle.pkl
 ```
 ## File Format
 The checkpoint is a Python pickle containing:
@@ -94,17 +140,6 @@ The checkpoint is a Python pickle containing:
 }
 ```
-Load with:
-```python
-import pickle
-with open("needle.pkl", "rb") as f:
-    data = pickle.load(f)
-```
-## Training Data
-Post-trained on [Cactus-Compute/tool-calls](https://huggingface.co/datasets/Cactus-Compute/tool-calls), a synthesized dataset of 2M+ function calling examples spanning 15 tool categories (timers, messaging, media, navigation, smart home, fitness, etc.).
 ## License
 MIT

   - on-device
   - jax
   - flax
 ---
 # Needle
 Distilled from Gemini 3.1 Flash Lite. Runs at 6000 tok/s prefill and 1200 tok/s decode on [Cactus](https://github.com/cactus-compute/cactus).
 | | |
 |---|---|
 | Parameters | 26M |
 | Pretraining | 200B tokens on 16x TPU v6e (27hrs) |
 | Post-training | 2B tokens of function call data (45mins) |
+```
+d=512, 8H/4KV, BPE=8192
+                                  ┌──────────────┐
+                                  │  Tool Call   │
+                                  └──────┬───────┘
+                                        ┌┴──────────┐
+                                        │  Softmax  │
+                                        └─────┬─────┘
+                                        ┌─────┴─────┐
+                                        │ Linear (T)│  <- tied
+                                        └─────┬─────┘
+                                        ┌─────┴─────┐
+                                        │ ZCRMSNorm │
+                                        └─────┬─────┘
+                                     ┌────────┴────────┐
+                                     │ Decoder x 8     │
+                                     │┌───────────────┐│
+                                     ││ ZCRMSNorm     ││
+                                     ││ Masked Self   ││
+                                     ││ Attn + RoPE   ││
+                                     ││ Gated Residual││
+                                     │├───────────────┤│
+  ┌──────────────┐                   ││ ZCRMSNorm     ││
+  │ Encoder x 12 │─────────────────────>Cross Attn    ││
+  │              │                   ││ Gated Residual││
+  │ ┌──────────┐ │                   │└───────────────┘│
+  │ │ZCRMSNorm │ │                   └────────┬────────┘
+  │ │Self Attn │ │                      ┌─────┴─────┐
+  │ │ GQA+RoPE │ │                      │ Embedding │  <- shared
+  │ │Gated Res │ │                      └─────┬─────┘
+  │ │          │ │                    ┌───────┴────────┐
+  │ │ (no FFN) │ │                    │[EOS]<tool_call>│
+  │ └──────────┘ │                    │ + answer       │
+  │              │                    └────────────────┘
+  └──────┬───────┘
+         │
+    ┌────┴──────┐
+    │ Embedding │
+    └────┬──────┘
+         │
+    ┌────┴──────┐
+    │   Text    │
+    │  query    │
+    └───────────┘
+```
 No feedforward layers. Each encoder block is gated self-attention; each decoder block is gated self-attention + gated cross-attention. The only nonlinearities are softmax and sigmoid.
 python -m src.training.finetune data.jsonl --checkpoint checkpoints/needle.pkl
 ```
+## Links
+- [Needle](https://github.com/cactus-compute/needle) - training, finetuning, and inference code
+- [Cactus](https://github.com/cactus-compute/cactus) - on-device runtime (6000 tok/s prefill, 1200 tok/s decode)
+- [Simple Attention Networks](https://github.com/cactus-compute/needle/blob/main/docs/simple_attention_networks.md) - architecture details
 ## File Format
 The checkpoint is a Python pickle containing:
 }
 ```
 ## License
 MIT