--- title: LocalAgent Tool Calling (WebGPU) emoji: πŸ› οΈ colorFrom: indigo colorTo: purple sdk: static pinned: false license: mit short_description: Sub-100M from-scratch tool-calling agent in the browser --- # LocalAgent β€” tool calling in the browser (WebGPU) A **28M-parameter, pretrained-from-scratch** byte-level agent that does **grounded tool calling** and **multi-step planning** β€” running **entirely in your browser** on [onnxruntime-web](https://onnxruntime.ai/docs/tutorials/web/) with the **WebGPU** backend (WASM fallback when WebGPU is unavailable). No server, no API key; the model is downloaded once and cached. Model: [`danelcsb/localagent-tiny-30m-byte`](https://huggingface.co/danelcsb/localagent-tiny-30m-byte). Source: [LocalAgent](https://github.com/sangbumchoi/localagent). ## What it shows (generable dispatch β€” no fixed-N classifier) - **Route gate** β€” a 5-way head (`web_search / computer_use / code / app_action / text`) on the ONNX `hidden` output; the `text` route is **abstention** (answer directly / no tool). - **Tool selection** β€” a **dense two-tower selector**: the query tower projects `hidden`, scored by cosine against a precomputed per-tool description-embedding matrix over the **50-tool** surface (`argmax_j qΒ·tool_matrix[j]`). Adding/removing a tool is adding/removing a row β€” no retraining. - **Grounded arguments** β€” copied from spans of your prompt via the learned pointer head, so the emitted call is schema-valid by construction. - **Multi-step plans** β€” the rollout: pick a tool β†’ ground it β†’ feed back a simulated response β†’ pick the next, until the route head emits `text`. ## How it runs (honest version) The transformer forward pass runs on **WebGPU** via an exported ONNX graph that emits `logits` and the last `hidden` state. The **route head**, the **dense selector** (matmul + normalize + argmax over the precomputed tool matrix), the **pointer-copy** grounding, and the **planner loop** are light JavaScript on top β€” a faithful port of the Python `routes` / `dense_selector` / `pointer_head` pipeline (parity-checked at export: 100% argmax/top-1 agreement). First load fetches `model.fp16.onnx` (~57 MB) and caches it. ## Files - `index.html` / `style.css` β€” the UI shell. - `app.js` β€” byte tokenizer, onnxruntime-web session (WebGPU + WASM fallback), route+selector dispatch, grounding, and the planner rollout. - `model.fp16.onnx`, `heads.json`, `meta.json`, `dispatch_heads.json` β€” the exported inference bundle (**not in the source repo**; deploy artifacts). See `DEPLOY.md` for the exact commands. ## Deploy See **`DEPLOY.md`** for copy-paste build + push commands. In short: export the bundle from the latest checkpoint and upload the static app + the four bundle files into a `sdk: static` Space: ```bash python -c "from localagent.inference.export.to_onnx import export_web; \ export_web('runs/tiny-30m-scenarios-best.pt', 'build/web')" ``` `app.js` fetches `model.fp16.onnx` / `heads.json` / `meta.json` / `dispatch_heads.json` relative to the page, so they must sit next to `index.html`. Export is parity-checked vs PyTorch (max |Ξ”logits| 7.6e-6; route-head & dense-selector argmax/top-1 100% agreement). The graph is standard opset-17; onnxruntime-web falls back per-op to WASM for any op without a WebGPU kernel, with identical results.