--- license: apache-2.0 language: - en - fr library_name: transformers pipeline_tag: text-generation tags: - agentic - function-calling - tool-use - structured-generation - orchestration - code-agent - mcp - edge - small-language-model base_model: AMFORGE/samg-reasoning model-index: - name: SAM-G-CobraTooling results: - task: type: agentic-orchestration name: Agentic IDE tool-call orchestration (13 families, held-out) metrics: - type: exact_match value: 78.8 name: Exact plan match, aggregate (%) - type: accuracy value: 94.0 name: Risk-gate fidelity (%) --- # SAM-G-CobraTooling **SAM-G-CobraTooling** is a 30.3M-parameter model fine-tuned from [SAM-G-Reasoning](https://huggingface.co/AMFORGE/samg-reasoning) on 196k agentic orchestration traces. It turns a natural-language instruction — or an observation from a previous step — into an **ordered, risk-flagged JSON plan of tool calls**. It is the local orchestration layer of an agentic IDE: it routes, decomposes, tracks state, reacts to exit codes and HTTP status, and emits structured tool calls entirely offline. It does **not** write code; code is delegated to a larger model via an `ask_code_model` hand-off. Built by **AMEFORGE** for the CobraBub IDE. - **Parameters:** 30.3M · **Footprint:** 121 MB fp32 (~30 MB quantized) · **Base:** SAM-G-Reasoning - **Fine-tuning:** prompt-masked SFT (loss on the plan span only), cosine 8e-5, 10k steps, best at 6k - **Aggregate exact plan-match:** 78.8% (held-out, disjoint seed) - **Lineage:** SAM-G → SAM-G-Reasoning → SAM-G-CobraTooling ## Output format ``` [ACTION] {"plan":[{"op":...,"args":{...},"risk":"safe|critical"}, ...]} | {"last_op":...,"...":...} [ACTION] {"plan":[ ... ]} # reactive (observation-driven) ``` Every step carries a `risk` flag (`safe` or `critical`) that drives the IDE confirmation gate: safe ops run autonomously, critical ops require explicit user confirmation. ## What it is good at — and what it is not Stress-tested on thirteen families. The pattern mirrors the rest of the SAM-G line: it excels at **routing and reaction** (short, procedural) and is limited on **long ordered chains** that must match exactly at 30M parameters. | Family | Exact % | Type | |---|---|---| | single_tool (routing) | 100 | routing | | retry_loop (exit-code state machine) | 100 | reaction | | feedback_react (stdout/stderr) | 100 | reaction | | git_workflow (status→add→push, gated) | 100 | procedural | | scrape_research (fetch→summarize→act) | 100 | procedural | | db_query (SQL, SELECT vs mutation) | 100 | structured call | | webhook_wait (async callback) | 92 | async reaction | | **mcp_call (filesystem/github/postgres)** | **83** | **structured call** | | api_call (REST/GraphQL + HTTP state machine) | 75 | structured call | | plan_chain (multi-step plans) | 58 | planning | | risk_gate (mixed safe/critical plans) | 58 | gated planning | | fs_watch (file-change reaction) | 42 | async reaction | | build_test_cycle (edit→test→react + hand-off) | 17 | long chain | Routing, exit-code reaction, git, scraping and SQL routing are saturated. `mcp_call` at 83% makes the model a viable local driver for MCP servers — the core capability of a hosted code agent, here running offline. `plan_chain` rose from the v1 plateau (0–42%) to 58% after broadening generator coverage. `build_test_cycle` remains the hard family: four-to-five ordered ops ending in a code-model hand-off, scored by strict exact match — the same long-chain ceiling seen with arithmetic in SAM-G-Reasoning. For those, decompose app-side into shorter sub-calls. ## Security: the risk flag is advisory, not a boundary The model flags critical ops with **94% fidelity** across all families — strong for pre-flagging and good UX. **It must not be the sole security boundary.** A 30M model will mis-flag a fraction of decisions, and the failure modes are asymmetric: a false negative (a critical op flagged `safe`) would auto-run a destructive command without confirmation. Integrators must add a **deterministic backstop**: a hard whitelist/blacklist in the app that forces `critical` on known-dangerous operations (`rm -rf`, `git push`, `DROP`/`DELETE`, external mutating HTTP, MCP write tools, `delete_file`) regardless of the model's flag. Treat the model's `risk` field as a fast hint that pre-fills the confirmation gate, with the app's deterministic rules as the enforced boundary. ## Op vocabulary Routing/IO: `open_file`, `list_dir`, `run_command`, `scrape`, `summarize`, `capture`, `open_app`. Hand-off: `ask_code_model`, `write_file`. Control: `retry`, `escalate`, `backoff`, `reauth`, `continue`, `stop`. Integrations: `api_call`, `mcp_call`, `db_query`, `webhook_wait`, `fs_watch`, `git_push`. ## Intended use The local planning/routing/reaction layer of an agentic IDE: decompose an instruction into ordered tool calls, react to observations (exit codes, stderr, HTTP status, DB row counts, webhook payloads, file-change events), and emit structured, risk-flagged plans offline and for free. Roughly the procedural majority of agentic turns; hard code generation and long exact chains are escalated to a larger model via `ask_code_model`. ## Usage ```python import sentencepiece as spm, torch sp = spm.SentencePieceProcessor(); sp.Load("samg_tokenizer.model") # routing prompt = "open src/main.js and run the tests [ACTION]" # -> {"plan":[{"op":"open_file","args":{"path":"src/main.js"},"risk":"safe"}, # {"op":"run_command","args":{"cmd":"pytest"},"risk":"safe"}]} # reactive: HTTP 429 -> back off and retry prompt = "rate limited, back off and retry | {\"last_op\":\"api_call\",\"status\":429} [ACTION]" # -> {"plan":[{"op":"backoff","args":{"seconds":30},"risk":"safe"}, # {"op":"retry","args":{"attempt":2},"risk":"safe"}]} ids = torch.tensor([sp.EncodeAsIds(prompt)]) # greedy-decode the [ACTION] span -> structured plan JSON ``` ## Limitations - `build_test_cycle` (17%) and the exact-match of `plan_chain`/`risk_gate` (58%) plateau because long, strictly-ordered plans are hard at 30M; decompose long plans app-side into shorter sub-calls. - The `risk` flag is advisory (94% fidelity); enforce a deterministic backstop in the app, as above. - Traces are synthetic, drawn from the training family distribution with a disjoint evaluation seed; coverage reflects the generator, not arbitrary real-world tool APIs. - Not a general assistant and does not write code; it orchestrates and hands off. Inherits the base model's knowledge limits. ## Citation ```bibtex @misc{samgcobratooling2026, title = {SAM-G-CobraTooling: Risk-Flagged Agentic Tool-Call Orchestration at 30M Parameters}, author = {AMEFORGE Lab}, year = {2026} } ```