samg-cobratooling / README.md
ameforge's picture
Create README.md
448beb7 verified
|
Raw
History Blame Contribute Delete
6.84 kB
---
license: apache-2.0
language:
- en
- fr
library_name: transformers
pipeline_tag: text-generation
tags:
- agentic
- function-calling
- tool-use
- structured-generation
- orchestration
- code-agent
- mcp
- edge
- small-language-model
base_model: AMFORGE/samg-reasoning
model-index:
- name: SAM-G-CobraTooling
results:
- task:
type: agentic-orchestration
name: Agentic IDE tool-call orchestration (13 families, held-out)
metrics:
- type: exact_match
value: 78.8
name: Exact plan match, aggregate (%)
- type: accuracy
value: 94.0
name: Risk-gate fidelity (%)
---
# SAM-G-CobraTooling
**SAM-G-CobraTooling** is a 30.3M-parameter model fine-tuned from
[SAM-G-Reasoning](https://huggingface.co/AMFORGE/samg-reasoning) on 196k
agentic orchestration traces. It turns a natural-language instruction — or an
observation from a previous step — into an **ordered, risk-flagged JSON plan of
tool calls**. It is the local orchestration layer of an agentic IDE: it routes,
decomposes, tracks state, reacts to exit codes and HTTP status, and emits
structured tool calls entirely offline. It does **not** write code; code is
delegated to a larger model via an `ask_code_model` hand-off. Built by
**AMEFORGE** for the CobraBub IDE.
- **Parameters:** 30.3M · **Footprint:** 121 MB fp32 (~30 MB quantized) · **Base:** SAM-G-Reasoning
- **Fine-tuning:** prompt-masked SFT (loss on the plan span only), cosine 8e-5, 10k steps, best at 6k
- **Aggregate exact plan-match:** 78.8% (held-out, disjoint seed)
- **Lineage:** SAM-G → SAM-G-Reasoning → SAM-G-CobraTooling
## Output format
```
<instruction> [ACTION] {"plan":[{"op":...,"args":{...},"risk":"safe|critical"}, ...]}
<intent> | {"last_op":...,"...":...} [ACTION] {"plan":[ ... ]} # reactive (observation-driven)
```
Every step carries a `risk` flag (`safe` or `critical`) that drives the IDE
confirmation gate: safe ops run autonomously, critical ops require explicit
user confirmation.
## What it is good at — and what it is not
Stress-tested on thirteen families. The pattern mirrors the rest of the SAM-G
line: it excels at **routing and reaction** (short, procedural) and is limited
on **long ordered chains** that must match exactly at 30M parameters.
| Family | Exact % | Type |
|---|---|---|
| single_tool (routing) | 100 | routing |
| retry_loop (exit-code state machine) | 100 | reaction |
| feedback_react (stdout/stderr) | 100 | reaction |
| git_workflow (status→add→push, gated) | 100 | procedural |
| scrape_research (fetch→summarize→act) | 100 | procedural |
| db_query (SQL, SELECT vs mutation) | 100 | structured call |
| webhook_wait (async callback) | 92 | async reaction |
| **mcp_call (filesystem/github/postgres)** | **83** | **structured call** |
| api_call (REST/GraphQL + HTTP state machine) | 75 | structured call |
| plan_chain (multi-step plans) | 58 | planning |
| risk_gate (mixed safe/critical plans) | 58 | gated planning |
| fs_watch (file-change reaction) | 42 | async reaction |
| build_test_cycle (edit→test→react + hand-off) | 17 | long chain |
Routing, exit-code reaction, git, scraping and SQL routing are saturated.
`mcp_call` at 83% makes the model a viable local driver for MCP servers — the
core capability of a hosted code agent, here running offline. `plan_chain` rose
from the v1 plateau (0–42%) to 58% after broadening generator coverage.
`build_test_cycle` remains the hard family: four-to-five ordered ops ending in a
code-model hand-off, scored by strict exact match — the same long-chain ceiling
seen with arithmetic in SAM-G-Reasoning. For those, decompose app-side into
shorter sub-calls.
## Security: the risk flag is advisory, not a boundary
The model flags critical ops with **94% fidelity** across all families — strong
for pre-flagging and good UX. **It must not be the sole security boundary.** A
30M model will mis-flag a fraction of decisions, and the failure modes are
asymmetric: a false negative (a critical op flagged `safe`) would auto-run a
destructive command without confirmation. Integrators must add a
**deterministic backstop**: a hard whitelist/blacklist in the app that forces
`critical` on known-dangerous operations (`rm -rf`, `git push`, `DROP`/`DELETE`,
external mutating HTTP, MCP write tools, `delete_file`) regardless of the
model's flag. Treat the model's `risk` field as a fast hint that pre-fills the
confirmation gate, with the app's deterministic rules as the enforced boundary.
## Op vocabulary
Routing/IO: `open_file`, `list_dir`, `run_command`, `scrape`, `summarize`,
`capture`, `open_app`. Hand-off: `ask_code_model`, `write_file`. Control:
`retry`, `escalate`, `backoff`, `reauth`, `continue`, `stop`. Integrations:
`api_call`, `mcp_call`, `db_query`, `webhook_wait`, `fs_watch`, `git_push`.
## Intended use
The local planning/routing/reaction layer of an agentic IDE: decompose an
instruction into ordered tool calls, react to observations (exit codes, stderr,
HTTP status, DB row counts, webhook payloads, file-change events), and emit
structured, risk-flagged plans offline and for free. Roughly the procedural
majority of agentic turns; hard code generation and long exact chains are
escalated to a larger model via `ask_code_model`.
## Usage
```python
import sentencepiece as spm, torch
sp = spm.SentencePieceProcessor(); sp.Load("samg_tokenizer.model")
# routing
prompt = "open src/main.js and run the tests [ACTION]"
# -> {"plan":[{"op":"open_file","args":{"path":"src/main.js"},"risk":"safe"},
# {"op":"run_command","args":{"cmd":"pytest"},"risk":"safe"}]}
# reactive: HTTP 429 -> back off and retry
prompt = "rate limited, back off and retry | {\"last_op\":\"api_call\",\"status\":429} [ACTION]"
# -> {"plan":[{"op":"backoff","args":{"seconds":30},"risk":"safe"},
# {"op":"retry","args":{"attempt":2},"risk":"safe"}]}
ids = torch.tensor([sp.EncodeAsIds(prompt)])
# greedy-decode the [ACTION] span -> structured plan JSON
```
## Limitations
- `build_test_cycle` (17%) and the exact-match of `plan_chain`/`risk_gate`
(58%) plateau because long, strictly-ordered plans are hard at 30M; decompose
long plans app-side into shorter sub-calls.
- The `risk` flag is advisory (94% fidelity); enforce a deterministic backstop
in the app, as above.
- Traces are synthetic, drawn from the training family distribution with a
disjoint evaluation seed; coverage reflects the generator, not arbitrary
real-world tool APIs.
- Not a general assistant and does not write code; it orchestrates and hands
off. Inherits the base model's knowledge limits.
## Citation
```bibtex
@misc{samgcobratooling2026,
title = {SAM-G-CobraTooling: Risk-Flagged Agentic Tool-Call Orchestration at 30M Parameters},
author = {AMEFORGE Lab},
year = {2026}
}
```