Create README.md

448beb7 verified 15 days ago

6.84 kB

	---
	license: apache-2.0
	language:
	- en
	- fr
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- agentic
	- function-calling
	- tool-use
	- structured-generation
	- orchestration
	- code-agent
	- mcp
	- edge
	- small-language-model
	base_model: AMFORGE/samg-reasoning
	model-index:
	- name: SAM-G-CobraTooling
	results:
	- task:
	type: agentic-orchestration
	name: Agentic IDE tool-call orchestration (13 families, held-out)
	metrics:
	- type: exact_match
	value: 78.8
	name: Exact plan match, aggregate (%)
	- type: accuracy
	value: 94.0
	name: Risk-gate fidelity (%)
	---

	# SAM-G-CobraTooling

	SAM-G-CobraTooling is a 30.3M-parameter model fine-tuned from
	[SAM-G-Reasoning](https://huggingface.co/AMFORGE/samg-reasoning) on 196k
	agentic orchestration traces. It turns a natural-language instruction — or an
	observation from a previous step — into an **ordered, risk-flagged JSON plan of
	tool calls**. It is the local orchestration layer of an agentic IDE: it routes,
	decomposes, tracks state, reacts to exit codes and HTTP status, and emits
	structured tool calls entirely offline. It does not write code; code is
	delegated to a larger model via an `ask_code_model` hand-off. Built by
	AMEFORGE for the CobraBub IDE.

	- Parameters: 30.3M · Footprint: 121 MB fp32 (~30 MB quantized) · Base: SAM-G-Reasoning
	- Fine-tuning: prompt-masked SFT (loss on the plan span only), cosine 8e-5, 10k steps, best at 6k
	- Aggregate exact plan-match: 78.8% (held-out, disjoint seed)
	- Lineage: SAM-G → SAM-G-Reasoning → SAM-G-CobraTooling

	## Output format

	```
	<instruction> [ACTION] {"plan":[{"op":...,"args":{...},"risk":"safe\|critical"}, ...]}
	<intent> \| {"last_op":...,"...":...} [ACTION] {"plan":[ ... ]} # reactive (observation-driven)
	```

	Every step carries a `risk` flag (`safe` or `critical`) that drives the IDE
	confirmation gate: safe ops run autonomously, critical ops require explicit
	user confirmation.

	## What it is good at — and what it is not

	Stress-tested on thirteen families. The pattern mirrors the rest of the SAM-G
	line: it excels at routing and reaction (short, procedural) and is limited
	on long ordered chains that must match exactly at 30M parameters.

	\| Family \| Exact % \| Type \|
	\|---\|---\|---\|
	\| single_tool (routing) \| 100 \| routing \|
	\| retry_loop (exit-code state machine) \| 100 \| reaction \|
	\| feedback_react (stdout/stderr) \| 100 \| reaction \|
	\| git_workflow (status→add→push, gated) \| 100 \| procedural \|
	\| scrape_research (fetch→summarize→act) \| 100 \| procedural \|
	\| db_query (SQL, SELECT vs mutation) \| 100 \| structured call \|
	\| webhook_wait (async callback) \| 92 \| async reaction \|
	\| mcp_call (filesystem/github/postgres) \| 83 \| structured call \|
	\| api_call (REST/GraphQL + HTTP state machine) \| 75 \| structured call \|
	\| plan_chain (multi-step plans) \| 58 \| planning \|
	\| risk_gate (mixed safe/critical plans) \| 58 \| gated planning \|
	\| fs_watch (file-change reaction) \| 42 \| async reaction \|
	\| build_test_cycle (edit→test→react + hand-off) \| 17 \| long chain \|

	Routing, exit-code reaction, git, scraping and SQL routing are saturated.
	`mcp_call` at 83% makes the model a viable local driver for MCP servers — the
	core capability of a hosted code agent, here running offline. `plan_chain` rose
	from the v1 plateau (0–42%) to 58% after broadening generator coverage.
	`build_test_cycle` remains the hard family: four-to-five ordered ops ending in a
	code-model hand-off, scored by strict exact match — the same long-chain ceiling
	seen with arithmetic in SAM-G-Reasoning. For those, decompose app-side into
	shorter sub-calls.

	## Security: the risk flag is advisory, not a boundary

	The model flags critical ops with 94% fidelity across all families — strong
	for pre-flagging and good UX. It must not be the sole security boundary. A
	30M model will mis-flag a fraction of decisions, and the failure modes are
	asymmetric: a false negative (a critical op flagged `safe`) would auto-run a
	destructive command without confirmation. Integrators must add a
	deterministic backstop: a hard whitelist/blacklist in the app that forces
	`critical` on known-dangerous operations (`rm -rf`, `git push`, `DROP`/`DELETE`,
	external mutating HTTP, MCP write tools, `delete_file`) regardless of the
	model's flag. Treat the model's `risk` field as a fast hint that pre-fills the
	confirmation gate, with the app's deterministic rules as the enforced boundary.

	## Op vocabulary

	Routing/IO: `open_file`, `list_dir`, `run_command`, `scrape`, `summarize`,
	`capture`, `open_app`. Hand-off: `ask_code_model`, `write_file`. Control:
	`retry`, `escalate`, `backoff`, `reauth`, `continue`, `stop`. Integrations:
	`api_call`, `mcp_call`, `db_query`, `webhook_wait`, `fs_watch`, `git_push`.

	## Intended use

	The local planning/routing/reaction layer of an agentic IDE: decompose an
	instruction into ordered tool calls, react to observations (exit codes, stderr,
	HTTP status, DB row counts, webhook payloads, file-change events), and emit
	structured, risk-flagged plans offline and for free. Roughly the procedural
	majority of agentic turns; hard code generation and long exact chains are
	escalated to a larger model via `ask_code_model`.

	## Usage

	```python
	import sentencepiece as spm, torch
	sp = spm.SentencePieceProcessor(); sp.Load("samg_tokenizer.model")

	# routing
	prompt = "open src/main.js and run the tests [ACTION]"
	# -> {"plan":[{"op":"open_file","args":{"path":"src/main.js"},"risk":"safe"},
	# {"op":"run_command","args":{"cmd":"pytest"},"risk":"safe"}]}

	# reactive: HTTP 429 -> back off and retry
	prompt = "rate limited, back off and retry \| {\"last_op\":\"api_call\",\"status\":429} [ACTION]"
	# -> {"plan":[{"op":"backoff","args":{"seconds":30},"risk":"safe"},
	# {"op":"retry","args":{"attempt":2},"risk":"safe"}]}

	ids = torch.tensor([sp.EncodeAsIds(prompt)])
	# greedy-decode the [ACTION] span -> structured plan JSON
	```

	## Limitations

	- `build_test_cycle` (17%) and the exact-match of `plan_chain`/`risk_gate`
	(58%) plateau because long, strictly-ordered plans are hard at 30M; decompose
	long plans app-side into shorter sub-calls.
	- The `risk` flag is advisory (94% fidelity); enforce a deterministic backstop
	in the app, as above.
	- Traces are synthetic, drawn from the training family distribution with a
	disjoint evaluation seed; coverage reflects the generator, not arbitrary
	real-world tool APIs.
	- Not a general assistant and does not write code; it orchestrates and hands
	off. Inherits the base model's knowledge limits.

	## Citation

	```bibtex
	@misc{samgcobratooling2026,
	title = {SAM-G-CobraTooling: Risk-Flagged Agentic Tool-Call Orchestration at 30M Parameters},
	author = {AMEFORGE Lab},
	year = {2026}
	}
	```