samg / README.md

Update README.md

1f22aa4 verified 15 days ago

5.15 kB

	---
	license: bsl-1.0
	language:
	- en
	- fr
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- structured-generation
	- function-calling
	- tool-use
	- json
	- edge
	- offline
	- robotics
	- iot
	- agentic
	- small-language-model
	model-index:
	- name: SAM-G
	results:
	- task:
	type: structured-action-generation
	name: Instruction-to-JSON (10 domains, zero-shot)
	metrics:
	- type: json_valid
	value: 100
	name: Valid JSON (%)
	- type: exact_match
	value: 76
	name: Exact match (%)
	- type: exact_match_fr
	value: 77
	name: Exact match, French (%)
	- task:
	type: text-generation
	name: Language modeling (FineWeb-Edu held-out)
	metrics:
	- type: bits_per_byte
	value: 1.179
	name: Bits per byte
	---

	# SAM-G

	SAM-G is a 30.3M-parameter dual-mode language model for **offline structured
	action generation**. Given a natural-language instruction it emits compact,
	schema-valid JSON for ten domains; given a question it emits free text. Mode
	selection is learned, not prompted. Built by AMEFORGE for robotics, IoT and
	embedded deployment where hosted-LLM APIs are too costly, too slow, or
	unavailable.

	- Parameters: 30.3M · Footprint: 121 MB fp32 (~30 MB int8)
	- Context: 1024 tokens · Languages: English, French (actions)
	- Throughput: ~235 tok/s, 16 ms first-token (single GPU); runs on a
	Raspberry-Pi-class CPU
	- Released: model weights + inference tokenizer. Training pipeline, data
	generators and architecture are proprietary.

	## Two modes

	\| Input \| Model emits \|
	\|---\|---\|
	\| `turn on the kitchen lamp` \| `[ACTION] {"domain":"home","op":"set_state","params":{"device":"lamp","name":"kitchen","state":"on"}}` \|
	\| `what is a mutex` \| `[CHAT] A mutex is a lock that allows one thread at a time.` \|

	Domains: `ros`, `http`, `mqtt`, `db`, `workflow`, `ecommerce`, `vehicle`,
	`home`, `cal`, `file`.

	## Benchmark

	SAM-G is evaluated zero-shot in its native format; baselines run 3-shot
	through their chat template with a system instruction. `bpb` is tokenizer-fair
	(per-token perplexity is not comparable across vocabularies). `exact/M` =
	action exact-match per million parameters — the efficiency axis.

	\| Model \| Params \| bpb ↓ \| JSON valid % \| Exact % \| Exact FR % \| Cloze % \| MB \| tok/s \| exact/M ↑ \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| SAM-G \| 30.3M \| 1.179 \| 100 \| 76 \| 77 \| 83 \| 121 \| 235 \| 2.51 \|
	\| Pythia-70M \| 70M \| 1.674 \| 2 \| 0 \| 0 \| 75 \| 141 \| 120 \| 0.00 \|
	\| Qwen2.5-0.5B-Instruct \| 494M \| 0.814 \| 99 \| 25 \| 7 \| 96 \| 988 \| 27 \| 0.05 \|
	\| SmolLM2-360M-Instruct \| 362M \| 0.812 \| 96 \| 14 \| 0 \| 96 \| 724 \| 21 \| 0.04 \|
	\| Qwen2.5-1.5B-Instruct \| 889M \| 0.753 \| 98 \| 21 \| 0 \| 96 \| 444* \| 13 \| 0.02 \|

	<sub>*Qwen2.5-1.5B loaded in 4-bit. Larger general models lead on bits-per-byte
	and cloze (they are 12–30× bigger and trained for general knowledge); SAM-G
	leads decisively on structured action, French actions, footprint, speed, and
	exact-match per parameter. Notably Qwen2.5-1.5B scores below Qwen2.5-0.5B on
	action exact-match — capability here comes from domain specialization, not
	scale.</sub>

	## Per-domain exact match (%)

	\| ros \| http \| mqtt \| db \| workflow \| ecommerce \| vehicle \| home \| cal \| file \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| 0 \| 100 \| 100 \| 100 \| 60 \| 100 \| 100 \| 50 \| 80 \| 60 \|

	All general baselines score 0 on most domains, succeeding only partially on the
	most generic ones (home, cal). `ros` (floating-point fields) is SAM-G's weakest
	schema and benefits most from additional training data.

	## Usage

	```python
	import sentencepiece as spm, torch
	# Load the released inference tokenizer (samg_tokenizer.model) and weights.
	sp = spm.SentencePieceProcessor(); sp.Load("samg_tokenizer.model")

	prompt = "publish 21.5 on sensors/temp qos 1 [ACTION]"
	ids = torch.tensor([sp.EncodeAsIds(prompt)])
	# greedy-decode with your loaded model until EOS, then sp.DecodeIds(...)
	# -> {"domain":"mqtt","op":"publish","params":{"topic":"sensors/temp","payload":21.5,"qos":1}}
	```

	Always parse output as JSON and validate against your schema before execution.

	## Intended use

	On-device home automation; NL→ROS robot command layers; MQTT fleet gateways;
	offline vehicle commands; NL-to-SQL on embedded databases; workflow triggers;
	and the structured tool-calling stage of agentic pipelines — as a drop-in
	replacement or a fast router ahead of a larger hosted model.

	## Limitations

	- Not a general assistant: factual knowledge and open-ended reasoning are
	limited at this scale; larger general models lead on bits-per-byte and cloze.
	- French covers actions, not extended prose.
	- Schemas outside the ten domains need fine-tuning. The `ros` schema
	(floating-point fields) is the weakest and benefits most from more data.
	- The action benchmark is synthetic, drawn from the training distribution
	family with a disjoint evaluation seed (999).

	## Citation

	```bibtex
	@misc{samg2026,
	title = {SAM-G: A 30M-Parameter Dual-Mode Language Model for Offline Structured Action Generation},
	author = {AMEFORGE Lab},
	year = {2026}
	}
	```