sam_tokenizer / README.md
ameforge's picture
Upload README.md with huggingface_hub
319d2da verified
---
license: apache-2.0
language: [en]
tags:
- nexusbpe
- tokenizer
- structured-action-model
- agentic-ai
- robotics
- iot
- workflow-automation
- multi-domain
inference: false
---
# SAM Tokenizer β€” `AMFORGE/sam_tokenizer`
Official tokenizer for **SAM (Structured Action Model)** by **AMFORGE**.
Built on **NexusBPE**, AMEFORGE's in-house tokenization architecture designed
for structured action generation across heterogeneous domains.
---
## What it does
A single tokenizer that handles **10 production domains** with uniform
quality β€” robotics, HTTP / REST APIs, MQTT / IoT messaging, databases,
workflow orchestration, e-commerce, autonomous vehicles, smart home,
calendar / email, and filesystem operations.
## Why it matters
Generic LLM tokenizers shred coordinates and identifiers into fragments:
```
0.5 β†’ ['0', '.', '5'] (3 tokens)
-1.2 β†’ ['-', '1', '.', '2'] (4 tokens)
8080 β†’ ['8', '0', '80'] (3 tokens)
```
This destroys numeric precision, balloons sequence length, and forces the
model to learn arithmetic from character soup. **NexusBPE keeps these
atomic by construction**, while still compressing prose efficiently.
| | Generic tokenizer | NexusBPE |
|---|---|---|
| `move to x=0.5 y=-1.2 z=0.8` | ~16 tokens | ~6 tokens |
| `POST /api/v1/orders` | ~8 tokens | ~3 tokens |
| `GET /users β†’ 404` | ~6 tokens | ~3 tokens |
Lower sequence length β†’ lower latency, lower memory, sharper attention
on the parts that matter.
---
## Highlights
- **Vocab size**: 12000
- **Atomic guarantees**: every coordinate, status code, port, frequency,
and angle in the supported ranges encodes to a single token
- **Domain coverage**: 10 first-class domains via dedicated marker tokens
- **Schema-conditioned**: native support for JSON Schema in-context conditioning
- **Reversible**: bit-perfect roundtrip on all structured payloads
- **Deterministic**: identical input β†’ identical token IDs across runs
- **Compact**: ~3Γ— shorter sequences than generic LLM tokenizers on agentic tasks
---
## Loading
The tokenizer ships as a binary model file. Load it via the lightweight
NexusBPE wrapper:
```python
from huggingface_hub import hf_hub_download
class NexusBPE:
"""Minimal loader for SAM / NexusBPE tokenizers."""
def __init__(self, model_path: str):
import sentencepiece as _spm # implementation detail
self._sp = _spm.SentencePieceProcessor(); self._sp.Load(model_path)
self.vocab_size = self._sp.GetPieceSize()
self.pad_id = self._sp.pad_id(); self.eos_id = self._sp.eos_id()
def encode(self, text: str) -> list[int]:
return self._sp.EncodeAsIds(text)
def decode(self, ids) -> str:
return self._sp.DecodeIds(list(ids))
path = hf_hub_download(repo_id="AMFORGE/sam_tokenizer", filename="sam_tokenizer.model")
tok = NexusBPE(path)
ids = tok.encode('<ROS><TASK>move to x=0.5 y=-1.2 z=0.8</TASK>')
print(f"Tokens: {len(ids)}")
print(f"Roundtrip: {tok.decode(ids)}")
```
---
## Domain markers
The tokenizer reserves marker tokens for each supported domain so the
model can condition its output on the active domain:
| Marker | Purpose |
|---|---|
| `<ROS>` | Robotics (ROS / ROS2) |
| `<HTTP>` | HTTP / REST APIs |
| `<MQTT>` | MQTT / IoT messaging |
| `<DB>` | Databases (SQL / NoSQL / Redis) |
| `<WORKFLOW>` | Workflow orchestration |
| `<ECOMMERCE>` | E-commerce |
| `<VEHICLE>` | Autonomous vehicles |
| `<HOME>` | Smart home |
| `<CAL>` | Calendar / email |
| `<FILE>` | Filesystem |
Plus structural markers β€” `<SCHEMA>`, `<TASK>`, `<JSON>`, `<ACTION>`,
`<META>` β€” for schema-conditioned prompting.
---
## Used by
- [`AMFORGE/sam-v1`](https://huggingface.co/AMFORGE/sam-v1) β€” the SAM model
## License
APACHE-2.0. Free for research and commercial use. Attribution appreciated.
## Citation
```bibtex
@misc{sam_tokenizer_2026,
title = {SAM Tokenizer: NexusBPE for Multi-Domain Structured Action Generation},
author = {AMFORGE},
year = {2026},
url = {https://huggingface.co/AMFORGE/sam_tokenizer}
}
```
---
Built with **NexusBPE** by **AMFORGE** β€” https://huggingface.co/AMFORGE