| --- |
| license: apache-2.0 |
| language: [en] |
| tags: |
| - nexusbpe |
| - tokenizer |
| - structured-action-model |
| - agentic-ai |
| - robotics |
| - iot |
| - workflow-automation |
| - multi-domain |
| inference: false |
| --- |
| |
| # SAM Tokenizer β `AMFORGE/sam_tokenizer` |
| |
| Official tokenizer for **SAM (Structured Action Model)** by **AMFORGE**. |
| Built on **NexusBPE**, AMEFORGE's in-house tokenization architecture designed |
| for structured action generation across heterogeneous domains. |
| |
| --- |
| |
| ## What it does |
| |
| A single tokenizer that handles **10 production domains** with uniform |
| quality β robotics, HTTP / REST APIs, MQTT / IoT messaging, databases, |
| workflow orchestration, e-commerce, autonomous vehicles, smart home, |
| calendar / email, and filesystem operations. |
| |
| ## Why it matters |
| |
| Generic LLM tokenizers shred coordinates and identifiers into fragments: |
| |
| ``` |
| 0.5 β ['0', '.', '5'] (3 tokens) |
| -1.2 β ['-', '1', '.', '2'] (4 tokens) |
| 8080 β ['8', '0', '80'] (3 tokens) |
| ``` |
| |
| This destroys numeric precision, balloons sequence length, and forces the |
| model to learn arithmetic from character soup. **NexusBPE keeps these |
| atomic by construction**, while still compressing prose efficiently. |
| |
| | | Generic tokenizer | NexusBPE | |
| |---|---|---| |
| | `move to x=0.5 y=-1.2 z=0.8` | ~16 tokens | ~6 tokens | |
| | `POST /api/v1/orders` | ~8 tokens | ~3 tokens | |
| | `GET /users β 404` | ~6 tokens | ~3 tokens | |
| |
| Lower sequence length β lower latency, lower memory, sharper attention |
| on the parts that matter. |
| |
| --- |
| |
| ## Highlights |
| |
| - **Vocab size**: 12000 |
| - **Atomic guarantees**: every coordinate, status code, port, frequency, |
| and angle in the supported ranges encodes to a single token |
| - **Domain coverage**: 10 first-class domains via dedicated marker tokens |
| - **Schema-conditioned**: native support for JSON Schema in-context conditioning |
| - **Reversible**: bit-perfect roundtrip on all structured payloads |
| - **Deterministic**: identical input β identical token IDs across runs |
| - **Compact**: ~3Γ shorter sequences than generic LLM tokenizers on agentic tasks |
| |
| --- |
| |
| ## Loading |
| |
| The tokenizer ships as a binary model file. Load it via the lightweight |
| NexusBPE wrapper: |
| |
| ```python |
| from huggingface_hub import hf_hub_download |
|
|
| class NexusBPE: |
| """Minimal loader for SAM / NexusBPE tokenizers.""" |
| def __init__(self, model_path: str): |
| import sentencepiece as _spm # implementation detail |
| self._sp = _spm.SentencePieceProcessor(); self._sp.Load(model_path) |
| self.vocab_size = self._sp.GetPieceSize() |
| self.pad_id = self._sp.pad_id(); self.eos_id = self._sp.eos_id() |
| def encode(self, text: str) -> list[int]: |
| return self._sp.EncodeAsIds(text) |
| def decode(self, ids) -> str: |
| return self._sp.DecodeIds(list(ids)) |
| |
| path = hf_hub_download(repo_id="AMFORGE/sam_tokenizer", filename="sam_tokenizer.model") |
| tok = NexusBPE(path) |
| |
| ids = tok.encode('<ROS><TASK>move to x=0.5 y=-1.2 z=0.8</TASK>') |
| print(f"Tokens: {len(ids)}") |
| print(f"Roundtrip: {tok.decode(ids)}") |
| ``` |
| |
| --- |
| |
| ## Domain markers |
| |
| The tokenizer reserves marker tokens for each supported domain so the |
| model can condition its output on the active domain: |
| |
| | Marker | Purpose | |
| |---|---| |
| | `<ROS>` | Robotics (ROS / ROS2) | |
| | `<HTTP>` | HTTP / REST APIs | |
| | `<MQTT>` | MQTT / IoT messaging | |
| | `<DB>` | Databases (SQL / NoSQL / Redis) | |
| | `<WORKFLOW>` | Workflow orchestration | |
| | `<ECOMMERCE>` | E-commerce | |
| | `<VEHICLE>` | Autonomous vehicles | |
| | `<HOME>` | Smart home | |
| | `<CAL>` | Calendar / email | |
| | `<FILE>` | Filesystem | |
| |
| Plus structural markers β `<SCHEMA>`, `<TASK>`, `<JSON>`, `<ACTION>`, |
| `<META>` β for schema-conditioned prompting. |
| |
| --- |
| |
| ## Used by |
| |
| - [`AMFORGE/sam-v1`](https://huggingface.co/AMFORGE/sam-v1) β the SAM model |
| |
| ## License |
| |
| APACHE-2.0. Free for research and commercial use. Attribution appreciated. |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{sam_tokenizer_2026, |
| title = {SAM Tokenizer: NexusBPE for Multi-Domain Structured Action Generation}, |
| author = {AMFORGE}, |
| year = {2026}, |
| url = {https://huggingface.co/AMFORGE/sam_tokenizer} |
| } |
| ``` |
| |
| --- |
| |
| Built with **NexusBPE** by **AMFORGE** β https://huggingface.co/AMFORGE |
| |