--- license: apache-2.0 language: [en] tags: - nexusbpe - tokenizer - structured-action-model - agentic-ai - robotics - iot - workflow-automation - multi-domain inference: false --- # SAM Tokenizer — `AMFORGE/sam_tokenizer` Official tokenizer for **SAM (Structured Action Model)** by **AMFORGE**. Built on **NexusBPE**, AMEFORGE's in-house tokenization architecture designed for structured action generation across heterogeneous domains. --- ## What it does A single tokenizer that handles **10 production domains** with uniform quality — robotics, HTTP / REST APIs, MQTT / IoT messaging, databases, workflow orchestration, e-commerce, autonomous vehicles, smart home, calendar / email, and filesystem operations. ## Why it matters Generic LLM tokenizers shred coordinates and identifiers into fragments: ``` 0.5 → ['0', '.', '5'] (3 tokens) -1.2 → ['-', '1', '.', '2'] (4 tokens) 8080 → ['8', '0', '80'] (3 tokens) ``` This destroys numeric precision, balloons sequence length, and forces the model to learn arithmetic from character soup. **NexusBPE keeps these atomic by construction**, while still compressing prose efficiently. | | Generic tokenizer | NexusBPE | |---|---|---| | `move to x=0.5 y=-1.2 z=0.8` | ~16 tokens | ~6 tokens | | `POST /api/v1/orders` | ~8 tokens | ~3 tokens | | `GET /users → 404` | ~6 tokens | ~3 tokens | Lower sequence length → lower latency, lower memory, sharper attention on the parts that matter. --- ## Highlights - **Vocab size**: 12000 - **Atomic guarantees**: every coordinate, status code, port, frequency, and angle in the supported ranges encodes to a single token - **Domain coverage**: 10 first-class domains via dedicated marker tokens - **Schema-conditioned**: native support for JSON Schema in-context conditioning - **Reversible**: bit-perfect roundtrip on all structured payloads - **Deterministic**: identical input → identical token IDs across runs - **Compact**: ~3× shorter sequences than generic LLM tokenizers on agentic tasks --- ## Loading The tokenizer ships as a binary model file. Load it via the lightweight NexusBPE wrapper: ```python from huggingface_hub import hf_hub_download class NexusBPE: """Minimal loader for SAM / NexusBPE tokenizers.""" def __init__(self, model_path: str): import sentencepiece as _spm # implementation detail self._sp = _spm.SentencePieceProcessor(); self._sp.Load(model_path) self.vocab_size = self._sp.GetPieceSize() self.pad_id = self._sp.pad_id(); self.eos_id = self._sp.eos_id() def encode(self, text: str) -> list[int]: return self._sp.EncodeAsIds(text) def decode(self, ids) -> str: return self._sp.DecodeIds(list(ids)) path = hf_hub_download(repo_id="AMFORGE/sam_tokenizer", filename="sam_tokenizer.model") tok = NexusBPE(path) ids = tok.encode('move to x=0.5 y=-1.2 z=0.8') print(f"Tokens: {len(ids)}") print(f"Roundtrip: {tok.decode(ids)}") ``` --- ## Domain markers The tokenizer reserves marker tokens for each supported domain so the model can condition its output on the active domain: | Marker | Purpose | |---|---| | `` | Robotics (ROS / ROS2) | | `` | HTTP / REST APIs | | `` | MQTT / IoT messaging | | `` | Databases (SQL / NoSQL / Redis) | | `` | Workflow orchestration | | `` | E-commerce | | `` | Autonomous vehicles | | `` | Smart home | | `` | Calendar / email | | `` | Filesystem | Plus structural markers — ``, ``, ``, ``, `` — for schema-conditioned prompting. --- ## Used by - [`AMFORGE/sam-v1`](https://huggingface.co/AMFORGE/sam-v1) — the SAM model ## License APACHE-2.0. Free for research and commercial use. Attribution appreciated. ## Citation ```bibtex @misc{sam_tokenizer_2026, title = {SAM Tokenizer: NexusBPE for Multi-Domain Structured Action Generation}, author = {AMFORGE}, year = {2026}, url = {https://huggingface.co/AMFORGE/sam_tokenizer} } ``` --- Built with **NexusBPE** by **AMFORGE** — https://huggingface.co/AMFORGE