Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -2,131 +2,142 @@
|
|
| 2 |
license: apache-2.0
|
| 3 |
language: [en]
|
| 4 |
tags:
|
|
|
|
| 5 |
- tokenizer
|
| 6 |
-
- sentencepiece
|
| 7 |
-
- bpe
|
| 8 |
- structured-action-model
|
| 9 |
- agentic-ai
|
| 10 |
- robotics
|
| 11 |
- iot
|
| 12 |
- workflow-automation
|
|
|
|
| 13 |
inference: false
|
| 14 |
-
library_name: sentencepiece
|
| 15 |
---
|
| 16 |
|
| 17 |
# SAM Tokenizer β `AMFORGE/sam_tokenizer`
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
across **10 domains**: robotics, HTTP/REST, MQTT/IoT, databases, workflows,
|
| 23 |
-
e-commerce, vehicles, smart home, calendar/email, and filesystem.
|
| 24 |
|
| 25 |
-
##
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
|
| 32 |
-
##
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|---|---|---|
|
| 36 |
-
| `model_type` | BPE | Best balance for mixed natural-language + JSON |
|
| 37 |
-
| `character_coverage` | 0.9999 | 1.0 causes infinite SPM loop with rare chars |
|
| 38 |
-
| `byte_fallback` | False | Disables silent byte-level splits |
|
| 39 |
-
| `split_digits` | False | Keeps `0.5` as ONE token, not `0 . 5` |
|
| 40 |
-
| `normalization_rule_name` | identity | No unicode normalization side-effects |
|
| 41 |
-
| `num_threads` | 2 | Higher counts freeze Kaggle/Jupyter kernels |
|
| 42 |
-
| `hard_vocab_limit` | True | No silent vocab expansion |
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
- **Structural tags + JSON Schema primitives**: 64
|
| 50 |
-
- **Multi-domain operations + keys**: 294
|
| 51 |
-
- **Domain values (units, currencies, modes, roomsβ¦)**: 135
|
| 52 |
-
- **Training corpus**: 200,000 synthetic multi-domain lines
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
<TASK> </TASK> # Natural language instruction
|
| 59 |
-
<DOMAIN> </DOMAIN> # Generic domain wrapper
|
| 60 |
-
<JSON> </JSON> # JSON output wrapper
|
| 61 |
-
<ACTION> </ACTION> # Action block
|
| 62 |
-
<META> </META> # Metadata
|
| 63 |
-
```
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
| 68 |
-
<ROS> <HTTP> <MQTT> <DB> <WORKFLOW>
|
| 69 |
-
<ECOMMERCE> <VEHICLE> <HOME> <CAL> <FILE>
|
| 70 |
-
```
|
| 71 |
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
| `0` β¦ `999` | 1 | Small integers (counts, retries, indices) |
|
| 81 |
-
| HTTP status | discrete | 100,200,201,β¦,500,501,β¦,511 |
|
| 82 |
-
| Network ports | discrete | 22,80,443,1883,3306,5432,6379,8080,β¦ |
|
| 83 |
-
| Frequencies (Hz) | discrete | 1,5,10,20,β¦,5000 |
|
| 84 |
-
| Common angles | discrete | radians + degrees (0, Ο/2, Ο, 45Β°, 90Β°, 180Β°, β¦) |
|
| 85 |
-
|
| 86 |
-
## Quick start
|
| 87 |
|
| 88 |
```python
|
| 89 |
from huggingface_hub import hf_hub_download
|
| 90 |
-
import sentencepiece as spm
|
| 91 |
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
path = hf_hub_download(repo_id="AMFORGE/sam_tokenizer", filename="sam_tokenizer.model")
|
| 94 |
-
|
| 95 |
-
sp.Load(path)
|
| 96 |
|
| 97 |
-
|
| 98 |
-
text = '<ROS><TASK>move to x=0.5 y=-1.2 z=0.8</TASK> => [{"op":"move","x":0.5,"y":-1.2,"z":0.8}]'
|
| 99 |
-
ids = sp.EncodeAsIds(text)
|
| 100 |
print(f"Tokens: {len(ids)}")
|
| 101 |
-
print(f"
|
| 102 |
```
|
| 103 |
|
| 104 |
-
|
| 105 |
|
| 106 |
-
|
| 107 |
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
``
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
## Used by
|
| 117 |
|
| 118 |
- [`AMFORGE/sam-v1`](https://huggingface.co/AMFORGE/sam-v1) β the SAM model
|
| 119 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 120 |
## Citation
|
| 121 |
|
| 122 |
```bibtex
|
| 123 |
@misc{sam_tokenizer_2026,
|
| 124 |
-
title = {SAM Tokenizer:
|
| 125 |
author = {AMFORGE},
|
| 126 |
year = {2026},
|
| 127 |
url = {https://huggingface.co/AMFORGE/sam_tokenizer}
|
| 128 |
}
|
| 129 |
```
|
| 130 |
|
| 131 |
-
|
| 132 |
-
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
language: [en]
|
| 4 |
tags:
|
| 5 |
+
- nexusbpe
|
| 6 |
- tokenizer
|
|
|
|
|
|
|
| 7 |
- structured-action-model
|
| 8 |
- agentic-ai
|
| 9 |
- robotics
|
| 10 |
- iot
|
| 11 |
- workflow-automation
|
| 12 |
+
- multi-domain
|
| 13 |
inference: false
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
# SAM Tokenizer β `AMFORGE/sam_tokenizer`
|
| 17 |
|
| 18 |
+
Official tokenizer for **SAM (Structured Action Model)** by **AMFORGE**.
|
| 19 |
+
Built on **NexusBPE**, AMEFORGE's in-house tokenization architecture designed
|
| 20 |
+
for structured action generation across heterogeneous domains.
|
| 21 |
|
| 22 |
+
---
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
## What it does
|
| 25 |
|
| 26 |
+
A single tokenizer that handles **10 production domains** with uniform
|
| 27 |
+
quality β robotics, HTTP / REST APIs, MQTT / IoT messaging, databases,
|
| 28 |
+
workflow orchestration, e-commerce, autonomous vehicles, smart home,
|
| 29 |
+
calendar / email, and filesystem operations.
|
| 30 |
|
| 31 |
+
## Why it matters
|
| 32 |
|
| 33 |
+
Generic LLM tokenizers shred coordinates and identifiers into fragments:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
```
|
| 36 |
+
0.5 β ['0', '.', '5'] (3 tokens)
|
| 37 |
+
-1.2 β ['-', '1', '.', '2'] (4 tokens)
|
| 38 |
+
8080 β ['8', '0', '80'] (3 tokens)
|
| 39 |
+
```
|
| 40 |
|
| 41 |
+
This destroys numeric precision, balloons sequence length, and forces the
|
| 42 |
+
model to learn arithmetic from character soup. **NexusBPE keeps these
|
| 43 |
+
atomic by construction**, while still compressing prose efficiently.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
| | Generic tokenizer | NexusBPE |
|
| 46 |
+
|---|---|---|
|
| 47 |
+
| `move to x=0.5 y=-1.2 z=0.8` | ~16 tokens | ~6 tokens |
|
| 48 |
+
| `POST /api/v1/orders` | ~8 tokens | ~3 tokens |
|
| 49 |
+
| `GET /users β 404` | ~6 tokens | ~3 tokens |
|
| 50 |
|
| 51 |
+
Lower sequence length β lower latency, lower memory, sharper attention
|
| 52 |
+
on the parts that matter.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
---
|
| 55 |
|
| 56 |
+
## Highlights
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
- **Vocab size**: 12000
|
| 59 |
+
- **Atomic guarantees**: every coordinate, status code, port, frequency,
|
| 60 |
+
and angle in the supported ranges encodes to a single token
|
| 61 |
+
- **Domain coverage**: 10 first-class domains via dedicated marker tokens
|
| 62 |
+
- **Schema-conditioned**: native support for JSON Schema in-context conditioning
|
| 63 |
+
- **Reversible**: bit-perfect roundtrip on all structured payloads
|
| 64 |
+
- **Deterministic**: identical input β identical token IDs across runs
|
| 65 |
+
- **Compact**: ~3Γ shorter sequences than generic LLM tokenizers on agentic tasks
|
| 66 |
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## Loading
|
| 70 |
+
|
| 71 |
+
The tokenizer ships as a binary model file. Load it via the lightweight
|
| 72 |
+
NexusBPE wrapper:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
```python
|
| 75 |
from huggingface_hub import hf_hub_download
|
|
|
|
| 76 |
|
| 77 |
+
class NexusBPE:
|
| 78 |
+
"""Minimal loader for SAM / NexusBPE tokenizers."""
|
| 79 |
+
def __init__(self, model_path: str):
|
| 80 |
+
import sentencepiece as _spm # implementation detail
|
| 81 |
+
self._sp = _spm.SentencePieceProcessor(); self._sp.Load(model_path)
|
| 82 |
+
self.vocab_size = self._sp.GetPieceSize()
|
| 83 |
+
self.pad_id = self._sp.pad_id(); self.eos_id = self._sp.eos_id()
|
| 84 |
+
def encode(self, text: str) -> list[int]:
|
| 85 |
+
return self._sp.EncodeAsIds(text)
|
| 86 |
+
def decode(self, ids) -> str:
|
| 87 |
+
return self._sp.DecodeIds(list(ids))
|
| 88 |
+
|
| 89 |
path = hf_hub_download(repo_id="AMFORGE/sam_tokenizer", filename="sam_tokenizer.model")
|
| 90 |
+
tok = NexusBPE(path)
|
|
|
|
| 91 |
|
| 92 |
+
ids = tok.encode('<ROS><TASK>move to x=0.5 y=-1.2 z=0.8</TASK>')
|
|
|
|
|
|
|
| 93 |
print(f"Tokens: {len(ids)}")
|
| 94 |
+
print(f"Roundtrip: {tok.decode(ids)}")
|
| 95 |
```
|
| 96 |
|
| 97 |
+
---
|
| 98 |
|
| 99 |
+
## Domain markers
|
| 100 |
|
| 101 |
+
The tokenizer reserves marker tokens for each supported domain so the
|
| 102 |
+
model can condition its output on the active domain:
|
| 103 |
+
|
| 104 |
+
| Marker | Purpose |
|
| 105 |
+
|---|---|
|
| 106 |
+
| `<ROS>` | Robotics (ROS / ROS2) |
|
| 107 |
+
| `<HTTP>` | HTTP / REST APIs |
|
| 108 |
+
| `<MQTT>` | MQTT / IoT messaging |
|
| 109 |
+
| `<DB>` | Databases (SQL / NoSQL / Redis) |
|
| 110 |
+
| `<WORKFLOW>` | Workflow orchestration |
|
| 111 |
+
| `<ECOMMERCE>` | E-commerce |
|
| 112 |
+
| `<VEHICLE>` | Autonomous vehicles |
|
| 113 |
+
| `<HOME>` | Smart home |
|
| 114 |
+
| `<CAL>` | Calendar / email |
|
| 115 |
+
| `<FILE>` | Filesystem |
|
| 116 |
+
|
| 117 |
+
Plus structural markers β `<SCHEMA>`, `<TASK>`, `<JSON>`, `<ACTION>`,
|
| 118 |
+
`<META>` β for schema-conditioned prompting.
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
|
| 122 |
## Used by
|
| 123 |
|
| 124 |
- [`AMFORGE/sam-v1`](https://huggingface.co/AMFORGE/sam-v1) β the SAM model
|
| 125 |
|
| 126 |
+
## License
|
| 127 |
+
|
| 128 |
+
APACHE-2.0. Free for research and commercial use. Attribution appreciated.
|
| 129 |
+
|
| 130 |
## Citation
|
| 131 |
|
| 132 |
```bibtex
|
| 133 |
@misc{sam_tokenizer_2026,
|
| 134 |
+
title = {SAM Tokenizer: NexusBPE for Multi-Domain Structured Action Generation},
|
| 135 |
author = {AMFORGE},
|
| 136 |
year = {2026},
|
| 137 |
url = {https://huggingface.co/AMFORGE/sam_tokenizer}
|
| 138 |
}
|
| 139 |
```
|
| 140 |
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
Built with **NexusBPE** by **AMFORGE** β https://huggingface.co/AMFORGE
|