Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language: [en]
|
| 4 |
+
tags:
|
| 5 |
+
- tokenizer
|
| 6 |
+
- sentencepiece
|
| 7 |
+
- bpe
|
| 8 |
+
- structured-action-model
|
| 9 |
+
- agentic-ai
|
| 10 |
+
- robotics
|
| 11 |
+
- iot
|
| 12 |
+
- workflow-automation
|
| 13 |
+
inference: false
|
| 14 |
+
library_name: sentencepiece
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# SAM Tokenizer — `AMFORGE/sam_tokenizer`
|
| 18 |
+
|
| 19 |
+
The official tokenizer for **SAM (Structured Action Model)** by **AMFORGE**.
|
| 20 |
+
|
| 21 |
+
A SentencePiece BPE tokenizer purpose-built for structured action generation
|
| 22 |
+
across **10 domains**: robotics, HTTP/REST, MQTT/IoT, databases, workflows,
|
| 23 |
+
e-commerce, vehicles, smart home, calendar/email, and filesystem.
|
| 24 |
+
|
| 25 |
+
## Why a custom tokenizer
|
| 26 |
+
|
| 27 |
+
Generic LLM tokenizers shred robotics and JSON tokens into fragments
|
| 28 |
+
(`0.5` → `0`, `.`, `5`). This kills numeric precision and bloats sequence
|
| 29 |
+
length. SAM's tokenizer enforces **atomic numerics** via reserved symbols,
|
| 30 |
+
keeping every coordinate, status code, port, and angle as a single token.
|
| 31 |
+
|
| 32 |
+
## Design principles (the "SparsForos philosophy")
|
| 33 |
+
|
| 34 |
+
| Setting | Value | Reason |
|
| 35 |
+
|---|---|---|
|
| 36 |
+
| `model_type` | BPE | Best balance for mixed natural-language + JSON |
|
| 37 |
+
| `character_coverage` | 0.9999 | 1.0 causes infinite SPM loop with rare chars |
|
| 38 |
+
| `byte_fallback` | False | Disables silent byte-level splits |
|
| 39 |
+
| `split_digits` | False | Keeps `0.5` as ONE token, not `0 . 5` |
|
| 40 |
+
| `normalization_rule_name` | identity | No unicode normalization side-effects |
|
| 41 |
+
| `num_threads` | 2 | Higher counts freeze Kaggle/Jupyter kernels |
|
| 42 |
+
| `hard_vocab_limit` | True | No silent vocab expansion |
|
| 43 |
+
|
| 44 |
+
## Statistics
|
| 45 |
+
|
| 46 |
+
- **Vocab size**: 12000
|
| 47 |
+
- **Reserved domain symbols**: 6806
|
| 48 |
+
- **Atomic numeric tokens**: 6320
|
| 49 |
+
- **Structural tags + JSON Schema primitives**: 64
|
| 50 |
+
- **Multi-domain operations + keys**: 294
|
| 51 |
+
- **Domain values (units, currencies, modes, rooms…)**: 135
|
| 52 |
+
- **Training corpus**: 200,000 synthetic multi-domain lines
|
| 53 |
+
|
| 54 |
+
## Reserved structural tags
|
| 55 |
+
|
| 56 |
+
```
|
| 57 |
+
<SCHEMA> </SCHEMA> # JSON Schema conditioning
|
| 58 |
+
<TASK> </TASK> # Natural language instruction
|
| 59 |
+
<DOMAIN> </DOMAIN> # Generic domain wrapper
|
| 60 |
+
<JSON> </JSON> # JSON output wrapper
|
| 61 |
+
<ACTION> </ACTION> # Action block
|
| 62 |
+
<META> </META> # Metadata
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
## Reserved domain tags
|
| 66 |
+
|
| 67 |
+
```
|
| 68 |
+
<ROS> <HTTP> <MQTT> <DB> <WORKFLOW>
|
| 69 |
+
<ECOMMERCE> <VEHICLE> <HOME> <CAL> <FILE>
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
## Reserved numeric ranges (all atomic)
|
| 73 |
+
|
| 74 |
+
| Range | Step | Use |
|
| 75 |
+
|---|---|---|
|
| 76 |
+
| `-10.00` … `10.00` | 0.01 | Spatial coords (robotics, vehicle, IoT) |
|
| 77 |
+
| `0.00` … `10.00` | 0.05 | Velocities (m/s) |
|
| 78 |
+
| `0.00` … `1.00` | 0.01 | Force, ratios, probabilities |
|
| 79 |
+
| `0.0` … `60.0` | 0.5 | Wait/timeout durations (s) |
|
| 80 |
+
| `0` … `999` | 1 | Small integers (counts, retries, indices) |
|
| 81 |
+
| HTTP status | discrete | 100,200,201,…,500,501,…,511 |
|
| 82 |
+
| Network ports | discrete | 22,80,443,1883,3306,5432,6379,8080,… |
|
| 83 |
+
| Frequencies (Hz) | discrete | 1,5,10,20,…,5000 |
|
| 84 |
+
| Common angles | discrete | radians + degrees (0, π/2, π, 45°, 90°, 180°, …) |
|
| 85 |
+
|
| 86 |
+
## Quick start
|
| 87 |
+
|
| 88 |
+
```python
|
| 89 |
+
from huggingface_hub import hf_hub_download
|
| 90 |
+
import sentencepiece as spm
|
| 91 |
+
|
| 92 |
+
# Public — no token required
|
| 93 |
+
path = hf_hub_download(repo_id="AMFORGE/sam_tokenizer", filename="sam_tokenizer.model")
|
| 94 |
+
sp = spm.SentencePieceProcessor()
|
| 95 |
+
sp.Load(path)
|
| 96 |
+
|
| 97 |
+
# Atomic numeric encoding — try this with any generic tokenizer for comparison
|
| 98 |
+
text = '<ROS><TASK>move to x=0.5 y=-1.2 z=0.8</TASK> => [{"op":"move","x":0.5,"y":-1.2,"z":0.8}]'
|
| 99 |
+
ids = sp.EncodeAsIds(text)
|
| 100 |
+
print(f"Tokens: {len(ids)}")
|
| 101 |
+
print(f"Decoded: {sp.DecodeIds(ids)}")
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
## Atomicity verification
|
| 105 |
+
|
| 106 |
+
Verify that floats stay as single tokens:
|
| 107 |
+
|
| 108 |
+
```python
|
| 109 |
+
for v in ["0.5", "-1.2", "1.57", "0.01", "3.14"]:
|
| 110 |
+
pieces = sp.EncodeAsPieces(v)
|
| 111 |
+
real = [p for p in pieces if p not in ("\u2581", " ")]
|
| 112 |
+
assert len(real) == 1, f"{v} got split: {real}"
|
| 113 |
+
print("[OK] All floats atomic.")
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
## Used by
|
| 117 |
+
|
| 118 |
+
- [`AMFORGE/sam-v1`](https://huggingface.co/AMFORGE/sam-v1) — the SAM model
|
| 119 |
+
|
| 120 |
+
## Citation
|
| 121 |
+
|
| 122 |
+
```bibtex
|
| 123 |
+
@misc{sam_tokenizer_2026,
|
| 124 |
+
title = {SAM Tokenizer: Atomic Multi-Domain BPE for Structured Action Generation},
|
| 125 |
+
author = {AMFORGE},
|
| 126 |
+
year = {2026},
|
| 127 |
+
url = {https://huggingface.co/AMFORGE/sam_tokenizer}
|
| 128 |
+
}
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
Built with the SparsForos philosophy by **AMFORGE** —
|
| 132 |
+
https://huggingface.co/AMFORGE
|