ameforge commited on
Commit
85f3fc9
·
verified ·
1 Parent(s): 7f6894b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language: [en]
4
+ tags:
5
+ - tokenizer
6
+ - sentencepiece
7
+ - bpe
8
+ - structured-action-model
9
+ - agentic-ai
10
+ - robotics
11
+ - iot
12
+ - workflow-automation
13
+ inference: false
14
+ library_name: sentencepiece
15
+ ---
16
+
17
+ # SAM Tokenizer — `AMFORGE/sam_tokenizer`
18
+
19
+ The official tokenizer for **SAM (Structured Action Model)** by **AMFORGE**.
20
+
21
+ A SentencePiece BPE tokenizer purpose-built for structured action generation
22
+ across **10 domains**: robotics, HTTP/REST, MQTT/IoT, databases, workflows,
23
+ e-commerce, vehicles, smart home, calendar/email, and filesystem.
24
+
25
+ ## Why a custom tokenizer
26
+
27
+ Generic LLM tokenizers shred robotics and JSON tokens into fragments
28
+ (`0.5` → `0`, `.`, `5`). This kills numeric precision and bloats sequence
29
+ length. SAM's tokenizer enforces **atomic numerics** via reserved symbols,
30
+ keeping every coordinate, status code, port, and angle as a single token.
31
+
32
+ ## Design principles (the "SparsForos philosophy")
33
+
34
+ | Setting | Value | Reason |
35
+ |---|---|---|
36
+ | `model_type` | BPE | Best balance for mixed natural-language + JSON |
37
+ | `character_coverage` | 0.9999 | 1.0 causes infinite SPM loop with rare chars |
38
+ | `byte_fallback` | False | Disables silent byte-level splits |
39
+ | `split_digits` | False | Keeps `0.5` as ONE token, not `0 . 5` |
40
+ | `normalization_rule_name` | identity | No unicode normalization side-effects |
41
+ | `num_threads` | 2 | Higher counts freeze Kaggle/Jupyter kernels |
42
+ | `hard_vocab_limit` | True | No silent vocab expansion |
43
+
44
+ ## Statistics
45
+
46
+ - **Vocab size**: 12000
47
+ - **Reserved domain symbols**: 6806
48
+ - **Atomic numeric tokens**: 6320
49
+ - **Structural tags + JSON Schema primitives**: 64
50
+ - **Multi-domain operations + keys**: 294
51
+ - **Domain values (units, currencies, modes, rooms…)**: 135
52
+ - **Training corpus**: 200,000 synthetic multi-domain lines
53
+
54
+ ## Reserved structural tags
55
+
56
+ ```
57
+ <SCHEMA> </SCHEMA> # JSON Schema conditioning
58
+ <TASK> </TASK> # Natural language instruction
59
+ <DOMAIN> </DOMAIN> # Generic domain wrapper
60
+ <JSON> </JSON> # JSON output wrapper
61
+ <ACTION> </ACTION> # Action block
62
+ <META> </META> # Metadata
63
+ ```
64
+
65
+ ## Reserved domain tags
66
+
67
+ ```
68
+ <ROS> <HTTP> <MQTT> <DB> <WORKFLOW>
69
+ <ECOMMERCE> <VEHICLE> <HOME> <CAL> <FILE>
70
+ ```
71
+
72
+ ## Reserved numeric ranges (all atomic)
73
+
74
+ | Range | Step | Use |
75
+ |---|---|---|
76
+ | `-10.00` … `10.00` | 0.01 | Spatial coords (robotics, vehicle, IoT) |
77
+ | `0.00` … `10.00` | 0.05 | Velocities (m/s) |
78
+ | `0.00` … `1.00` | 0.01 | Force, ratios, probabilities |
79
+ | `0.0` … `60.0` | 0.5 | Wait/timeout durations (s) |
80
+ | `0` … `999` | 1 | Small integers (counts, retries, indices) |
81
+ | HTTP status | discrete | 100,200,201,…,500,501,…,511 |
82
+ | Network ports | discrete | 22,80,443,1883,3306,5432,6379,8080,… |
83
+ | Frequencies (Hz) | discrete | 1,5,10,20,…,5000 |
84
+ | Common angles | discrete | radians + degrees (0, π/2, π, 45°, 90°, 180°, …) |
85
+
86
+ ## Quick start
87
+
88
+ ```python
89
+ from huggingface_hub import hf_hub_download
90
+ import sentencepiece as spm
91
+
92
+ # Public — no token required
93
+ path = hf_hub_download(repo_id="AMFORGE/sam_tokenizer", filename="sam_tokenizer.model")
94
+ sp = spm.SentencePieceProcessor()
95
+ sp.Load(path)
96
+
97
+ # Atomic numeric encoding — try this with any generic tokenizer for comparison
98
+ text = '<ROS><TASK>move to x=0.5 y=-1.2 z=0.8</TASK> => [{"op":"move","x":0.5,"y":-1.2,"z":0.8}]'
99
+ ids = sp.EncodeAsIds(text)
100
+ print(f"Tokens: {len(ids)}")
101
+ print(f"Decoded: {sp.DecodeIds(ids)}")
102
+ ```
103
+
104
+ ## Atomicity verification
105
+
106
+ Verify that floats stay as single tokens:
107
+
108
+ ```python
109
+ for v in ["0.5", "-1.2", "1.57", "0.01", "3.14"]:
110
+ pieces = sp.EncodeAsPieces(v)
111
+ real = [p for p in pieces if p not in ("\u2581", " ")]
112
+ assert len(real) == 1, f"{v} got split: {real}"
113
+ print("[OK] All floats atomic.")
114
+ ```
115
+
116
+ ## Used by
117
+
118
+ - [`AMFORGE/sam-v1`](https://huggingface.co/AMFORGE/sam-v1) — the SAM model
119
+
120
+ ## Citation
121
+
122
+ ```bibtex
123
+ @misc{sam_tokenizer_2026,
124
+ title = {SAM Tokenizer: Atomic Multi-Domain BPE for Structured Action Generation},
125
+ author = {AMFORGE},
126
+ year = {2026},
127
+ url = {https://huggingface.co/AMFORGE/sam_tokenizer}
128
+ }
129
+ ```
130
+
131
+ Built with the SparsForos philosophy by **AMFORGE** —
132
+ https://huggingface.co/AMFORGE