anthonym21 commited on
Commit
f876778
·
verified ·
1 Parent(s): d54e053

Update tokenizer: 73K training objects, 125 keys, DOI 10.5281/zenodo.18879110

Browse files
Files changed (2) hide show
  1. README.md +15 -11
  2. json_tokenizer_vocab.json +0 -0
README.md CHANGED
@@ -16,7 +16,7 @@ pipeline_tag: text-generation
16
 
17
  A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.
18
 
19
- **Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.XXXXXXX)
20
 
21
  **Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)
22
 
@@ -25,7 +25,7 @@ A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar
25
  | Metric | Value |
26
  |--------|-------|
27
  | Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
28
- | Vocabulary size | **4,190 tokens** (vs 100,256 for cl100k_base) |
29
  | Vocab reduction | **~90x smaller** |
30
  | Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
31
  | Crossover point | Beats cl100k_base at just **558 tokens** |
@@ -34,19 +34,20 @@ A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar
34
 
35
  Three-tier vocabulary:
36
  1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
37
- 2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (65 keys)
38
  3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)
39
 
40
  ## This Model
41
 
42
- This pretrained tokenizer was trained on four structured JSON datasets:
43
  - GeoJSON city features (geographic data)
44
  - Observability telemetry logs (monitoring data)
45
  - Kubernetes manifests (infrastructure config)
46
  - Structured API outputs
 
47
 
48
- **Total training objects:** 10,530
49
- **Vocabulary:** 4,190 tokens (16 structural + 16 reserved + 65 keys + 4,096 BPE)
50
 
51
  ## Usage
52
 
@@ -114,11 +115,14 @@ hf_tok.save_pretrained("./my_hf_tokenizer")
114
  ## Citation
115
 
116
  ```bibtex
117
- @article{maio2026json,
118
- title={Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
119
- author={Maio, Anthony},
120
- year={2026},
121
- url={https://doi.org/10.5281/zenodo.XXXXXXX}
 
 
 
122
  }
123
  ```
124
 
 
16
 
17
  A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.
18
 
19
+ **Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.18879110)
20
 
21
  **Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)
22
 
 
25
  | Metric | Value |
26
  |--------|-------|
27
  | Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
28
+ | Vocabulary size | **4,251 tokens** (vs 100,256 for cl100k_base) |
29
  | Vocab reduction | **~90x smaller** |
30
  | Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
31
  | Crossover point | Beats cl100k_base at just **558 tokens** |
 
34
 
35
  Three-tier vocabulary:
36
  1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
37
+ 2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (125 keys)
38
  3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)
39
 
40
  ## This Model
41
 
42
+ This pretrained tokenizer was trained on structured JSON datasets:
43
  - GeoJSON city features (geographic data)
44
  - Observability telemetry logs (monitoring data)
45
  - Kubernetes manifests (infrastructure config)
46
  - Structured API outputs
47
+ - Synthetic training corpus (700 objects)
48
 
49
+ **Total training objects:** 72,991
50
+ **Vocabulary:** 4,251 tokens (16 structural + 16 reserved + 125 keys + 4,096 BPE)
51
 
52
  ## Usage
53
 
 
115
  ## Citation
116
 
117
  ```bibtex
118
+ @software{maio2026jsontokenizer,
119
+ author = {Maio, Anthony},
120
+ title = {Structure-Aware Tokenization for {JSON}: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
121
+ year = {2026},
122
+ url = {https://github.com/anthony-maio/json-tokenizer},
123
+ doi = {10.5281/zenodo.18879110},
124
+ version = {0.2.0},
125
+ license = {MIT}
126
  }
127
  ```
128
 
json_tokenizer_vocab.json CHANGED
The diff for this file is too large to render. See raw diff