Update tokenizer: 73K training objects, 125 keys, DOI 10.5281/zenodo.18879110
Browse files- README.md +15 -11
- json_tokenizer_vocab.json +0 -0
README.md
CHANGED
|
@@ -16,7 +16,7 @@ pipeline_tag: text-generation
|
|
| 16 |
|
| 17 |
A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.
|
| 18 |
|
| 19 |
-
**Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.
|
| 20 |
|
| 21 |
**Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)
|
| 22 |
|
|
@@ -25,7 +25,7 @@ A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar
|
|
| 25 |
| Metric | Value |
|
| 26 |
|--------|-------|
|
| 27 |
| Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
|
| 28 |
-
| Vocabulary size | **4,
|
| 29 |
| Vocab reduction | **~90x smaller** |
|
| 30 |
| Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
|
| 31 |
| Crossover point | Beats cl100k_base at just **558 tokens** |
|
|
@@ -34,19 +34,20 @@ A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar
|
|
| 34 |
|
| 35 |
Three-tier vocabulary:
|
| 36 |
1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
|
| 37 |
-
2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (
|
| 38 |
3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)
|
| 39 |
|
| 40 |
## This Model
|
| 41 |
|
| 42 |
-
This pretrained tokenizer was trained on
|
| 43 |
- GeoJSON city features (geographic data)
|
| 44 |
- Observability telemetry logs (monitoring data)
|
| 45 |
- Kubernetes manifests (infrastructure config)
|
| 46 |
- Structured API outputs
|
|
|
|
| 47 |
|
| 48 |
-
**Total training objects:**
|
| 49 |
-
**Vocabulary:** 4,
|
| 50 |
|
| 51 |
## Usage
|
| 52 |
|
|
@@ -114,11 +115,14 @@ hf_tok.save_pretrained("./my_hf_tokenizer")
|
|
| 114 |
## Citation
|
| 115 |
|
| 116 |
```bibtex
|
| 117 |
-
@
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
year={2026},
|
| 121 |
-
url={https://
|
|
|
|
|
|
|
|
|
|
| 122 |
}
|
| 123 |
```
|
| 124 |
|
|
|
|
| 16 |
|
| 17 |
A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.
|
| 18 |
|
| 19 |
+
**Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.18879110)
|
| 20 |
|
| 21 |
**Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)
|
| 22 |
|
|
|
|
| 25 |
| Metric | Value |
|
| 26 |
|--------|-------|
|
| 27 |
| Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
|
| 28 |
+
| Vocabulary size | **4,251 tokens** (vs 100,256 for cl100k_base) |
|
| 29 |
| Vocab reduction | **~90x smaller** |
|
| 30 |
| Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
|
| 31 |
| Crossover point | Beats cl100k_base at just **558 tokens** |
|
|
|
|
| 34 |
|
| 35 |
Three-tier vocabulary:
|
| 36 |
1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
|
| 37 |
+
2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (125 keys)
|
| 38 |
3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)
|
| 39 |
|
| 40 |
## This Model
|
| 41 |
|
| 42 |
+
This pretrained tokenizer was trained on structured JSON datasets:
|
| 43 |
- GeoJSON city features (geographic data)
|
| 44 |
- Observability telemetry logs (monitoring data)
|
| 45 |
- Kubernetes manifests (infrastructure config)
|
| 46 |
- Structured API outputs
|
| 47 |
+
- Synthetic training corpus (700 objects)
|
| 48 |
|
| 49 |
+
**Total training objects:** 72,991
|
| 50 |
+
**Vocabulary:** 4,251 tokens (16 structural + 16 reserved + 125 keys + 4,096 BPE)
|
| 51 |
|
| 52 |
## Usage
|
| 53 |
|
|
|
|
| 115 |
## Citation
|
| 116 |
|
| 117 |
```bibtex
|
| 118 |
+
@software{maio2026jsontokenizer,
|
| 119 |
+
author = {Maio, Anthony},
|
| 120 |
+
title = {Structure-Aware Tokenization for {JSON}: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
|
| 121 |
+
year = {2026},
|
| 122 |
+
url = {https://github.com/anthony-maio/json-tokenizer},
|
| 123 |
+
doi = {10.5281/zenodo.18879110},
|
| 124 |
+
version = {0.2.0},
|
| 125 |
+
license = {MIT}
|
| 126 |
}
|
| 127 |
```
|
| 128 |
|
json_tokenizer_vocab.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|