anthonym21's picture
Update tokenizer: 73K training objects, 125 keys, DOI 10.5281/zenodo.18879110
f876778 verified
---
library_name: transformers
license: mit
language:
- en
tags:
- tokenizer
- json
- bpe
- structured-data
- llm
pipeline_tag: text-generation
---
# json-tokenizer: Structure-Aware Tokenization for JSON
A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.
**Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.18879110)
**Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)
## Key Results
| Metric | Value |
|--------|-------|
| Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
| Vocabulary size | **4,251 tokens** (vs 100,256 for cl100k_base) |
| Vocab reduction | **~90x smaller** |
| Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
| Crossover point | Beats cl100k_base at just **558 tokens** |
## Architecture
Three-tier vocabulary:
1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (125 keys)
3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)
## This Model
This pretrained tokenizer was trained on structured JSON datasets:
- GeoJSON city features (geographic data)
- Observability telemetry logs (monitoring data)
- Kubernetes manifests (infrastructure config)
- Structured API outputs
- Synthetic training corpus (700 objects)
**Total training objects:** 72,991
**Vocabulary:** 4,251 tokens (16 structural + 16 reserved + 125 keys + 4,096 BPE)
## Usage
### With HuggingFace Transformers
```python
# Requires: pip install json-tokenizer[huggingface]
from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
tokenizer = JSONPreTrainedTokenizer.from_pretrained("anthonym21/json-tokenizer-structured")
# Encode JSON
output = tokenizer('{"name": "Alice", "age": 30, "active": true}')
print(output["input_ids"])
# Decode back to JSON (lossless)
decoded = tokenizer.decode(output["input_ids"])
print(decoded) # {"name": "Alice", "age": 30, "active": true}
```
### With Core Library
```python
# Requires: pip install json-tokenizer
from json_tokenizer import JSONTokenizer
tokenizer = JSONTokenizer.load("./path/to/saved/tokenizer")
# Encode (accepts Python dicts, lists, or JSON strings)
ids = tokenizer.encode({"name": "Alice", "age": 30})
# Decode (lossless roundtrip)
json_str = tokenizer.decode(ids)
```
## Training Your Own
```python
from json_tokenizer import JSONTokenizer
tok = JSONTokenizer(bpe_vocab_size=4096, max_key_vocab=512)
tok.train_from_json_files(["your_data.jsonl"])
tok.save("./my_tokenizer")
# Convert to HF format
from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
hf_tok = JSONPreTrainedTokenizer.from_json_tokenizer(tok)
hf_tok.save_pretrained("./my_hf_tokenizer")
```
## Where It Wins / Where It Loses
| Scenario | json-tokenizer | cl100k_base |
|----------|---------------|-------------|
| GeoJSON (schema-repetitive) | **+7.8% savings** | baseline |
| Telemetry logs | **+5.5% savings** | baseline |
| Batch JSON arrays | **+9.3% savings** | baseline |
| Config objects | **+12.3% savings** | baseline |
| Prose-heavy JSON (Alpaca) | -26.2% | **wins** |
| K8s manifests (deep nesting) | break-even | break-even |
**Best for:** API responses, observability logs, function calling, structured outputs
**Not for:** Prose-heavy JSON, general-purpose text
## Citation
```bibtex
@software{maio2026jsontokenizer,
author = {Maio, Anthony},
title = {Structure-Aware Tokenization for {JSON}: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
year = {2026},
url = {https://github.com/anthony-maio/json-tokenizer},
doi = {10.5281/zenodo.18879110},
version = {0.2.0},
license = {MIT}
}
```
## License
MIT