File size: 4,099 Bytes
fe95c99 f876778 fe95c99 f876778 fe95c99 f876778 fe95c99 f876778 fe95c99 f876778 fe95c99 f876778 fe95c99 f876778 fe95c99 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | ---
library_name: transformers
license: mit
language:
- en
tags:
- tokenizer
- json
- bpe
- structured-data
- llm
pipeline_tag: text-generation
---
# json-tokenizer: Structure-Aware Tokenization for JSON
A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.
**Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.18879110)
**Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)
## Key Results
| Metric | Value |
|--------|-------|
| Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
| Vocabulary size | **4,251 tokens** (vs 100,256 for cl100k_base) |
| Vocab reduction | **~90x smaller** |
| Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
| Crossover point | Beats cl100k_base at just **558 tokens** |
## Architecture
Three-tier vocabulary:
1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (125 keys)
3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)
## This Model
This pretrained tokenizer was trained on structured JSON datasets:
- GeoJSON city features (geographic data)
- Observability telemetry logs (monitoring data)
- Kubernetes manifests (infrastructure config)
- Structured API outputs
- Synthetic training corpus (700 objects)
**Total training objects:** 72,991
**Vocabulary:** 4,251 tokens (16 structural + 16 reserved + 125 keys + 4,096 BPE)
## Usage
### With HuggingFace Transformers
```python
# Requires: pip install json-tokenizer[huggingface]
from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
tokenizer = JSONPreTrainedTokenizer.from_pretrained("anthonym21/json-tokenizer-structured")
# Encode JSON
output = tokenizer('{"name": "Alice", "age": 30, "active": true}')
print(output["input_ids"])
# Decode back to JSON (lossless)
decoded = tokenizer.decode(output["input_ids"])
print(decoded) # {"name": "Alice", "age": 30, "active": true}
```
### With Core Library
```python
# Requires: pip install json-tokenizer
from json_tokenizer import JSONTokenizer
tokenizer = JSONTokenizer.load("./path/to/saved/tokenizer")
# Encode (accepts Python dicts, lists, or JSON strings)
ids = tokenizer.encode({"name": "Alice", "age": 30})
# Decode (lossless roundtrip)
json_str = tokenizer.decode(ids)
```
## Training Your Own
```python
from json_tokenizer import JSONTokenizer
tok = JSONTokenizer(bpe_vocab_size=4096, max_key_vocab=512)
tok.train_from_json_files(["your_data.jsonl"])
tok.save("./my_tokenizer")
# Convert to HF format
from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
hf_tok = JSONPreTrainedTokenizer.from_json_tokenizer(tok)
hf_tok.save_pretrained("./my_hf_tokenizer")
```
## Where It Wins / Where It Loses
| Scenario | json-tokenizer | cl100k_base |
|----------|---------------|-------------|
| GeoJSON (schema-repetitive) | **+7.8% savings** | baseline |
| Telemetry logs | **+5.5% savings** | baseline |
| Batch JSON arrays | **+9.3% savings** | baseline |
| Config objects | **+12.3% savings** | baseline |
| Prose-heavy JSON (Alpaca) | -26.2% | **wins** |
| K8s manifests (deep nesting) | break-even | break-even |
**Best for:** API responses, observability logs, function calling, structured outputs
**Not for:** Prose-heavy JSON, general-purpose text
## Citation
```bibtex
@software{maio2026jsontokenizer,
author = {Maio, Anthony},
title = {Structure-Aware Tokenization for {JSON}: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
year = {2026},
url = {https://github.com/anthony-maio/json-tokenizer},
doi = {10.5281/zenodo.18879110},
version = {0.2.0},
license = {MIT}
}
```
## License
MIT
|