File size: 4,099 Bytes
fe95c99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f876778
fe95c99
 
 
 
 
 
 
 
f876778
fe95c99
 
 
 
 
 
 
 
f876778
fe95c99
 
 
 
f876778
fe95c99
 
 
 
f876778
fe95c99
f876778
 
fe95c99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f876778
 
 
 
 
 
 
 
fe95c99
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
library_name: transformers
license: mit
language:
  - en
tags:
  - tokenizer
  - json
  - bpe
  - structured-data
  - llm
pipeline_tag: text-generation
---

# json-tokenizer: Structure-Aware Tokenization for JSON

A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.

**Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.18879110)

**Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)

## Key Results

| Metric | Value |
|--------|-------|
| Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
| Vocabulary size | **4,251 tokens** (vs 100,256 for cl100k_base) |
| Vocab reduction | **~90x smaller** |
| Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
| Crossover point | Beats cl100k_base at just **558 tokens** |

## Architecture

Three-tier vocabulary:
1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (125 keys)
3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)

## This Model

This pretrained tokenizer was trained on structured JSON datasets:
- GeoJSON city features (geographic data)
- Observability telemetry logs (monitoring data)
- Kubernetes manifests (infrastructure config)
- Structured API outputs
- Synthetic training corpus (700 objects)

**Total training objects:** 72,991
**Vocabulary:** 4,251 tokens (16 structural + 16 reserved + 125 keys + 4,096 BPE)

## Usage

### With HuggingFace Transformers

```python
# Requires: pip install json-tokenizer[huggingface]
from json_tokenizer.hf_compat import JSONPreTrainedTokenizer

tokenizer = JSONPreTrainedTokenizer.from_pretrained("anthonym21/json-tokenizer-structured")

# Encode JSON
output = tokenizer('{"name": "Alice", "age": 30, "active": true}')
print(output["input_ids"])

# Decode back to JSON (lossless)
decoded = tokenizer.decode(output["input_ids"])
print(decoded)  # {"name": "Alice", "age": 30, "active": true}
```

### With Core Library

```python
# Requires: pip install json-tokenizer
from json_tokenizer import JSONTokenizer

tokenizer = JSONTokenizer.load("./path/to/saved/tokenizer")

# Encode (accepts Python dicts, lists, or JSON strings)
ids = tokenizer.encode({"name": "Alice", "age": 30})

# Decode (lossless roundtrip)
json_str = tokenizer.decode(ids)
```

## Training Your Own

```python
from json_tokenizer import JSONTokenizer

tok = JSONTokenizer(bpe_vocab_size=4096, max_key_vocab=512)
tok.train_from_json_files(["your_data.jsonl"])
tok.save("./my_tokenizer")

# Convert to HF format
from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
hf_tok = JSONPreTrainedTokenizer.from_json_tokenizer(tok)
hf_tok.save_pretrained("./my_hf_tokenizer")
```

## Where It Wins / Where It Loses

| Scenario | json-tokenizer | cl100k_base |
|----------|---------------|-------------|
| GeoJSON (schema-repetitive) | **+7.8% savings** | baseline |
| Telemetry logs | **+5.5% savings** | baseline |
| Batch JSON arrays | **+9.3% savings** | baseline |
| Config objects | **+12.3% savings** | baseline |
| Prose-heavy JSON (Alpaca) | -26.2% | **wins** |
| K8s manifests (deep nesting) | break-even | break-even |

**Best for:** API responses, observability logs, function calling, structured outputs
**Not for:** Prose-heavy JSON, general-purpose text

## Citation

```bibtex
@software{maio2026jsontokenizer,
  author    = {Maio, Anthony},
  title     = {Structure-Aware Tokenization for {JSON}: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
  year      = {2026},
  url       = {https://github.com/anthony-maio/json-tokenizer},
  doi       = {10.5281/zenodo.18879110},
  version   = {0.2.0},
  license   = {MIT}
}
```

## License

MIT