File size: 3,928 Bytes
fe95c99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
library_name: transformers
license: mit
language:
  - en
tags:
  - tokenizer
  - json
  - bpe
  - structured-data
  - llm
pipeline_tag: text-generation
---

# json-tokenizer: Structure-Aware Tokenization for JSON

A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.

**Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.XXXXXXX)

**Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)

## Key Results

| Metric | Value |
|--------|-------|
| Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
| Vocabulary size | **4,190 tokens** (vs 100,256 for cl100k_base) |
| Vocab reduction | **~90x smaller** |
| Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
| Crossover point | Beats cl100k_base at just **558 tokens** |

## Architecture

Three-tier vocabulary:
1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (65 keys)
3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)

## This Model

This pretrained tokenizer was trained on four structured JSON datasets:
- GeoJSON city features (geographic data)
- Observability telemetry logs (monitoring data)
- Kubernetes manifests (infrastructure config)
- Structured API outputs

**Total training objects:** 10,530
**Vocabulary:** 4,190 tokens (16 structural + 16 reserved + 65 keys + 4,096 BPE)

## Usage

### With HuggingFace Transformers

```python
# Requires: pip install json-tokenizer[huggingface]
from json_tokenizer.hf_compat import JSONPreTrainedTokenizer

tokenizer = JSONPreTrainedTokenizer.from_pretrained("anthonym21/json-tokenizer-structured")

# Encode JSON
output = tokenizer('{"name": "Alice", "age": 30, "active": true}')
print(output["input_ids"])

# Decode back to JSON (lossless)
decoded = tokenizer.decode(output["input_ids"])
print(decoded)  # {"name": "Alice", "age": 30, "active": true}
```

### With Core Library

```python
# Requires: pip install json-tokenizer
from json_tokenizer import JSONTokenizer

tokenizer = JSONTokenizer.load("./path/to/saved/tokenizer")

# Encode (accepts Python dicts, lists, or JSON strings)
ids = tokenizer.encode({"name": "Alice", "age": 30})

# Decode (lossless roundtrip)
json_str = tokenizer.decode(ids)
```

## Training Your Own

```python
from json_tokenizer import JSONTokenizer

tok = JSONTokenizer(bpe_vocab_size=4096, max_key_vocab=512)
tok.train_from_json_files(["your_data.jsonl"])
tok.save("./my_tokenizer")

# Convert to HF format
from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
hf_tok = JSONPreTrainedTokenizer.from_json_tokenizer(tok)
hf_tok.save_pretrained("./my_hf_tokenizer")
```

## Where It Wins / Where It Loses

| Scenario | json-tokenizer | cl100k_base |
|----------|---------------|-------------|
| GeoJSON (schema-repetitive) | **+7.8% savings** | baseline |
| Telemetry logs | **+5.5% savings** | baseline |
| Batch JSON arrays | **+9.3% savings** | baseline |
| Config objects | **+12.3% savings** | baseline |
| Prose-heavy JSON (Alpaca) | -26.2% | **wins** |
| K8s manifests (deep nesting) | break-even | break-even |

**Best for:** API responses, observability logs, function calling, structured outputs
**Not for:** Prose-heavy JSON, general-purpose text

## Citation

```bibtex
@article{maio2026json,
  title={Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
  author={Maio, Anthony},
  year={2026},
  url={https://doi.org/10.5281/zenodo.XXXXXXX}
}
```

## License

MIT