Upload folder using huggingface_hub

fe95c99 verified 2 days ago

3.93 kB

	---
	library_name: transformers
	license: mit
	language:
	- en
	tags:
	- tokenizer
	- json
	- bpe
	- structured-data
	- llm
	pipeline_tag: text-generation
	---

	# json-tokenizer: Structure-Aware Tokenization for JSON

	A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.

	Paper: [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.XXXXXXX)

	Code: [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)

	## Key Results

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Token savings vs cl100k_base \| 5-15% on schema-repetitive JSON \|
	\| Vocabulary size \| 4,190 tokens (vs 100,256 for cl100k_base) \|
	\| Vocab reduction \| ~90x smaller \|
	\| Roundtrip fidelity \| 100% lossless across 4,200+ test objects \|
	\| Crossover point \| Beats cl100k_base at just 558 tokens \|

	## Architecture

	Three-tier vocabulary:
	1. Structural tokens (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
	2. Key vocabulary (IDs 32-N): Learned single-token keys from training data (65 keys)
	3. BPE subwords (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)

	## This Model

	This pretrained tokenizer was trained on four structured JSON datasets:
	- GeoJSON city features (geographic data)
	- Observability telemetry logs (monitoring data)
	- Kubernetes manifests (infrastructure config)
	- Structured API outputs

	Total training objects: 10,530
	Vocabulary: 4,190 tokens (16 structural + 16 reserved + 65 keys + 4,096 BPE)

	## Usage

	### With HuggingFace Transformers

	```python
	# Requires: pip install json-tokenizer[huggingface]
	from json_tokenizer.hf_compat import JSONPreTrainedTokenizer

	tokenizer = JSONPreTrainedTokenizer.from_pretrained("anthonym21/json-tokenizer-structured")

	# Encode JSON
	output = tokenizer('{"name": "Alice", "age": 30, "active": true}')
	print(output["input_ids"])

	# Decode back to JSON (lossless)
	decoded = tokenizer.decode(output["input_ids"])
	print(decoded) # {"name": "Alice", "age": 30, "active": true}
	```

	### With Core Library

	```python
	# Requires: pip install json-tokenizer
	from json_tokenizer import JSONTokenizer

	tokenizer = JSONTokenizer.load("./path/to/saved/tokenizer")

	# Encode (accepts Python dicts, lists, or JSON strings)
	ids = tokenizer.encode({"name": "Alice", "age": 30})

	# Decode (lossless roundtrip)
	json_str = tokenizer.decode(ids)
	```

	## Training Your Own

	```python
	from json_tokenizer import JSONTokenizer

	tok = JSONTokenizer(bpe_vocab_size=4096, max_key_vocab=512)
	tok.train_from_json_files(["your_data.jsonl"])
	tok.save("./my_tokenizer")

	# Convert to HF format
	from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
	hf_tok = JSONPreTrainedTokenizer.from_json_tokenizer(tok)
	hf_tok.save_pretrained("./my_hf_tokenizer")
	```

	## Where It Wins / Where It Loses

	\| Scenario \| json-tokenizer \| cl100k_base \|
	\|----------\|---------------\|-------------\|
	\| GeoJSON (schema-repetitive) \| +7.8% savings \| baseline \|
	\| Telemetry logs \| +5.5% savings \| baseline \|
	\| Batch JSON arrays \| +9.3% savings \| baseline \|
	\| Config objects \| +12.3% savings \| baseline \|
	\| Prose-heavy JSON (Alpaca) \| -26.2% \| wins \|
	\| K8s manifests (deep nesting) \| break-even \| break-even \|

	Best for: API responses, observability logs, function calling, structured outputs
	Not for: Prose-heavy JSON, general-purpose text

	## Citation

	```bibtex
	@article{maio2026json,
	title={Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
	author={Maio, Anthony},
	year={2026},
	url={https://doi.org/10.5281/zenodo.XXXXXXX}
	}
	```

	## License

	MIT