Instructions to use anthonym21/json-tokenizer-structured with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use anthonym21/json-tokenizer-structured with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="anthonym21/json-tokenizer-structured")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("anthonym21/json-tokenizer-structured", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use anthonym21/json-tokenizer-structured with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "anthonym21/json-tokenizer-structured"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anthonym21/json-tokenizer-structured",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/anthonym21/json-tokenizer-structured

SGLang

How to use anthonym21/json-tokenizer-structured with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "anthonym21/json-tokenizer-structured" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anthonym21/json-tokenizer-structured",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "anthonym21/json-tokenizer-structured" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anthonym21/json-tokenizer-structured",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use anthonym21/json-tokenizer-structured with Docker Model Runner:
```
docker model run hf.co/anthonym21/json-tokenizer-structured
```

anthonym21 commited on Mar 5

Commit

f876778

verified ·

1 Parent(s): d54e053

Update tokenizer: 73K training objects, 125 keys, DOI 10.5281/zenodo.18879110

Browse files

Files changed (2) hide show

README.md +15 -11
json_tokenizer_vocab.json +0 -0

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ pipeline_tag: text-generation
 A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.
-**Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.XXXXXXX)
 **Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)
@@ -25,7 +25,7 @@ A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar
 | Metric | Value |
 |--------|-------|
 | Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
-| Vocabulary size | **4,190 tokens** (vs 100,256 for cl100k_base) |
 | Vocab reduction | **~90x smaller** |
 | Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
 | Crossover point | Beats cl100k_base at just **558 tokens** |
@@ -34,19 +34,20 @@ A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar
 Three-tier vocabulary:
 1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
-2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (65 keys)
 3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)
 ## This Model
-This pretrained tokenizer was trained on four structured JSON datasets:
 - GeoJSON city features (geographic data)
 - Observability telemetry logs (monitoring data)
 - Kubernetes manifests (infrastructure config)
 - Structured API outputs
-**Total training objects:** 10,530
-**Vocabulary:** 4,190 tokens (16 structural + 16 reserved + 65 keys + 4,096 BPE)
 ## Usage
@@ -114,11 +115,14 @@ hf_tok.save_pretrained("./my_hf_tokenizer")
 ## Citation
 ```bibtex
-@article{maio2026json,
-  title={Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
-  author={Maio, Anthony},
-  year={2026},
-  url={https://doi.org/10.5281/zenodo.XXXXXXX}
 }
 ```

 A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.
+**Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.18879110)
 **Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)
 | Metric | Value |
 |--------|-------|
 | Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
+| Vocabulary size | **4,251 tokens** (vs 100,256 for cl100k_base) |
 | Vocab reduction | **~90x smaller** |
 | Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
 | Crossover point | Beats cl100k_base at just **558 tokens** |
 Three-tier vocabulary:
 1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
+2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (125 keys)
 3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)
 ## This Model
+This pretrained tokenizer was trained on structured JSON datasets:
 - GeoJSON city features (geographic data)
 - Observability telemetry logs (monitoring data)
 - Kubernetes manifests (infrastructure config)
 - Structured API outputs
+- Synthetic training corpus (700 objects)
+**Total training objects:** 72,991
+**Vocabulary:** 4,251 tokens (16 structural + 16 reserved + 125 keys + 4,096 BPE)
 ## Usage
 ## Citation
 ```bibtex
+@software{maio2026jsontokenizer,
+  author    = {Maio, Anthony},
+  title     = {Structure-Aware Tokenization for {JSON}: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
+  year      = {2026},
+  url       = {https://github.com/anthony-maio/json-tokenizer},
+  doi       = {10.5281/zenodo.18879110},
+  version   = {0.2.0},
+  license   = {MIT}
 }
 ```

json_tokenizer_vocab.json CHANGED Viewed

The diff for this file is too large to render. See raw diff