Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +127 -0
json_tokenizer_vocab.json +0 -0
tokenizer_config.json +43 -0

README.md ADDED Viewed

	@@ -0,0 +1,127 @@

+---
+library_name: transformers
+license: mit
+language:
+  - en
+tags:
+  - tokenizer
+  - json
+  - bpe
+  - structured-data
+  - llm
+pipeline_tag: text-generation
+---
+# json-tokenizer: Structure-Aware Tokenization for JSON
+A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.
+**Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.XXXXXXX)
+**Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)
+## Key Results
+| Metric | Value |
+|--------|-------|
+| Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON |
+| Vocabulary size | **4,190 tokens** (vs 100,256 for cl100k_base) |
+| Vocab reduction | **~90x smaller** |
+| Roundtrip fidelity | **100% lossless** across 4,200+ test objects |
+| Crossover point | Beats cl100k_base at just **558 tokens** |
+## Architecture
+Three-tier vocabulary:
+1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
+2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (65 keys)
+3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)
+## This Model
+This pretrained tokenizer was trained on four structured JSON datasets:
+- GeoJSON city features (geographic data)
+- Observability telemetry logs (monitoring data)
+- Kubernetes manifests (infrastructure config)
+- Structured API outputs
+**Total training objects:** 10,530
+**Vocabulary:** 4,190 tokens (16 structural + 16 reserved + 65 keys + 4,096 BPE)
+## Usage
+### With HuggingFace Transformers
+```python
+# Requires: pip install json-tokenizer[huggingface]
+from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
+tokenizer = JSONPreTrainedTokenizer.from_pretrained("anthonym21/json-tokenizer-structured")
+# Encode JSON
+output = tokenizer('{"name": "Alice", "age": 30, "active": true}')
+print(output["input_ids"])
+# Decode back to JSON (lossless)
+decoded = tokenizer.decode(output["input_ids"])
+print(decoded)  # {"name": "Alice", "age": 30, "active": true}
+```
+### With Core Library
+```python
+# Requires: pip install json-tokenizer
+from json_tokenizer import JSONTokenizer
+tokenizer = JSONTokenizer.load("./path/to/saved/tokenizer")
+# Encode (accepts Python dicts, lists, or JSON strings)
+ids = tokenizer.encode({"name": "Alice", "age": 30})
+# Decode (lossless roundtrip)
+json_str = tokenizer.decode(ids)
+```
+## Training Your Own
+```python
+from json_tokenizer import JSONTokenizer
+tok = JSONTokenizer(bpe_vocab_size=4096, max_key_vocab=512)
+tok.train_from_json_files(["your_data.jsonl"])
+tok.save("./my_tokenizer")
+# Convert to HF format
+from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
+hf_tok = JSONPreTrainedTokenizer.from_json_tokenizer(tok)
+hf_tok.save_pretrained("./my_hf_tokenizer")
+```
+## Where It Wins / Where It Loses
+| Scenario | json-tokenizer | cl100k_base |
+|----------|---------------|-------------|
+| GeoJSON (schema-repetitive) | **+7.8% savings** | baseline |
+| Telemetry logs | **+5.5% savings** | baseline |
+| Batch JSON arrays | **+9.3% savings** | baseline |
+| Config objects | **+12.3% savings** | baseline |
+| Prose-heavy JSON (Alpaca) | -26.2% | **wins** |
+| K8s manifests (deep nesting) | break-even | break-even |
+**Best for:** API responses, observability logs, function calling, structured outputs
+**Not for:** Prose-heavy JSON, general-purpose text
+## Citation
+```bibtex
+@article{maio2026json,
+  title={Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
+  author={Maio, Anthony},
+  year={2026},
+  url={https://doi.org/10.5281/zenodo.XXXXXXX}
+}
+```
+## License
+MIT

json_tokenizer_vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "15": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "backend": "custom",
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<pad>",
+  "tokenizer_class": "JSONPreTrainedTokenizer",
+  "unk_token": "<unk>"
+}