mjbommar commited on
Commit
1be5751
·
verified ·
1 Parent(s): 9c6a895

Upload OGBERT tokenizer (vocab_size=8192)

Browse files
Files changed (3) hide show
  1. README.md +38 -14
  2. tokenizer.json +2 -7
  3. tokenizer_config.json +10 -16
README.md CHANGED
@@ -1,25 +1,49 @@
1
  ---
2
- library_name: tokenizers
3
- pipeline_tag: feature-extraction
4
  language:
5
  - en
6
- license: mit
 
7
  tags:
 
 
8
  - ogbert
9
  - modernbert
10
  - opengloss
11
- - tokenizer
12
- - bpe
13
- - vocab:8192
14
- datasets:
15
- - mjbommar/opengloss-v1.1-dictionary
16
  ---
17
 
18
- # OGBERT Tokenizer (8192)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- Byte-level BPE tokenizer for OGBERT models. Trained on OpenGloss headwords only, with ordered specials (<|start|>, <|end|>, <|pad|>, <|unk|>, <|cls|>, <|sep|>, <|mask|>) and a final non-special space token that does not participate in merges. Suitable for ModernBERT/transformers usage.
21
 
22
- - Vocab size: 8192
23
- - Alphabet: 0-255 bytes + specials + trailing space token
24
- - Training data: OpenGloss dictionary headwords (HF dataset mjbommar/opengloss-v1.1-dictionary)
25
- - Notes: space token is appended to avoid merges; special tokens are in fixed order.
 
1
  ---
 
 
2
  language:
3
  - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
  tags:
7
+ - tokenizer
8
+ - bpe
9
  - ogbert
10
  - modernbert
11
  - opengloss
 
 
 
 
 
12
  ---
13
 
14
+ # OGBERT Tokenizer (8K)
15
+
16
+ A 8,192-token BPE tokenizer for [OpenGloss](https://arxiv.org/abs/2511.18622) OGBERT embedding models.
17
+
18
+ ## Usage
19
+
20
+ ```python
21
+ from transformers import AutoTokenizer
22
+
23
+ tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-tokenizer-8k")
24
+ tokens = tokenizer.encode("hello world")
25
+ ```
26
+
27
+ ## Details
28
+
29
+ - **Vocab Size**: 8,192 (power of 2)
30
+ - **Space Token**: ID 8191
31
+ - **Special Tokens**: IDs 0-6 (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`)
32
+ - **Training Data**: [mjbommar/opengloss-v1.1-dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-dictionary)
33
+
34
+ ## Citation
35
+
36
+ ```bibtex
37
+ @misc{bommarito2025opengloss,
38
+ title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
39
+ author={Michael J. Bommarito II},
40
+ year={2025},
41
+ eprint={2511.18622},
42
+ archivePrefix={arXiv},
43
+ primaryClass={cs.CL}
44
+ }
45
+ ```
46
 
47
+ ## License
48
 
49
+ Apache 2.0
 
 
 
tokenizer.json CHANGED
@@ -67,7 +67,7 @@
67
  "special": true
68
  },
69
  {
70
- "id": 8192,
71
  "content": " ",
72
  "single_word": false,
73
  "lstrip": false,
@@ -8289,8 +8289,7 @@
8289
  "kow": 8187,
8290
  "kib": 8188,
8291
  "knit": 8189,
8292
- "ll": 8190,
8293
- "lul": 8191
8294
  },
8295
  "merges": [
8296
  [
@@ -40004,10 +40003,6 @@
40004
  [
40005
  "l",
40006
  "l"
40007
- ],
40008
- [
40009
- "l",
40010
- "ul"
40011
  ]
40012
  ]
40013
  }
 
67
  "special": true
68
  },
69
  {
70
+ "id": 8191,
71
  "content": " ",
72
  "single_word": false,
73
  "lstrip": false,
 
8289
  "kow": 8187,
8290
  "kib": 8188,
8291
  "knit": 8189,
8292
+ "ll": 8190
 
8293
  },
8294
  "merges": [
8295
  [
 
40003
  [
40004
  "l",
40005
  "l"
 
 
 
 
40006
  ]
40007
  ]
40008
  }
tokenizer_config.json CHANGED
@@ -1,22 +1,16 @@
1
  {
2
- "tokenizer_class": "PreTrainedTokenizerFast",
 
3
  "bos_token": "<|start|>",
 
 
4
  "eos_token": "<|end|>",
 
 
5
  "pad_token": "<|pad|>",
6
- "unk_token": "<|unk|>",
7
- "cls_token": "<|cls|>",
8
  "sep_token": "<|sep|>",
9
- "mask_token": "<|mask|>",
10
- "model_max_length": 4096,
11
- "padding_side": "right",
12
- "truncation": "longest_first",
13
- "special_tokens_map": {
14
- "bos_token": "<|start|>",
15
- "eos_token": "<|end|>",
16
- "pad_token": "<|pad|>",
17
- "unk_token": "<|unk|>",
18
- "cls_token": "<|cls|>",
19
- "sep_token": "<|sep|>",
20
- "mask_token": "<|mask|>"
21
- }
22
  }
 
1
  {
2
+ "additional_special_tokens": null,
3
+ "backend": "tokenizers",
4
  "bos_token": "<|start|>",
5
+ "clean_up_tokenization_spaces": false,
6
+ "cls_token": "<|cls|>",
7
  "eos_token": "<|end|>",
8
+ "mask_token": "<|mask|>",
9
+ "model_max_length": 1024,
10
  "pad_token": "<|pad|>",
 
 
11
  "sep_token": "<|sep|>",
12
+ "tokenizer_class": "PreTrainedTokenizerFast",
13
+ "unk_token": "<|unk|>",
14
+ "model_type": "modernbert",
15
+ "vocab_size": 8192
 
 
 
 
 
 
 
 
 
16
  }