Nj-1111 commited on
Commit
9abb5c9
·
verified ·
1 Parent(s): 1a33991

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +492 -35
README.md CHANGED
@@ -10,63 +10,520 @@ library_name: transformers
10
 
11
  # Copernicus Tokenizer
12
 
13
- Domain-general BPE tokenizer trained from scratch on 3.96 million documents
14
- spanning natural language, code, mathematics, and scientific text.
15
 
16
- | Parameter | Value |
17
- |---|---|
18
- | Algorithm | Byte-Pair Encoding (BPE) |
19
- | Vocabulary size | 55,812 |
20
- | Merges | 55,725 |
21
- | Byte encoding | GPT-2 byte-level (256-char alphabet) |
22
- | Min frequency | 3 |
23
 
24
- ## Quick start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ```python
27
  from transformers import AutoTokenizer
28
 
29
- tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer")
 
 
 
 
 
30
 
31
- ids = tokenizer("Hello, world!")
32
- print(ids)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ```
34
 
35
- ## Use in a training loop
 
 
36
 
37
  ```python
38
  from transformers import PreTrainedTokenizerFast
39
 
40
- tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer")
 
 
41
 
42
  inputs = tokenizer(
43
- ["Hello world", "def foo(): pass"],
 
 
 
44
  truncation=True,
45
  max_length=2048,
46
  padding="max_length",
47
- return_tensors="pt",
48
  )
49
  ```
50
 
51
- ## Special tokens
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
- | Token | Role |
54
- |---|---|
55
- | `<\|endoftext\|>` | BOS / EOS |
56
- | `<\|unk\|>` | Unknown |
57
- | `<\|pad\|>` | Padding |
58
- | `<think>` / `</think>` | Chain-of-thought delimiters |
59
- | `<\|user\|>` / `<\|assistant\|>` / `<\|system\|>` | Chat roles |
60
- | `<\|im_start\|>` / `<\|im_end\|>` | ChatML-style markers |
61
- | `<\|tool_call\|>` / `<\|tool_result\|>` | Tool use |
62
 
63
- ## Training data
64
 
65
- | Domain | Source |
66
- |---|---|
67
- | Natural language | Wikipedia (multilingual), Common Crawl |
68
- | Code | The Stack |
69
- | Mathematics | MATH dataset, arXiv |
70
- | Science | PubMed, S2ORC |
71
 
72
- Training code: [github.com/Nj-1111/copernicus-tokenizer](https://github.com/Nj-1111/copernicus-tokenizer)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  # Copernicus Tokenizer
12
 
13
+ ## Overview
 
14
 
15
+ **Copernicus Tokenizer** is a domain-general Byte-Pair Encoding (BPE) tokenizer trained from scratch for large language models operating across heterogeneous reasoning domains, including:
 
 
 
 
 
 
16
 
17
+ * Natural language
18
+ * Source code
19
+ * Mathematical notation
20
+ * Scientific literature
21
+ * Symbol-heavy technical text
22
+ * Structured chat and tool-use formatting
23
+
24
+ The tokenizer was designed to prioritize:
25
+
26
+ 1. Reversible decoding integrity
27
+ 2. Structural fidelity for code
28
+ 3. Mathematical symbol preservation
29
+ 4. Vocabulary efficiency under mixed-domain corpora
30
+ 5. Robust multilingual byte-level coverage
31
+
32
+ The tokenizer uses GPT-2-style byte-level pretokenization combined with custom BPE merge training over approximately **3.96 million documents** sourced from code, scientific literature, mathematics, and natural language corpora.
33
+
34
+ ---
35
+
36
+ # Technical Specifications
37
+
38
+ | Parameter | Value | | |
39
+ | ----------------------- | ------------------------------- | --------- | -- |
40
+ | Tokenizer Type | Byte-Pair Encoding (BPE) | | |
41
+ | Pretokenization | GPT-2 byte-level | | |
42
+ | Vocabulary Size | 55,812 | | |
43
+ | Merge Operations | 55,725 | | |
44
+ | Base Alphabet | 256-byte alphabet | | |
45
+ | Minimum Merge Frequency | 3 | | |
46
+ | Unknown Token | `< | unk | >` |
47
+ | Padding Token | `< | pad | >` |
48
+ | BOS/EOS Token | `< | endoftext | >` |
49
+ | Maximum Sequence Length | 4096 | | |
50
+ | Training Documents | ~3.96M | | |
51
+ | Intended Use | General-purpose LLM pretraining | | |
52
+
53
+ ---
54
+
55
+ # Design Goals
56
+
57
+ The tokenizer was explicitly optimized for mixed-domain reasoning workloads rather than purely conversational English.
58
+
59
+ Core objectives included:
60
+
61
+ * Preserving programming-language structure
62
+ * Maintaining reversible decode behavior
63
+ * Improving compression over legacy GPT-2 BPEs
64
+ * Supporting LaTeX and symbolic mathematics
65
+ * Avoiding excessive fragmentation of scientific terminology
66
+ * Supporting tool-calling and agentic prompting formats
67
+
68
+ ---
69
+
70
+ # Supported Domains
71
+
72
+ | Domain | Optimization Goal |
73
+ | ---------------- | ------------------------------------------------ |
74
+ | Natural Language | Compression efficiency + morphology preservation |
75
+ | Source Code | Syntax stability + AST-safe decoding |
76
+ | Mathematics | LaTeX atomicity + operator preservation |
77
+ | Scientific Text | Technical terminology coverage |
78
+ | Chat/Agents | Structured conversational formatting |
79
+ | Unicode Text | Full byte-level reversibility |
80
+
81
+ ---
82
+
83
+ # Special Tokens
84
+
85
+ | Token | Purpose | | | | |
86
+ | ---------------------- | --------------------------- | ----- | --------------- | -- | ----------------- |
87
+ | `< | endoftext | >` | BOS / EOS | | |
88
+ | `< | unk | >` | Unknown token | | |
89
+ | `< | pad | >` | Padding | | |
90
+ | `<think>` / `</think>` | Chain-of-thought delimiters | | | | |
91
+ | `< | user | >` | Chat role token | | |
92
+ | `< | assistant | >` | Chat role token | | |
93
+ | `< | system | >` | Chat role token | | |
94
+ | `< | im_start | >`/`< | im_end | >` | ChatML formatting |
95
+ | `< | tool_call | >` | Tool invocation | | |
96
+ | `< | tool_result | >` | Tool response | | |
97
+
98
+ ---
99
+
100
+ # Training Corpus
101
+
102
+ The tokenizer was trained on a heterogeneous multi-domain corpus.
103
+
104
+ | Domain | Primary Sources |
105
+ | --------------------- | ----------------------- |
106
+ | Natural Language | Wikipedia, Common Crawl |
107
+ | Source Code | The Stack |
108
+ | Mathematics | MATH dataset, arXiv |
109
+ | Scientific Literature | PubMed, S2ORC |
110
+
111
+ The corpus intentionally mixed:
112
+
113
+ * prose
114
+ * code
115
+ * formulas
116
+ * Unicode-heavy text
117
+ * markdown
118
+ * structured conversations
119
+ * technical documentation
120
+
121
+ This mixture was intended to prevent domain starvation during BPE merge allocation.
122
+
123
+ ---
124
+
125
+ # Installation
126
 
127
  ```python
128
  from transformers import AutoTokenizer
129
 
130
+ tokenizer = AutoTokenizer.from_pretrained(
131
+ "Nj-1111/Copernicus-Tokenizer"
132
+ )
133
+ ```
134
+
135
+ ---
136
 
137
+ # Example Usage
138
+
139
+ ```python
140
+ from transformers import AutoTokenizer
141
+
142
+ tokenizer = AutoTokenizer.from_pretrained(
143
+ "Nj-1111/Copernicus-Tokenizer"
144
+ )
145
+
146
+ text = "def factorial(n): return 1 if n <= 1 else n * factorial(n-1)"
147
+
148
+ encoded = tokenizer(text)
149
+ print(encoded["input_ids"])
150
+
151
+ decoded = tokenizer.decode(encoded["input_ids"])
152
+ print(decoded)
153
  ```
154
 
155
+ ---
156
+
157
+ # Batched Training Usage
158
 
159
  ```python
160
  from transformers import PreTrainedTokenizerFast
161
 
162
+ tokenizer = PreTrainedTokenizerFast.from_pretrained(
163
+ "Nj-1111/Copernicus-Tokenizer"
164
+ )
165
 
166
  inputs = tokenizer(
167
+ [
168
+ "Hello world",
169
+ "def foo(): pass"
170
+ ],
171
  truncation=True,
172
  max_length=2048,
173
  padding="max_length",
174
+ return_tensors="pt"
175
  )
176
  ```
177
 
178
+ ---
179
+
180
+ # Evaluation Methodology
181
+
182
+ The tokenizer was evaluated using a mixed-domain stress-testing suite designed to benchmark:
183
+
184
+ * compression efficiency
185
+ * structural preservation
186
+ * mathematical tokenization quality
187
+ * reversibility
188
+ * morphology handling
189
+ * numeric stability
190
+ * code integrity
191
+
192
+ The benchmark corpus included:
193
+
194
+ * deeply nested Python syntax
195
+ * asynchronous code
196
+ * indentation stress tests
197
+ * LaTeX equations
198
+ * Unicode mathematics
199
+ * morphologically rich English
200
+ * long decimal sequences
201
+ * hexadecimal and binary literals
202
+
203
+ Baseline comparison was performed against the GPT-2 tokenizer.
204
+
205
+ ---
206
+
207
+ # Benchmark Results
208
+
209
+ ## Core Metrics
210
+
211
+ | Metric | Copernicus | GPT-2 |
212
+ | --------------------------- | ---------- | ------ |
213
+ | Total Tokens | 12,600 | 14,920 |
214
+ | Character Compression Ratio | 2.754 | 2.326 |
215
+ | Byte Compression Ratio | 2.870 | 2.424 |
216
+ | Word Fertility | 2.601 | 2.872 |
217
+ | Entropy | 6.850 | 6.775 |
218
+ | Estimated BPT Proxy | 5.726 | 6.715 |
219
+ | Reversible Integrity | True | True |
220
+ | Unknown Tokens | 0 | 0 |
221
+
222
+ ---
223
+
224
+ # Interpretation of Metrics
225
+
226
+ ## Compression Efficiency
227
+
228
+ Copernicus demonstrates significantly stronger compression than GPT-2 on mixed-domain technical corpora.
229
+
230
+ The lower fertility and higher compression ratio indicate:
231
+
232
+ * better merge efficiency
233
+ * stronger domain coverage
234
+ * reduced subword fragmentation
235
+ * improved vocabulary allocation
236
+
237
+ The benchmark corpus was intentionally difficult and included:
238
+
239
+ * source code
240
+ * LaTeX
241
+ * Unicode mathematics
242
+ * technical scientific language
243
+ * long numeric sequences
244
+
245
+ Performance on standard English corpora is expected to exceed the reported mixed-domain ratios.
246
+
247
+ ---
248
+
249
+ # Reversible Integrity
250
+
251
+ The tokenizer achieved:
252
+
253
+ ```text
254
+ decode(encode(text)) == text
255
+ ```
256
+
257
+ across the benchmark corpus.
258
+
259
+ This property is critical for:
260
+
261
+ * code generation
262
+ * compiler-safe decoding
263
+ * mathematical reconstruction
264
+ * structured prompting
265
+ * dataset integrity preservation
266
+
267
+ ---
268
+
269
+ # Structural Purity Evaluation
270
+
271
+ ## Structural Purity Score
272
+
273
+ ```text
274
+ 0.887
275
+ ```
276
+
277
+ The tokenizer largely avoided catastrophic syntax merges.
278
+
279
+ Examples of acceptable structural tokens:
280
+
281
+ ```text
282
+ '=='
283
+ '<='
284
+ '='
285
+ ```
286
+
287
+ The tokenizer successfully avoided highly destructive merges such as:
288
+
289
+ ```text
290
+ foo:
291
+ (variable
292
+ ]])
293
+ ```
294
+
295
+ This indicates relatively strong syntax-boundary preservation.
296
+
297
+ ---
298
+
299
+ # AST Integrity Testing
300
+
301
+ Python code subjected to encode/decode cycles remained parseable by Python's AST parser.
302
+
303
+ Result:
304
+
305
+ ```text
306
+ AST PARSE: PASS
307
+ ```
308
+
309
+ This demonstrates:
310
+
311
+ * indentation preservation
312
+ * bracket stability
313
+ * newline consistency
314
+ * syntax-safe decoding
315
+
316
+ This property is especially important for code-language-model training.
317
+
318
+ ---
319
+
320
+ # Mathematical Tokenization Quality
321
+
322
+ ## LaTeX Atomicity Score
323
+
324
+ ```text
325
+ 0.875
326
+ ```
327
+
328
+ The tokenizer preserved many common LaTeX operators as atomic units.
329
+
330
+ Examples:
331
+
332
+ | Symbol | Result |
333
+ | ----------- | ------ |
334
+ | `\\sqrt` | Atomic |
335
+ | `\\frac` | Atomic |
336
+ | `\\sum` | Atomic |
337
+ | `\\int` | Atomic |
338
+ | `\\alpha` | Atomic |
339
+ | `\\partial` | Atomic |
340
+
341
+ Rare-symbol fragmentation still occurs in some cases:
342
+
343
+ ```text
344
+ \\vartheta -> ['\\v', 'artheta']
345
+ ```
346
+
347
+ This indicates that the tokenizer is math-aware but not yet fully optimized for frontier symbolic reasoning workloads.
348
+
349
+ ---
350
+
351
+ # Morphological Evaluation
352
+
353
+ The tokenizer demonstrated strong segmentation behavior on morphologically rich vocabulary.
354
+
355
+ Examples:
356
+
357
+ | Word | Tokenization |
358
+ | ---------------------- | ------------------------------- |
359
+ | interoperability | inter + oper + ability |
360
+ | hyperparameterization | hyper + parameter + ization |
361
+ | counterrevolutionaries | counter + rev + olution + aries |
362
+
363
+ This suggests:
364
+
365
+ * good subword reuse
366
+ * semantic morpheme retention
367
+ * efficient scientific terminology handling
368
+
369
+ Some residual BPE artifacts remain:
370
+
371
+ ```text
372
+ antidisestablishmentarianism
373
+ -> ant + idis + estab + lish + ment + arian + ism
374
+ ```
375
+
376
+ indicating mid-frequency merge residue.
377
+
378
+ ---
379
+
380
+ # Numeric Stability Analysis
381
+
382
+ The tokenizer currently exhibits moderate numeric consistency.
383
+
384
+ Examples:
385
+
386
+ ```text
387
+ 890.123456789
388
+ -> ['89', '0.', '123456789']
389
+ ```
390
+
391
+ ```text
392
+ 9876543210.000000000001
393
+ -> ['987', '65', '432', '10.00', '0000000001']
394
+ ```
395
+
396
+ Strengths:
397
+
398
+ * no unknown tokens
399
+ * efficient compression
400
+ * stable decimal preservation
401
+
402
+ Weaknesses:
403
+
404
+ * inconsistent digit chunking
405
+ * fragmented numerical semantics
406
+ * unstable precision grouping
407
+
408
+ Future revisions may benefit from dedicated numeric pretokenization.
409
+
410
+ ---
411
+
412
+ # Whitespace & Indentation Behavior
413
+
414
+ The tokenizer partially compresses indentation patterns.
415
+
416
+ Examples:
417
+
418
+ ```text
419
+ 4 spaces -> ['ĠĠ', 'ĠĠ']
420
+ 8 spaces -> ['ĠĠ', 'ĠĠ', 'ĠĠ', 'ĠĠ']
421
+ ```
422
+
423
+ This behavior is functional but not yet indentation-semantic.
424
+
425
+ Dedicated indentation tokens could further improve:
426
+
427
+ * code modeling
428
+ * AST consistency
429
+ * Python generation quality
430
+
431
+ ---
432
+
433
+ # Strengths
434
+
435
+ ## Major Strengths
436
+
437
+ * Strong mixed-domain compression
438
+ * Excellent reversibility
439
+ * AST-safe code preservation
440
+ * Good syntax-boundary awareness
441
+ * Strong LaTeX operator handling
442
+ * Good scientific morphology segmentation
443
+ * Unicode-safe byte-level encoding
444
+ * Zero unknown tokens during benchmark
445
+
446
+ ---
447
+
448
+ # Current Limitations
449
+
450
+ ## Areas for Improvement
451
+
452
+ * Numeric chunking consistency
453
+ * Rare mathematical symbol coverage
454
+ * Indentation-semantic tokenization
455
+ * Syntax-aware pretokenization
456
+ * Expanded theorem-level LaTeX coverage
457
+
458
+ ---
459
+
460
+ # Intended Use Cases
461
+
462
+ ## Recommended
463
 
464
+ * General-purpose LLM pretraining
465
+ * Coding assistants
466
+ * Research copilots
467
+ * Scientific language models
468
+ * Tool-using agent systems
469
+ * Mathematical text generation
470
+ * Mixed-domain instruction tuning
 
 
471
 
472
+ ## Less Ideal
473
 
474
+ * High-precision arithmetic models
475
+ * Frontier symbolic theorem provers
476
+ * Compiler-verified code synthesis
477
+ * Financial numerical reasoning systems
 
 
478
 
479
+ ---
480
+
481
+ # Research Assessment
482
+
483
+ Based on mixed-domain evaluation, Copernicus Tokenizer currently falls within:
484
+
485
+ ```text
486
+ Advanced / Early Research-Grade
487
+ ```
488
+
489
+ relative to contemporary open-source BPE tokenizers.
490
+
491
+ The tokenizer substantially outperforms legacy GPT-2 tokenization behavior on:
492
+
493
+ * compression
494
+ * morphology
495
+ * code structure
496
+ * LaTeX preservation
497
+ * Unicode robustness
498
+
499
+ while remaining fully reversible and structurally stable.
500
+
501
+ ---
502
+
503
+ # Future Work
504
+
505
+ Planned future improvements may include:
506
+
507
+ * syntax-aware code pretokenization
508
+ * dedicated numeric tokenization strategies
509
+ * extended LaTeX operator vocabularies
510
+ * theorem-aware symbolic coverage
511
+ * indentation-semantic merges
512
+ * multilingual optimization
513
+ * adaptive merge allocation
514
+
515
+ ---
516
+
517
+ # Repository
518
+
519
+ Training code and tokenizer assets:
520
+
521
+ ```text
522
+ github.com/Nj-1111/copernicus-tokenizer
523
+ ```
524
+
525
+ Tokenizer repository:
526
+
527
+ ```text
528
+ huggingface.co/Nj-1111/Copernicus-Tokenizer
529
+ ```