Nj-1111
/

Copernicus-Tokenizer

@@ -10,63 +10,520 @@ library_name: transformers
 # Copernicus Tokenizer
-Domain-general BPE tokenizer trained from scratch on 3.96 million documents
-spanning natural language, code, mathematics, and scientific text.
-| Parameter | Value |
-|---|---|
-| Algorithm | Byte-Pair Encoding (BPE) |
-| Vocabulary size | 55,812 |
-| Merges | 55,725 |
-| Byte encoding | GPT-2 byte-level (256-char alphabet) |
-| Min frequency | 3 |
-## Quick start
 ```python
 from transformers import AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("Nj-1111/Copernicus-Tokenizer")
-ids = tokenizer("Hello, world!")
-print(ids)
 ```
-## Use in a training loop
 ```python
 from transformers import PreTrainedTokenizerFast
-tokenizer = PreTrainedTokenizerFast.from_pretrained("Nj-1111/Copernicus-Tokenizer")
 inputs = tokenizer(
-    ["Hello world", "def foo(): pass"],
     truncation=True,
     max_length=2048,
     padding="max_length",
-    return_tensors="pt",
 )
 ```
-## Special tokens
-| Token | Role |
-|---|---|
-| `<\|endoftext\|>` | BOS / EOS |
-| `<\|unk\|>` | Unknown |
-| `<\|pad\|>` | Padding |
-| `<think>` / `</think>` | Chain-of-thought delimiters |
-| `<\|user\|>` / `<\|assistant\|>` / `<\|system\|>` | Chat roles |
-| `<\|im_start\|>` / `<\|im_end\|>` | ChatML-style markers |
-| `<\|tool_call\|>` / `<\|tool_result\|>` | Tool use |
-## Training data
-| Domain | Source |
-|---|---|
-| Natural language | Wikipedia (multilingual), Common Crawl |
-| Code | The Stack |
-| Mathematics | MATH dataset, arXiv |
-| Science | PubMed, S2ORC |
-Training code: [github.com/Nj-1111/copernicus-tokenizer](https://github.com/Nj-1111/copernicus-tokenizer)

 # Copernicus Tokenizer
+## Overview
+**Copernicus Tokenizer** is a domain-general Byte-Pair Encoding (BPE) tokenizer trained from scratch for large language models operating across heterogeneous reasoning domains, including:
+* Natural language
+* Source code
+* Mathematical notation
+* Scientific literature
+* Symbol-heavy technical text
+* Structured chat and tool-use formatting
+The tokenizer was designed to prioritize:
+1. Reversible decoding integrity
+2. Structural fidelity for code
+3. Mathematical symbol preservation
+4. Vocabulary efficiency under mixed-domain corpora
+5. Robust multilingual byte-level coverage
+The tokenizer uses GPT-2-style byte-level pretokenization combined with custom BPE merge training over approximately **3.96 million documents** sourced from code, scientific literature, mathematics, and natural language corpora.
+---
+# Technical Specifications
+| Parameter               | Value                           |           |    |
+| ----------------------- | ------------------------------- | --------- | -- |
+| Tokenizer Type          | Byte-Pair Encoding (BPE)        |           |    |
+| Pretokenization         | GPT-2 byte-level                |           |    |
+| Vocabulary Size         | 55,812                          |           |    |
+| Merge Operations        | 55,725                          |           |    |
+| Base Alphabet           | 256-byte alphabet               |           |    |
+| Minimum Merge Frequency | 3                               |           |    |
+| Unknown Token           | `<                              | unk       | >` |
+| Padding Token           | `<                              | pad       | >` |
+| BOS/EOS Token           | `<                              | endoftext | >` |
+| Maximum Sequence Length | 4096                            |           |    |
+| Training Documents      | ~3.96M                          |           |    |
+| Intended Use            | General-purpose LLM pretraining |           |    |
+---
+# Design Goals
+The tokenizer was explicitly optimized for mixed-domain reasoning workloads rather than purely conversational English.
+Core objectives included:
+* Preserving programming-language structure
+* Maintaining reversible decode behavior
+* Improving compression over legacy GPT-2 BPEs
+* Supporting LaTeX and symbolic mathematics
+* Avoiding excessive fragmentation of scientific terminology
+* Supporting tool-calling and agentic prompting formats
+---
+# Supported Domains
+| Domain           | Optimization Goal                                |
+| ---------------- | ------------------------------------------------ |
+| Natural Language | Compression efficiency + morphology preservation |
+| Source Code      | Syntax stability + AST-safe decoding             |
+| Mathematics      | LaTeX atomicity + operator preservation          |
+| Scientific Text  | Technical terminology coverage                   |
+| Chat/Agents      | Structured conversational formatting             |
+| Unicode Text     | Full byte-level reversibility                    |
+---
+# Special Tokens
+| Token                  | Purpose                     |       |                 |    |                   |
+| ---------------------- | --------------------------- | ----- | --------------- | -- | ----------------- |
+| `<                     | endoftext                   | >`    | BOS / EOS       |    |                   |
+| `<                     | unk                         | >`    | Unknown token   |    |                   |
+| `<                     | pad                         | >`    | Padding         |    |                   |
+| `<think>` / `</think>` | Chain-of-thought delimiters |       |                 |    |                   |
+| `<                     | user                        | >`    | Chat role token |    |                   |
+| `<                     | assistant                   | >`    | Chat role token |    |                   |
+| `<                     | system                      | >`    | Chat role token |    |                   |
+| `<                     | im_start                    | >`/`< | im_end          | >` | ChatML formatting |
+| `<                     | tool_call                   | >`    | Tool invocation |    |                   |
+| `<                     | tool_result                 | >`    | Tool response   |    |                   |
+---
+# Training Corpus
+The tokenizer was trained on a heterogeneous multi-domain corpus.
+| Domain                | Primary Sources         |
+| --------------------- | ----------------------- |
+| Natural Language      | Wikipedia, Common Crawl |
+| Source Code           | The Stack               |
+| Mathematics           | MATH dataset, arXiv     |
+| Scientific Literature | PubMed, S2ORC           |
+The corpus intentionally mixed:
+* prose
+* code
+* formulas
+* Unicode-heavy text
+* markdown
+* structured conversations
+* technical documentation
+This mixture was intended to prevent domain starvation during BPE merge allocation.
+---
+# Installation
 ```python
 from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    "Nj-1111/Copernicus-Tokenizer"
+)
+```
+---
+# Example Usage
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    "Nj-1111/Copernicus-Tokenizer"
+)
+text = "def factorial(n): return 1 if n <= 1 else n * factorial(n-1)"
+encoded = tokenizer(text)
+print(encoded["input_ids"])
+decoded = tokenizer.decode(encoded["input_ids"])
+print(decoded)
 ```
+---
+# Batched Training Usage
 ```python
 from transformers import PreTrainedTokenizerFast
+tokenizer = PreTrainedTokenizerFast.from_pretrained(
+    "Nj-1111/Copernicus-Tokenizer"
+)
 inputs = tokenizer(
+    [
+        "Hello world",
+        "def foo(): pass"
+    ],
     truncation=True,
     max_length=2048,
     padding="max_length",
+    return_tensors="pt"
 )
 ```
+---
+# Evaluation Methodology
+The tokenizer was evaluated using a mixed-domain stress-testing suite designed to benchmark:
+* compression efficiency
+* structural preservation
+* mathematical tokenization quality
+* reversibility
+* morphology handling
+* numeric stability
+* code integrity
+The benchmark corpus included:
+* deeply nested Python syntax
+* asynchronous code
+* indentation stress tests
+* LaTeX equations
+* Unicode mathematics
+* morphologically rich English
+* long decimal sequences
+* hexadecimal and binary literals
+Baseline comparison was performed against the GPT-2 tokenizer.
+---
+# Benchmark Results
+## Core Metrics
+| Metric                      | Copernicus | GPT-2  |
+| --------------------------- | ---------- | ------ |
+| Total Tokens                | 12,600     | 14,920 |
+| Character Compression Ratio | 2.754      | 2.326  |
+| Byte Compression Ratio      | 2.870      | 2.424  |
+| Word Fertility              | 2.601      | 2.872  |
+| Entropy                     | 6.850      | 6.775  |
+| Estimated BPT Proxy         | 5.726      | 6.715  |
+| Reversible Integrity        | True       | True   |
+| Unknown Tokens              | 0          | 0      |
+---
+# Interpretation of Metrics
+## Compression Efficiency
+Copernicus demonstrates significantly stronger compression than GPT-2 on mixed-domain technical corpora.
+The lower fertility and higher compression ratio indicate:
+* better merge efficiency
+* stronger domain coverage
+* reduced subword fragmentation
+* improved vocabulary allocation
+The benchmark corpus was intentionally difficult and included:
+* source code
+* LaTeX
+* Unicode mathematics
+* technical scientific language
+* long numeric sequences
+Performance on standard English corpora is expected to exceed the reported mixed-domain ratios.
+---
+# Reversible Integrity
+The tokenizer achieved:
+```text
+decode(encode(text)) == text
+```
+across the benchmark corpus.
+This property is critical for:
+* code generation
+* compiler-safe decoding
+* mathematical reconstruction
+* structured prompting
+* dataset integrity preservation
+---
+# Structural Purity Evaluation
+## Structural Purity Score
+```text
+0.887
+```
+The tokenizer largely avoided catastrophic syntax merges.
+Examples of acceptable structural tokens:
+```text
+'=='
+'<='
+'='
+```
+The tokenizer successfully avoided highly destructive merges such as:
+```text
+foo:
+(variable
+]])
+```
+This indicates relatively strong syntax-boundary preservation.
+---
+# AST Integrity Testing
+Python code subjected to encode/decode cycles remained parseable by Python's AST parser.
+Result:
+```text
+AST PARSE: PASS
+```
+This demonstrates:
+* indentation preservation
+* bracket stability
+* newline consistency
+* syntax-safe decoding
+This property is especially important for code-language-model training.
+---
+# Mathematical Tokenization Quality
+## LaTeX Atomicity Score
+```text
+0.875
+```
+The tokenizer preserved many common LaTeX operators as atomic units.
+Examples:
+| Symbol      | Result |
+| ----------- | ------ |
+| `\\sqrt`    | Atomic |
+| `\\frac`    | Atomic |
+| `\\sum`     | Atomic |
+| `\\int`     | Atomic |
+| `\\alpha`   | Atomic |
+| `\\partial` | Atomic |
+Rare-symbol fragmentation still occurs in some cases:
+```text
+\\vartheta -> ['\\v', 'artheta']
+```
+This indicates that the tokenizer is math-aware but not yet fully optimized for frontier symbolic reasoning workloads.
+---
+# Morphological Evaluation
+The tokenizer demonstrated strong segmentation behavior on morphologically rich vocabulary.
+Examples:
+| Word                   | Tokenization                    |
+| ---------------------- | ------------------------------- |
+| interoperability       | inter + oper + ability          |
+| hyperparameterization  | hyper + parameter + ization     |
+| counterrevolutionaries | counter + rev + olution + aries |
+This suggests:
+* good subword reuse
+* semantic morpheme retention
+* efficient scientific terminology handling
+Some residual BPE artifacts remain:
+```text
+antidisestablishmentarianism
+-> ant + idis + estab + lish + ment + arian + ism
+```
+indicating mid-frequency merge residue.
+---
+# Numeric Stability Analysis
+The tokenizer currently exhibits moderate numeric consistency.
+Examples:
+```text
+890.123456789
+-> ['89', '0.', '123456789']
+```
+```text
+9876543210.000000000001
+-> ['987', '65', '432', '10.00', '0000000001']
+```
+Strengths:
+* no unknown tokens
+* efficient compression
+* stable decimal preservation
+Weaknesses:
+* inconsistent digit chunking
+* fragmented numerical semantics
+* unstable precision grouping
+Future revisions may benefit from dedicated numeric pretokenization.
+---
+# Whitespace & Indentation Behavior
+The tokenizer partially compresses indentation patterns.
+Examples:
+```text
+4 spaces -> ['ĠĠ', 'ĠĠ']
+8 spaces -> ['ĠĠ', 'ĠĠ', 'ĠĠ', 'ĠĠ']
+```
+This behavior is functional but not yet indentation-semantic.
+Dedicated indentation tokens could further improve:
+* code modeling
+* AST consistency
+* Python generation quality
+---
+# Strengths
+## Major Strengths
+* Strong mixed-domain compression
+* Excellent reversibility
+* AST-safe code preservation
+* Good syntax-boundary awareness
+* Strong LaTeX operator handling
+* Good scientific morphology segmentation
+* Unicode-safe byte-level encoding
+* Zero unknown tokens during benchmark
+---
+# Current Limitations
+## Areas for Improvement
+* Numeric chunking consistency
+* Rare mathematical symbol coverage
+* Indentation-semantic tokenization
+* Syntax-aware pretokenization
+* Expanded theorem-level LaTeX coverage
+---
+# Intended Use Cases
+## Recommended
+* General-purpose LLM pretraining
+* Coding assistants
+* Research copilots
+* Scientific language models
+* Tool-using agent systems
+* Mathematical text generation
+* Mixed-domain instruction tuning
+## Less Ideal
+* High-precision arithmetic models
+* Frontier symbolic theorem provers
+* Compiler-verified code synthesis
+* Financial numerical reasoning systems
+---
+# Research Assessment
+Based on mixed-domain evaluation, Copernicus Tokenizer currently falls within:
+```text
+Advanced / Early Research-Grade
+```
+relative to contemporary open-source BPE tokenizers.
+The tokenizer substantially outperforms legacy GPT-2 tokenization behavior on:
+* compression
+* morphology
+* code structure
+* LaTeX preservation
+* Unicode robustness
+while remaining fully reversible and structurally stable.
+---
+# Future Work
+Planned future improvements may include:
+* syntax-aware code pretokenization
+* dedicated numeric tokenization strategies
+* extended LaTeX operator vocabularies
+* theorem-aware symbolic coverage
+* indentation-semantic merges
+* multilingual optimization
+* adaptive merge allocation
+---
+# Repository
+Training code and tokenizer assets:
+```text
+github.com/Nj-1111/copernicus-tokenizer
+```
+Tokenizer repository:
+```text
+huggingface.co/Nj-1111/Copernicus-Tokenizer
+```