File size: 18,666 Bytes
edede4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
# 🌟 MathTok: Canonicalized AST-Based Mathematical Tokenizer Codebase Review

An in-depth structural and architectural analysis of the **MathTok** pipeline located at `c:\Users\surwe\Project\math_token`. This document serves as a comprehensive system review, detailing the mathematical foundations, the 7-layer pipeline design, system components, evaluation metrics, empirical results, and downstream application patterns of MathTok.

---

## πŸ“– Executive Summary

Standard natural language tokenizers (like Byte-Pair Encoding or SentencePiece) treat mathematical expressions as plain text sequences. This results in **structural fragmentation** (e.g., splitting a variable `VAR_THETA` or operator `OP_ADD` into arbitrary character chunks) and **semantic blindness** (failing to recognize algebraic equivalences like $x + 2 \equiv 2 + x$). 

**MathTok** solves this by introducing a **hybrid, structure-aware tokenization framework** for mathematical language modeling. By constructing an Abstract Syntax Tree (AST) from mathematical expressions, normalizing algebraic equivalences via symbolic mathematics (SymPy), and serializing the tree using Depth-First Search (DFS) preorder traversal, MathTok preserves full mathematical syntax and hierarchy. 

Additionally, MathTok automatically emits **structural attention metadata** for every token position, enabling downstream transformer models to implement tree-based or graph-structured attention patterns without architectural modifications.

```mermaid
graph TD
    A[Raw Input: Mixed Text + Math] --> B[Layer 2: Hybrid Lexer]
    B -->|TEXT Spans| C[Layer 7: BPE Text Sub-Vocab]
    B -->|MATH Spans| D[Layer 1: Canonicalizer Engine]
    D -->|SymPy Expression| E[Layer 3: AST Generator]
    E -->|Typed AST Tree| F[Layer 4: Semantic Operator Registry]
    F -->|Enriched Nodes| G[Layer 5: Structural Serializer]
    G -->|DFS Preorder Stream| H[Layer 6: Attention Metadata Gen]
    H -->|Attention Masks & Hints| I[Final Merged Token Stream]
    C --> I
```

---

## πŸ› οΈ The 7-Layer Processing Pipeline

MathTok's core engine is structured into seven distinct modular layers. Every component resides in the [`mathtok/`](file:///c:/Users/surwe/Project/math_token/mathtok) package.

### Layer 1: Canonicalizer Engine
* **Location**: [`canonicalizer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/canonicalizer.py)
* **Role**: Algebraic normalisation and format conversion (LaTeX $\to$ ASCII $\to$ SymPy).
* **Implementation Details**:
  * **Heuristic Format Detection**: Inspects the input for LaTeX syntax (e.g., `\frac`, `\sqrt`, `\sin`, `{`, math delimiters like `$` or `\(`).
  * **Parsing**: Utilizes `sympy.parsing.latex.parse_latex` (with ANTLR4) for LaTeX, falling back to `sympy.parsing.sympy_parser.parse_expr` with standard and implicit multiplication transformations for ASCII.
  * **Normalisation**: Leverages SymPy's symbolic engine to `expand()` products over sums and `simplify()` algebraic expressions. It normalizes operations internally (e.g., transforming subtractions $a - b$ into additions of products $\text{Add}(a, \text{Mul}(-1, b))$, and divisions $a / b$ into multiplications of powers $\text{Mul}(a, \text{Pow}(b, -1))$).
  * **Robustness & Performance**: Employs an LRU cache (default: 512 entries) to prevent redundant parsing and wraps expensive SymPy calls in a `ThreadPoolExecutor` with configurable parsing timeouts (default: 5.0 seconds) to prevent infinite loops on malicious, highly-complex inputs.

### Layer 2: Hybrid Mathematical Lexer
* **Location**: [`lexer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/lexer.py)
* **Role**: Alternating segment segmentation (TEXT spans vs. MATH spans).
* **Implementation Details**:
  * **Stage 1 (Unambiguous Delimiters)**: Extracts LaTeX math environments (double dollar `$$...$$`, inline dollar `$...\$`, bracket `\[...\]`, or parenthesis `\(...\)`).
  * **Stage 2 (ASCII Heuristics)**: Parses remaining text regions using pre-compiled regular expressions matching mathematical patterns (e.g., function calls `sin(...)`, exponents `x^2`, arithmetic boundaries `2*x+1`, relational equations `a+b=c`, and spelled-out Greek variables).
  * **Region Expansion**: Expands detected math seeds backwards to include leading unary operators and digits, and forwards to match balanced braces/parentheses and continuous math characters. Adjacent spans of identical types are merged.

### Layer 3: AST Generator
* **Location**: [`ast_generator.py`](file:///c:/Users/surwe/Project/math_token/mathtok/ast_generator.py)
* **Role**: SymPy AST conversion to typed, abstract vocabulary trees.
* **Implementation Details**:
  * Walks the SymPy internal expression tree recursively.
  * Maps generic SymPy types into the vocabulary of MathTok:
    * **Variables**: Standard letters map to `VAR_X`, `VAR_Y`, etc. Spelled-out Greek names map to `VAR_THETA`, `VAR_LAMBDA`, etc.
    * **Constants**: Values between $-10$ and $100$ receive dedicated tokens (e.g., `CONST_3`, `CONST_12`), large integers map to placeholders (e.g., `NUM_145`), floats map to string-encoded float tokens (e.g., `FLOAT_3p14`), and special constants map to `CONST_PI`, `CONST_E`, `CONST_I`, and `CONST_INF`.
    * **Unary Operations**: Converts negative numbers or multiplication by $-1$ to explicit `OP_NEG` nodes, and division inverses to `OP_RECIP` nodes.
    * **Fractions**: Converts `Rational(p, q)` into explicit binary `FRAC(numerator, denominator)` nodes.
  * **Recursion Guard**: Enforces `max_depth` limits (default: 20) to truncate overly-nested expressions, replacing them with a special `SUBTREE_TRUNCATED` node to avoid Python stack overflows.

### Layer 4: Semantic Operator Registry
* **Location**: [`operator_registry.py`](file:///c:/Users/surwe/Project/math_token/mathtok/operator_registry.py)
* **Role**: Rich metadata storage and categorisation for mathematical operators.
* **Implementation Details**:
  * Maintains an immutable registry of `OperatorMeta` instances mapping token strings to mathematical properties:
    * **Properties**: `arity` ($-1$ for variadic, or fixed integers like 1 or 2), `precedence`, `associativity` (left, right, or none), `semantic_role` (e.g., `aggregation` for addition, `periodic_oscillation` for sine), `latex_repr`, `ascii_repr`, `category`, and `is_commutative`.
    * **Domains**: Spans multiple mathematical branches: Arithmetic, Relational, Calculus, Trigonometry, Exponential/Logarithmic, Logic, Set Theory, Geometry, and Statistics.
    * **Inverses**: Declares explicit mathematical inverses in `INVERSE_PAIRS` (e.g., `FUNC_SIN` $\leftrightarrow$ `FUNC_ASIN`, `FUNC_EXP` $\leftrightarrow$ `FUNC_LOG`).

### Layer 5: Structural Serializer
* **Location**: [`serializer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/serializer.py)
* **Role**: Flattening the 2D tree structure into a 1-D stream using DFS preorder traversal.
* **Implementation Details**:
  * Emits nodes starting from the root down to the leaves, producing a flat sequence of `SerializedToken` objects carrying: `depth`, `node_id`, `parent_id`, `child_index`, `num_children`, `is_leaf`, and `subtree_size`.
  * **Scope Delineation**: Emits `[SCOPE_OPEN]` and `[SCOPE_CLOSE]` boundary tokens to explicitly group parameters for functions (e.g., `FUNC_SIN [SCOPE_OPEN] VAR_X [SCOPE_CLOSE]`).
  * **Subtree Deduplication**: Integrates MD5 structural hashing (`dedup_subtrees`) to replace duplicated structures (e.g., repeating sub-formulas) with a pointer reference (e.g., `SUBTREE_REF_ae34df51`), improving sequence compression.

### Layer 6: Structural Attention Metadata Generator
* **Location**: [`metadata.py`](file:///c:/Users/surwe/Project/math_token/mathtok/metadata.py)
* **Role**: Calculating positional contexts and binary attention mask matrices.
* **Implementation Details**:
  * Classifies tokens into categories: `operator`, `function`, `variable`, `constant`, `structural`, `boundary`, or `text`.
  * Generates a dot-separated positional hierarchy string for each node in `tree_position_key` (e.g., `0.1.2` denotes root $\to$ 2nd child $\to$ 3rd child), which is useful for hierarchical positional encodings.
  * **Attention Mask Matrix Synthesis**: Dynamically compiles four $N \times N$ binary attention mask matrices:
    * `parent_mask`: Direct dependency attention.
    * `children_mask`: Inverse dependency attention.
    * `sibling_mask`: Horizontal syntactic context attention.
    * `subtree_mask`: Complete structural scope attention.

### Layer 7: Vocabulary & BPE Compression
* **Location**: [`vocabulary.py`](file:///c:/Users/surwe/Project/math_token/mathtok/vocabulary.py)
* **Role**: Merging deterministic structural math vocabularies with Byte-Pair Encoding (BPE) text sub-vocabularies.
* **Implementation Details**:
  * **Two-Tier Architecture**:
    * **Tier 1 (Fixed Math Vocabulary)**: Reservoirs of deterministic, immutable IDs for standard operators, Greek/Latin variables, constants, boundaries, and placeholders. BPE is completely bypassed for math terms.
    * **Tier 2 (BPE Text Vocabulary)**: Natural language regions are processed via HuggingFace's `tokenizers` library, trained on corpus-specific text spans.
  * **HuggingFace Wrapper**: Under the hood, `MathTokHFTokenizer` acts as a drop-in subclass wrapper for `PreTrainedTokenizer`, enabling immediate integration into standard pipelines such as `transformers.Trainer`, `datasets.map`, and PyTorch collators.

---

## πŸ”„ Verification & Streaming Sub-systems

Beyond the core layers, MathTok implements crucial sub-systems to guarantee mathematical correctess and scale.

### Round-Trip Validation
* **Location**: [`validator.py`](file:///c:/Users/surwe/Project/math_token/mathtok/validator.py)
* **Role**: Guaranteeing zero semantic information loss during tokenization.
* **Implementation Details**:
  * Uses the emitted `TokenMetadata` sequence to mathematically reconstruct the original SymPy expression.
  * Rebuilds leaf nodes based on their category (constants, variables, truncations) and moves upwards to reconstruct complex nodes (`FRAC`, operators, custom functions).
  * Performs formal validation by checking if the algebraic difference between the original and reconstructed expressions simplifies to zero (`sp.simplify(original - reconstructed) == 0`).

### Streaming Pipeline
* **Location**: [`streaming.py`](file:///c:/Users/surwe/Project/math_token/mathtok/streaming.py)
* **Role**: Corpus-scale processing of large datasets without exhausting system memory.
* **Implementation Details**:
  * Wraps `MathTokPipeline` inside a lazy Python generator (`yield`).
  * Supports encoding custom iterators and streams line-delimited files sequentially, ensuring constant memory ($O(1)$ RAM) overhead during dataset processing.

---

## πŸ“ˆ Evaluation Suite & Benchmark Metrics

The [`evaluation/`](file:///c:/Users/surwe/Project/math_token/evaluation) package defines five core evaluation metrics (residing in [`metrics.py`](file:///c:/Users/surwe/Project/math_token/evaluation/metrics.py)) to assess tokenizer quality, benchmarked in [`comparison.py`](file:///c:/Users/surwe/Project/math_token/evaluation/comparison.py).

### Core Metrics

| Metric | Symbol | Definition & Formula | Mathematical Value |
| :--- | :---: | :--- | :--- |
| **Structural Compression Ratio** | **SCR** | $\text{mean}\left(\frac{\text{Structural Score}}{\text{Token Count}}\right)$ | Quantifies structural information density. Higher is better (more structure packed into fewer tokens). |
| **Canonical Consistency Score** | **CCS** | $\text{mean}\left( \text{Jaccard}(S_A, S_B) \right)$ over equivalent pairs | Evaluates algebraic invariance. A score of $1.0$ represents perfect semantic convergence. |
| **Operator Preservation Score** | **OPS** | $\%$ of expressions containing all expected operators | Measures robustness; ensures mathematical operations are never lost or corrupted. |
| **Token Stability** | **TS** | $1 - \text{Coefficient of Variation}(\text{length})$ | Assesses syntactic variance stability under re-writings. Higher is more stable. |
| **Tree Depth Fidelity** | **TDF** | $1 - \text{mean}\left( \frac{\vert d_{\text{actual}} - d_{\text{ground}} \vert}{d_{\text{ground}}} \right)$ | Measures max metadata depth accuracy against the ground truth SymPy height. |

> [!NOTE]
> **Semantic Compression Ratio (SCR)** is evaluated at three hierarchical levels in `comparison.py`:
> * **Level 1 β€” Structural Score to Token Ratio**: `structural_score / token_count`
> * **Level 2 β€” Semantic Density**: `math_tokens / total_tokens`
> * **Level 3 β€” Structural Efficiency**: `parent_child_relations / token_count`

---

## πŸ”¬ Empirical Benchmark Results

Empirical comparisons of MathTok against a standard subword tokenizer (GPT-2 BPE), a custom-trained SentencePiece (unigram) tokenizer, and character-level baselines over 70 complex test expressions across multiple disciplines reveal substantial improvements.

### 1. 3-Level Semantic Comparison (Aggregated)

Across the entire evaluation suite, the aggregated results illustrate MathTok's efficiency:

| Metric | MathTok | GPT-2 | SentencePiece | Character-Level |
| :--- | :---: | :---: | :---: | :---: |
| **Level 1 β€” SCR** (struct_score / tokens) | **0.9161** | 0.4251 | 0.3696 | 0.4005 |
| **Level 2 β€” Semantic Density** (math / total) | **0.5633** | 0.1838 | 0.1499 | β€” |
| **Level 3 β€” Structural Efficiency** (relations / tokens) | **0.2492** | *N/A* | *N/A* | β€” |
| **SCR Improvement Factor** (MathTok vs. Baseline) | **β€”** | **2.16x** | **2.48x** | **2.29x** |

### 2. Canonical Convergence & Consistency (Jaccard Overlap)

For mathematically equivalent pairs, MathTok achieves perfect Jaccard alignment (Jaccard = 1.0), whereas standard text-based tokenizers suffer significant fragmentation:

| Expression Pair | MathTok Jaccard | GPT-2 Jaccard | SentencePiece Jaccard | Convergence Status |
| :--- | :---: | :---: | :---: | :---: |
| `x + 2` vs. `2 + x` | **1.000** | 0.200 | 1.000 | **CONVERGED (100%)** |
| `a*b + a*c` vs. `a*(b+c)` | **1.000** | 0.444 | 0.625 | **CONVERGED (100%)** |
| `(x+1)^2` vs. `x^2+2x+1` | **1.000** | 0.273 | 0.222 | **CONVERGED (100%)** |
| `x^2 - y^2` vs. `(x+y)*(x-y)` | **1.000** | 0.091 | 0.300 | **CONVERGED (100%)** |
| `sin(x)^2 + cos(x)^2` vs. `1` | **1.000** | 0.000 | 0.000 | **CONVERGED (100%)** |
| `2*x + 2*y` vs. `2*(x+y)` | **1.000** | 0.444 | 0.571 | **CONVERGED (100%)** |
| `x*y + x*z` vs. `x*(y+z)` | **1.000** | 0.444 | 0.625 | **CONVERGED (100%)** |
| `a^2 + 2*a*b + b^2` vs. `(a+b)^2` | **1.000** | 0.364 | 0.455 | **CONVERGED (100%)** |

### 3. LaTeX vs. ASCII Format Invariance

MathTok perfectly converges inputs in differing representations to identical structural sequences, while subword tokenizers have severe variance:

| ASCII Expression | LaTeX Expression | MathTok same? | MT tokens A/L | GPT-2 tokens A/L | SP tokens A/L |
| :--- | :--- | :---: | :---: | :---: | :---: |
| `sin(x^2)` | `\sin(x^2)` | **YES (1.00)** | **8 / 8** | 6 / 7 | 6 / 6 |
| `sqrt(x^2 + 1)` | `\sqrt{x^2 + 1}` | **YES (1.00)** | **11 / 11** | 9 / 10 | 9 / 9 |
| `log(x)` | `\ln(x)` | **YES (1.00)** | **6 / 6** | 4 / 5 | 6 / 6 |
| `exp(x)` | `e^x` | **YES (1.00)** | **6 / 6** | 4 / 3 | 6 / 3 |
| `x/y` | `\frac{x}{y}` | **YES (1.00)** | **6 / 6** | 3 / 7 | 3 / 9 |
| `int(x^2, x)` | `\int x^2 dx` | **NO (~/fallback)** | **1 / 10** | 8 / 6 | 8 / 7 |
| `diff(sin(x), x)` | `\frac{d}{dx}\sin(x)` | **YES (1.00)** | **6 / 6** | 8 / 11 | 14 / 16 |
| `factorial(n)` | `n!` | **YES (1.00)** | **6 / 6** | 5 / 2 | 11 / 3 |

---

## πŸš€ Custom Attention Integration Patterns

The core value of MathTok for downstream machine learning practitioners is the **Layer 6 Attention Hints**. By translating tree relationships into standard masking shapes, model creators can train structure-aware networks natively.

Below are three attention mask designs that can be constructed directly from the outputs of `to_attention_mask_hints()`:

### 1. Parent-Child Hierarchical Mask
Encourages top-down syntactic attention. Nodes are only allowed to attend to their direct parent or child node.

```
       [+ (root)]             Parent Attention Mask Matrix:
        /      \              
     [x]       [3]            [ ] [+ (root)] [x] [3]
      |                       [+ (root)]   1    1   1
    [sin]                     [x]          1    1   0
                              [3]          1    0   1
```

### 2. Sibling Horizontal Mask
Focuses horizontal attention across operands of identical scopes (e.g., connecting operands inside an addition sequence, $a$ and $b$ and $c$, without parent noise).

### 3. Subtree Scope Mask
A highly effective block mask for mathematical reasoning. Restricts attention strictly within a subtree, isolating independent sub-expressions during reasoning loops.

---

## 🎯 Codebase Evaluation & Recommendations

### Key Strengths
1. **Outstanding Structural Integrity**: Modularity is excellent. Clear abstraction separation (canonicalization, tokenization, serialization, and vocabulary grouping) makes codebase expansion extremely straightforward.
2. **HuggingFace Compatibility**: Subclassing/wrapping the standard tokenizer class ensures immediate, zero-friction integration with existing libraries like PyTorch and HuggingFace.
3. **Rigorous Validation**: The inclusion of `validator.py` and the round-trip checking logic demonstrates high development standards.
4. **Reliability Guards**: LRU caches, concurrency thread pools, and recursion limits make this pipeline safe for server-side deployment.

### Recommended Enhancements
* **Vocabulary Extension**: Dynamically augment `_VAR_MAP` in `ast_generator.py` to natively support multi-character variables (e.g., physics variables like $v_{\text{init}}$ or matrix names) without splitting them into generic token placeholders.
* **SymPy Parser Customisation**: SymPy's LaTeX parser can occasionally fail on non-standard, custom LaTeX macros. Adding pre-processing ASCII/LaTeX regex cleaners in `lexer.py` prior to passing them to SymPy will improve the parse success rate of dirty online forum data.
* **TDF Precision**: In case of multi-nested subtrees (e.g., highly deeply-nested fractions), customize the tree depth calculation in `metrics.py` to evaluate structural depths on custom mathematical representations rather than internal SymPy structures.

---

### Citation Reference
```bibtex
@article{mathtok2026,
  title   = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework
             for Mathematical Language Modeling},
  author  = {Anonymous},
  year    = {2026},
  note    = {Under review}
}
```