mathtok / review.md
SurweeshSP's picture
Initial clean MathTok release
edede4c
# 🌟 MathTok: Canonicalized AST-Based Mathematical Tokenizer Codebase Review
An in-depth structural and architectural analysis of the **MathTok** pipeline located at `c:\Users\surwe\Project\math_token`. This document serves as a comprehensive system review, detailing the mathematical foundations, the 7-layer pipeline design, system components, evaluation metrics, empirical results, and downstream application patterns of MathTok.
---
## πŸ“– Executive Summary
Standard natural language tokenizers (like Byte-Pair Encoding or SentencePiece) treat mathematical expressions as plain text sequences. This results in **structural fragmentation** (e.g., splitting a variable `VAR_THETA` or operator `OP_ADD` into arbitrary character chunks) and **semantic blindness** (failing to recognize algebraic equivalences like $x + 2 \equiv 2 + x$).
**MathTok** solves this by introducing a **hybrid, structure-aware tokenization framework** for mathematical language modeling. By constructing an Abstract Syntax Tree (AST) from mathematical expressions, normalizing algebraic equivalences via symbolic mathematics (SymPy), and serializing the tree using Depth-First Search (DFS) preorder traversal, MathTok preserves full mathematical syntax and hierarchy.
Additionally, MathTok automatically emits **structural attention metadata** for every token position, enabling downstream transformer models to implement tree-based or graph-structured attention patterns without architectural modifications.
```mermaid
graph TD
A[Raw Input: Mixed Text + Math] --> B[Layer 2: Hybrid Lexer]
B -->|TEXT Spans| C[Layer 7: BPE Text Sub-Vocab]
B -->|MATH Spans| D[Layer 1: Canonicalizer Engine]
D -->|SymPy Expression| E[Layer 3: AST Generator]
E -->|Typed AST Tree| F[Layer 4: Semantic Operator Registry]
F -->|Enriched Nodes| G[Layer 5: Structural Serializer]
G -->|DFS Preorder Stream| H[Layer 6: Attention Metadata Gen]
H -->|Attention Masks & Hints| I[Final Merged Token Stream]
C --> I
```
---
## πŸ› οΈ The 7-Layer Processing Pipeline
MathTok's core engine is structured into seven distinct modular layers. Every component resides in the [`mathtok/`](file:///c:/Users/surwe/Project/math_token/mathtok) package.
### Layer 1: Canonicalizer Engine
* **Location**: [`canonicalizer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/canonicalizer.py)
* **Role**: Algebraic normalisation and format conversion (LaTeX $\to$ ASCII $\to$ SymPy).
* **Implementation Details**:
* **Heuristic Format Detection**: Inspects the input for LaTeX syntax (e.g., `\frac`, `\sqrt`, `\sin`, `{`, math delimiters like `$` or `\(`).
* **Parsing**: Utilizes `sympy.parsing.latex.parse_latex` (with ANTLR4) for LaTeX, falling back to `sympy.parsing.sympy_parser.parse_expr` with standard and implicit multiplication transformations for ASCII.
* **Normalisation**: Leverages SymPy's symbolic engine to `expand()` products over sums and `simplify()` algebraic expressions. It normalizes operations internally (e.g., transforming subtractions $a - b$ into additions of products $\text{Add}(a, \text{Mul}(-1, b))$, and divisions $a / b$ into multiplications of powers $\text{Mul}(a, \text{Pow}(b, -1))$).
* **Robustness & Performance**: Employs an LRU cache (default: 512 entries) to prevent redundant parsing and wraps expensive SymPy calls in a `ThreadPoolExecutor` with configurable parsing timeouts (default: 5.0 seconds) to prevent infinite loops on malicious, highly-complex inputs.
### Layer 2: Hybrid Mathematical Lexer
* **Location**: [`lexer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/lexer.py)
* **Role**: Alternating segment segmentation (TEXT spans vs. MATH spans).
* **Implementation Details**:
* **Stage 1 (Unambiguous Delimiters)**: Extracts LaTeX math environments (double dollar `$$...$$`, inline dollar `$...\$`, bracket `\[...\]`, or parenthesis `\(...\)`).
* **Stage 2 (ASCII Heuristics)**: Parses remaining text regions using pre-compiled regular expressions matching mathematical patterns (e.g., function calls `sin(...)`, exponents `x^2`, arithmetic boundaries `2*x+1`, relational equations `a+b=c`, and spelled-out Greek variables).
* **Region Expansion**: Expands detected math seeds backwards to include leading unary operators and digits, and forwards to match balanced braces/parentheses and continuous math characters. Adjacent spans of identical types are merged.
### Layer 3: AST Generator
* **Location**: [`ast_generator.py`](file:///c:/Users/surwe/Project/math_token/mathtok/ast_generator.py)
* **Role**: SymPy AST conversion to typed, abstract vocabulary trees.
* **Implementation Details**:
* Walks the SymPy internal expression tree recursively.
* Maps generic SymPy types into the vocabulary of MathTok:
* **Variables**: Standard letters map to `VAR_X`, `VAR_Y`, etc. Spelled-out Greek names map to `VAR_THETA`, `VAR_LAMBDA`, etc.
* **Constants**: Values between $-10$ and $100$ receive dedicated tokens (e.g., `CONST_3`, `CONST_12`), large integers map to placeholders (e.g., `NUM_145`), floats map to string-encoded float tokens (e.g., `FLOAT_3p14`), and special constants map to `CONST_PI`, `CONST_E`, `CONST_I`, and `CONST_INF`.
* **Unary Operations**: Converts negative numbers or multiplication by $-1$ to explicit `OP_NEG` nodes, and division inverses to `OP_RECIP` nodes.
* **Fractions**: Converts `Rational(p, q)` into explicit binary `FRAC(numerator, denominator)` nodes.
* **Recursion Guard**: Enforces `max_depth` limits (default: 20) to truncate overly-nested expressions, replacing them with a special `SUBTREE_TRUNCATED` node to avoid Python stack overflows.
### Layer 4: Semantic Operator Registry
* **Location**: [`operator_registry.py`](file:///c:/Users/surwe/Project/math_token/mathtok/operator_registry.py)
* **Role**: Rich metadata storage and categorisation for mathematical operators.
* **Implementation Details**:
* Maintains an immutable registry of `OperatorMeta` instances mapping token strings to mathematical properties:
* **Properties**: `arity` ($-1$ for variadic, or fixed integers like 1 or 2), `precedence`, `associativity` (left, right, or none), `semantic_role` (e.g., `aggregation` for addition, `periodic_oscillation` for sine), `latex_repr`, `ascii_repr`, `category`, and `is_commutative`.
* **Domains**: Spans multiple mathematical branches: Arithmetic, Relational, Calculus, Trigonometry, Exponential/Logarithmic, Logic, Set Theory, Geometry, and Statistics.
* **Inverses**: Declares explicit mathematical inverses in `INVERSE_PAIRS` (e.g., `FUNC_SIN` $\leftrightarrow$ `FUNC_ASIN`, `FUNC_EXP` $\leftrightarrow$ `FUNC_LOG`).
### Layer 5: Structural Serializer
* **Location**: [`serializer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/serializer.py)
* **Role**: Flattening the 2D tree structure into a 1-D stream using DFS preorder traversal.
* **Implementation Details**:
* Emits nodes starting from the root down to the leaves, producing a flat sequence of `SerializedToken` objects carrying: `depth`, `node_id`, `parent_id`, `child_index`, `num_children`, `is_leaf`, and `subtree_size`.
* **Scope Delineation**: Emits `[SCOPE_OPEN]` and `[SCOPE_CLOSE]` boundary tokens to explicitly group parameters for functions (e.g., `FUNC_SIN [SCOPE_OPEN] VAR_X [SCOPE_CLOSE]`).
* **Subtree Deduplication**: Integrates MD5 structural hashing (`dedup_subtrees`) to replace duplicated structures (e.g., repeating sub-formulas) with a pointer reference (e.g., `SUBTREE_REF_ae34df51`), improving sequence compression.
### Layer 6: Structural Attention Metadata Generator
* **Location**: [`metadata.py`](file:///c:/Users/surwe/Project/math_token/mathtok/metadata.py)
* **Role**: Calculating positional contexts and binary attention mask matrices.
* **Implementation Details**:
* Classifies tokens into categories: `operator`, `function`, `variable`, `constant`, `structural`, `boundary`, or `text`.
* Generates a dot-separated positional hierarchy string for each node in `tree_position_key` (e.g., `0.1.2` denotes root $\to$ 2nd child $\to$ 3rd child), which is useful for hierarchical positional encodings.
* **Attention Mask Matrix Synthesis**: Dynamically compiles four $N \times N$ binary attention mask matrices:
* `parent_mask`: Direct dependency attention.
* `children_mask`: Inverse dependency attention.
* `sibling_mask`: Horizontal syntactic context attention.
* `subtree_mask`: Complete structural scope attention.
### Layer 7: Vocabulary & BPE Compression
* **Location**: [`vocabulary.py`](file:///c:/Users/surwe/Project/math_token/mathtok/vocabulary.py)
* **Role**: Merging deterministic structural math vocabularies with Byte-Pair Encoding (BPE) text sub-vocabularies.
* **Implementation Details**:
* **Two-Tier Architecture**:
* **Tier 1 (Fixed Math Vocabulary)**: Reservoirs of deterministic, immutable IDs for standard operators, Greek/Latin variables, constants, boundaries, and placeholders. BPE is completely bypassed for math terms.
* **Tier 2 (BPE Text Vocabulary)**: Natural language regions are processed via HuggingFace's `tokenizers` library, trained on corpus-specific text spans.
* **HuggingFace Wrapper**: Under the hood, `MathTokHFTokenizer` acts as a drop-in subclass wrapper for `PreTrainedTokenizer`, enabling immediate integration into standard pipelines such as `transformers.Trainer`, `datasets.map`, and PyTorch collators.
---
## πŸ”„ Verification & Streaming Sub-systems
Beyond the core layers, MathTok implements crucial sub-systems to guarantee mathematical correctess and scale.
### Round-Trip Validation
* **Location**: [`validator.py`](file:///c:/Users/surwe/Project/math_token/mathtok/validator.py)
* **Role**: Guaranteeing zero semantic information loss during tokenization.
* **Implementation Details**:
* Uses the emitted `TokenMetadata` sequence to mathematically reconstruct the original SymPy expression.
* Rebuilds leaf nodes based on their category (constants, variables, truncations) and moves upwards to reconstruct complex nodes (`FRAC`, operators, custom functions).
* Performs formal validation by checking if the algebraic difference between the original and reconstructed expressions simplifies to zero (`sp.simplify(original - reconstructed) == 0`).
### Streaming Pipeline
* **Location**: [`streaming.py`](file:///c:/Users/surwe/Project/math_token/mathtok/streaming.py)
* **Role**: Corpus-scale processing of large datasets without exhausting system memory.
* **Implementation Details**:
* Wraps `MathTokPipeline` inside a lazy Python generator (`yield`).
* Supports encoding custom iterators and streams line-delimited files sequentially, ensuring constant memory ($O(1)$ RAM) overhead during dataset processing.
---
## πŸ“ˆ Evaluation Suite & Benchmark Metrics
The [`evaluation/`](file:///c:/Users/surwe/Project/math_token/evaluation) package defines five core evaluation metrics (residing in [`metrics.py`](file:///c:/Users/surwe/Project/math_token/evaluation/metrics.py)) to assess tokenizer quality, benchmarked in [`comparison.py`](file:///c:/Users/surwe/Project/math_token/evaluation/comparison.py).
### Core Metrics
| Metric | Symbol | Definition & Formula | Mathematical Value |
| :--- | :---: | :--- | :--- |
| **Structural Compression Ratio** | **SCR** | $\text{mean}\left(\frac{\text{Structural Score}}{\text{Token Count}}\right)$ | Quantifies structural information density. Higher is better (more structure packed into fewer tokens). |
| **Canonical Consistency Score** | **CCS** | $\text{mean}\left( \text{Jaccard}(S_A, S_B) \right)$ over equivalent pairs | Evaluates algebraic invariance. A score of $1.0$ represents perfect semantic convergence. |
| **Operator Preservation Score** | **OPS** | $\%$ of expressions containing all expected operators | Measures robustness; ensures mathematical operations are never lost or corrupted. |
| **Token Stability** | **TS** | $1 - \text{Coefficient of Variation}(\text{length})$ | Assesses syntactic variance stability under re-writings. Higher is more stable. |
| **Tree Depth Fidelity** | **TDF** | $1 - \text{mean}\left( \frac{\vert d_{\text{actual}} - d_{\text{ground}} \vert}{d_{\text{ground}}} \right)$ | Measures max metadata depth accuracy against the ground truth SymPy height. |
> [!NOTE]
> **Semantic Compression Ratio (SCR)** is evaluated at three hierarchical levels in `comparison.py`:
> * **Level 1 β€” Structural Score to Token Ratio**: `structural_score / token_count`
> * **Level 2 β€” Semantic Density**: `math_tokens / total_tokens`
> * **Level 3 β€” Structural Efficiency**: `parent_child_relations / token_count`
---
## πŸ”¬ Empirical Benchmark Results
Empirical comparisons of MathTok against a standard subword tokenizer (GPT-2 BPE), a custom-trained SentencePiece (unigram) tokenizer, and character-level baselines over 70 complex test expressions across multiple disciplines reveal substantial improvements.
### 1. 3-Level Semantic Comparison (Aggregated)
Across the entire evaluation suite, the aggregated results illustrate MathTok's efficiency:
| Metric | MathTok | GPT-2 | SentencePiece | Character-Level |
| :--- | :---: | :---: | :---: | :---: |
| **Level 1 β€” SCR** (struct_score / tokens) | **0.9161** | 0.4251 | 0.3696 | 0.4005 |
| **Level 2 β€” Semantic Density** (math / total) | **0.5633** | 0.1838 | 0.1499 | β€” |
| **Level 3 β€” Structural Efficiency** (relations / tokens) | **0.2492** | *N/A* | *N/A* | β€” |
| **SCR Improvement Factor** (MathTok vs. Baseline) | **β€”** | **2.16x** | **2.48x** | **2.29x** |
### 2. Canonical Convergence & Consistency (Jaccard Overlap)
For mathematically equivalent pairs, MathTok achieves perfect Jaccard alignment (Jaccard = 1.0), whereas standard text-based tokenizers suffer significant fragmentation:
| Expression Pair | MathTok Jaccard | GPT-2 Jaccard | SentencePiece Jaccard | Convergence Status |
| :--- | :---: | :---: | :---: | :---: |
| `x + 2` vs. `2 + x` | **1.000** | 0.200 | 1.000 | **CONVERGED (100%)** |
| `a*b + a*c` vs. `a*(b+c)` | **1.000** | 0.444 | 0.625 | **CONVERGED (100%)** |
| `(x+1)^2` vs. `x^2+2x+1` | **1.000** | 0.273 | 0.222 | **CONVERGED (100%)** |
| `x^2 - y^2` vs. `(x+y)*(x-y)` | **1.000** | 0.091 | 0.300 | **CONVERGED (100%)** |
| `sin(x)^2 + cos(x)^2` vs. `1` | **1.000** | 0.000 | 0.000 | **CONVERGED (100%)** |
| `2*x + 2*y` vs. `2*(x+y)` | **1.000** | 0.444 | 0.571 | **CONVERGED (100%)** |
| `x*y + x*z` vs. `x*(y+z)` | **1.000** | 0.444 | 0.625 | **CONVERGED (100%)** |
| `a^2 + 2*a*b + b^2` vs. `(a+b)^2` | **1.000** | 0.364 | 0.455 | **CONVERGED (100%)** |
### 3. LaTeX vs. ASCII Format Invariance
MathTok perfectly converges inputs in differing representations to identical structural sequences, while subword tokenizers have severe variance:
| ASCII Expression | LaTeX Expression | MathTok same? | MT tokens A/L | GPT-2 tokens A/L | SP tokens A/L |
| :--- | :--- | :---: | :---: | :---: | :---: |
| `sin(x^2)` | `\sin(x^2)` | **YES (1.00)** | **8 / 8** | 6 / 7 | 6 / 6 |
| `sqrt(x^2 + 1)` | `\sqrt{x^2 + 1}` | **YES (1.00)** | **11 / 11** | 9 / 10 | 9 / 9 |
| `log(x)` | `\ln(x)` | **YES (1.00)** | **6 / 6** | 4 / 5 | 6 / 6 |
| `exp(x)` | `e^x` | **YES (1.00)** | **6 / 6** | 4 / 3 | 6 / 3 |
| `x/y` | `\frac{x}{y}` | **YES (1.00)** | **6 / 6** | 3 / 7 | 3 / 9 |
| `int(x^2, x)` | `\int x^2 dx` | **NO (~/fallback)** | **1 / 10** | 8 / 6 | 8 / 7 |
| `diff(sin(x), x)` | `\frac{d}{dx}\sin(x)` | **YES (1.00)** | **6 / 6** | 8 / 11 | 14 / 16 |
| `factorial(n)` | `n!` | **YES (1.00)** | **6 / 6** | 5 / 2 | 11 / 3 |
---
## πŸš€ Custom Attention Integration Patterns
The core value of MathTok for downstream machine learning practitioners is the **Layer 6 Attention Hints**. By translating tree relationships into standard masking shapes, model creators can train structure-aware networks natively.
Below are three attention mask designs that can be constructed directly from the outputs of `to_attention_mask_hints()`:
### 1. Parent-Child Hierarchical Mask
Encourages top-down syntactic attention. Nodes are only allowed to attend to their direct parent or child node.
```
[+ (root)] Parent Attention Mask Matrix:
/ \
[x] [3] [ ] [+ (root)] [x] [3]
| [+ (root)] 1 1 1
[sin] [x] 1 1 0
[3] 1 0 1
```
### 2. Sibling Horizontal Mask
Focuses horizontal attention across operands of identical scopes (e.g., connecting operands inside an addition sequence, $a$ and $b$ and $c$, without parent noise).
### 3. Subtree Scope Mask
A highly effective block mask for mathematical reasoning. Restricts attention strictly within a subtree, isolating independent sub-expressions during reasoning loops.
---
## 🎯 Codebase Evaluation & Recommendations
### Key Strengths
1. **Outstanding Structural Integrity**: Modularity is excellent. Clear abstraction separation (canonicalization, tokenization, serialization, and vocabulary grouping) makes codebase expansion extremely straightforward.
2. **HuggingFace Compatibility**: Subclassing/wrapping the standard tokenizer class ensures immediate, zero-friction integration with existing libraries like PyTorch and HuggingFace.
3. **Rigorous Validation**: The inclusion of `validator.py` and the round-trip checking logic demonstrates high development standards.
4. **Reliability Guards**: LRU caches, concurrency thread pools, and recursion limits make this pipeline safe for server-side deployment.
### Recommended Enhancements
* **Vocabulary Extension**: Dynamically augment `_VAR_MAP` in `ast_generator.py` to natively support multi-character variables (e.g., physics variables like $v_{\text{init}}$ or matrix names) without splitting them into generic token placeholders.
* **SymPy Parser Customisation**: SymPy's LaTeX parser can occasionally fail on non-standard, custom LaTeX macros. Adding pre-processing ASCII/LaTeX regex cleaners in `lexer.py` prior to passing them to SymPy will improve the parse success rate of dirty online forum data.
* **TDF Precision**: In case of multi-nested subtrees (e.g., highly deeply-nested fractions), customize the tree depth calculation in `metrics.py` to evaluate structural depths on custom mathematical representations rather than internal SymPy structures.
---
### Citation Reference
```bibtex
@article{mathtok2026,
title = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework
for Mathematical Language Modeling},
author = {Anonymous},
year = {2026},
note = {Under review}
}
```