mathtok / review.md

Initial clean MathTok release

edede4c 4 days ago

18.7 kB

	# 🌟 MathTok: Canonicalized AST-Based Mathematical Tokenizer Codebase Review

	An in-depth structural and architectural analysis of the MathTok pipeline located at `c:\Users\surwe\Project\math_token`. This document serves as a comprehensive system review, detailing the mathematical foundations, the 7-layer pipeline design, system components, evaluation metrics, empirical results, and downstream application patterns of MathTok.

	---

	## 📖 Executive Summary

	Standard natural language tokenizers (like Byte-Pair Encoding or SentencePiece) treat mathematical expressions as plain text sequences. This results in structural fragmentation (e.g., splitting a variable `VAR_THETA` or operator `OP_ADD` into arbitrary character chunks) and semantic blindness (failing to recognize algebraic equivalences like $x + 2 \equiv 2 + x$).

	MathTok solves this by introducing a hybrid, structure-aware tokenization framework for mathematical language modeling. By constructing an Abstract Syntax Tree (AST) from mathematical expressions, normalizing algebraic equivalences via symbolic mathematics (SymPy), and serializing the tree using Depth-First Search (DFS) preorder traversal, MathTok preserves full mathematical syntax and hierarchy.

	Additionally, MathTok automatically emits structural attention metadata for every token position, enabling downstream transformer models to implement tree-based or graph-structured attention patterns without architectural modifications.

	```mermaid
	graph TD
	A[Raw Input: Mixed Text + Math] --> B[Layer 2: Hybrid Lexer]
	B -->\|TEXT Spans\| C[Layer 7: BPE Text Sub-Vocab]
	B -->\|MATH Spans\| D[Layer 1: Canonicalizer Engine]
	D -->\|SymPy Expression\| E[Layer 3: AST Generator]
	E -->\|Typed AST Tree\| F[Layer 4: Semantic Operator Registry]
	F -->\|Enriched Nodes\| G[Layer 5: Structural Serializer]
	G -->\|DFS Preorder Stream\| H[Layer 6: Attention Metadata Gen]
	H -->\|Attention Masks & Hints\| I[Final Merged Token Stream]
	C --> I
	```

	---

	## 🛠️ The 7-Layer Processing Pipeline

	MathTok's core engine is structured into seven distinct modular layers. Every component resides in the [`mathtok/`](file:///c:/Users/surwe/Project/math_token/mathtok) package.

	### Layer 1: Canonicalizer Engine
	* Location: [`canonicalizer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/canonicalizer.py)
	* Role: Algebraic normalisation and format conversion (LaTeX $\to$ ASCII $\to$ SymPy).
	* Implementation Details:
	* Heuristic Format Detection: Inspects the input for LaTeX syntax (e.g., `\frac`, `\sqrt`, `\sin`, `{`, math delimiters like `$` or `\(`).
	* Parsing: Utilizes `sympy.parsing.latex.parse_latex` (with ANTLR4) for LaTeX, falling back to `sympy.parsing.sympy_parser.parse_expr` with standard and implicit multiplication transformations for ASCII.
	* Normalisation: Leverages SymPy's symbolic engine to `expand()` products over sums and `simplify()` algebraic expressions. It normalizes operations internally (e.g., transforming subtractions $a - b$ into additions of products $\text{Add}(a, \text{Mul}(-1, b))$, and divisions $a / b$ into multiplications of powers $\text{Mul}(a, \text{Pow}(b, -1))$).
	* Robustness & Performance: Employs an LRU cache (default: 512 entries) to prevent redundant parsing and wraps expensive SymPy calls in a `ThreadPoolExecutor` with configurable parsing timeouts (default: 5.0 seconds) to prevent infinite loops on malicious, highly-complex inputs.

	### Layer 2: Hybrid Mathematical Lexer
	* Location: [`lexer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/lexer.py)
	* Role: Alternating segment segmentation (TEXT spans vs. MATH spans).
	* Implementation Details:
	* Stage 1 (Unambiguous Delimiters): Extracts LaTeX math environments (double dollar `$$...$$`, inline dollar `$...\$`, bracket `\[...\]`, or parenthesis `$...$`).
	* Stage 2 (ASCII Heuristics): Parses remaining text regions using pre-compiled regular expressions matching mathematical patterns (e.g., function calls `sin(...)`, exponents `x^2`, arithmetic boundaries `2*x+1`, relational equations `a+b=c`, and spelled-out Greek variables).
	* Region Expansion: Expands detected math seeds backwards to include leading unary operators and digits, and forwards to match balanced braces/parentheses and continuous math characters. Adjacent spans of identical types are merged.

	### Layer 3: AST Generator
	* Location: [`ast_generator.py`](file:///c:/Users/surwe/Project/math_token/mathtok/ast_generator.py)
	* Role: SymPy AST conversion to typed, abstract vocabulary trees.
	* Implementation Details:
	* Walks the SymPy internal expression tree recursively.
	* Maps generic SymPy types into the vocabulary of MathTok:
	* Variables: Standard letters map to `VAR_X`, `VAR_Y`, etc. Spelled-out Greek names map to `VAR_THETA`, `VAR_LAMBDA`, etc.
	* Constants: Values between $-10$ and $100$ receive dedicated tokens (e.g., `CONST_3`, `CONST_12`), large integers map to placeholders (e.g., `NUM_145`), floats map to string-encoded float tokens (e.g., `FLOAT_3p14`), and special constants map to `CONST_PI`, `CONST_E`, `CONST_I`, and `CONST_INF`.
	* Unary Operations: Converts negative numbers or multiplication by $-1$ to explicit `OP_NEG` nodes, and division inverses to `OP_RECIP` nodes.
	* Fractions: Converts `Rational(p, q)` into explicit binary `FRAC(numerator, denominator)` nodes.
	* Recursion Guard: Enforces `max_depth` limits (default: 20) to truncate overly-nested expressions, replacing them with a special `SUBTREE_TRUNCATED` node to avoid Python stack overflows.

	### Layer 4: Semantic Operator Registry
	* Location: [`operator_registry.py`](file:///c:/Users/surwe/Project/math_token/mathtok/operator_registry.py)
	* Role: Rich metadata storage and categorisation for mathematical operators.
	* Implementation Details:
	* Maintains an immutable registry of `OperatorMeta` instances mapping token strings to mathematical properties:
	* Properties: `arity` ($-1$ for variadic, or fixed integers like 1 or 2), `precedence`, `associativity` (left, right, or none), `semantic_role` (e.g., `aggregation` for addition, `periodic_oscillation` for sine), `latex_repr`, `ascii_repr`, `category`, and `is_commutative`.
	* Domains: Spans multiple mathematical branches: Arithmetic, Relational, Calculus, Trigonometry, Exponential/Logarithmic, Logic, Set Theory, Geometry, and Statistics.
	* Inverses: Declares explicit mathematical inverses in `INVERSE_PAIRS` (e.g., `FUNC_SIN` $\leftrightarrow$ `FUNC_ASIN`, `FUNC_EXP` $\leftrightarrow$ `FUNC_LOG`).

	### Layer 5: Structural Serializer
	* Location: [`serializer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/serializer.py)
	* Role: Flattening the 2D tree structure into a 1-D stream using DFS preorder traversal.
	* Implementation Details:
	* Emits nodes starting from the root down to the leaves, producing a flat sequence of `SerializedToken` objects carrying: `depth`, `node_id`, `parent_id`, `child_index`, `num_children`, `is_leaf`, and `subtree_size`.
	* Scope Delineation: Emits `[SCOPE_OPEN]` and `[SCOPE_CLOSE]` boundary tokens to explicitly group parameters for functions (e.g., `FUNC_SIN [SCOPE_OPEN] VAR_X [SCOPE_CLOSE]`).
	* Subtree Deduplication: Integrates MD5 structural hashing (`dedup_subtrees`) to replace duplicated structures (e.g., repeating sub-formulas) with a pointer reference (e.g., `SUBTREE_REF_ae34df51`), improving sequence compression.

	### Layer 6: Structural Attention Metadata Generator
	* Location: [`metadata.py`](file:///c:/Users/surwe/Project/math_token/mathtok/metadata.py)
	* Role: Calculating positional contexts and binary attention mask matrices.
	* Implementation Details:
	* Classifies tokens into categories: `operator`, `function`, `variable`, `constant`, `structural`, `boundary`, or `text`.
	* Generates a dot-separated positional hierarchy string for each node in `tree_position_key` (e.g., `0.1.2` denotes root $\to$ 2nd child $\to$ 3rd child), which is useful for hierarchical positional encodings.
	* Attention Mask Matrix Synthesis: Dynamically compiles four $N \times N$ binary attention mask matrices:
	* `parent_mask`: Direct dependency attention.
	* `children_mask`: Inverse dependency attention.
	* `sibling_mask`: Horizontal syntactic context attention.
	* `subtree_mask`: Complete structural scope attention.

	### Layer 7: Vocabulary & BPE Compression
	* Location: [`vocabulary.py`](file:///c:/Users/surwe/Project/math_token/mathtok/vocabulary.py)
	* Role: Merging deterministic structural math vocabularies with Byte-Pair Encoding (BPE) text sub-vocabularies.
	* Implementation Details:
	* Two-Tier Architecture:
	* Tier 1 (Fixed Math Vocabulary): Reservoirs of deterministic, immutable IDs for standard operators, Greek/Latin variables, constants, boundaries, and placeholders. BPE is completely bypassed for math terms.
	* Tier 2 (BPE Text Vocabulary): Natural language regions are processed via HuggingFace's `tokenizers` library, trained on corpus-specific text spans.
	* HuggingFace Wrapper: Under the hood, `MathTokHFTokenizer` acts as a drop-in subclass wrapper for `PreTrainedTokenizer`, enabling immediate integration into standard pipelines such as `transformers.Trainer`, `datasets.map`, and PyTorch collators.

	---

	## 🔄 Verification & Streaming Sub-systems

	Beyond the core layers, MathTok implements crucial sub-systems to guarantee mathematical correctess and scale.

	### Round-Trip Validation
	* Location: [`validator.py`](file:///c:/Users/surwe/Project/math_token/mathtok/validator.py)
	* Role: Guaranteeing zero semantic information loss during tokenization.
	* Implementation Details:
	* Uses the emitted `TokenMetadata` sequence to mathematically reconstruct the original SymPy expression.
	* Rebuilds leaf nodes based on their category (constants, variables, truncations) and moves upwards to reconstruct complex nodes (`FRAC`, operators, custom functions).
	* Performs formal validation by checking if the algebraic difference between the original and reconstructed expressions simplifies to zero (`sp.simplify(original - reconstructed) == 0`).

	### Streaming Pipeline
	* Location: [`streaming.py`](file:///c:/Users/surwe/Project/math_token/mathtok/streaming.py)
	* Role: Corpus-scale processing of large datasets without exhausting system memory.
	* Implementation Details:
	* Wraps `MathTokPipeline` inside a lazy Python generator (`yield`).
	* Supports encoding custom iterators and streams line-delimited files sequentially, ensuring constant memory ($O(1)$ RAM) overhead during dataset processing.

	---

	## 📈 Evaluation Suite & Benchmark Metrics

	The [`evaluation/`](file:///c:/Users/surwe/Project/math_token/evaluation) package defines five core evaluation metrics (residing in [`metrics.py`](file:///c:/Users/surwe/Project/math_token/evaluation/metrics.py)) to assess tokenizer quality, benchmarked in [`comparison.py`](file:///c:/Users/surwe/Project/math_token/evaluation/comparison.py).

	### Core Metrics

	\| Metric \| Symbol \| Definition & Formula \| Mathematical Value \|
	\| :--- \| :---: \| :--- \| :--- \|
	\| Structural Compression Ratio \| SCR \| $\text{mean}\left(\frac{\text{Structural Score}}{\text{Token Count}}\right)$ \| Quantifies structural information density. Higher is better (more structure packed into fewer tokens). \|
	\| Canonical Consistency Score \| CCS \| $\text{mean}\left( \text{Jaccard}(S_A, S_B) \right)$ over equivalent pairs \| Evaluates algebraic invariance. A score of $1.0$ represents perfect semantic convergence. \|
	\| Operator Preservation Score \| OPS \| $\%$ of expressions containing all expected operators \| Measures robustness; ensures mathematical operations are never lost or corrupted. \|
	\| Token Stability \| TS \| $1 - \text{Coefficient of Variation}(\text{length})$ \| Assesses syntactic variance stability under re-writings. Higher is more stable. \|
	\| Tree Depth Fidelity \| TDF \| $1 - \text{mean}\left( \frac{\vert d_{\text{actual}} - d_{\text{ground}} \vert}{d_{\text{ground}}} \right)$ \| Measures max metadata depth accuracy against the ground truth SymPy height. \|

	> [!NOTE]
	> Semantic Compression Ratio (SCR) is evaluated at three hierarchical levels in `comparison.py`:
	> * Level 1 — Structural Score to Token Ratio: `structural_score / token_count`
	> * Level 2 — Semantic Density: `math_tokens / total_tokens`
	> * Level 3 — Structural Efficiency: `parent_child_relations / token_count`

	---

	## 🔬 Empirical Benchmark Results

	Empirical comparisons of MathTok against a standard subword tokenizer (GPT-2 BPE), a custom-trained SentencePiece (unigram) tokenizer, and character-level baselines over 70 complex test expressions across multiple disciplines reveal substantial improvements.

	### 1. 3-Level Semantic Comparison (Aggregated)

	Across the entire evaluation suite, the aggregated results illustrate MathTok's efficiency:

	\| Metric \| MathTok \| GPT-2 \| SentencePiece \| Character-Level \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| Level 1 — SCR (struct_score / tokens) \| 0.9161 \| 0.4251 \| 0.3696 \| 0.4005 \|
	\| Level 2 — Semantic Density (math / total) \| 0.5633 \| 0.1838 \| 0.1499 \| — \|
	\| Level 3 — Structural Efficiency (relations / tokens) \| 0.2492 \| N/A \| N/A \| — \|
	\| SCR Improvement Factor (MathTok vs. Baseline) \| — \| 2.16x \| 2.48x \| 2.29x \|

	### 2. Canonical Convergence & Consistency (Jaccard Overlap)

	For mathematically equivalent pairs, MathTok achieves perfect Jaccard alignment (Jaccard = 1.0), whereas standard text-based tokenizers suffer significant fragmentation:

	\| Expression Pair \| MathTok Jaccard \| GPT-2 Jaccard \| SentencePiece Jaccard \| Convergence Status \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| `x + 2` vs. `2 + x` \| 1.000 \| 0.200 \| 1.000 \| CONVERGED (100%) \|
	\| `ab + ac` vs. `a(b+c)` \| 1.000* \| 0.444 \| 0.625 \| CONVERGED (100%) \|
	\| `(x+1)^2` vs. `x^2+2x+1` \| 1.000 \| 0.273 \| 0.222 \| CONVERGED (100%) \|
	\| `x^2 - y^2` vs. `(x+y)(x-y)` \| 1.000* \| 0.091 \| 0.300 \| CONVERGED (100%) \|
	\| `sin(x)^2 + cos(x)^2` vs. `1` \| 1.000 \| 0.000 \| 0.000 \| CONVERGED (100%) \|
	\| `2x + 2y` vs. `2(x+y)` \| 1.000* \| 0.444 \| 0.571 \| CONVERGED (100%) \|
	\| `xy + xz` vs. `x(y+z)` \| 1.000* \| 0.444 \| 0.625 \| CONVERGED (100%) \|
	\| `a^2 + 2ab + b^2` vs. `(a+b)^2` \| 1.000 \| 0.364 \| 0.455 \| CONVERGED (100%) \|

	### 3. LaTeX vs. ASCII Format Invariance

	MathTok perfectly converges inputs in differing representations to identical structural sequences, while subword tokenizers have severe variance:

	\| ASCII Expression \| LaTeX Expression \| MathTok same? \| MT tokens A/L \| GPT-2 tokens A/L \| SP tokens A/L \|
	\| :--- \| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| `sin(x^2)` \| `\sin(x^2)` \| YES (1.00) \| 8 / 8 \| 6 / 7 \| 6 / 6 \|
	\| `sqrt(x^2 + 1)` \| `\sqrt{x^2 + 1}` \| YES (1.00) \| 11 / 11 \| 9 / 10 \| 9 / 9 \|
	\| `log(x)` \| `\ln(x)` \| YES (1.00) \| 6 / 6 \| 4 / 5 \| 6 / 6 \|
	\| `exp(x)` \| `e^x` \| YES (1.00) \| 6 / 6 \| 4 / 3 \| 6 / 3 \|
	\| `x/y` \| `\frac{x}{y}` \| YES (1.00) \| 6 / 6 \| 3 / 7 \| 3 / 9 \|
	\| `int(x^2, x)` \| `\int x^2 dx` \| NO (~/fallback) \| 1 / 10 \| 8 / 6 \| 8 / 7 \|
	\| `diff(sin(x), x)` \| `\frac{d}{dx}\sin(x)` \| YES (1.00) \| 6 / 6 \| 8 / 11 \| 14 / 16 \|
	\| `factorial(n)` \| `n!` \| YES (1.00) \| 6 / 6 \| 5 / 2 \| 11 / 3 \|

	---

	## 🚀 Custom Attention Integration Patterns

	The core value of MathTok for downstream machine learning practitioners is the Layer 6 Attention Hints. By translating tree relationships into standard masking shapes, model creators can train structure-aware networks natively.

	Below are three attention mask designs that can be constructed directly from the outputs of `to_attention_mask_hints()`:

	### 1. Parent-Child Hierarchical Mask
	Encourages top-down syntactic attention. Nodes are only allowed to attend to their direct parent or child node.

	```
	[+ (root)] Parent Attention Mask Matrix:
	/ \
	[x] [3] [ ] [+ (root)] [x] [3]
	\| [+ (root)] 1 1 1
	[sin] [x] 1 1 0
	[3] 1 0 1
	```

	### 2. Sibling Horizontal Mask
	Focuses horizontal attention across operands of identical scopes (e.g., connecting operands inside an addition sequence, $a$ and $b$ and $c$, without parent noise).

	### 3. Subtree Scope Mask
	A highly effective block mask for mathematical reasoning. Restricts attention strictly within a subtree, isolating independent sub-expressions during reasoning loops.

	---

	## 🎯 Codebase Evaluation & Recommendations

	### Key Strengths
	1. Outstanding Structural Integrity: Modularity is excellent. Clear abstraction separation (canonicalization, tokenization, serialization, and vocabulary grouping) makes codebase expansion extremely straightforward.
	2. HuggingFace Compatibility: Subclassing/wrapping the standard tokenizer class ensures immediate, zero-friction integration with existing libraries like PyTorch and HuggingFace.
	3. Rigorous Validation: The inclusion of `validator.py` and the round-trip checking logic demonstrates high development standards.
	4. Reliability Guards: LRU caches, concurrency thread pools, and recursion limits make this pipeline safe for server-side deployment.

	### Recommended Enhancements
	* Vocabulary Extension: Dynamically augment `_VAR_MAP` in `ast_generator.py` to natively support multi-character variables (e.g., physics variables like $v_{\text{init}}$ or matrix names) without splitting them into generic token placeholders.
	* SymPy Parser Customisation: SymPy's LaTeX parser can occasionally fail on non-standard, custom LaTeX macros. Adding pre-processing ASCII/LaTeX regex cleaners in `lexer.py` prior to passing them to SymPy will improve the parse success rate of dirty online forum data.
	* TDF Precision: In case of multi-nested subtrees (e.g., highly deeply-nested fractions), customize the tree depth calculation in `metrics.py` to evaluate structural depths on custom mathematical representations rather than internal SymPy structures.

	---

	### Citation Reference
	```bibtex
	@article{mathtok2026,
	title = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework
	for Mathematical Language Modeling},
	author = {Anonymous},
	year = {2026},
	note = {Under review}
	}
	```