mathtok / review.md

Initial clean MathTok release

edede4c 4 days ago

18.7 kB

🌟 MathTok: Canonicalized AST-Based Mathematical Tokenizer Codebase Review

An in-depth structural and architectural analysis of the MathTok pipeline located at c:\Users\surwe\Project\math_token. This document serves as a comprehensive system review, detailing the mathematical foundations, the 7-layer pipeline design, system components, evaluation metrics, empirical results, and downstream application patterns of MathTok.

📖 Executive Summary

Standard natural language tokenizers (like Byte-Pair Encoding or SentencePiece) treat mathematical expressions as plain text sequences. This results in structural fragmentation (e.g., splitting a variable VAR_THETA or operator OP_ADD into arbitrary character chunks) and semantic blindness (failing to recognize algebraic equivalences like $x + 2 \equiv 2 + x$).

MathTok solves this by introducing a hybrid, structure-aware tokenization framework for mathematical language modeling. By constructing an Abstract Syntax Tree (AST) from mathematical expressions, normalizing algebraic equivalences via symbolic mathematics (SymPy), and serializing the tree using Depth-First Search (DFS) preorder traversal, MathTok preserves full mathematical syntax and hierarchy.

Additionally, MathTok automatically emits structural attention metadata for every token position, enabling downstream transformer models to implement tree-based or graph-structured attention patterns without architectural modifications.

graph TD
    A[Raw Input: Mixed Text + Math] --> B[Layer 2: Hybrid Lexer]
    B -->|TEXT Spans| C[Layer 7: BPE Text Sub-Vocab]
    B -->|MATH Spans| D[Layer 1: Canonicalizer Engine]
    D -->|SymPy Expression| E[Layer 3: AST Generator]
    E -->|Typed AST Tree| F[Layer 4: Semantic Operator Registry]
    F -->|Enriched Nodes| G[Layer 5: Structural Serializer]
    G -->|DFS Preorder Stream| H[Layer 6: Attention Metadata Gen]
    H -->|Attention Masks & Hints| I[Final Merged Token Stream]
    C --> I

🛠️ The 7-Layer Processing Pipeline

MathTok's core engine is structured into seven distinct modular layers. Every component resides in the mathtok/ package.

Layer 1: Canonicalizer Engine

Location: canonicalizer.py
Role: Algebraic normalisation and format conversion (LaTeX $\to$ ASCII $\to$ SymPy).
Implementation Details:
- Heuristic Format Detection: Inspects the input for LaTeX syntax (e.g., \frac, \sqrt, \sin, {, math delimiters like $ or \().
- Parsing: Utilizes sympy.parsing.latex.parse_latex (with ANTLR4) for LaTeX, falling back to sympy.parsing.sympy_parser.parse_expr with standard and implicit multiplication transformations for ASCII.
- Normalisation: Leverages SymPy's symbolic engine to expand() products over sums and simplify() algebraic expressions. It normalizes operations internally (e.g., transforming subtractions $a - b$ into additions of products $\text{Add}(a, \text{Mul}(-1, b))$, and divisions $a / b$ into multiplications of powers $\text{Mul}(a, \text{Pow}(b, -1))$).
- Robustness & Performance: Employs an LRU cache (default: 512 entries) to prevent redundant parsing and wraps expensive SymPy calls in a ThreadPoolExecutor with configurable parsing timeouts (default: 5.0 seconds) to prevent infinite loops on malicious, highly-complex inputs.

Layer 2: Hybrid Mathematical Lexer

Location: lexer.py
Role: Alternating segment segmentation (TEXT spans vs. MATH spans).
Implementation Details:
- Stage 1 (Unambiguous Delimiters): Extracts LaTeX math environments (double dollar $$...$$, inline dollar $...\$ , bracket \[...\], or parenthesis $...$).
- Stage 2 (ASCII Heuristics): Parses remaining text regions using pre-compiled regular expressions matching mathematical patterns (e.g., function calls sin(...), exponents x^2, arithmetic boundaries 2*x+1, relational equations a+b=c, and spelled-out Greek variables).
- Region Expansion: Expands detected math seeds backwards to include leading unary operators and digits, and forwards to match balanced braces/parentheses and continuous math characters. Adjacent spans of identical types are merged.

Layer 3: AST Generator

Location: ast_generator.py
Role: SymPy AST conversion to typed, abstract vocabulary trees.
Implementation Details:
- Walks the SymPy internal expression tree recursively.
- Maps generic SymPy types into the vocabulary of MathTok:
  - Variables: Standard letters map to VAR_X, VAR_Y, etc. Spelled-out Greek names map to VAR_THETA, VAR_LAMBDA, etc.
  - Constants: Values between $-10$ and $100$ receive dedicated tokens (e.g., CONST_3, CONST_12), large integers map to placeholders (e.g., NUM_145), floats map to string-encoded float tokens (e.g., FLOAT_3p14), and special constants map to CONST_PI, CONST_E, CONST_I, and CONST_INF.
  - Unary Operations: Converts negative numbers or multiplication by $-1$ to explicit OP_NEG nodes, and division inverses to OP_RECIP nodes.
  - Fractions: Converts Rational(p, q) into explicit binary FRAC(numerator, denominator) nodes.
- Recursion Guard: Enforces max_depth limits (default: 20) to truncate overly-nested expressions, replacing them with a special SUBTREE_TRUNCATED node to avoid Python stack overflows.

Layer 4: Semantic Operator Registry

Location: operator_registry.py
Role: Rich metadata storage and categorisation for mathematical operators.
Implementation Details:
- Maintains an immutable registry of OperatorMeta instances mapping token strings to mathematical properties:
  - Properties: arity ($-1$ for variadic, or fixed integers like 1 or 2), precedence, associativity (left, right, or none), semantic_role (e.g., aggregation for addition, periodic_oscillation for sine), latex_repr, ascii_repr, category, and is_commutative.
  - Domains: Spans multiple mathematical branches: Arithmetic, Relational, Calculus, Trigonometry, Exponential/Logarithmic, Logic, Set Theory, Geometry, and Statistics.
  - Inverses: Declares explicit mathematical inverses in INVERSE_PAIRS (e.g., FUNC_SIN $\leftrightarrow$ FUNC_ASIN, FUNC_EXP $\leftrightarrow$ FUNC_LOG).

Layer 5: Structural Serializer

Location: serializer.py
Role: Flattening the 2D tree structure into a 1-D stream using DFS preorder traversal.
Implementation Details:
- Emits nodes starting from the root down to the leaves, producing a flat sequence of SerializedToken objects carrying: depth, node_id, parent_id, child_index, num_children, is_leaf, and subtree_size.
- Scope Delineation: Emits [SCOPE_OPEN] and [SCOPE_CLOSE] boundary tokens to explicitly group parameters for functions (e.g., FUNC_SIN [SCOPE_OPEN] VAR_X [SCOPE_CLOSE]).
- Subtree Deduplication: Integrates MD5 structural hashing (dedup_subtrees) to replace duplicated structures (e.g., repeating sub-formulas) with a pointer reference (e.g., SUBTREE_REF_ae34df51), improving sequence compression.

Layer 6: Structural Attention Metadata Generator

Location: metadata.py
Role: Calculating positional contexts and binary attention mask matrices.
Implementation Details:
- Classifies tokens into categories: operator, function, variable, constant, structural, boundary, or text.
- Generates a dot-separated positional hierarchy string for each node in tree_position_key (e.g., 0.1.2 denotes root $\to$ 2nd child $\to$ 3rd child), which is useful for hierarchical positional encodings.
- Attention Mask Matrix Synthesis: Dynamically compiles four $N \times N$ binary attention mask matrices:
  - parent_mask: Direct dependency attention.
  - children_mask: Inverse dependency attention.
  - sibling_mask: Horizontal syntactic context attention.
  - subtree_mask: Complete structural scope attention.

Layer 7: Vocabulary & BPE Compression

Location: vocabulary.py
Role: Merging deterministic structural math vocabularies with Byte-Pair Encoding (BPE) text sub-vocabularies.
Implementation Details:
- Two-Tier Architecture:
  - Tier 1 (Fixed Math Vocabulary): Reservoirs of deterministic, immutable IDs for standard operators, Greek/Latin variables, constants, boundaries, and placeholders. BPE is completely bypassed for math terms.
  - Tier 2 (BPE Text Vocabulary): Natural language regions are processed via HuggingFace's tokenizers library, trained on corpus-specific text spans.
- HuggingFace Wrapper: Under the hood, MathTokHFTokenizer acts as a drop-in subclass wrapper for PreTrainedTokenizer, enabling immediate integration into standard pipelines such as transformers.Trainer, datasets.map, and PyTorch collators.

🔄 Verification & Streaming Sub-systems

Beyond the core layers, MathTok implements crucial sub-systems to guarantee mathematical correctess and scale.

Round-Trip Validation

Location: validator.py
Role: Guaranteeing zero semantic information loss during tokenization.
Implementation Details:
- Uses the emitted TokenMetadata sequence to mathematically reconstruct the original SymPy expression.
- Rebuilds leaf nodes based on their category (constants, variables, truncations) and moves upwards to reconstruct complex nodes (FRAC, operators, custom functions).
- Performs formal validation by checking if the algebraic difference between the original and reconstructed expressions simplifies to zero (sp.simplify(original - reconstructed) == 0).

Streaming Pipeline

Location: streaming.py
Role: Corpus-scale processing of large datasets without exhausting system memory.
Implementation Details:
- Wraps MathTokPipeline inside a lazy Python generator (yield).
- Supports encoding custom iterators and streams line-delimited files sequentially, ensuring constant memory ($O(1)$ RAM) overhead during dataset processing.

📈 Evaluation Suite & Benchmark Metrics

The evaluation/ package defines five core evaluation metrics (residing in metrics.py) to assess tokenizer quality, benchmarked in comparison.py.

Core Metrics

Metric	Symbol	Definition & Formula	Mathematical Value
Structural Compression Ratio	SCR	$\text{mean}\left(\frac{\text{Structural Score}}{\text{Token Count}}\right)$	Quantifies structural information density. Higher is better (more structure packed into fewer tokens).
Canonical Consistency Score	CCS	$\text{mean}\left( \text{Jaccard}(S_A, S_B) \right)$ over equivalent pairs	Evaluates algebraic invariance. A score of $1.0$ represents perfect semantic convergence.
Operator Preservation Score	OPS	$%$ of expressions containing all expected operators	Measures robustness; ensures mathematical operations are never lost or corrupted.
Token Stability	TS	$1 - \text{Coefficient of Variation}(\text{length})$	Assesses syntactic variance stability under re-writings. Higher is more stable.
Tree Depth Fidelity	TDF	$1 - \text{mean}\left( \frac{\vert d_{\text{actual}} - d_{\text{ground}} \vert}{d_{\text{ground}}} \right)$	Measures max metadata depth accuracy against the ground truth SymPy height.

Semantic Compression Ratio (SCR) is evaluated at three hierarchical levels in comparison.py:

Level 1 — Structural Score to Token Ratio: structural_score / token_count

Level 2 — Semantic Density: math_tokens / total_tokens

Level 3 — Structural Efficiency: parent_child_relations / token_count

🔬 Empirical Benchmark Results

Empirical comparisons of MathTok against a standard subword tokenizer (GPT-2 BPE), a custom-trained SentencePiece (unigram) tokenizer, and character-level baselines over 70 complex test expressions across multiple disciplines reveal substantial improvements.

1. 3-Level Semantic Comparison (Aggregated)

Across the entire evaluation suite, the aggregated results illustrate MathTok's efficiency:

Metric	MathTok	GPT-2	SentencePiece	Character-Level
Level 1 — SCR (struct_score / tokens)	0.9161	0.4251	0.3696	0.4005
Level 2 — Semantic Density (math / total)	0.5633	0.1838	0.1499	—
Level 3 — Structural Efficiency (relations / tokens)	0.2492	N/A	N/A	—
SCR Improvement Factor (MathTok vs. Baseline)	—	2.16x	2.48x	2.29x

2. Canonical Convergence & Consistency (Jaccard Overlap)

For mathematically equivalent pairs, MathTok achieves perfect Jaccard alignment (Jaccard = 1.0), whereas standard text-based tokenizers suffer significant fragmentation:

Expression Pair	MathTok Jaccard	GPT-2 Jaccard	SentencePiece Jaccard	Convergence Status
`x + 2` vs. `2 + x`	1.000	0.200	1.000	CONVERGED (100%)
`ab + ac` vs. `a*(b+c)`	1.000	0.444	0.625	CONVERGED (100%)
`(x+1)^2` vs. `x^2+2x+1`	1.000	0.273	0.222	CONVERGED (100%)
`x^2 - y^2` vs. `(x+y)*(x-y)`	1.000	0.091	0.300	CONVERGED (100%)
`sin(x)^2 + cos(x)^2` vs. `1`	1.000	0.000	0.000	CONVERGED (100%)
`2x + 2y` vs. `2*(x+y)`	1.000	0.444	0.571	CONVERGED (100%)
`xy + xz` vs. `x*(y+z)`	1.000	0.444	0.625	CONVERGED (100%)
`a^2 + 2ab + b^2` vs. `(a+b)^2`	1.000	0.364	0.455	CONVERGED (100%)

3. LaTeX vs. ASCII Format Invariance

MathTok perfectly converges inputs in differing representations to identical structural sequences, while subword tokenizers have severe variance:

ASCII Expression	LaTeX Expression	MathTok same?	MT tokens A/L	GPT-2 tokens A/L	SP tokens A/L
`sin(x^2)`	`\sin(x^2)`	YES (1.00)	8 / 8	6 / 7	6 / 6
`sqrt(x^2 + 1)`	`\sqrt{x^2 + 1}`	YES (1.00)	11 / 11	9 / 10	9 / 9
`log(x)`	`\ln(x)`	YES (1.00)	6 / 6	4 / 5	6 / 6
`exp(x)`	`e^x`	YES (1.00)	6 / 6	4 / 3	6 / 3
`x/y`	`\frac{x}{y}`	YES (1.00)	6 / 6	3 / 7	3 / 9
`int(x^2, x)`	`\int x^2 dx`	NO (~/fallback)	1 / 10	8 / 6	8 / 7
`diff(sin(x), x)`	`\frac{d}{dx}\sin(x)`	YES (1.00)	6 / 6	8 / 11	14 / 16
`factorial(n)`	`n!`	YES (1.00)	6 / 6	5 / 2	11 / 3

🚀 Custom Attention Integration Patterns

The core value of MathTok for downstream machine learning practitioners is the Layer 6 Attention Hints. By translating tree relationships into standard masking shapes, model creators can train structure-aware networks natively.

Below are three attention mask designs that can be constructed directly from the outputs of to_attention_mask_hints():

1. Parent-Child Hierarchical Mask

Encourages top-down syntactic attention. Nodes are only allowed to attend to their direct parent or child node.

       [+ (root)]             Parent Attention Mask Matrix:
        /      \              
     [x]       [3]            [ ] [+ (root)] [x] [3]
      |                       [+ (root)]   1    1   1
    [sin]                     [x]          1    1   0
                              [3]          1    0   1

2. Sibling Horizontal Mask

Focuses horizontal attention across operands of identical scopes (e.g., connecting operands inside an addition sequence, $a$ and $b$ and $c$, without parent noise).

3. Subtree Scope Mask

A highly effective block mask for mathematical reasoning. Restricts attention strictly within a subtree, isolating independent sub-expressions during reasoning loops.

🎯 Codebase Evaluation & Recommendations

Key Strengths

Outstanding Structural Integrity: Modularity is excellent. Clear abstraction separation (canonicalization, tokenization, serialization, and vocabulary grouping) makes codebase expansion extremely straightforward.
HuggingFace Compatibility: Subclassing/wrapping the standard tokenizer class ensures immediate, zero-friction integration with existing libraries like PyTorch and HuggingFace.
Rigorous Validation: The inclusion of validator.py and the round-trip checking logic demonstrates high development standards.
Reliability Guards: LRU caches, concurrency thread pools, and recursion limits make this pipeline safe for server-side deployment.

Recommended Enhancements

Vocabulary Extension: Dynamically augment _VAR_MAP in ast_generator.py to natively support multi-character variables (e.g., physics variables like $v_{\text{init}}$ or matrix names) without splitting them into generic token placeholders.
SymPy Parser Customisation: SymPy's LaTeX parser can occasionally fail on non-standard, custom LaTeX macros. Adding pre-processing ASCII/LaTeX regex cleaners in lexer.py prior to passing them to SymPy will improve the parse success rate of dirty online forum data.
TDF Precision: In case of multi-nested subtrees (e.g., highly deeply-nested fractions), customize the tree depth calculation in metrics.py to evaluate structural depths on custom mathematical representations rather than internal SymPy structures.

Citation Reference

@article{mathtok2026,
  title   = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework
             for Mathematical Language Modeling},
  author  = {Anonymous},
  year    = {2026},
  note    = {Under review}
}