mathtok / review.md
SurweeshSP's picture
Initial clean MathTok release
edede4c

🌟 MathTok: Canonicalized AST-Based Mathematical Tokenizer Codebase Review

An in-depth structural and architectural analysis of the MathTok pipeline located at c:\Users\surwe\Project\math_token. This document serves as a comprehensive system review, detailing the mathematical foundations, the 7-layer pipeline design, system components, evaluation metrics, empirical results, and downstream application patterns of MathTok.


πŸ“– Executive Summary

Standard natural language tokenizers (like Byte-Pair Encoding or SentencePiece) treat mathematical expressions as plain text sequences. This results in structural fragmentation (e.g., splitting a variable VAR_THETA or operator OP_ADD into arbitrary character chunks) and semantic blindness (failing to recognize algebraic equivalences like $x + 2 \equiv 2 + x$).

MathTok solves this by introducing a hybrid, structure-aware tokenization framework for mathematical language modeling. By constructing an Abstract Syntax Tree (AST) from mathematical expressions, normalizing algebraic equivalences via symbolic mathematics (SymPy), and serializing the tree using Depth-First Search (DFS) preorder traversal, MathTok preserves full mathematical syntax and hierarchy.

Additionally, MathTok automatically emits structural attention metadata for every token position, enabling downstream transformer models to implement tree-based or graph-structured attention patterns without architectural modifications.

graph TD
    A[Raw Input: Mixed Text + Math] --> B[Layer 2: Hybrid Lexer]
    B -->|TEXT Spans| C[Layer 7: BPE Text Sub-Vocab]
    B -->|MATH Spans| D[Layer 1: Canonicalizer Engine]
    D -->|SymPy Expression| E[Layer 3: AST Generator]
    E -->|Typed AST Tree| F[Layer 4: Semantic Operator Registry]
    F -->|Enriched Nodes| G[Layer 5: Structural Serializer]
    G -->|DFS Preorder Stream| H[Layer 6: Attention Metadata Gen]
    H -->|Attention Masks & Hints| I[Final Merged Token Stream]
    C --> I

πŸ› οΈ The 7-Layer Processing Pipeline

MathTok's core engine is structured into seven distinct modular layers. Every component resides in the mathtok/ package.

Layer 1: Canonicalizer Engine

  • Location: canonicalizer.py
  • Role: Algebraic normalisation and format conversion (LaTeX $\to$ ASCII $\to$ SymPy).
  • Implementation Details:
    • Heuristic Format Detection: Inspects the input for LaTeX syntax (e.g., \frac, \sqrt, \sin, {, math delimiters like $ or \().
    • Parsing: Utilizes sympy.parsing.latex.parse_latex (with ANTLR4) for LaTeX, falling back to sympy.parsing.sympy_parser.parse_expr with standard and implicit multiplication transformations for ASCII.
    • Normalisation: Leverages SymPy's symbolic engine to expand() products over sums and simplify() algebraic expressions. It normalizes operations internally (e.g., transforming subtractions $a - b$ into additions of products $\text{Add}(a, \text{Mul}(-1, b))$, and divisions $a / b$ into multiplications of powers $\text{Mul}(a, \text{Pow}(b, -1))$).
    • Robustness & Performance: Employs an LRU cache (default: 512 entries) to prevent redundant parsing and wraps expensive SymPy calls in a ThreadPoolExecutor with configurable parsing timeouts (default: 5.0 seconds) to prevent infinite loops on malicious, highly-complex inputs.

Layer 2: Hybrid Mathematical Lexer

  • Location: lexer.py
  • Role: Alternating segment segmentation (TEXT spans vs. MATH spans).
  • Implementation Details:
    • Stage 1 (Unambiguous Delimiters): Extracts LaTeX math environments (double dollar $$...$$, inline dollar $...\$, bracket \[...\], or parenthesis \(...\)).
    • Stage 2 (ASCII Heuristics): Parses remaining text regions using pre-compiled regular expressions matching mathematical patterns (e.g., function calls sin(...), exponents x^2, arithmetic boundaries 2*x+1, relational equations a+b=c, and spelled-out Greek variables).
    • Region Expansion: Expands detected math seeds backwards to include leading unary operators and digits, and forwards to match balanced braces/parentheses and continuous math characters. Adjacent spans of identical types are merged.

Layer 3: AST Generator

  • Location: ast_generator.py
  • Role: SymPy AST conversion to typed, abstract vocabulary trees.
  • Implementation Details:
    • Walks the SymPy internal expression tree recursively.
    • Maps generic SymPy types into the vocabulary of MathTok:
      • Variables: Standard letters map to VAR_X, VAR_Y, etc. Spelled-out Greek names map to VAR_THETA, VAR_LAMBDA, etc.
      • Constants: Values between $-10$ and $100$ receive dedicated tokens (e.g., CONST_3, CONST_12), large integers map to placeholders (e.g., NUM_145), floats map to string-encoded float tokens (e.g., FLOAT_3p14), and special constants map to CONST_PI, CONST_E, CONST_I, and CONST_INF.
      • Unary Operations: Converts negative numbers or multiplication by $-1$ to explicit OP_NEG nodes, and division inverses to OP_RECIP nodes.
      • Fractions: Converts Rational(p, q) into explicit binary FRAC(numerator, denominator) nodes.
    • Recursion Guard: Enforces max_depth limits (default: 20) to truncate overly-nested expressions, replacing them with a special SUBTREE_TRUNCATED node to avoid Python stack overflows.

Layer 4: Semantic Operator Registry

  • Location: operator_registry.py
  • Role: Rich metadata storage and categorisation for mathematical operators.
  • Implementation Details:
    • Maintains an immutable registry of OperatorMeta instances mapping token strings to mathematical properties:
      • Properties: arity ($-1$ for variadic, or fixed integers like 1 or 2), precedence, associativity (left, right, or none), semantic_role (e.g., aggregation for addition, periodic_oscillation for sine), latex_repr, ascii_repr, category, and is_commutative.
      • Domains: Spans multiple mathematical branches: Arithmetic, Relational, Calculus, Trigonometry, Exponential/Logarithmic, Logic, Set Theory, Geometry, and Statistics.
      • Inverses: Declares explicit mathematical inverses in INVERSE_PAIRS (e.g., FUNC_SIN $\leftrightarrow$ FUNC_ASIN, FUNC_EXP $\leftrightarrow$ FUNC_LOG).

Layer 5: Structural Serializer

  • Location: serializer.py
  • Role: Flattening the 2D tree structure into a 1-D stream using DFS preorder traversal.
  • Implementation Details:
    • Emits nodes starting from the root down to the leaves, producing a flat sequence of SerializedToken objects carrying: depth, node_id, parent_id, child_index, num_children, is_leaf, and subtree_size.
    • Scope Delineation: Emits [SCOPE_OPEN] and [SCOPE_CLOSE] boundary tokens to explicitly group parameters for functions (e.g., FUNC_SIN [SCOPE_OPEN] VAR_X [SCOPE_CLOSE]).
    • Subtree Deduplication: Integrates MD5 structural hashing (dedup_subtrees) to replace duplicated structures (e.g., repeating sub-formulas) with a pointer reference (e.g., SUBTREE_REF_ae34df51), improving sequence compression.

Layer 6: Structural Attention Metadata Generator

  • Location: metadata.py
  • Role: Calculating positional contexts and binary attention mask matrices.
  • Implementation Details:
    • Classifies tokens into categories: operator, function, variable, constant, structural, boundary, or text.
    • Generates a dot-separated positional hierarchy string for each node in tree_position_key (e.g., 0.1.2 denotes root $\to$ 2nd child $\to$ 3rd child), which is useful for hierarchical positional encodings.
    • Attention Mask Matrix Synthesis: Dynamically compiles four $N \times N$ binary attention mask matrices:
      • parent_mask: Direct dependency attention.
      • children_mask: Inverse dependency attention.
      • sibling_mask: Horizontal syntactic context attention.
      • subtree_mask: Complete structural scope attention.

Layer 7: Vocabulary & BPE Compression

  • Location: vocabulary.py
  • Role: Merging deterministic structural math vocabularies with Byte-Pair Encoding (BPE) text sub-vocabularies.
  • Implementation Details:
    • Two-Tier Architecture:
      • Tier 1 (Fixed Math Vocabulary): Reservoirs of deterministic, immutable IDs for standard operators, Greek/Latin variables, constants, boundaries, and placeholders. BPE is completely bypassed for math terms.
      • Tier 2 (BPE Text Vocabulary): Natural language regions are processed via HuggingFace's tokenizers library, trained on corpus-specific text spans.
    • HuggingFace Wrapper: Under the hood, MathTokHFTokenizer acts as a drop-in subclass wrapper for PreTrainedTokenizer, enabling immediate integration into standard pipelines such as transformers.Trainer, datasets.map, and PyTorch collators.

πŸ”„ Verification & Streaming Sub-systems

Beyond the core layers, MathTok implements crucial sub-systems to guarantee mathematical correctess and scale.

Round-Trip Validation

  • Location: validator.py
  • Role: Guaranteeing zero semantic information loss during tokenization.
  • Implementation Details:
    • Uses the emitted TokenMetadata sequence to mathematically reconstruct the original SymPy expression.
    • Rebuilds leaf nodes based on their category (constants, variables, truncations) and moves upwards to reconstruct complex nodes (FRAC, operators, custom functions).
    • Performs formal validation by checking if the algebraic difference between the original and reconstructed expressions simplifies to zero (sp.simplify(original - reconstructed) == 0).

Streaming Pipeline

  • Location: streaming.py
  • Role: Corpus-scale processing of large datasets without exhausting system memory.
  • Implementation Details:
    • Wraps MathTokPipeline inside a lazy Python generator (yield).
    • Supports encoding custom iterators and streams line-delimited files sequentially, ensuring constant memory ($O(1)$ RAM) overhead during dataset processing.

πŸ“ˆ Evaluation Suite & Benchmark Metrics

The evaluation/ package defines five core evaluation metrics (residing in metrics.py) to assess tokenizer quality, benchmarked in comparison.py.

Core Metrics

Metric Symbol Definition & Formula Mathematical Value
Structural Compression Ratio SCR $\text{mean}\left(\frac{\text{Structural Score}}{\text{Token Count}}\right)$ Quantifies structural information density. Higher is better (more structure packed into fewer tokens).
Canonical Consistency Score CCS $\text{mean}\left( \text{Jaccard}(S_A, S_B) \right)$ over equivalent pairs Evaluates algebraic invariance. A score of $1.0$ represents perfect semantic convergence.
Operator Preservation Score OPS $%$ of expressions containing all expected operators Measures robustness; ensures mathematical operations are never lost or corrupted.
Token Stability TS $1 - \text{Coefficient of Variation}(\text{length})$ Assesses syntactic variance stability under re-writings. Higher is more stable.
Tree Depth Fidelity TDF $1 - \text{mean}\left( \frac{\vert d_{\text{actual}} - d_{\text{ground}} \vert}{d_{\text{ground}}} \right)$ Measures max metadata depth accuracy against the ground truth SymPy height.

Semantic Compression Ratio (SCR) is evaluated at three hierarchical levels in comparison.py:

  • Level 1 β€” Structural Score to Token Ratio: structural_score / token_count
  • Level 2 β€” Semantic Density: math_tokens / total_tokens
  • Level 3 β€” Structural Efficiency: parent_child_relations / token_count

πŸ”¬ Empirical Benchmark Results

Empirical comparisons of MathTok against a standard subword tokenizer (GPT-2 BPE), a custom-trained SentencePiece (unigram) tokenizer, and character-level baselines over 70 complex test expressions across multiple disciplines reveal substantial improvements.

1. 3-Level Semantic Comparison (Aggregated)

Across the entire evaluation suite, the aggregated results illustrate MathTok's efficiency:

Metric MathTok GPT-2 SentencePiece Character-Level
Level 1 β€” SCR (struct_score / tokens) 0.9161 0.4251 0.3696 0.4005
Level 2 β€” Semantic Density (math / total) 0.5633 0.1838 0.1499 β€”
Level 3 β€” Structural Efficiency (relations / tokens) 0.2492 N/A N/A β€”
SCR Improvement Factor (MathTok vs. Baseline) β€” 2.16x 2.48x 2.29x

2. Canonical Convergence & Consistency (Jaccard Overlap)

For mathematically equivalent pairs, MathTok achieves perfect Jaccard alignment (Jaccard = 1.0), whereas standard text-based tokenizers suffer significant fragmentation:

Expression Pair MathTok Jaccard GPT-2 Jaccard SentencePiece Jaccard Convergence Status
x + 2 vs. 2 + x 1.000 0.200 1.000 CONVERGED (100%)
a*b + a*c vs. a*(b+c) 1.000 0.444 0.625 CONVERGED (100%)
(x+1)^2 vs. x^2+2x+1 1.000 0.273 0.222 CONVERGED (100%)
x^2 - y^2 vs. (x+y)*(x-y) 1.000 0.091 0.300 CONVERGED (100%)
sin(x)^2 + cos(x)^2 vs. 1 1.000 0.000 0.000 CONVERGED (100%)
2*x + 2*y vs. 2*(x+y) 1.000 0.444 0.571 CONVERGED (100%)
x*y + x*z vs. x*(y+z) 1.000 0.444 0.625 CONVERGED (100%)
a^2 + 2*a*b + b^2 vs. (a+b)^2 1.000 0.364 0.455 CONVERGED (100%)

3. LaTeX vs. ASCII Format Invariance

MathTok perfectly converges inputs in differing representations to identical structural sequences, while subword tokenizers have severe variance:

ASCII Expression LaTeX Expression MathTok same? MT tokens A/L GPT-2 tokens A/L SP tokens A/L
sin(x^2) \sin(x^2) YES (1.00) 8 / 8 6 / 7 6 / 6
sqrt(x^2 + 1) \sqrt{x^2 + 1} YES (1.00) 11 / 11 9 / 10 9 / 9
log(x) \ln(x) YES (1.00) 6 / 6 4 / 5 6 / 6
exp(x) e^x YES (1.00) 6 / 6 4 / 3 6 / 3
x/y \frac{x}{y} YES (1.00) 6 / 6 3 / 7 3 / 9
int(x^2, x) \int x^2 dx NO (~/fallback) 1 / 10 8 / 6 8 / 7
diff(sin(x), x) \frac{d}{dx}\sin(x) YES (1.00) 6 / 6 8 / 11 14 / 16
factorial(n) n! YES (1.00) 6 / 6 5 / 2 11 / 3

πŸš€ Custom Attention Integration Patterns

The core value of MathTok for downstream machine learning practitioners is the Layer 6 Attention Hints. By translating tree relationships into standard masking shapes, model creators can train structure-aware networks natively.

Below are three attention mask designs that can be constructed directly from the outputs of to_attention_mask_hints():

1. Parent-Child Hierarchical Mask

Encourages top-down syntactic attention. Nodes are only allowed to attend to their direct parent or child node.

       [+ (root)]             Parent Attention Mask Matrix:
        /      \              
     [x]       [3]            [ ] [+ (root)] [x] [3]
      |                       [+ (root)]   1    1   1
    [sin]                     [x]          1    1   0
                              [3]          1    0   1

2. Sibling Horizontal Mask

Focuses horizontal attention across operands of identical scopes (e.g., connecting operands inside an addition sequence, $a$ and $b$ and $c$, without parent noise).

3. Subtree Scope Mask

A highly effective block mask for mathematical reasoning. Restricts attention strictly within a subtree, isolating independent sub-expressions during reasoning loops.


🎯 Codebase Evaluation & Recommendations

Key Strengths

  1. Outstanding Structural Integrity: Modularity is excellent. Clear abstraction separation (canonicalization, tokenization, serialization, and vocabulary grouping) makes codebase expansion extremely straightforward.
  2. HuggingFace Compatibility: Subclassing/wrapping the standard tokenizer class ensures immediate, zero-friction integration with existing libraries like PyTorch and HuggingFace.
  3. Rigorous Validation: The inclusion of validator.py and the round-trip checking logic demonstrates high development standards.
  4. Reliability Guards: LRU caches, concurrency thread pools, and recursion limits make this pipeline safe for server-side deployment.

Recommended Enhancements

  • Vocabulary Extension: Dynamically augment _VAR_MAP in ast_generator.py to natively support multi-character variables (e.g., physics variables like $v_{\text{init}}$ or matrix names) without splitting them into generic token placeholders.
  • SymPy Parser Customisation: SymPy's LaTeX parser can occasionally fail on non-standard, custom LaTeX macros. Adding pre-processing ASCII/LaTeX regex cleaners in lexer.py prior to passing them to SymPy will improve the parse success rate of dirty online forum data.
  • TDF Precision: In case of multi-nested subtrees (e.g., highly deeply-nested fractions), customize the tree depth calculation in metrics.py to evaluate structural depths on custom mathematical representations rather than internal SymPy structures.

Citation Reference

@article{mathtok2026,
  title   = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework
             for Mathematical Language Modeling},
  author  = {Anonymous},
  year    = {2026},
  note    = {Under review}
}