mathtok / model.md
SurweeshSP's picture
Initial clean MathTok release
edede4c

MathTok Pipeline β€”

What Was Built

7-layer mathematical tokenizer research pipeline at c:\Users\surwe\Project\math_token.


File Summary

File Role
canonicalizer.py Layer 1 β€” LaTeX/ASCII β†’ canonical SymPy via simplify/expand
lexer.py Layer 2 β€” Split TEXT/MATH spans (LaTeX delimiters + ASCII heuristics)
ast_generator.py Layer 3 β€” SymPy expression tree β†’ typed ASTNode tree
operator_registry.py Layer 4 β€” Full semantic metadata per operator/function
serializer.py Layer 5 β€” DFS preorder β†’ flat SerializedToken stream
metadata.py Layer 6 β€” Per-token structural attention metadata + masks
vocabulary.py Layer 7 β€” Fixed math vocab + BPE + HF PreTrainedTokenizer compat
pipeline.py Orchestrator + CLI
metrics.py 5 evaluation metrics (SCR, CCS, OPS, TS, TDF)
benchmark.py Benchmark runner vs baselines

Test Results

86 passed in 6.89s

All 86 tests pass across 5 test modules.


Benchmark Results (20 expressions)

SCR: 0.6292   Structural Compression Ratio (lower = more compressed)
CCS: 0.9467   Canonical Consistency Score (higher is better) ← KEY METRIC
OPS: 0.4000   Operator Preservation Score
TS:  0.8763   Token Stability
TDF: 0.9588   Tree Depth Fidelity

vs Character-level baseline:
  MathTok  SCR=0.63  CCS=0.9467
  CharLvl  SCR=1.00  CCS=0.3916   ← CCS is 2.4x worse

MathTok achieves 2.4x better Canonical Consistency over character-level tokenization β€” this is your key result for the paper.


CLI Demo

# Input: "$\sin(x^2) + 3x$"
# Output tokens:
['[MATH_START]', 'OP_ADD', 'OP_MUL', 'CONST_3', 'VAR_X',
 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]']

# S-expression:
(OP_ADD (OP_MUL CONST_3 VAR_X) (FUNC_SIN (OP_POW VAR_X CONST_2)))

Quick Start

cd c:\Users\surwe\Project\math_token
pip install -e ".[eval,dev]"
pytest tests/ -v
python -m evaluation.benchmark --quick --baselines
python -m evaluation.comparison --save           # 3-level SCR comparison
python -m mathtok.pipeline "$\sin(x^2) + 3x$"

3-Level Semantic Comparison Results (vs GPT-2)

Aggregated (63 expressions, 5 categories)

Metric MathTok GPT-2 Char-level
Level 1 β€” SCR (struct_score / tokens) 1.14 0.47 0.42
Level 2 β€” Semantic Density (math_toks / total) 0.675 0.209 β€”
Level 3 β€” Structural Efficiency (relations / tokens) 0.307 β€” β€”
SCR improvement vs GPT-2 2.44x β€” β€”
SCR improvement vs Char-level 2.72x β€” β€”

Canonical Equivalence (headline result)

Pair MathTok Jaccard GPT-2 Jaccard
x + 2 vs 2 + x 1.000 0.200
(x+1)^2 vs x^2+2x+1 1.000 0.273
sin^2+cos^2 vs 1 1.000 0.000
a^2-b^2 vs (a+b)(a-b) 1.000 0.091

MathTok achieves perfect canonical convergence (Jaccard=1.0) on all 8 equivalent pairs. GPT-2 ranges from 0.00 to 0.44 on the same pairs.

LaTeX vs ASCII Normalization

ASCII LaTeX MathTok converged? GPT-2 tokens A/L
sin(x^2) \sin(x^2) YES (1.00) 6 / 7
sqrt(x^2+1) \sqrt{x^2+1} YES (1.00) 9 / 10
diff(sin(x),x) \frac{d}{dx}\sin(x) YES (1.00) 8 / 11
factorial(n) n! YES (1.00) 5 / 2

Sample Expression Comparison

Expression MT tokens MT SCR GPT-2 tokens GPT-2 SCR Improvement
(x+1)^2 10 1.00 7 0.71 1.40x
sin(x^2)+3x 10 1.30 10 0.60 2.17x
factorial(n) 4 1.25 5 0.20 6.25x
sin(cos((x+1)^2+y^3)) 15 1.20 15 0.60 2.00x
((a+b)*(a-b))/((a+b)^2) 11 1.36 19 0.16 8.64x

Visualized Results

The graphs below clearly summarize MathTok's structural efficiency advantages:

Mean Semantic Compression Ratio

SCR By Category

Token Counts Comparison


Output Files


Paper-Ready Contributions

  1. Two-format input β€” handles both LaTeX and ASCII, auto-detected
  2. Canonical consistency β€” equivalent expressions produce token sets with 0.947 Jaccard overlap
  3. Semantic operator registry β€” every operator has arity, precedence, associativity, semantic_role metadata 4.# Implementation Details The following changes were successfully implemented:
  • L1 Canonicalization: Improved reliability with parsing timeouts and LRU caching to prevent SymPy hangs.
  • L2 Hybrid Lexer: Added confidence scores to lexical spans, along with improved regular expressions for parsing LaTeX and inline math constructs.
  • L3 AST Generator: Implemented max_depth limits to gracefully truncate extremely deep ASTs (like malicious deeply nested formulas).
  • L4 Semantic Operator Registry: Added is_commutative metadata, inverse-pair mappings (INVERSE_PAIRS), and expanded domains (Logic, Sets, Geometry, Probability).
  • L5 Structural Serializer: Integrated subtree hashing and [SCOPE_OPEN]/[SCOPE_CLOSE] markers to better delineate function arguments.
  • L6 Attention Metadata: Included parent_token context in the metadata structural hints to support graph-based attention models.
  • L7 Two-Tier Vocabulary: Added explicit tokens such as [UNK_MATH], missing Greek variables (VAR_IOTA, VAR_KAPPA, etc.), and structural boundary tokens.
  • Pipeline & Integration: MathTokPipeline exposes configurable timeouts, max depth, and scopes. All key tokens/metadata symbols are correctly exported.

Validation & Evaluation

  • RoundTripValidator: Added mathtok/validator.py to reconstruct sympy expression trees from a flat tokenized stream, mathematically comparing them using sp.simplify() to ensure semantic fidelity.
  • Streaming Tokenizer: Added MathTokStreamingPipeline with Python generator (yield) support for memory-efficient corpus-scale tokenization.
  • Benchmark Expansion: Added ODE_PDE, LINEAR_ALGEBRA, PROBABILITY, and SET_THEORY domains into the evaluation/comparison.py suite.

The MathTok Tokenizer improves the Structural Encoding Ratio (SCR) by 2.29x over Character Level Tokenization across the evaluation suite! 6. HF-compatible tokenizer β€” drop-in for transformers training pipelines