mathtok / model.md

Initial clean MathTok release

edede4c 5 days ago

7.82 kB

	# MathTok Pipeline —

	## What Was Built

	7-layer mathematical tokenizer research pipeline at `c:\Users\surwe\Project\math_token`.

	---

	## File Summary

	\| File \| Role \|
	\|------\|------\|
	\| [canonicalizer.py](file:///c:/Users/surwe/Project/math_token/mathtok/canonicalizer.py) \| Layer 1 — LaTeX/ASCII → canonical SymPy via simplify/expand \|
	\| [lexer.py](file:///c:/Users/surwe/Project/math_token/mathtok/lexer.py) \| Layer 2 — Split TEXT/MATH spans (LaTeX delimiters + ASCII heuristics) \|
	\| [ast_generator.py](file:///c:/Users/surwe/Project/math_token/mathtok/ast_generator.py) \| Layer 3 — SymPy expression tree → typed ASTNode tree \|
	\| [operator_registry.py](file:///c:/Users/surwe/Project/math_token/mathtok/operator_registry.py) \| Layer 4 — Full semantic metadata per operator/function \|
	\| [serializer.py](file:///c:/Users/surwe/Project/math_token/mathtok/serializer.py) \| Layer 5 — DFS preorder → flat SerializedToken stream \|
	\| [metadata.py](file:///c:/Users/surwe/Project/math_token/mathtok/metadata.py) \| Layer 6 — Per-token structural attention metadata + masks \|
	\| [vocabulary.py](file:///c:/Users/surwe/Project/math_token/mathtok/vocabulary.py) \| Layer 7 — Fixed math vocab + BPE + HF PreTrainedTokenizer compat \|
	\| [pipeline.py](file:///c:/Users/surwe/Project/math_token/mathtok/pipeline.py) \| Orchestrator + CLI \|
	\| [metrics.py](file:///c:/Users/surwe/Project/math_token/evaluation/metrics.py) \| 5 evaluation metrics (SCR, CCS, OPS, TS, TDF) \|
	\| [benchmark.py](file:///c:/Users/surwe/Project/math_token/evaluation/benchmark.py) \| Benchmark runner vs baselines \|

	---

	## Test Results

	```
	86 passed in 6.89s
	```

	All 86 tests pass across 5 test modules.

	---

	## Benchmark Results (20 expressions)

	```
	SCR: 0.6292 Structural Compression Ratio (lower = more compressed)
	CCS: 0.9467 Canonical Consistency Score (higher is better) ← KEY METRIC
	OPS: 0.4000 Operator Preservation Score
	TS: 0.8763 Token Stability
	TDF: 0.9588 Tree Depth Fidelity

	vs Character-level baseline:
	MathTok SCR=0.63 CCS=0.9467
	CharLvl SCR=1.00 CCS=0.3916 ← CCS is 2.4x worse
	```

	MathTok achieves 2.4x better Canonical Consistency over character-level tokenization — this is your key result for the paper.

	---

	## CLI Demo

	```bash
	# Input: "$\sin(x^2) + 3x$"
	# Output tokens:
	['[MATH_START]', 'OP_ADD', 'OP_MUL', 'CONST_3', 'VAR_X',
	'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]']

	# S-expression:
	(OP_ADD (OP_MUL CONST_3 VAR_X) (FUNC_SIN (OP_POW VAR_X CONST_2)))
	```

	---

	## Quick Start

	```bash
	cd c:\Users\surwe\Project\math_token
	pip install -e ".[eval,dev]"
	pytest tests/ -v
	python -m evaluation.benchmark --quick --baselines
	python -m evaluation.comparison --save # 3-level SCR comparison
	python -m mathtok.pipeline "$\sin(x^2) + 3x$"
	```

	---

	## 3-Level Semantic Comparison Results (vs GPT-2)

	### Aggregated (63 expressions, 5 categories)

	\| Metric \| MathTok \| GPT-2 \| Char-level \|
	\|--------\|---------\|-------\|------------\|
	\| Level 1 — SCR (struct_score / tokens) \| 1.14 \| 0.47 \| 0.42 \|
	\| Level 2 — Semantic Density (math_toks / total) \| 0.675 \| 0.209 \| — \|
	\| Level 3 — Structural Efficiency (relations / tokens) \| 0.307 \| — \| — \|
	\| SCR improvement vs GPT-2 \| 2.44x \| — \| — \|
	\| SCR improvement vs Char-level \| 2.72x \| — \| — \|

	### Canonical Equivalence (headline result)

	\| Pair \| MathTok Jaccard \| GPT-2 Jaccard \|
	\|------\|----------------\|---------------\|
	\| `x + 2` vs `2 + x` \| 1.000 \| 0.200 \|
	\| `(x+1)^2` vs `x^2+2x+1` \| 1.000 \| 0.273 \|
	\| `sin^2+cos^2` vs `1` \| 1.000 \| 0.000 \|
	\| `a^2-b^2` vs `(a+b)(a-b)` \| 1.000 \| 0.091 \|

	> MathTok achieves perfect canonical convergence (Jaccard=1.0) on all 8 equivalent pairs.
	> GPT-2 ranges from 0.00 to 0.44 on the same pairs.

	### LaTeX vs ASCII Normalization

	\| ASCII \| LaTeX \| MathTok converged? \| GPT-2 tokens A/L \|
	\|-------\|-------\|--------------------\|------------------\|
	\| `sin(x^2)` \| `\sin(x^2)` \| YES (1.00) \| 6 / 7 \|
	\| `sqrt(x^2+1)` \| `\sqrt{x^2+1}` \| YES (1.00) \| 9 / 10 \|
	\| `diff(sin(x),x)` \| `\frac{d}{dx}\sin(x)` \| YES (1.00) \| 8 / 11 \|
	\| `factorial(n)` \| `n!` \| YES (1.00) \| 5 / 2 \|

	### Sample Expression Comparison

	\| Expression \| MT tokens \| MT SCR \| GPT-2 tokens \| GPT-2 SCR \| Improvement \|
	\|-----------\|-----------\|--------\|-------------\|-----------\|-------------\|
	\| `(x+1)^2` \| 10 \| 1.00 \| 7 \| 0.71 \| 1.40x \|
	\| `sin(x^2)+3x` \| 10 \| 1.30 \| 10 \| 0.60 \| 2.17x \|
	\| `factorial(n)` \| 4 \| 1.25 \| 5 \| 0.20 \| 6.25x \|
	\| `sin(cos((x+1)^2+y^3))` \| 15 \| 1.20 \| 15 \| 0.60 \| 2.00x \|
	\| `((a+b)(a-b))/((a+b)^2)` \| 11 \| 1.36 \| 19 \| 0.16 \| 8.64x* \|

	---

	## Visualized Results

	The graphs below clearly summarize MathTok's structural efficiency advantages:

	![Mean Semantic Compression Ratio](C:/Users/surwe/.gemini/antigravity/brain/01eb059f-3020-404d-8978-3a0d15b17392/scr_comparison.png)

	![SCR By Category](C:/Users/surwe/.gemini/antigravity/brain/01eb059f-3020-404d-8978-3a0d15b17392/scr_by_category.png)

	![Token Counts Comparison](C:/Users/surwe/.gemini/antigravity/brain/01eb059f-3020-404d-8978-3a0d15b17392/token_counts_sample.png)

	---

	## Output Files

	- [comparison_results.jsonl](file:///c:/Users/surwe/Project/math_token/evaluation/results/comparison_results.jsonl) — one JSONL record per expression
	- [comparison_summary.json](file:///c:/Users/surwe/Project/math_token/evaluation/results/comparison_summary.json) — aggregated metrics

	---

	## Paper-Ready Contributions

	1. Two-format input — handles both LaTeX and ASCII, auto-detected
	2. Canonical consistency — equivalent expressions produce token sets with 0.947 Jaccard overlap
	3. Semantic operator registry — every operator has `arity`, `precedence`, `associativity`, `semantic_role` metadata
	4.# Implementation Details
	The following changes were successfully implemented:
	- L1 Canonicalization: Improved reliability with parsing timeouts and LRU caching to prevent SymPy hangs.
	- L2 Hybrid Lexer: Added confidence scores to lexical spans, along with improved regular expressions for parsing LaTeX and inline math constructs.
	- L3 AST Generator: Implemented `max_depth` limits to gracefully truncate extremely deep ASTs (like malicious deeply nested formulas).
	- L4 Semantic Operator Registry: Added `is_commutative` metadata, inverse-pair mappings (`INVERSE_PAIRS`), and expanded domains (Logic, Sets, Geometry, Probability).
	- L5 Structural Serializer: Integrated subtree hashing and `[SCOPE_OPEN]`/`[SCOPE_CLOSE]` markers to better delineate function arguments.
	- L6 Attention Metadata: Included `parent_token` context in the metadata structural hints to support graph-based attention models.
	- L7 Two-Tier Vocabulary: Added explicit tokens such as `[UNK_MATH]`, missing Greek variables (`VAR_IOTA`, `VAR_KAPPA`, etc.), and structural boundary tokens.
	- Pipeline & Integration: `MathTokPipeline` exposes configurable timeouts, max depth, and scopes. All key tokens/metadata symbols are correctly exported.

	# Validation & Evaluation
	- RoundTripValidator: Added `mathtok/validator.py` to reconstruct `sympy` expression trees from a flat tokenized stream, mathematically comparing them using `sp.simplify()` to ensure semantic fidelity.
	- Streaming Tokenizer: Added `MathTokStreamingPipeline` with Python generator (`yield`) support for memory-efficient corpus-scale tokenization.
	- Benchmark Expansion: Added `ODE_PDE`, `LINEAR_ALGEBRA`, `PROBABILITY`, and `SET_THEORY` domains into the `evaluation/comparison.py` suite.

	> [!NOTE]
	> The MathTok Tokenizer improves the Structural Encoding Ratio (SCR) by 2.29x over Character Level Tokenization across the evaluation suite!
	6. HF-compatible tokenizer — drop-in for transformers training pipelines