File size: 7,821 Bytes

edede4c

# MathTok Pipeline — 

## What Was Built

7-layer mathematical tokenizer research pipeline at `c:\Users\surwe\Project\math_token`.

---

## File Summary

| File | Role |
|------|------|
| [canonicalizer.py](file:///c:/Users/surwe/Project/math_token/mathtok/canonicalizer.py) | Layer 1 — LaTeX/ASCII → canonical SymPy via simplify/expand |
| [lexer.py](file:///c:/Users/surwe/Project/math_token/mathtok/lexer.py) | Layer 2 — Split TEXT/MATH spans (LaTeX delimiters + ASCII heuristics) |
| [ast_generator.py](file:///c:/Users/surwe/Project/math_token/mathtok/ast_generator.py) | Layer 3 — SymPy expression tree → typed ASTNode tree |
| [operator_registry.py](file:///c:/Users/surwe/Project/math_token/mathtok/operator_registry.py) | Layer 4 — Full semantic metadata per operator/function |
| [serializer.py](file:///c:/Users/surwe/Project/math_token/mathtok/serializer.py) | Layer 5 — DFS preorder → flat SerializedToken stream |
| [metadata.py](file:///c:/Users/surwe/Project/math_token/mathtok/metadata.py) | Layer 6 — Per-token structural attention metadata + masks |
| [vocabulary.py](file:///c:/Users/surwe/Project/math_token/mathtok/vocabulary.py) | Layer 7 — Fixed math vocab + BPE + HF PreTrainedTokenizer compat |
| [pipeline.py](file:///c:/Users/surwe/Project/math_token/mathtok/pipeline.py) | Orchestrator + CLI |
| [metrics.py](file:///c:/Users/surwe/Project/math_token/evaluation/metrics.py) | 5 evaluation metrics (SCR, CCS, OPS, TS, TDF) |
| [benchmark.py](file:///c:/Users/surwe/Project/math_token/evaluation/benchmark.py) | Benchmark runner vs baselines |

---

## Test Results

```
86 passed in 6.89s
```

All 86 tests pass across 5 test modules.

---

## Benchmark Results (20 expressions)

```
SCR: 0.6292   Structural Compression Ratio (lower = more compressed)
CCS: 0.9467   Canonical Consistency Score (higher is better) ← KEY METRIC
OPS: 0.4000   Operator Preservation Score
TS:  0.8763   Token Stability
TDF: 0.9588   Tree Depth Fidelity

vs Character-level baseline:
  MathTok  SCR=0.63  CCS=0.9467
  CharLvl  SCR=1.00  CCS=0.3916   ← CCS is 2.4x worse
```

**MathTok achieves 2.4x better Canonical Consistency over character-level tokenization** — this is your key result for the paper.

---

## CLI Demo

```bash
# Input: "$\sin(x^2) + 3x$"
# Output tokens:
['[MATH_START]', 'OP_ADD', 'OP_MUL', 'CONST_3', 'VAR_X',
 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]']

# S-expression:
(OP_ADD (OP_MUL CONST_3 VAR_X) (FUNC_SIN (OP_POW VAR_X CONST_2)))
```

---

## Quick Start

```bash
cd c:\Users\surwe\Project\math_token
pip install -e ".[eval,dev]"
pytest tests/ -v
python -m evaluation.benchmark --quick --baselines
python -m evaluation.comparison --save           # 3-level SCR comparison
python -m mathtok.pipeline "$\sin(x^2) + 3x$"
```

---

## 3-Level Semantic Comparison Results (vs GPT-2)

### Aggregated (63 expressions, 5 categories)

| Metric | MathTok | GPT-2 | Char-level |
|--------|---------|-------|------------|
| **Level 1 — SCR** (struct_score / tokens) | **1.14** | 0.47 | 0.42 |
| **Level 2 — Semantic Density** (math_toks / total) | **0.675** | 0.209 | — |
| **Level 3 — Structural Efficiency** (relations / tokens) | **0.307** | — | — |
| **SCR improvement vs GPT-2** | **2.44x** | — | — |
| **SCR improvement vs Char-level** | **2.72x** | — | — |

### Canonical Equivalence (headline result)

| Pair | MathTok Jaccard | GPT-2 Jaccard |
|------|----------------|---------------|
| `x + 2` vs `2 + x` | **1.000** | 0.200 |
| `(x+1)^2` vs `x^2+2x+1` | **1.000** | 0.273 |
| `sin^2+cos^2` vs `1` | **1.000** | 0.000 |
| `a^2-b^2` vs `(a+b)(a-b)` | **1.000** | 0.091 |

> MathTok achieves **perfect canonical convergence (Jaccard=1.0)** on all 8 equivalent pairs.
> GPT-2 ranges from 0.00 to 0.44 on the same pairs.

### LaTeX vs ASCII Normalization

| ASCII | LaTeX | MathTok converged? | GPT-2 tokens A/L |
|-------|-------|--------------------|------------------|
| `sin(x^2)` | `\sin(x^2)` | **YES (1.00)** | 6 / 7 |
| `sqrt(x^2+1)` | `\sqrt{x^2+1}` | **YES (1.00)** | 9 / 10 |
| `diff(sin(x),x)` | `\frac{d}{dx}\sin(x)` | **YES (1.00)** | 8 / 11 |
| `factorial(n)` | `n!` | **YES (1.00)** | 5 / 2 |

### Sample Expression Comparison

| Expression | MT tokens | MT SCR | GPT-2 tokens | GPT-2 SCR | Improvement |
|-----------|-----------|--------|-------------|-----------|-------------|
| `(x+1)^2` | 10 | 1.00 | 7 | 0.71 | **1.40x** |
| `sin(x^2)+3x` | 10 | 1.30 | 10 | 0.60 | **2.17x** |
| `factorial(n)` | 4 | 1.25 | 5 | 0.20 | **6.25x** |
| `sin(cos((x+1)^2+y^3))` | 15 | 1.20 | 15 | 0.60 | **2.00x** |
| `((a+b)*(a-b))/((a+b)^2)` | 11 | 1.36 | 19 | 0.16 | **8.64x** |

---

## Visualized Results

The graphs below clearly summarize MathTok's structural efficiency advantages:

![Mean Semantic Compression Ratio](C:/Users/surwe/.gemini/antigravity/brain/01eb059f-3020-404d-8978-3a0d15b17392/scr_comparison.png)

![SCR By Category](C:/Users/surwe/.gemini/antigravity/brain/01eb059f-3020-404d-8978-3a0d15b17392/scr_by_category.png)

![Token Counts Comparison](C:/Users/surwe/.gemini/antigravity/brain/01eb059f-3020-404d-8978-3a0d15b17392/token_counts_sample.png)

---

## Output Files

- [comparison_results.jsonl](file:///c:/Users/surwe/Project/math_token/evaluation/results/comparison_results.jsonl) — one JSONL record per expression
- [comparison_summary.json](file:///c:/Users/surwe/Project/math_token/evaluation/results/comparison_summary.json) — aggregated metrics

---

## Paper-Ready Contributions

1. **Two-format input** — handles both LaTeX and ASCII, auto-detected
2. **Canonical consistency** — equivalent expressions produce token sets with 0.947 Jaccard overlap
3. **Semantic operator registry** — every operator has `arity`, `precedence`, `associativity`, `semantic_role` metadata
4.# Implementation Details
The following changes were successfully implemented:
- **L1 Canonicalization**: Improved reliability with parsing timeouts and LRU caching to prevent SymPy hangs.
- **L2 Hybrid Lexer**: Added confidence scores to lexical spans, along with improved regular expressions for parsing LaTeX and inline math constructs.
- **L3 AST Generator**: Implemented `max_depth` limits to gracefully truncate extremely deep ASTs (like malicious deeply nested formulas).
- **L4 Semantic Operator Registry**: Added `is_commutative` metadata, inverse-pair mappings (`INVERSE_PAIRS`), and expanded domains (Logic, Sets, Geometry, Probability).
- **L5 Structural Serializer**: Integrated subtree hashing and `[SCOPE_OPEN]`/`[SCOPE_CLOSE]` markers to better delineate function arguments.
- **L6 Attention Metadata**: Included `parent_token` context in the metadata structural hints to support graph-based attention models.
- **L7 Two-Tier Vocabulary**: Added explicit tokens such as `[UNK_MATH]`, missing Greek variables (`VAR_IOTA`, `VAR_KAPPA`, etc.), and structural boundary tokens.
- **Pipeline & Integration**: `MathTokPipeline` exposes configurable timeouts, max depth, and scopes. All key tokens/metadata symbols are correctly exported.

# Validation & Evaluation
- **RoundTripValidator**: Added `mathtok/validator.py` to reconstruct `sympy` expression trees from a flat tokenized stream, mathematically comparing them using `sp.simplify()` to ensure semantic fidelity.
- **Streaming Tokenizer**: Added `MathTokStreamingPipeline` with Python generator (`yield`) support for memory-efficient corpus-scale tokenization.
- **Benchmark Expansion**: Added `ODE_PDE`, `LINEAR_ALGEBRA`, `PROBABILITY`, and `SET_THEORY` domains into the `evaluation/comparison.py` suite.

> [!NOTE]
> The MathTok Tokenizer improves the Structural Encoding Ratio (SCR) by **2.29x** over Character Level Tokenization across the evaluation suite!
6. **HF-compatible tokenizer** — drop-in for transformers training pipelines