File size: 7,821 Bytes
edede4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# MathTok Pipeline β€” 

## What Was Built

7-layer mathematical tokenizer research pipeline at `c:\Users\surwe\Project\math_token`.

---

## File Summary

| File | Role |
|------|------|
| [canonicalizer.py](file:///c:/Users/surwe/Project/math_token/mathtok/canonicalizer.py) | Layer 1 β€” LaTeX/ASCII β†’ canonical SymPy via simplify/expand |
| [lexer.py](file:///c:/Users/surwe/Project/math_token/mathtok/lexer.py) | Layer 2 β€” Split TEXT/MATH spans (LaTeX delimiters + ASCII heuristics) |
| [ast_generator.py](file:///c:/Users/surwe/Project/math_token/mathtok/ast_generator.py) | Layer 3 β€” SymPy expression tree β†’ typed ASTNode tree |
| [operator_registry.py](file:///c:/Users/surwe/Project/math_token/mathtok/operator_registry.py) | Layer 4 β€” Full semantic metadata per operator/function |
| [serializer.py](file:///c:/Users/surwe/Project/math_token/mathtok/serializer.py) | Layer 5 β€” DFS preorder β†’ flat SerializedToken stream |
| [metadata.py](file:///c:/Users/surwe/Project/math_token/mathtok/metadata.py) | Layer 6 β€” Per-token structural attention metadata + masks |
| [vocabulary.py](file:///c:/Users/surwe/Project/math_token/mathtok/vocabulary.py) | Layer 7 β€” Fixed math vocab + BPE + HF PreTrainedTokenizer compat |
| [pipeline.py](file:///c:/Users/surwe/Project/math_token/mathtok/pipeline.py) | Orchestrator + CLI |
| [metrics.py](file:///c:/Users/surwe/Project/math_token/evaluation/metrics.py) | 5 evaluation metrics (SCR, CCS, OPS, TS, TDF) |
| [benchmark.py](file:///c:/Users/surwe/Project/math_token/evaluation/benchmark.py) | Benchmark runner vs baselines |

---

## Test Results

```
86 passed in 6.89s
```

All 86 tests pass across 5 test modules.

---

## Benchmark Results (20 expressions)

```
SCR: 0.6292   Structural Compression Ratio (lower = more compressed)
CCS: 0.9467   Canonical Consistency Score (higher is better) ← KEY METRIC
OPS: 0.4000   Operator Preservation Score
TS:  0.8763   Token Stability
TDF: 0.9588   Tree Depth Fidelity

vs Character-level baseline:
  MathTok  SCR=0.63  CCS=0.9467
  CharLvl  SCR=1.00  CCS=0.3916   ← CCS is 2.4x worse
```

**MathTok achieves 2.4x better Canonical Consistency over character-level tokenization** β€” this is your key result for the paper.

---

## CLI Demo

```bash
# Input: "$\sin(x^2) + 3x$"
# Output tokens:
['[MATH_START]', 'OP_ADD', 'OP_MUL', 'CONST_3', 'VAR_X',
 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]']

# S-expression:
(OP_ADD (OP_MUL CONST_3 VAR_X) (FUNC_SIN (OP_POW VAR_X CONST_2)))
```

---

## Quick Start

```bash
cd c:\Users\surwe\Project\math_token
pip install -e ".[eval,dev]"
pytest tests/ -v
python -m evaluation.benchmark --quick --baselines
python -m evaluation.comparison --save           # 3-level SCR comparison
python -m mathtok.pipeline "$\sin(x^2) + 3x$"
```

---

## 3-Level Semantic Comparison Results (vs GPT-2)

### Aggregated (63 expressions, 5 categories)

| Metric | MathTok | GPT-2 | Char-level |
|--------|---------|-------|------------|
| **Level 1 β€” SCR** (struct_score / tokens) | **1.14** | 0.47 | 0.42 |
| **Level 2 β€” Semantic Density** (math_toks / total) | **0.675** | 0.209 | β€” |
| **Level 3 β€” Structural Efficiency** (relations / tokens) | **0.307** | β€” | β€” |
| **SCR improvement vs GPT-2** | **2.44x** | β€” | β€” |
| **SCR improvement vs Char-level** | **2.72x** | β€” | β€” |

### Canonical Equivalence (headline result)

| Pair | MathTok Jaccard | GPT-2 Jaccard |
|------|----------------|---------------|
| `x + 2` vs `2 + x` | **1.000** | 0.200 |
| `(x+1)^2` vs `x^2+2x+1` | **1.000** | 0.273 |
| `sin^2+cos^2` vs `1` | **1.000** | 0.000 |
| `a^2-b^2` vs `(a+b)(a-b)` | **1.000** | 0.091 |

> MathTok achieves **perfect canonical convergence (Jaccard=1.0)** on all 8 equivalent pairs.
> GPT-2 ranges from 0.00 to 0.44 on the same pairs.

### LaTeX vs ASCII Normalization

| ASCII | LaTeX | MathTok converged? | GPT-2 tokens A/L |
|-------|-------|--------------------|------------------|
| `sin(x^2)` | `\sin(x^2)` | **YES (1.00)** | 6 / 7 |
| `sqrt(x^2+1)` | `\sqrt{x^2+1}` | **YES (1.00)** | 9 / 10 |
| `diff(sin(x),x)` | `\frac{d}{dx}\sin(x)` | **YES (1.00)** | 8 / 11 |
| `factorial(n)` | `n!` | **YES (1.00)** | 5 / 2 |

### Sample Expression Comparison

| Expression | MT tokens | MT SCR | GPT-2 tokens | GPT-2 SCR | Improvement |
|-----------|-----------|--------|-------------|-----------|-------------|
| `(x+1)^2` | 10 | 1.00 | 7 | 0.71 | **1.40x** |
| `sin(x^2)+3x` | 10 | 1.30 | 10 | 0.60 | **2.17x** |
| `factorial(n)` | 4 | 1.25 | 5 | 0.20 | **6.25x** |
| `sin(cos((x+1)^2+y^3))` | 15 | 1.20 | 15 | 0.60 | **2.00x** |
| `((a+b)*(a-b))/((a+b)^2)` | 11 | 1.36 | 19 | 0.16 | **8.64x** |

---

## Visualized Results

The graphs below clearly summarize MathTok's structural efficiency advantages:

![Mean Semantic Compression Ratio](C:/Users/surwe/.gemini/antigravity/brain/01eb059f-3020-404d-8978-3a0d15b17392/scr_comparison.png)

![SCR By Category](C:/Users/surwe/.gemini/antigravity/brain/01eb059f-3020-404d-8978-3a0d15b17392/scr_by_category.png)

![Token Counts Comparison](C:/Users/surwe/.gemini/antigravity/brain/01eb059f-3020-404d-8978-3a0d15b17392/token_counts_sample.png)

---

## Output Files

- [comparison_results.jsonl](file:///c:/Users/surwe/Project/math_token/evaluation/results/comparison_results.jsonl) β€” one JSONL record per expression
- [comparison_summary.json](file:///c:/Users/surwe/Project/math_token/evaluation/results/comparison_summary.json) β€” aggregated metrics

---

## Paper-Ready Contributions

1. **Two-format input** β€” handles both LaTeX and ASCII, auto-detected
2. **Canonical consistency** β€” equivalent expressions produce token sets with 0.947 Jaccard overlap
3. **Semantic operator registry** β€” every operator has `arity`, `precedence`, `associativity`, `semantic_role` metadata
4.# Implementation Details
The following changes were successfully implemented:
- **L1 Canonicalization**: Improved reliability with parsing timeouts and LRU caching to prevent SymPy hangs.
- **L2 Hybrid Lexer**: Added confidence scores to lexical spans, along with improved regular expressions for parsing LaTeX and inline math constructs.
- **L3 AST Generator**: Implemented `max_depth` limits to gracefully truncate extremely deep ASTs (like malicious deeply nested formulas).
- **L4 Semantic Operator Registry**: Added `is_commutative` metadata, inverse-pair mappings (`INVERSE_PAIRS`), and expanded domains (Logic, Sets, Geometry, Probability).
- **L5 Structural Serializer**: Integrated subtree hashing and `[SCOPE_OPEN]`/`[SCOPE_CLOSE]` markers to better delineate function arguments.
- **L6 Attention Metadata**: Included `parent_token` context in the metadata structural hints to support graph-based attention models.
- **L7 Two-Tier Vocabulary**: Added explicit tokens such as `[UNK_MATH]`, missing Greek variables (`VAR_IOTA`, `VAR_KAPPA`, etc.), and structural boundary tokens.
- **Pipeline & Integration**: `MathTokPipeline` exposes configurable timeouts, max depth, and scopes. All key tokens/metadata symbols are correctly exported.

# Validation & Evaluation
- **RoundTripValidator**: Added `mathtok/validator.py` to reconstruct `sympy` expression trees from a flat tokenized stream, mathematically comparing them using `sp.simplify()` to ensure semantic fidelity.
- **Streaming Tokenizer**: Added `MathTokStreamingPipeline` with Python generator (`yield`) support for memory-efficient corpus-scale tokenization.
- **Benchmark Expansion**: Added `ODE_PDE`, `LINEAR_ALGEBRA`, `PROBABILITY`, and `SET_THEORY` domains into the `evaluation/comparison.py` suite.

> [!NOTE]
> The MathTok Tokenizer improves the Structural Encoding Ratio (SCR) by **2.29x** over Character Level Tokenization across the evaluation suite!
6. **HF-compatible tokenizer** β€” drop-in for transformers training pipelines