Text Generation
Transformers
English
custom
tokenizer
symbolic-ai
mathematics
llm
reasoning
ast
compiler
nlp
deep-learning
machine-learning
mathematical-reasoning
symbolic-reasoning
tokenization
parser
artificial-intelligence
Eval Results (legacy)
Instructions to use SurweeshSP/mathtok with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SurweeshSP/mathtok with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SurweeshSP/mathtok")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("SurweeshSP/mathtok", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SurweeshSP/mathtok with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SurweeshSP/mathtok" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/SurweeshSP/mathtok
- SGLang
How to use SurweeshSP/mathtok with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SurweeshSP/mathtok" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SurweeshSP/mathtok" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use SurweeshSP/mathtok with Docker Model Runner:
docker model run hf.co/SurweeshSP/mathtok
File size: 7,821 Bytes
edede4c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | # MathTok Pipeline β
## What Was Built
7-layer mathematical tokenizer research pipeline at `c:\Users\surwe\Project\math_token`.
---
## File Summary
| File | Role |
|------|------|
| [canonicalizer.py](file:///c:/Users/surwe/Project/math_token/mathtok/canonicalizer.py) | Layer 1 β LaTeX/ASCII β canonical SymPy via simplify/expand |
| [lexer.py](file:///c:/Users/surwe/Project/math_token/mathtok/lexer.py) | Layer 2 β Split TEXT/MATH spans (LaTeX delimiters + ASCII heuristics) |
| [ast_generator.py](file:///c:/Users/surwe/Project/math_token/mathtok/ast_generator.py) | Layer 3 β SymPy expression tree β typed ASTNode tree |
| [operator_registry.py](file:///c:/Users/surwe/Project/math_token/mathtok/operator_registry.py) | Layer 4 β Full semantic metadata per operator/function |
| [serializer.py](file:///c:/Users/surwe/Project/math_token/mathtok/serializer.py) | Layer 5 β DFS preorder β flat SerializedToken stream |
| [metadata.py](file:///c:/Users/surwe/Project/math_token/mathtok/metadata.py) | Layer 6 β Per-token structural attention metadata + masks |
| [vocabulary.py](file:///c:/Users/surwe/Project/math_token/mathtok/vocabulary.py) | Layer 7 β Fixed math vocab + BPE + HF PreTrainedTokenizer compat |
| [pipeline.py](file:///c:/Users/surwe/Project/math_token/mathtok/pipeline.py) | Orchestrator + CLI |
| [metrics.py](file:///c:/Users/surwe/Project/math_token/evaluation/metrics.py) | 5 evaluation metrics (SCR, CCS, OPS, TS, TDF) |
| [benchmark.py](file:///c:/Users/surwe/Project/math_token/evaluation/benchmark.py) | Benchmark runner vs baselines |
---
## Test Results
```
86 passed in 6.89s
```
All 86 tests pass across 5 test modules.
---
## Benchmark Results (20 expressions)
```
SCR: 0.6292 Structural Compression Ratio (lower = more compressed)
CCS: 0.9467 Canonical Consistency Score (higher is better) β KEY METRIC
OPS: 0.4000 Operator Preservation Score
TS: 0.8763 Token Stability
TDF: 0.9588 Tree Depth Fidelity
vs Character-level baseline:
MathTok SCR=0.63 CCS=0.9467
CharLvl SCR=1.00 CCS=0.3916 β CCS is 2.4x worse
```
**MathTok achieves 2.4x better Canonical Consistency over character-level tokenization** β this is your key result for the paper.
---
## CLI Demo
```bash
# Input: "$\sin(x^2) + 3x$"
# Output tokens:
['[MATH_START]', 'OP_ADD', 'OP_MUL', 'CONST_3', 'VAR_X',
'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]']
# S-expression:
(OP_ADD (OP_MUL CONST_3 VAR_X) (FUNC_SIN (OP_POW VAR_X CONST_2)))
```
---
## Quick Start
```bash
cd c:\Users\surwe\Project\math_token
pip install -e ".[eval,dev]"
pytest tests/ -v
python -m evaluation.benchmark --quick --baselines
python -m evaluation.comparison --save # 3-level SCR comparison
python -m mathtok.pipeline "$\sin(x^2) + 3x$"
```
---
## 3-Level Semantic Comparison Results (vs GPT-2)
### Aggregated (63 expressions, 5 categories)
| Metric | MathTok | GPT-2 | Char-level |
|--------|---------|-------|------------|
| **Level 1 β SCR** (struct_score / tokens) | **1.14** | 0.47 | 0.42 |
| **Level 2 β Semantic Density** (math_toks / total) | **0.675** | 0.209 | β |
| **Level 3 β Structural Efficiency** (relations / tokens) | **0.307** | β | β |
| **SCR improvement vs GPT-2** | **2.44x** | β | β |
| **SCR improvement vs Char-level** | **2.72x** | β | β |
### Canonical Equivalence (headline result)
| Pair | MathTok Jaccard | GPT-2 Jaccard |
|------|----------------|---------------|
| `x + 2` vs `2 + x` | **1.000** | 0.200 |
| `(x+1)^2` vs `x^2+2x+1` | **1.000** | 0.273 |
| `sin^2+cos^2` vs `1` | **1.000** | 0.000 |
| `a^2-b^2` vs `(a+b)(a-b)` | **1.000** | 0.091 |
> MathTok achieves **perfect canonical convergence (Jaccard=1.0)** on all 8 equivalent pairs.
> GPT-2 ranges from 0.00 to 0.44 on the same pairs.
### LaTeX vs ASCII Normalization
| ASCII | LaTeX | MathTok converged? | GPT-2 tokens A/L |
|-------|-------|--------------------|------------------|
| `sin(x^2)` | `\sin(x^2)` | **YES (1.00)** | 6 / 7 |
| `sqrt(x^2+1)` | `\sqrt{x^2+1}` | **YES (1.00)** | 9 / 10 |
| `diff(sin(x),x)` | `\frac{d}{dx}\sin(x)` | **YES (1.00)** | 8 / 11 |
| `factorial(n)` | `n!` | **YES (1.00)** | 5 / 2 |
### Sample Expression Comparison
| Expression | MT tokens | MT SCR | GPT-2 tokens | GPT-2 SCR | Improvement |
|-----------|-----------|--------|-------------|-----------|-------------|
| `(x+1)^2` | 10 | 1.00 | 7 | 0.71 | **1.40x** |
| `sin(x^2)+3x` | 10 | 1.30 | 10 | 0.60 | **2.17x** |
| `factorial(n)` | 4 | 1.25 | 5 | 0.20 | **6.25x** |
| `sin(cos((x+1)^2+y^3))` | 15 | 1.20 | 15 | 0.60 | **2.00x** |
| `((a+b)*(a-b))/((a+b)^2)` | 11 | 1.36 | 19 | 0.16 | **8.64x** |
---
## Visualized Results
The graphs below clearly summarize MathTok's structural efficiency advantages:



---
## Output Files
- [comparison_results.jsonl](file:///c:/Users/surwe/Project/math_token/evaluation/results/comparison_results.jsonl) β one JSONL record per expression
- [comparison_summary.json](file:///c:/Users/surwe/Project/math_token/evaluation/results/comparison_summary.json) β aggregated metrics
---
## Paper-Ready Contributions
1. **Two-format input** β handles both LaTeX and ASCII, auto-detected
2. **Canonical consistency** β equivalent expressions produce token sets with 0.947 Jaccard overlap
3. **Semantic operator registry** β every operator has `arity`, `precedence`, `associativity`, `semantic_role` metadata
4.# Implementation Details
The following changes were successfully implemented:
- **L1 Canonicalization**: Improved reliability with parsing timeouts and LRU caching to prevent SymPy hangs.
- **L2 Hybrid Lexer**: Added confidence scores to lexical spans, along with improved regular expressions for parsing LaTeX and inline math constructs.
- **L3 AST Generator**: Implemented `max_depth` limits to gracefully truncate extremely deep ASTs (like malicious deeply nested formulas).
- **L4 Semantic Operator Registry**: Added `is_commutative` metadata, inverse-pair mappings (`INVERSE_PAIRS`), and expanded domains (Logic, Sets, Geometry, Probability).
- **L5 Structural Serializer**: Integrated subtree hashing and `[SCOPE_OPEN]`/`[SCOPE_CLOSE]` markers to better delineate function arguments.
- **L6 Attention Metadata**: Included `parent_token` context in the metadata structural hints to support graph-based attention models.
- **L7 Two-Tier Vocabulary**: Added explicit tokens such as `[UNK_MATH]`, missing Greek variables (`VAR_IOTA`, `VAR_KAPPA`, etc.), and structural boundary tokens.
- **Pipeline & Integration**: `MathTokPipeline` exposes configurable timeouts, max depth, and scopes. All key tokens/metadata symbols are correctly exported.
# Validation & Evaluation
- **RoundTripValidator**: Added `mathtok/validator.py` to reconstruct `sympy` expression trees from a flat tokenized stream, mathematically comparing them using `sp.simplify()` to ensure semantic fidelity.
- **Streaming Tokenizer**: Added `MathTokStreamingPipeline` with Python generator (`yield`) support for memory-efficient corpus-scale tokenization.
- **Benchmark Expansion**: Added `ODE_PDE`, `LINEAR_ALGEBRA`, `PROBABILITY`, and `SET_THEORY` domains into the `evaluation/comparison.py` suite.
> [!NOTE]
> The MathTok Tokenizer improves the Structural Encoding Ratio (SCR) by **2.29x** over Character Level Tokenization across the evaluation suite!
6. **HF-compatible tokenizer** β drop-in for transformers training pipelines
|