Instructions to use SurweeshSP/mathtok with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SurweeshSP/mathtok with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SurweeshSP/mathtok")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("SurweeshSP/mathtok", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SurweeshSP/mathtok with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SurweeshSP/mathtok" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/SurweeshSP/mathtok
- SGLang
How to use SurweeshSP/mathtok with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SurweeshSP/mathtok" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SurweeshSP/mathtok" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use SurweeshSP/mathtok with Docker Model Runner:
docker model run hf.co/SurweeshSP/mathtok
MathTok Pipeline β
What Was Built
7-layer mathematical tokenizer research pipeline at c:\Users\surwe\Project\math_token.
File Summary
| File | Role |
|---|---|
| canonicalizer.py | Layer 1 β LaTeX/ASCII β canonical SymPy via simplify/expand |
| lexer.py | Layer 2 β Split TEXT/MATH spans (LaTeX delimiters + ASCII heuristics) |
| ast_generator.py | Layer 3 β SymPy expression tree β typed ASTNode tree |
| operator_registry.py | Layer 4 β Full semantic metadata per operator/function |
| serializer.py | Layer 5 β DFS preorder β flat SerializedToken stream |
| metadata.py | Layer 6 β Per-token structural attention metadata + masks |
| vocabulary.py | Layer 7 β Fixed math vocab + BPE + HF PreTrainedTokenizer compat |
| pipeline.py | Orchestrator + CLI |
| metrics.py | 5 evaluation metrics (SCR, CCS, OPS, TS, TDF) |
| benchmark.py | Benchmark runner vs baselines |
Test Results
86 passed in 6.89s
All 86 tests pass across 5 test modules.
Benchmark Results (20 expressions)
SCR: 0.6292 Structural Compression Ratio (lower = more compressed)
CCS: 0.9467 Canonical Consistency Score (higher is better) β KEY METRIC
OPS: 0.4000 Operator Preservation Score
TS: 0.8763 Token Stability
TDF: 0.9588 Tree Depth Fidelity
vs Character-level baseline:
MathTok SCR=0.63 CCS=0.9467
CharLvl SCR=1.00 CCS=0.3916 β CCS is 2.4x worse
MathTok achieves 2.4x better Canonical Consistency over character-level tokenization β this is your key result for the paper.
CLI Demo
# Input: "$\sin(x^2) + 3x$"
# Output tokens:
['[MATH_START]', 'OP_ADD', 'OP_MUL', 'CONST_3', 'VAR_X',
'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]']
# S-expression:
(OP_ADD (OP_MUL CONST_3 VAR_X) (FUNC_SIN (OP_POW VAR_X CONST_2)))
Quick Start
cd c:\Users\surwe\Project\math_token
pip install -e ".[eval,dev]"
pytest tests/ -v
python -m evaluation.benchmark --quick --baselines
python -m evaluation.comparison --save # 3-level SCR comparison
python -m mathtok.pipeline "$\sin(x^2) + 3x$"
3-Level Semantic Comparison Results (vs GPT-2)
Aggregated (63 expressions, 5 categories)
| Metric | MathTok | GPT-2 | Char-level |
|---|---|---|---|
| Level 1 β SCR (struct_score / tokens) | 1.14 | 0.47 | 0.42 |
| Level 2 β Semantic Density (math_toks / total) | 0.675 | 0.209 | β |
| Level 3 β Structural Efficiency (relations / tokens) | 0.307 | β | β |
| SCR improvement vs GPT-2 | 2.44x | β | β |
| SCR improvement vs Char-level | 2.72x | β | β |
Canonical Equivalence (headline result)
| Pair | MathTok Jaccard | GPT-2 Jaccard |
|---|---|---|
x + 2 vs 2 + x |
1.000 | 0.200 |
(x+1)^2 vs x^2+2x+1 |
1.000 | 0.273 |
sin^2+cos^2 vs 1 |
1.000 | 0.000 |
a^2-b^2 vs (a+b)(a-b) |
1.000 | 0.091 |
MathTok achieves perfect canonical convergence (Jaccard=1.0) on all 8 equivalent pairs. GPT-2 ranges from 0.00 to 0.44 on the same pairs.
LaTeX vs ASCII Normalization
| ASCII | LaTeX | MathTok converged? | GPT-2 tokens A/L |
|---|---|---|---|
sin(x^2) |
\sin(x^2) |
YES (1.00) | 6 / 7 |
sqrt(x^2+1) |
\sqrt{x^2+1} |
YES (1.00) | 9 / 10 |
diff(sin(x),x) |
\frac{d}{dx}\sin(x) |
YES (1.00) | 8 / 11 |
factorial(n) |
n! |
YES (1.00) | 5 / 2 |
Sample Expression Comparison
| Expression | MT tokens | MT SCR | GPT-2 tokens | GPT-2 SCR | Improvement |
|---|---|---|---|---|---|
(x+1)^2 |
10 | 1.00 | 7 | 0.71 | 1.40x |
sin(x^2)+3x |
10 | 1.30 | 10 | 0.60 | 2.17x |
factorial(n) |
4 | 1.25 | 5 | 0.20 | 6.25x |
sin(cos((x+1)^2+y^3)) |
15 | 1.20 | 15 | 0.60 | 2.00x |
((a+b)*(a-b))/((a+b)^2) |
11 | 1.36 | 19 | 0.16 | 8.64x |
Visualized Results
The graphs below clearly summarize MathTok's structural efficiency advantages:
Output Files
- comparison_results.jsonl β one JSONL record per expression
- comparison_summary.json β aggregated metrics
Paper-Ready Contributions
- Two-format input β handles both LaTeX and ASCII, auto-detected
- Canonical consistency β equivalent expressions produce token sets with 0.947 Jaccard overlap
- Semantic operator registry β every operator has
arity,precedence,associativity,semantic_rolemetadata 4.# Implementation Details The following changes were successfully implemented:
- L1 Canonicalization: Improved reliability with parsing timeouts and LRU caching to prevent SymPy hangs.
- L2 Hybrid Lexer: Added confidence scores to lexical spans, along with improved regular expressions for parsing LaTeX and inline math constructs.
- L3 AST Generator: Implemented
max_depthlimits to gracefully truncate extremely deep ASTs (like malicious deeply nested formulas). - L4 Semantic Operator Registry: Added
is_commutativemetadata, inverse-pair mappings (INVERSE_PAIRS), and expanded domains (Logic, Sets, Geometry, Probability). - L5 Structural Serializer: Integrated subtree hashing and
[SCOPE_OPEN]/[SCOPE_CLOSE]markers to better delineate function arguments. - L6 Attention Metadata: Included
parent_tokencontext in the metadata structural hints to support graph-based attention models. - L7 Two-Tier Vocabulary: Added explicit tokens such as
[UNK_MATH], missing Greek variables (VAR_IOTA,VAR_KAPPA, etc.), and structural boundary tokens. - Pipeline & Integration:
MathTokPipelineexposes configurable timeouts, max depth, and scopes. All key tokens/metadata symbols are correctly exported.
Validation & Evaluation
- RoundTripValidator: Added
mathtok/validator.pyto reconstructsympyexpression trees from a flat tokenized stream, mathematically comparing them usingsp.simplify()to ensure semantic fidelity. - Streaming Tokenizer: Added
MathTokStreamingPipelinewith Python generator (yield) support for memory-efficient corpus-scale tokenization. - Benchmark Expansion: Added
ODE_PDE,LINEAR_ALGEBRA,PROBABILITY, andSET_THEORYdomains into theevaluation/comparison.pysuite.
The MathTok Tokenizer improves the Structural Encoding Ratio (SCR) by 2.29x over Character Level Tokenization across the evaluation suite! 6. HF-compatible tokenizer β drop-in for transformers training pipelines