Instructions to use SurweeshSP/mathtok with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SurweeshSP/mathtok with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SurweeshSP/mathtok")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("SurweeshSP/mathtok", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SurweeshSP/mathtok with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SurweeshSP/mathtok" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/SurweeshSP/mathtok
- SGLang
How to use SurweeshSP/mathtok with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SurweeshSP/mathtok" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SurweeshSP/mathtok" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use SurweeshSP/mathtok with Docker Model Runner:
docker model run hf.co/SurweeshSP/mathtok
π MathTok: Canonicalized AST-Based Mathematical Tokenizer Codebase Review
An in-depth structural and architectural analysis of the MathTok pipeline located at c:\Users\surwe\Project\math_token. This document serves as a comprehensive system review, detailing the mathematical foundations, the 7-layer pipeline design, system components, evaluation metrics, empirical results, and downstream application patterns of MathTok.
π Executive Summary
Standard natural language tokenizers (like Byte-Pair Encoding or SentencePiece) treat mathematical expressions as plain text sequences. This results in structural fragmentation (e.g., splitting a variable VAR_THETA or operator OP_ADD into arbitrary character chunks) and semantic blindness (failing to recognize algebraic equivalences like $x + 2 \equiv 2 + x$).
MathTok solves this by introducing a hybrid, structure-aware tokenization framework for mathematical language modeling. By constructing an Abstract Syntax Tree (AST) from mathematical expressions, normalizing algebraic equivalences via symbolic mathematics (SymPy), and serializing the tree using Depth-First Search (DFS) preorder traversal, MathTok preserves full mathematical syntax and hierarchy.
Additionally, MathTok automatically emits structural attention metadata for every token position, enabling downstream transformer models to implement tree-based or graph-structured attention patterns without architectural modifications.
graph TD
A[Raw Input: Mixed Text + Math] --> B[Layer 2: Hybrid Lexer]
B -->|TEXT Spans| C[Layer 7: BPE Text Sub-Vocab]
B -->|MATH Spans| D[Layer 1: Canonicalizer Engine]
D -->|SymPy Expression| E[Layer 3: AST Generator]
E -->|Typed AST Tree| F[Layer 4: Semantic Operator Registry]
F -->|Enriched Nodes| G[Layer 5: Structural Serializer]
G -->|DFS Preorder Stream| H[Layer 6: Attention Metadata Gen]
H -->|Attention Masks & Hints| I[Final Merged Token Stream]
C --> I
π οΈ The 7-Layer Processing Pipeline
MathTok's core engine is structured into seven distinct modular layers. Every component resides in the mathtok/ package.
Layer 1: Canonicalizer Engine
- Location:
canonicalizer.py - Role: Algebraic normalisation and format conversion (LaTeX $\to$ ASCII $\to$ SymPy).
- Implementation Details:
- Heuristic Format Detection: Inspects the input for LaTeX syntax (e.g.,
\frac,\sqrt,\sin,{, math delimiters like$or\(). - Parsing: Utilizes
sympy.parsing.latex.parse_latex(with ANTLR4) for LaTeX, falling back tosympy.parsing.sympy_parser.parse_exprwith standard and implicit multiplication transformations for ASCII. - Normalisation: Leverages SymPy's symbolic engine to
expand()products over sums andsimplify()algebraic expressions. It normalizes operations internally (e.g., transforming subtractions $a - b$ into additions of products $\text{Add}(a, \text{Mul}(-1, b))$, and divisions $a / b$ into multiplications of powers $\text{Mul}(a, \text{Pow}(b, -1))$). - Robustness & Performance: Employs an LRU cache (default: 512 entries) to prevent redundant parsing and wraps expensive SymPy calls in a
ThreadPoolExecutorwith configurable parsing timeouts (default: 5.0 seconds) to prevent infinite loops on malicious, highly-complex inputs.
- Heuristic Format Detection: Inspects the input for LaTeX syntax (e.g.,
Layer 2: Hybrid Mathematical Lexer
- Location:
lexer.py - Role: Alternating segment segmentation (TEXT spans vs. MATH spans).
- Implementation Details:
- Stage 1 (Unambiguous Delimiters): Extracts LaTeX math environments (double dollar
$$...$$, inline dollar$...\$, bracket\[...\], or parenthesis\(...\)). - Stage 2 (ASCII Heuristics): Parses remaining text regions using pre-compiled regular expressions matching mathematical patterns (e.g., function calls
sin(...), exponentsx^2, arithmetic boundaries2*x+1, relational equationsa+b=c, and spelled-out Greek variables). - Region Expansion: Expands detected math seeds backwards to include leading unary operators and digits, and forwards to match balanced braces/parentheses and continuous math characters. Adjacent spans of identical types are merged.
- Stage 1 (Unambiguous Delimiters): Extracts LaTeX math environments (double dollar
Layer 3: AST Generator
- Location:
ast_generator.py - Role: SymPy AST conversion to typed, abstract vocabulary trees.
- Implementation Details:
- Walks the SymPy internal expression tree recursively.
- Maps generic SymPy types into the vocabulary of MathTok:
- Variables: Standard letters map to
VAR_X,VAR_Y, etc. Spelled-out Greek names map toVAR_THETA,VAR_LAMBDA, etc. - Constants: Values between $-10$ and $100$ receive dedicated tokens (e.g.,
CONST_3,CONST_12), large integers map to placeholders (e.g.,NUM_145), floats map to string-encoded float tokens (e.g.,FLOAT_3p14), and special constants map toCONST_PI,CONST_E,CONST_I, andCONST_INF. - Unary Operations: Converts negative numbers or multiplication by $-1$ to explicit
OP_NEGnodes, and division inverses toOP_RECIPnodes. - Fractions: Converts
Rational(p, q)into explicit binaryFRAC(numerator, denominator)nodes.
- Variables: Standard letters map to
- Recursion Guard: Enforces
max_depthlimits (default: 20) to truncate overly-nested expressions, replacing them with a specialSUBTREE_TRUNCATEDnode to avoid Python stack overflows.
Layer 4: Semantic Operator Registry
- Location:
operator_registry.py - Role: Rich metadata storage and categorisation for mathematical operators.
- Implementation Details:
- Maintains an immutable registry of
OperatorMetainstances mapping token strings to mathematical properties:- Properties:
arity($-1$ for variadic, or fixed integers like 1 or 2),precedence,associativity(left, right, or none),semantic_role(e.g.,aggregationfor addition,periodic_oscillationfor sine),latex_repr,ascii_repr,category, andis_commutative. - Domains: Spans multiple mathematical branches: Arithmetic, Relational, Calculus, Trigonometry, Exponential/Logarithmic, Logic, Set Theory, Geometry, and Statistics.
- Inverses: Declares explicit mathematical inverses in
INVERSE_PAIRS(e.g.,FUNC_SIN$\leftrightarrow$FUNC_ASIN,FUNC_EXP$\leftrightarrow$FUNC_LOG).
- Properties:
- Maintains an immutable registry of
Layer 5: Structural Serializer
- Location:
serializer.py - Role: Flattening the 2D tree structure into a 1-D stream using DFS preorder traversal.
- Implementation Details:
- Emits nodes starting from the root down to the leaves, producing a flat sequence of
SerializedTokenobjects carrying:depth,node_id,parent_id,child_index,num_children,is_leaf, andsubtree_size. - Scope Delineation: Emits
[SCOPE_OPEN]and[SCOPE_CLOSE]boundary tokens to explicitly group parameters for functions (e.g.,FUNC_SIN [SCOPE_OPEN] VAR_X [SCOPE_CLOSE]). - Subtree Deduplication: Integrates MD5 structural hashing (
dedup_subtrees) to replace duplicated structures (e.g., repeating sub-formulas) with a pointer reference (e.g.,SUBTREE_REF_ae34df51), improving sequence compression.
- Emits nodes starting from the root down to the leaves, producing a flat sequence of
Layer 6: Structural Attention Metadata Generator
- Location:
metadata.py - Role: Calculating positional contexts and binary attention mask matrices.
- Implementation Details:
- Classifies tokens into categories:
operator,function,variable,constant,structural,boundary, ortext. - Generates a dot-separated positional hierarchy string for each node in
tree_position_key(e.g.,0.1.2denotes root $\to$ 2nd child $\to$ 3rd child), which is useful for hierarchical positional encodings. - Attention Mask Matrix Synthesis: Dynamically compiles four $N \times N$ binary attention mask matrices:
parent_mask: Direct dependency attention.children_mask: Inverse dependency attention.sibling_mask: Horizontal syntactic context attention.subtree_mask: Complete structural scope attention.
- Classifies tokens into categories:
Layer 7: Vocabulary & BPE Compression
- Location:
vocabulary.py - Role: Merging deterministic structural math vocabularies with Byte-Pair Encoding (BPE) text sub-vocabularies.
- Implementation Details:
- Two-Tier Architecture:
- Tier 1 (Fixed Math Vocabulary): Reservoirs of deterministic, immutable IDs for standard operators, Greek/Latin variables, constants, boundaries, and placeholders. BPE is completely bypassed for math terms.
- Tier 2 (BPE Text Vocabulary): Natural language regions are processed via HuggingFace's
tokenizerslibrary, trained on corpus-specific text spans.
- HuggingFace Wrapper: Under the hood,
MathTokHFTokenizeracts as a drop-in subclass wrapper forPreTrainedTokenizer, enabling immediate integration into standard pipelines such astransformers.Trainer,datasets.map, and PyTorch collators.
- Two-Tier Architecture:
π Verification & Streaming Sub-systems
Beyond the core layers, MathTok implements crucial sub-systems to guarantee mathematical correctess and scale.
Round-Trip Validation
- Location:
validator.py - Role: Guaranteeing zero semantic information loss during tokenization.
- Implementation Details:
- Uses the emitted
TokenMetadatasequence to mathematically reconstruct the original SymPy expression. - Rebuilds leaf nodes based on their category (constants, variables, truncations) and moves upwards to reconstruct complex nodes (
FRAC, operators, custom functions). - Performs formal validation by checking if the algebraic difference between the original and reconstructed expressions simplifies to zero (
sp.simplify(original - reconstructed) == 0).
- Uses the emitted
Streaming Pipeline
- Location:
streaming.py - Role: Corpus-scale processing of large datasets without exhausting system memory.
- Implementation Details:
- Wraps
MathTokPipelineinside a lazy Python generator (yield). - Supports encoding custom iterators and streams line-delimited files sequentially, ensuring constant memory ($O(1)$ RAM) overhead during dataset processing.
- Wraps
π Evaluation Suite & Benchmark Metrics
The evaluation/ package defines five core evaluation metrics (residing in metrics.py) to assess tokenizer quality, benchmarked in comparison.py.
Core Metrics
| Metric | Symbol | Definition & Formula | Mathematical Value |
|---|---|---|---|
| Structural Compression Ratio | SCR | $\text{mean}\left(\frac{\text{Structural Score}}{\text{Token Count}}\right)$ | Quantifies structural information density. Higher is better (more structure packed into fewer tokens). |
| Canonical Consistency Score | CCS | $\text{mean}\left( \text{Jaccard}(S_A, S_B) \right)$ over equivalent pairs | Evaluates algebraic invariance. A score of $1.0$ represents perfect semantic convergence. |
| Operator Preservation Score | OPS | $%$ of expressions containing all expected operators | Measures robustness; ensures mathematical operations are never lost or corrupted. |
| Token Stability | TS | $1 - \text{Coefficient of Variation}(\text{length})$ | Assesses syntactic variance stability under re-writings. Higher is more stable. |
| Tree Depth Fidelity | TDF | $1 - \text{mean}\left( \frac{\vert d_{\text{actual}} - d_{\text{ground}} \vert}{d_{\text{ground}}} \right)$ | Measures max metadata depth accuracy against the ground truth SymPy height. |
Semantic Compression Ratio (SCR) is evaluated at three hierarchical levels in
comparison.py:
- Level 1 β Structural Score to Token Ratio:
structural_score / token_count- Level 2 β Semantic Density:
math_tokens / total_tokens- Level 3 β Structural Efficiency:
parent_child_relations / token_count
π¬ Empirical Benchmark Results
Empirical comparisons of MathTok against a standard subword tokenizer (GPT-2 BPE), a custom-trained SentencePiece (unigram) tokenizer, and character-level baselines over 70 complex test expressions across multiple disciplines reveal substantial improvements.
1. 3-Level Semantic Comparison (Aggregated)
Across the entire evaluation suite, the aggregated results illustrate MathTok's efficiency:
| Metric | MathTok | GPT-2 | SentencePiece | Character-Level |
|---|---|---|---|---|
| Level 1 β SCR (struct_score / tokens) | 0.9161 | 0.4251 | 0.3696 | 0.4005 |
| Level 2 β Semantic Density (math / total) | 0.5633 | 0.1838 | 0.1499 | β |
| Level 3 β Structural Efficiency (relations / tokens) | 0.2492 | N/A | N/A | β |
| SCR Improvement Factor (MathTok vs. Baseline) | β | 2.16x | 2.48x | 2.29x |
2. Canonical Convergence & Consistency (Jaccard Overlap)
For mathematically equivalent pairs, MathTok achieves perfect Jaccard alignment (Jaccard = 1.0), whereas standard text-based tokenizers suffer significant fragmentation:
| Expression Pair | MathTok Jaccard | GPT-2 Jaccard | SentencePiece Jaccard | Convergence Status |
|---|---|---|---|---|
x + 2 vs. 2 + x |
1.000 | 0.200 | 1.000 | CONVERGED (100%) |
a*b + a*c vs. a*(b+c) |
1.000 | 0.444 | 0.625 | CONVERGED (100%) |
(x+1)^2 vs. x^2+2x+1 |
1.000 | 0.273 | 0.222 | CONVERGED (100%) |
x^2 - y^2 vs. (x+y)*(x-y) |
1.000 | 0.091 | 0.300 | CONVERGED (100%) |
sin(x)^2 + cos(x)^2 vs. 1 |
1.000 | 0.000 | 0.000 | CONVERGED (100%) |
2*x + 2*y vs. 2*(x+y) |
1.000 | 0.444 | 0.571 | CONVERGED (100%) |
x*y + x*z vs. x*(y+z) |
1.000 | 0.444 | 0.625 | CONVERGED (100%) |
a^2 + 2*a*b + b^2 vs. (a+b)^2 |
1.000 | 0.364 | 0.455 | CONVERGED (100%) |
3. LaTeX vs. ASCII Format Invariance
MathTok perfectly converges inputs in differing representations to identical structural sequences, while subword tokenizers have severe variance:
| ASCII Expression | LaTeX Expression | MathTok same? | MT tokens A/L | GPT-2 tokens A/L | SP tokens A/L |
|---|---|---|---|---|---|
sin(x^2) |
\sin(x^2) |
YES (1.00) | 8 / 8 | 6 / 7 | 6 / 6 |
sqrt(x^2 + 1) |
\sqrt{x^2 + 1} |
YES (1.00) | 11 / 11 | 9 / 10 | 9 / 9 |
log(x) |
\ln(x) |
YES (1.00) | 6 / 6 | 4 / 5 | 6 / 6 |
exp(x) |
e^x |
YES (1.00) | 6 / 6 | 4 / 3 | 6 / 3 |
x/y |
\frac{x}{y} |
YES (1.00) | 6 / 6 | 3 / 7 | 3 / 9 |
int(x^2, x) |
\int x^2 dx |
NO (~/fallback) | 1 / 10 | 8 / 6 | 8 / 7 |
diff(sin(x), x) |
\frac{d}{dx}\sin(x) |
YES (1.00) | 6 / 6 | 8 / 11 | 14 / 16 |
factorial(n) |
n! |
YES (1.00) | 6 / 6 | 5 / 2 | 11 / 3 |
π Custom Attention Integration Patterns
The core value of MathTok for downstream machine learning practitioners is the Layer 6 Attention Hints. By translating tree relationships into standard masking shapes, model creators can train structure-aware networks natively.
Below are three attention mask designs that can be constructed directly from the outputs of to_attention_mask_hints():
1. Parent-Child Hierarchical Mask
Encourages top-down syntactic attention. Nodes are only allowed to attend to their direct parent or child node.
[+ (root)] Parent Attention Mask Matrix:
/ \
[x] [3] [ ] [+ (root)] [x] [3]
| [+ (root)] 1 1 1
[sin] [x] 1 1 0
[3] 1 0 1
2. Sibling Horizontal Mask
Focuses horizontal attention across operands of identical scopes (e.g., connecting operands inside an addition sequence, $a$ and $b$ and $c$, without parent noise).
3. Subtree Scope Mask
A highly effective block mask for mathematical reasoning. Restricts attention strictly within a subtree, isolating independent sub-expressions during reasoning loops.
π― Codebase Evaluation & Recommendations
Key Strengths
- Outstanding Structural Integrity: Modularity is excellent. Clear abstraction separation (canonicalization, tokenization, serialization, and vocabulary grouping) makes codebase expansion extremely straightforward.
- HuggingFace Compatibility: Subclassing/wrapping the standard tokenizer class ensures immediate, zero-friction integration with existing libraries like PyTorch and HuggingFace.
- Rigorous Validation: The inclusion of
validator.pyand the round-trip checking logic demonstrates high development standards. - Reliability Guards: LRU caches, concurrency thread pools, and recursion limits make this pipeline safe for server-side deployment.
Recommended Enhancements
- Vocabulary Extension: Dynamically augment
_VAR_MAPinast_generator.pyto natively support multi-character variables (e.g., physics variables like $v_{\text{init}}$ or matrix names) without splitting them into generic token placeholders. - SymPy Parser Customisation: SymPy's LaTeX parser can occasionally fail on non-standard, custom LaTeX macros. Adding pre-processing ASCII/LaTeX regex cleaners in
lexer.pyprior to passing them to SymPy will improve the parse success rate of dirty online forum data. - TDF Precision: In case of multi-nested subtrees (e.g., highly deeply-nested fractions), customize the tree depth calculation in
metrics.pyto evaluate structural depths on custom mathematical representations rather than internal SymPy structures.
Citation Reference
@article{mathtok2026,
title = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework
for Mathematical Language Modeling},
author = {Anonymous},
year = {2026},
note = {Under review}
}