Text Generation
Transformers
English
custom
tokenizer
symbolic-ai
mathematics
llm
reasoning
ast
compiler
nlp
deep-learning
machine-learning
mathematical-reasoning
symbolic-reasoning
tokenization
parser
artificial-intelligence
Eval Results (legacy)
Instructions to use SurweeshSP/mathtok with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SurweeshSP/mathtok with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SurweeshSP/mathtok")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("SurweeshSP/mathtok", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SurweeshSP/mathtok with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SurweeshSP/mathtok" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/SurweeshSP/mathtok
- SGLang
How to use SurweeshSP/mathtok with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SurweeshSP/mathtok" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SurweeshSP/mathtok" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use SurweeshSP/mathtok with Docker Model Runner:
docker model run hf.co/SurweeshSP/mathtok
| # π MathTok: Canonicalized AST-Based Mathematical Tokenizer Codebase Review | |
| An in-depth structural and architectural analysis of the **MathTok** pipeline located at `c:\Users\surwe\Project\math_token`. This document serves as a comprehensive system review, detailing the mathematical foundations, the 7-layer pipeline design, system components, evaluation metrics, empirical results, and downstream application patterns of MathTok. | |
| --- | |
| ## π Executive Summary | |
| Standard natural language tokenizers (like Byte-Pair Encoding or SentencePiece) treat mathematical expressions as plain text sequences. This results in **structural fragmentation** (e.g., splitting a variable `VAR_THETA` or operator `OP_ADD` into arbitrary character chunks) and **semantic blindness** (failing to recognize algebraic equivalences like $x + 2 \equiv 2 + x$). | |
| **MathTok** solves this by introducing a **hybrid, structure-aware tokenization framework** for mathematical language modeling. By constructing an Abstract Syntax Tree (AST) from mathematical expressions, normalizing algebraic equivalences via symbolic mathematics (SymPy), and serializing the tree using Depth-First Search (DFS) preorder traversal, MathTok preserves full mathematical syntax and hierarchy. | |
| Additionally, MathTok automatically emits **structural attention metadata** for every token position, enabling downstream transformer models to implement tree-based or graph-structured attention patterns without architectural modifications. | |
| ```mermaid | |
| graph TD | |
| A[Raw Input: Mixed Text + Math] --> B[Layer 2: Hybrid Lexer] | |
| B -->|TEXT Spans| C[Layer 7: BPE Text Sub-Vocab] | |
| B -->|MATH Spans| D[Layer 1: Canonicalizer Engine] | |
| D -->|SymPy Expression| E[Layer 3: AST Generator] | |
| E -->|Typed AST Tree| F[Layer 4: Semantic Operator Registry] | |
| F -->|Enriched Nodes| G[Layer 5: Structural Serializer] | |
| G -->|DFS Preorder Stream| H[Layer 6: Attention Metadata Gen] | |
| H -->|Attention Masks & Hints| I[Final Merged Token Stream] | |
| C --> I | |
| ``` | |
| --- | |
| ## π οΈ The 7-Layer Processing Pipeline | |
| MathTok's core engine is structured into seven distinct modular layers. Every component resides in the [`mathtok/`](file:///c:/Users/surwe/Project/math_token/mathtok) package. | |
| ### Layer 1: Canonicalizer Engine | |
| * **Location**: [`canonicalizer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/canonicalizer.py) | |
| * **Role**: Algebraic normalisation and format conversion (LaTeX $\to$ ASCII $\to$ SymPy). | |
| * **Implementation Details**: | |
| * **Heuristic Format Detection**: Inspects the input for LaTeX syntax (e.g., `\frac`, `\sqrt`, `\sin`, `{`, math delimiters like `$` or `\(`). | |
| * **Parsing**: Utilizes `sympy.parsing.latex.parse_latex` (with ANTLR4) for LaTeX, falling back to `sympy.parsing.sympy_parser.parse_expr` with standard and implicit multiplication transformations for ASCII. | |
| * **Normalisation**: Leverages SymPy's symbolic engine to `expand()` products over sums and `simplify()` algebraic expressions. It normalizes operations internally (e.g., transforming subtractions $a - b$ into additions of products $\text{Add}(a, \text{Mul}(-1, b))$, and divisions $a / b$ into multiplications of powers $\text{Mul}(a, \text{Pow}(b, -1))$). | |
| * **Robustness & Performance**: Employs an LRU cache (default: 512 entries) to prevent redundant parsing and wraps expensive SymPy calls in a `ThreadPoolExecutor` with configurable parsing timeouts (default: 5.0 seconds) to prevent infinite loops on malicious, highly-complex inputs. | |
| ### Layer 2: Hybrid Mathematical Lexer | |
| * **Location**: [`lexer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/lexer.py) | |
| * **Role**: Alternating segment segmentation (TEXT spans vs. MATH spans). | |
| * **Implementation Details**: | |
| * **Stage 1 (Unambiguous Delimiters)**: Extracts LaTeX math environments (double dollar `$$...$$`, inline dollar `$...\$`, bracket `\[...\]`, or parenthesis `\(...\)`). | |
| * **Stage 2 (ASCII Heuristics)**: Parses remaining text regions using pre-compiled regular expressions matching mathematical patterns (e.g., function calls `sin(...)`, exponents `x^2`, arithmetic boundaries `2*x+1`, relational equations `a+b=c`, and spelled-out Greek variables). | |
| * **Region Expansion**: Expands detected math seeds backwards to include leading unary operators and digits, and forwards to match balanced braces/parentheses and continuous math characters. Adjacent spans of identical types are merged. | |
| ### Layer 3: AST Generator | |
| * **Location**: [`ast_generator.py`](file:///c:/Users/surwe/Project/math_token/mathtok/ast_generator.py) | |
| * **Role**: SymPy AST conversion to typed, abstract vocabulary trees. | |
| * **Implementation Details**: | |
| * Walks the SymPy internal expression tree recursively. | |
| * Maps generic SymPy types into the vocabulary of MathTok: | |
| * **Variables**: Standard letters map to `VAR_X`, `VAR_Y`, etc. Spelled-out Greek names map to `VAR_THETA`, `VAR_LAMBDA`, etc. | |
| * **Constants**: Values between $-10$ and $100$ receive dedicated tokens (e.g., `CONST_3`, `CONST_12`), large integers map to placeholders (e.g., `NUM_145`), floats map to string-encoded float tokens (e.g., `FLOAT_3p14`), and special constants map to `CONST_PI`, `CONST_E`, `CONST_I`, and `CONST_INF`. | |
| * **Unary Operations**: Converts negative numbers or multiplication by $-1$ to explicit `OP_NEG` nodes, and division inverses to `OP_RECIP` nodes. | |
| * **Fractions**: Converts `Rational(p, q)` into explicit binary `FRAC(numerator, denominator)` nodes. | |
| * **Recursion Guard**: Enforces `max_depth` limits (default: 20) to truncate overly-nested expressions, replacing them with a special `SUBTREE_TRUNCATED` node to avoid Python stack overflows. | |
| ### Layer 4: Semantic Operator Registry | |
| * **Location**: [`operator_registry.py`](file:///c:/Users/surwe/Project/math_token/mathtok/operator_registry.py) | |
| * **Role**: Rich metadata storage and categorisation for mathematical operators. | |
| * **Implementation Details**: | |
| * Maintains an immutable registry of `OperatorMeta` instances mapping token strings to mathematical properties: | |
| * **Properties**: `arity` ($-1$ for variadic, or fixed integers like 1 or 2), `precedence`, `associativity` (left, right, or none), `semantic_role` (e.g., `aggregation` for addition, `periodic_oscillation` for sine), `latex_repr`, `ascii_repr`, `category`, and `is_commutative`. | |
| * **Domains**: Spans multiple mathematical branches: Arithmetic, Relational, Calculus, Trigonometry, Exponential/Logarithmic, Logic, Set Theory, Geometry, and Statistics. | |
| * **Inverses**: Declares explicit mathematical inverses in `INVERSE_PAIRS` (e.g., `FUNC_SIN` $\leftrightarrow$ `FUNC_ASIN`, `FUNC_EXP` $\leftrightarrow$ `FUNC_LOG`). | |
| ### Layer 5: Structural Serializer | |
| * **Location**: [`serializer.py`](file:///c:/Users/surwe/Project/math_token/mathtok/serializer.py) | |
| * **Role**: Flattening the 2D tree structure into a 1-D stream using DFS preorder traversal. | |
| * **Implementation Details**: | |
| * Emits nodes starting from the root down to the leaves, producing a flat sequence of `SerializedToken` objects carrying: `depth`, `node_id`, `parent_id`, `child_index`, `num_children`, `is_leaf`, and `subtree_size`. | |
| * **Scope Delineation**: Emits `[SCOPE_OPEN]` and `[SCOPE_CLOSE]` boundary tokens to explicitly group parameters for functions (e.g., `FUNC_SIN [SCOPE_OPEN] VAR_X [SCOPE_CLOSE]`). | |
| * **Subtree Deduplication**: Integrates MD5 structural hashing (`dedup_subtrees`) to replace duplicated structures (e.g., repeating sub-formulas) with a pointer reference (e.g., `SUBTREE_REF_ae34df51`), improving sequence compression. | |
| ### Layer 6: Structural Attention Metadata Generator | |
| * **Location**: [`metadata.py`](file:///c:/Users/surwe/Project/math_token/mathtok/metadata.py) | |
| * **Role**: Calculating positional contexts and binary attention mask matrices. | |
| * **Implementation Details**: | |
| * Classifies tokens into categories: `operator`, `function`, `variable`, `constant`, `structural`, `boundary`, or `text`. | |
| * Generates a dot-separated positional hierarchy string for each node in `tree_position_key` (e.g., `0.1.2` denotes root $\to$ 2nd child $\to$ 3rd child), which is useful for hierarchical positional encodings. | |
| * **Attention Mask Matrix Synthesis**: Dynamically compiles four $N \times N$ binary attention mask matrices: | |
| * `parent_mask`: Direct dependency attention. | |
| * `children_mask`: Inverse dependency attention. | |
| * `sibling_mask`: Horizontal syntactic context attention. | |
| * `subtree_mask`: Complete structural scope attention. | |
| ### Layer 7: Vocabulary & BPE Compression | |
| * **Location**: [`vocabulary.py`](file:///c:/Users/surwe/Project/math_token/mathtok/vocabulary.py) | |
| * **Role**: Merging deterministic structural math vocabularies with Byte-Pair Encoding (BPE) text sub-vocabularies. | |
| * **Implementation Details**: | |
| * **Two-Tier Architecture**: | |
| * **Tier 1 (Fixed Math Vocabulary)**: Reservoirs of deterministic, immutable IDs for standard operators, Greek/Latin variables, constants, boundaries, and placeholders. BPE is completely bypassed for math terms. | |
| * **Tier 2 (BPE Text Vocabulary)**: Natural language regions are processed via HuggingFace's `tokenizers` library, trained on corpus-specific text spans. | |
| * **HuggingFace Wrapper**: Under the hood, `MathTokHFTokenizer` acts as a drop-in subclass wrapper for `PreTrainedTokenizer`, enabling immediate integration into standard pipelines such as `transformers.Trainer`, `datasets.map`, and PyTorch collators. | |
| --- | |
| ## π Verification & Streaming Sub-systems | |
| Beyond the core layers, MathTok implements crucial sub-systems to guarantee mathematical correctess and scale. | |
| ### Round-Trip Validation | |
| * **Location**: [`validator.py`](file:///c:/Users/surwe/Project/math_token/mathtok/validator.py) | |
| * **Role**: Guaranteeing zero semantic information loss during tokenization. | |
| * **Implementation Details**: | |
| * Uses the emitted `TokenMetadata` sequence to mathematically reconstruct the original SymPy expression. | |
| * Rebuilds leaf nodes based on their category (constants, variables, truncations) and moves upwards to reconstruct complex nodes (`FRAC`, operators, custom functions). | |
| * Performs formal validation by checking if the algebraic difference between the original and reconstructed expressions simplifies to zero (`sp.simplify(original - reconstructed) == 0`). | |
| ### Streaming Pipeline | |
| * **Location**: [`streaming.py`](file:///c:/Users/surwe/Project/math_token/mathtok/streaming.py) | |
| * **Role**: Corpus-scale processing of large datasets without exhausting system memory. | |
| * **Implementation Details**: | |
| * Wraps `MathTokPipeline` inside a lazy Python generator (`yield`). | |
| * Supports encoding custom iterators and streams line-delimited files sequentially, ensuring constant memory ($O(1)$ RAM) overhead during dataset processing. | |
| --- | |
| ## π Evaluation Suite & Benchmark Metrics | |
| The [`evaluation/`](file:///c:/Users/surwe/Project/math_token/evaluation) package defines five core evaluation metrics (residing in [`metrics.py`](file:///c:/Users/surwe/Project/math_token/evaluation/metrics.py)) to assess tokenizer quality, benchmarked in [`comparison.py`](file:///c:/Users/surwe/Project/math_token/evaluation/comparison.py). | |
| ### Core Metrics | |
| | Metric | Symbol | Definition & Formula | Mathematical Value | | |
| | :--- | :---: | :--- | :--- | | |
| | **Structural Compression Ratio** | **SCR** | $\text{mean}\left(\frac{\text{Structural Score}}{\text{Token Count}}\right)$ | Quantifies structural information density. Higher is better (more structure packed into fewer tokens). | | |
| | **Canonical Consistency Score** | **CCS** | $\text{mean}\left( \text{Jaccard}(S_A, S_B) \right)$ over equivalent pairs | Evaluates algebraic invariance. A score of $1.0$ represents perfect semantic convergence. | | |
| | **Operator Preservation Score** | **OPS** | $\%$ of expressions containing all expected operators | Measures robustness; ensures mathematical operations are never lost or corrupted. | | |
| | **Token Stability** | **TS** | $1 - \text{Coefficient of Variation}(\text{length})$ | Assesses syntactic variance stability under re-writings. Higher is more stable. | | |
| | **Tree Depth Fidelity** | **TDF** | $1 - \text{mean}\left( \frac{\vert d_{\text{actual}} - d_{\text{ground}} \vert}{d_{\text{ground}}} \right)$ | Measures max metadata depth accuracy against the ground truth SymPy height. | | |
| > [!NOTE] | |
| > **Semantic Compression Ratio (SCR)** is evaluated at three hierarchical levels in `comparison.py`: | |
| > * **Level 1 β Structural Score to Token Ratio**: `structural_score / token_count` | |
| > * **Level 2 β Semantic Density**: `math_tokens / total_tokens` | |
| > * **Level 3 β Structural Efficiency**: `parent_child_relations / token_count` | |
| --- | |
| ## π¬ Empirical Benchmark Results | |
| Empirical comparisons of MathTok against a standard subword tokenizer (GPT-2 BPE), a custom-trained SentencePiece (unigram) tokenizer, and character-level baselines over 70 complex test expressions across multiple disciplines reveal substantial improvements. | |
| ### 1. 3-Level Semantic Comparison (Aggregated) | |
| Across the entire evaluation suite, the aggregated results illustrate MathTok's efficiency: | |
| | Metric | MathTok | GPT-2 | SentencePiece | Character-Level | | |
| | :--- | :---: | :---: | :---: | :---: | | |
| | **Level 1 β SCR** (struct_score / tokens) | **0.9161** | 0.4251 | 0.3696 | 0.4005 | | |
| | **Level 2 β Semantic Density** (math / total) | **0.5633** | 0.1838 | 0.1499 | β | | |
| | **Level 3 β Structural Efficiency** (relations / tokens) | **0.2492** | *N/A* | *N/A* | β | | |
| | **SCR Improvement Factor** (MathTok vs. Baseline) | **β** | **2.16x** | **2.48x** | **2.29x** | | |
| ### 2. Canonical Convergence & Consistency (Jaccard Overlap) | |
| For mathematically equivalent pairs, MathTok achieves perfect Jaccard alignment (Jaccard = 1.0), whereas standard text-based tokenizers suffer significant fragmentation: | |
| | Expression Pair | MathTok Jaccard | GPT-2 Jaccard | SentencePiece Jaccard | Convergence Status | | |
| | :--- | :---: | :---: | :---: | :---: | | |
| | `x + 2` vs. `2 + x` | **1.000** | 0.200 | 1.000 | **CONVERGED (100%)** | | |
| | `a*b + a*c` vs. `a*(b+c)` | **1.000** | 0.444 | 0.625 | **CONVERGED (100%)** | | |
| | `(x+1)^2` vs. `x^2+2x+1` | **1.000** | 0.273 | 0.222 | **CONVERGED (100%)** | | |
| | `x^2 - y^2` vs. `(x+y)*(x-y)` | **1.000** | 0.091 | 0.300 | **CONVERGED (100%)** | | |
| | `sin(x)^2 + cos(x)^2` vs. `1` | **1.000** | 0.000 | 0.000 | **CONVERGED (100%)** | | |
| | `2*x + 2*y` vs. `2*(x+y)` | **1.000** | 0.444 | 0.571 | **CONVERGED (100%)** | | |
| | `x*y + x*z` vs. `x*(y+z)` | **1.000** | 0.444 | 0.625 | **CONVERGED (100%)** | | |
| | `a^2 + 2*a*b + b^2` vs. `(a+b)^2` | **1.000** | 0.364 | 0.455 | **CONVERGED (100%)** | | |
| ### 3. LaTeX vs. ASCII Format Invariance | |
| MathTok perfectly converges inputs in differing representations to identical structural sequences, while subword tokenizers have severe variance: | |
| | ASCII Expression | LaTeX Expression | MathTok same? | MT tokens A/L | GPT-2 tokens A/L | SP tokens A/L | | |
| | :--- | :--- | :---: | :---: | :---: | :---: | | |
| | `sin(x^2)` | `\sin(x^2)` | **YES (1.00)** | **8 / 8** | 6 / 7 | 6 / 6 | | |
| | `sqrt(x^2 + 1)` | `\sqrt{x^2 + 1}` | **YES (1.00)** | **11 / 11** | 9 / 10 | 9 / 9 | | |
| | `log(x)` | `\ln(x)` | **YES (1.00)** | **6 / 6** | 4 / 5 | 6 / 6 | | |
| | `exp(x)` | `e^x` | **YES (1.00)** | **6 / 6** | 4 / 3 | 6 / 3 | | |
| | `x/y` | `\frac{x}{y}` | **YES (1.00)** | **6 / 6** | 3 / 7 | 3 / 9 | | |
| | `int(x^2, x)` | `\int x^2 dx` | **NO (~/fallback)** | **1 / 10** | 8 / 6 | 8 / 7 | | |
| | `diff(sin(x), x)` | `\frac{d}{dx}\sin(x)` | **YES (1.00)** | **6 / 6** | 8 / 11 | 14 / 16 | | |
| | `factorial(n)` | `n!` | **YES (1.00)** | **6 / 6** | 5 / 2 | 11 / 3 | | |
| --- | |
| ## π Custom Attention Integration Patterns | |
| The core value of MathTok for downstream machine learning practitioners is the **Layer 6 Attention Hints**. By translating tree relationships into standard masking shapes, model creators can train structure-aware networks natively. | |
| Below are three attention mask designs that can be constructed directly from the outputs of `to_attention_mask_hints()`: | |
| ### 1. Parent-Child Hierarchical Mask | |
| Encourages top-down syntactic attention. Nodes are only allowed to attend to their direct parent or child node. | |
| ``` | |
| [+ (root)] Parent Attention Mask Matrix: | |
| / \ | |
| [x] [3] [ ] [+ (root)] [x] [3] | |
| | [+ (root)] 1 1 1 | |
| [sin] [x] 1 1 0 | |
| [3] 1 0 1 | |
| ``` | |
| ### 2. Sibling Horizontal Mask | |
| Focuses horizontal attention across operands of identical scopes (e.g., connecting operands inside an addition sequence, $a$ and $b$ and $c$, without parent noise). | |
| ### 3. Subtree Scope Mask | |
| A highly effective block mask for mathematical reasoning. Restricts attention strictly within a subtree, isolating independent sub-expressions during reasoning loops. | |
| --- | |
| ## π― Codebase Evaluation & Recommendations | |
| ### Key Strengths | |
| 1. **Outstanding Structural Integrity**: Modularity is excellent. Clear abstraction separation (canonicalization, tokenization, serialization, and vocabulary grouping) makes codebase expansion extremely straightforward. | |
| 2. **HuggingFace Compatibility**: Subclassing/wrapping the standard tokenizer class ensures immediate, zero-friction integration with existing libraries like PyTorch and HuggingFace. | |
| 3. **Rigorous Validation**: The inclusion of `validator.py` and the round-trip checking logic demonstrates high development standards. | |
| 4. **Reliability Guards**: LRU caches, concurrency thread pools, and recursion limits make this pipeline safe for server-side deployment. | |
| ### Recommended Enhancements | |
| * **Vocabulary Extension**: Dynamically augment `_VAR_MAP` in `ast_generator.py` to natively support multi-character variables (e.g., physics variables like $v_{\text{init}}$ or matrix names) without splitting them into generic token placeholders. | |
| * **SymPy Parser Customisation**: SymPy's LaTeX parser can occasionally fail on non-standard, custom LaTeX macros. Adding pre-processing ASCII/LaTeX regex cleaners in `lexer.py` prior to passing them to SymPy will improve the parse success rate of dirty online forum data. | |
| * **TDF Precision**: In case of multi-nested subtrees (e.g., highly deeply-nested fractions), customize the tree depth calculation in `metrics.py` to evaluate structural depths on custom mathematical representations rather than internal SymPy structures. | |
| --- | |
| ### Citation Reference | |
| ```bibtex | |
| @article{mathtok2026, | |
| title = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework | |
| for Mathematical Language Modeling}, | |
| author = {Anonymous}, | |
| year = {2026}, | |
| note = {Under review} | |
| } | |
| ``` | |