Text Generation
Transformers
English
custom
tokenizer
symbolic-ai
mathematics
llm
reasoning
ast
compiler
nlp
deep-learning
machine-learning
mathematical-reasoning
symbolic-reasoning
tokenization
parser
artificial-intelligence
Eval Results (legacy)
Instructions to use SurweeshSP/mathtok with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SurweeshSP/mathtok with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SurweeshSP/mathtok")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("SurweeshSP/mathtok", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SurweeshSP/mathtok with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SurweeshSP/mathtok" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/SurweeshSP/mathtok
- SGLang
How to use SurweeshSP/mathtok with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SurweeshSP/mathtok" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SurweeshSP/mathtok" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SurweeshSP/mathtok", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use SurweeshSP/mathtok with Docker Model Runner:
docker model run hf.co/SurweeshSP/mathtok
| language: | |
| - en | |
| license: mit | |
| library_name: custom | |
| tags: | |
| - tokenizer | |
| - symbolic-ai | |
| - mathematics | |
| - llm | |
| - reasoning | |
| - ast | |
| - compiler | |
| - nlp | |
| - deep-learning | |
| - machine-learning | |
| - mathematical-reasoning | |
| - symbolic-reasoning | |
| - tokenization | |
| - parser | |
| - transformers | |
| - artificial-intelligence | |
| pipeline_tag: text-generation | |
| datasets: | |
| - custom-mathematical-dataset | |
| metrics: | |
| - semantic-density | |
| - structural-efficiency | |
| - symbolic-compression-ratio | |
| model-index: | |
| - name: MathTok | |
| results: | |
| - task: | |
| type: tokenization | |
| name: Mathematical Tokenization | |
| dataset: | |
| name: Custom Mathematical Benchmark | |
| type: symbolic-math | |
| metrics: | |
| - type: semantic-density | |
| value: Improved | |
| name: Semantic Density | |
| - type: structural-efficiency | |
| value: Optimized | |
| name: Structural Efficiency | |
| - type: symbolic-compression-ratio | |
| value: Enhanced | |
| name: SCR | |
| co2_eq_emissions: | |
| emissions: 0 | |
| license_name: mit | |
| pretty_name: MathTok | |
| thumbnail: assets/mathtok_architecture_improvements.svg | |
| # MathTok | |
| **A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling** | |
|  | |
|  | |
|  | |
|  | |
|  | |
| --- | |
| ## Why MathTok? | |
| Traditional tokenizers such as BPE and SentencePiece treat mathematical | |
| expressions as plain text sequences, fragmenting semantic structure and | |
| discarding operator hierarchy. | |
| MathTok introduces a structure-aware tokenization pipeline that: | |
| - canonicalizes equivalent mathematical expressions, | |
| - preserves AST hierarchy, | |
| - encodes operator semantics explicitly, | |
| - improves symbolic compression efficiency, | |
| - and enables future tree-aware transformer architectures. | |
| --- | |
| ## Overview | |
| MathTok is a research-grade tokenizer pipeline that converts raw mathematical expressions (LaTeX or ASCII) into a structured, semantically-rich token stream. Unlike standard BPE or SentencePiece tokenizers, MathTok is *structure-aware*: it builds an Abstract Syntax Tree (AST) from each expression and serializes it via DFS preorder traversal, preserving full mathematical structure. | |
| ``` | |
| Raw Mathematical Expression | |
| β | |
| Canonicalization Layer (sympy: simplify, expand, normalize) | |
| β | |
| Hybrid Mathematical Lexer (split TEXT / MATH spans) | |
| β | |
| AST Generator (SymPy tree β typed ASTNode tree) | |
| β | |
| Operator-Aware Semantic Encoder (rich metadata per operator) | |
| β | |
| Structural Serialization (DFS preorder β flat token stream) | |
| β | |
| Structural Attention Metadata (per-token tree context) | |
| β | |
| Vocabulary Mapping + BPE (fixed math vocab + HF BPE for text) | |
| β | |
| Compressed Token Stream | |
| ``` | |
| --- | |
| ## Architecture | |
|  | |
| --- | |
| ## Installation | |
| Clone the repository and install the package in editable mode: | |
| ```bash | |
| git clone https://github.com/SurweeshSP/mathtok.git | |
| cd mathtok | |
| pip install -e ".[eval,dev]" | |
| ``` | |
| --- | |
| ## Quick Start | |
| ### Tokenize a Mathematical Expression | |
| Run the tokenizer pipeline directly from the command line: | |
| ```bash | |
| python -m mathtok.pipeline "The derivative of sin(x^2) + 3x" | |
| ``` | |
| Example output: | |
| ```text | |
| [ | |
| FUNCTION_SIN, | |
| VARIABLE_x, | |
| POWER, | |
| NUMBER_2, | |
| OP_ADD, | |
| NUMBER_3, | |
| VARIABLE_x | |
| ] | |
| ``` | |
| --- | |
| ## Running the Test Suite | |
| Execute the comprehensive unit and integration test suite: | |
| ```bash | |
| pytest tests/ -v | |
| ``` | |
| Current coverage includes: | |
| - AST generation | |
| - Canonicalization | |
| - Lexer validation | |
| - Pipeline integration | |
| - Serialization consistency | |
| - Structural comparison metrics | |
| --- | |
| ## Comparative Tokenizer Evaluation | |
| Run the full benchmark evaluation pipeline: | |
| ```bash | |
| python -m evaluation.comparison | |
| ``` | |
| This benchmark compares: | |
| - MathTok (Hybrid AST Tokenizer) | |
| - GPT-2 BPE | |
| - SentencePiece Unigram | |
| - Character-Level Tokenization | |
| Evaluation metrics include: | |
| - Symbolic Compression Ratio (SCR) | |
| - Semantic Density | |
| - Structural Efficiency | |
| - Token Fragmentation | |
| - Sequence Compactness | |
| --- | |
| ## Visualization Dashboard | |
| Generate benchmark plots and the unified evaluation dashboard: | |
| ```bash | |
| python -m evaluation.visualize | |
| ``` | |
| Generated outputs include: | |
| - Semantic Density Comparison | |
| - SCR Comparison | |
| - Structural Efficiency Comparison | |
| - Token Count Analysis | |
| - Unified Metrics Dashboard | |
| All generated figures are stored in: | |
| ```text | |
| evaluation/results/ | |
| ``` | |
| --- | |
| ## Repository Structure | |
| ```text | |
| mathtok/ | |
| βββ mathtok/ # Core tokenizer framework | |
| βββ evaluation/ # Benchmarking and evaluation | |
| βββ tests/ # Comprehensive test suite | |
| βββ assets/ # Architecture diagrams | |
| βββ README.md | |
| βββ setup.py | |
| βββ pyproject.toml | |
| ``` | |
| --- | |
| ## Python API | |
| ```python | |
| from mathtok import MathTokPipeline | |
| pipeline = MathTokPipeline() | |
| # Encode mixed text + math (supporting LaTeX or ASCII syntax) | |
| out = pipeline.encode("The derivative of $\\sin(x^2)$ is $2x\\cos(x^2)$.") | |
| print(out.tokens) # ['[MATH_START]', 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]', ...] | |
| print(out.sexp) # (FUNC_SIN (OP_POW VAR_X CONST_2)) | |
| print(out.input_ids) # [4, 27, 10, 45, 12, 5, ...] | |
| # Access structural metadata (for tree-aware attention masking) | |
| for meta in out.metadata: | |
| print(meta.token, meta.depth, meta.tree_position_key) | |
| # Pure math expression serialization | |
| out = pipeline.encode_math_only("(x+1)^2") | |
| print(out.sexp) # (OP_POW (OP_ADD VAR_X CONST_1) CONST_2) | |
| # HuggingFace-compatible tokenizer export | |
| hf_tok = pipeline.get_hf_tokenizer() | |
| hf_tok.save_pretrained("./mathtok-tokenizer") | |
| result = hf_tok("x^2 + 2*x + 1", return_tensors="pt") | |
| ``` | |
| --- | |
| ## Research Contributions | |
| ### 1. Hybrid Lexer | |
| Separates natural language from mathematical content using LaTeX delimiter detection (`$...$`, `\(...\)`, `\[...\]`) and ASCII math heuristics. | |
| ### 2. Canonicalization Engine | |
| Normalizes mathematically equivalent expressions via SymPy's `simplify()`, `expand()`, and internal representation (subtraction β addition + negation, division β multiplication + reciprocal). | |
| ### 3. AST-Based Structural Serialization | |
| Maps SymPy's expression tree to a typed token vocabulary with semantic metadata per operator. Serializes via DFS preorder traversal. | |
| ### 4. Operator Semantic Registry | |
| Every operator and function carries an explicit metadata record: `arity`, `precedence`, `associativity`, `semantic_role`. This is the primary novelty over standard tokenization. | |
| ### 5. Structural Attention Metadata | |
| Per-token records encoding `depth`, `parent_id`, `children_ids`, `tree_position_key`, and `sibling_count` β enabling future structure-aware attention. | |
| ### 6. Two-Tier Vocabulary | |
| - **Fixed math vocabulary**: deterministic IDs for all operators, functions, variables, constants. | |
| - **BPE text vocabulary**: HuggingFace `tokenizers` BPE for natural language spans. | |
| --- | |
| ## Evaluation Metrics & Benchmarks | |
| ### Core Metrics | |
| | Metric | Symbol | Meaning | | |
| |--------|--------|---------| | |
| | **Semantic Compression Ratio** | SCR | `structural_score / token_count` (Higher is better β measures parsed semantic content density) | | |
| | **Semantic Density** | SD | `math_tokens / total_tokens` (Ratio of high-value math tokens, measures information density) | | |
| | **Structural Efficiency** | SE | `parent_child_relations / token_count` (Ratio of hierarchy relationships encoded per token) | | |
| | **Token Stability** | TS | `1 - CoV(token count across rewritings)` (Fidelity and stability across representations) | | |
| ### Empirical Benchmarks (4-Way Comparison) | |
| Below are the empirical averages computed over our comprehensive suite of 70 mathematical test expressions: | |
| | Tokenizer | Mean SCR (β Better) | Semantic Density (β Better) | Structural Efficiency (β Better) | | |
| |:---|:---:|:---:|:---:| | |
| | **MathTok (Ours)** | **0.8501** | **0.5285** | **0.2339** | | |
| | **GPT-2 BPE** | 0.4251 | 0.1838 | 0.1491 | | |
| | **SentencePiece Unigram** | 0.3696 | 0.1499 | 0.1403 | | |
| | **Character-Level** | 0.3708 | 0.1518 | 0.1518 | | |
| > [!NOTE] | |
| > * MathTok achieves a **2.30x structural compression improvement** over SentencePiece. | |
| > * MathTok packs **3.52x more math-centric information** per token stream compared to SentencePiece unigrams (**0.5285** vs **0.1499**), showing immense semantic density. | |
| > * MathTok is **1.67x more efficient** at encoding hierarchical ast relationships directly into token structures (**0.2339** vs **0.1403**). | |
| ### High-Impact Visualizations | |
| The visualization system runs via `python -m evaluation.visualize` and exports professional visual assets under [`evaluation/results/`](file:///c:/Users/surwe/Project/math_token/evaluation/results/): | |
| - **Unified Evaluation Dashboard** (`metrics_dashboard.png`): 3-panel side-by-side display of SCR, Semantic Density, and Structural Efficiency. | |
| - **Overall SCR Comparison** (`scr_comparison.png`): Comparative summary bar chart. | |
| - **Category-Level Breakdowns** (`scr_by_category.png`): SCR analyzed by nested/standard categories. | |
| - **Semantic Density Summary** (`semantic_density_comparison.png`): Ratio of math structure to total tokens. | |
| --- | |
| ## Project Structure | |
| ``` | |
| math_token/ | |
| βββ mathtok/ | |
| β βββ canonicalizer.py # Layer 1: Canonicalization Engine | |
| β βββ lexer.py # Layer 2: Hybrid Mathematical Lexer | |
| β βββ ast_generator.py # Layer 3: AST Generator | |
| β βββ operator_registry.py # Layer 4: Operator Semantic Registry | |
| β βββ serializer.py # Layer 5: Structural Traversal & Serialization | |
| β βββ metadata.py # Layer 6: Structural Attention Metadata | |
| β βββ vocabulary.py # Layer 7: Two-Tier Vocabulary | |
| β βββ pipeline.py # Orchestrator Pipeline | |
| βββ evaluation/ | |
| β βββ metrics.py # Definition of core evaluation metrics | |
| β βββ benchmark.py # Quick benchmarking scripts | |
| β βββ comparison.py # Full 4-way comparative framework (SentencePiece integrated) | |
| β βββ visualize.py # Custom dashboard visualization engine | |
| β βββ results/ # JSON/JSONL reports & visual plots | |
| βββ tests/ # 110+ passing unit tests | |
| ``` | |
| --- | |
| ## Future Work | |
| - Tree-aware transformer attention integration | |
| - Native mathematical pretraining corpus | |
| - Symbolic reasoning benchmarks | |
| - Neural theorem proving interfaces | |
| - Equation graph embeddings | |
| - Mathematical multimodal tokenization | |
| - Integration with Lean/Coq theorem systems | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @article{surweesh2026mathtok, | |
| title = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling and Symbolic Reasoning}, | |
| author = {Surweesh SP}, | |
| year = {2026}, | |
| journal = {Preprint}, | |
| note = {Open-source research framework available on GitHub and Hugging Face}, | |
| keywords = {Mathematical Tokenization, Symbolic AI, Abstract Syntax Trees, LLMs, NLP, Mathematical Reasoning, Canonicalization}, | |
| url = {https://huggingface.co/Surweesh/MathTok} | |
| } | |
| ``` | |
| --- | |
| ## Links | |
| - GitHub: https://github.com/SurweeshSP/mathtok | |
| - Hugging Face: https://huggingface.co/Surweesh/MathTok | |