mathtok / README.md
SurweeshSP
Update README.md
c1f7e74 verified
---
language:
- en
license: mit
library_name: custom
tags:
- tokenizer
- symbolic-ai
- mathematics
- llm
- reasoning
- ast
- compiler
- nlp
- deep-learning
- machine-learning
- mathematical-reasoning
- symbolic-reasoning
- tokenization
- parser
- transformers
- artificial-intelligence
pipeline_tag: text-generation
datasets:
- custom-mathematical-dataset
metrics:
- semantic-density
- structural-efficiency
- symbolic-compression-ratio
model-index:
- name: MathTok
results:
- task:
type: tokenization
name: Mathematical Tokenization
dataset:
name: Custom Mathematical Benchmark
type: symbolic-math
metrics:
- type: semantic-density
value: Improved
name: Semantic Density
- type: structural-efficiency
value: Optimized
name: Structural Efficiency
- type: symbolic-compression-ratio
value: Enhanced
name: SCR
co2_eq_emissions:
emissions: 0
license_name: mit
pretty_name: MathTok
thumbnail: assets/mathtok_architecture_improvements.svg
---
# MathTok
**A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling**
![Python](https://img.shields.io/badge/Python-3.10-blue)
![License](https://img.shields.io/badge/License-MIT-green)
![HuggingFace](https://img.shields.io/badge/HuggingFace-Live-yellow)
![Tests](https://img.shields.io/badge/Tests-110%2B-success)
![Research](https://img.shields.io/badge/Focus-Symbolic%20AI-purple)
---
## Why MathTok?
Traditional tokenizers such as BPE and SentencePiece treat mathematical
expressions as plain text sequences, fragmenting semantic structure and
discarding operator hierarchy.
MathTok introduces a structure-aware tokenization pipeline that:
- canonicalizes equivalent mathematical expressions,
- preserves AST hierarchy,
- encodes operator semantics explicitly,
- improves symbolic compression efficiency,
- and enables future tree-aware transformer architectures.
---
## Overview
MathTok is a research-grade tokenizer pipeline that converts raw mathematical expressions (LaTeX or ASCII) into a structured, semantically-rich token stream. Unlike standard BPE or SentencePiece tokenizers, MathTok is *structure-aware*: it builds an Abstract Syntax Tree (AST) from each expression and serializes it via DFS preorder traversal, preserving full mathematical structure.
```
Raw Mathematical Expression
↓
Canonicalization Layer (sympy: simplify, expand, normalize)
↓
Hybrid Mathematical Lexer (split TEXT / MATH spans)
↓
AST Generator (SymPy tree β†’ typed ASTNode tree)
↓
Operator-Aware Semantic Encoder (rich metadata per operator)
↓
Structural Serialization (DFS preorder β†’ flat token stream)
↓
Structural Attention Metadata (per-token tree context)
↓
Vocabulary Mapping + BPE (fixed math vocab + HF BPE for text)
↓
Compressed Token Stream
```
---
## Architecture
![MathTok Architecture](assets/mathtok_architecture_improvements.svg)
---
## Installation
Clone the repository and install the package in editable mode:
```bash
git clone https://github.com/SurweeshSP/mathtok.git
cd mathtok
pip install -e ".[eval,dev]"
```
---
## Quick Start
### Tokenize a Mathematical Expression
Run the tokenizer pipeline directly from the command line:
```bash
python -m mathtok.pipeline "The derivative of sin(x^2) + 3x"
```
Example output:
```text
[
FUNCTION_SIN,
VARIABLE_x,
POWER,
NUMBER_2,
OP_ADD,
NUMBER_3,
VARIABLE_x
]
```
---
## Running the Test Suite
Execute the comprehensive unit and integration test suite:
```bash
pytest tests/ -v
```
Current coverage includes:
- AST generation
- Canonicalization
- Lexer validation
- Pipeline integration
- Serialization consistency
- Structural comparison metrics
---
## Comparative Tokenizer Evaluation
Run the full benchmark evaluation pipeline:
```bash
python -m evaluation.comparison
```
This benchmark compares:
- MathTok (Hybrid AST Tokenizer)
- GPT-2 BPE
- SentencePiece Unigram
- Character-Level Tokenization
Evaluation metrics include:
- Symbolic Compression Ratio (SCR)
- Semantic Density
- Structural Efficiency
- Token Fragmentation
- Sequence Compactness
---
## Visualization Dashboard
Generate benchmark plots and the unified evaluation dashboard:
```bash
python -m evaluation.visualize
```
Generated outputs include:
- Semantic Density Comparison
- SCR Comparison
- Structural Efficiency Comparison
- Token Count Analysis
- Unified Metrics Dashboard
All generated figures are stored in:
```text
evaluation/results/
```
---
## Repository Structure
```text
mathtok/
β”œβ”€β”€ mathtok/ # Core tokenizer framework
β”œβ”€β”€ evaluation/ # Benchmarking and evaluation
β”œβ”€β”€ tests/ # Comprehensive test suite
β”œβ”€β”€ assets/ # Architecture diagrams
β”œβ”€β”€ README.md
β”œβ”€β”€ setup.py
└── pyproject.toml
```
---
## Python API
```python
from mathtok import MathTokPipeline
pipeline = MathTokPipeline()
# Encode mixed text + math (supporting LaTeX or ASCII syntax)
out = pipeline.encode("The derivative of $\\sin(x^2)$ is $2x\\cos(x^2)$.")
print(out.tokens) # ['[MATH_START]', 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]', ...]
print(out.sexp) # (FUNC_SIN (OP_POW VAR_X CONST_2))
print(out.input_ids) # [4, 27, 10, 45, 12, 5, ...]
# Access structural metadata (for tree-aware attention masking)
for meta in out.metadata:
print(meta.token, meta.depth, meta.tree_position_key)
# Pure math expression serialization
out = pipeline.encode_math_only("(x+1)^2")
print(out.sexp) # (OP_POW (OP_ADD VAR_X CONST_1) CONST_2)
# HuggingFace-compatible tokenizer export
hf_tok = pipeline.get_hf_tokenizer()
hf_tok.save_pretrained("./mathtok-tokenizer")
result = hf_tok("x^2 + 2*x + 1", return_tensors="pt")
```
---
## Research Contributions
### 1. Hybrid Lexer
Separates natural language from mathematical content using LaTeX delimiter detection (`$...$`, `\(...\)`, `\[...\]`) and ASCII math heuristics.
### 2. Canonicalization Engine
Normalizes mathematically equivalent expressions via SymPy's `simplify()`, `expand()`, and internal representation (subtraction β†’ addition + negation, division β†’ multiplication + reciprocal).
### 3. AST-Based Structural Serialization
Maps SymPy's expression tree to a typed token vocabulary with semantic metadata per operator. Serializes via DFS preorder traversal.
### 4. Operator Semantic Registry
Every operator and function carries an explicit metadata record: `arity`, `precedence`, `associativity`, `semantic_role`. This is the primary novelty over standard tokenization.
### 5. Structural Attention Metadata
Per-token records encoding `depth`, `parent_id`, `children_ids`, `tree_position_key`, and `sibling_count` β€” enabling future structure-aware attention.
### 6. Two-Tier Vocabulary
- **Fixed math vocabulary**: deterministic IDs for all operators, functions, variables, constants.
- **BPE text vocabulary**: HuggingFace `tokenizers` BPE for natural language spans.
---
## Evaluation Metrics & Benchmarks
### Core Metrics
| Metric | Symbol | Meaning |
|--------|--------|---------|
| **Semantic Compression Ratio** | SCR | `structural_score / token_count` (Higher is better β€” measures parsed semantic content density) |
| **Semantic Density** | SD | `math_tokens / total_tokens` (Ratio of high-value math tokens, measures information density) |
| **Structural Efficiency** | SE | `parent_child_relations / token_count` (Ratio of hierarchy relationships encoded per token) |
| **Token Stability** | TS | `1 - CoV(token count across rewritings)` (Fidelity and stability across representations) |
### Empirical Benchmarks (4-Way Comparison)
Below are the empirical averages computed over our comprehensive suite of 70 mathematical test expressions:
| Tokenizer | Mean SCR (↑ Better) | Semantic Density (↑ Better) | Structural Efficiency (↑ Better) |
|:---|:---:|:---:|:---:|
| **MathTok (Ours)** | **0.8501** | **0.5285** | **0.2339** |
| **GPT-2 BPE** | 0.4251 | 0.1838 | 0.1491 |
| **SentencePiece Unigram** | 0.3696 | 0.1499 | 0.1403 |
| **Character-Level** | 0.3708 | 0.1518 | 0.1518 |
> [!NOTE]
> * MathTok achieves a **2.30x structural compression improvement** over SentencePiece.
> * MathTok packs **3.52x more math-centric information** per token stream compared to SentencePiece unigrams (**0.5285** vs **0.1499**), showing immense semantic density.
> * MathTok is **1.67x more efficient** at encoding hierarchical ast relationships directly into token structures (**0.2339** vs **0.1403**).
### High-Impact Visualizations
The visualization system runs via `python -m evaluation.visualize` and exports professional visual assets under [`evaluation/results/`](file:///c:/Users/surwe/Project/math_token/evaluation/results/):
- **Unified Evaluation Dashboard** (`metrics_dashboard.png`): 3-panel side-by-side display of SCR, Semantic Density, and Structural Efficiency.
- **Overall SCR Comparison** (`scr_comparison.png`): Comparative summary bar chart.
- **Category-Level Breakdowns** (`scr_by_category.png`): SCR analyzed by nested/standard categories.
- **Semantic Density Summary** (`semantic_density_comparison.png`): Ratio of math structure to total tokens.
---
## Project Structure
```
math_token/
β”œβ”€β”€ mathtok/
β”‚ β”œβ”€β”€ canonicalizer.py # Layer 1: Canonicalization Engine
β”‚ β”œβ”€β”€ lexer.py # Layer 2: Hybrid Mathematical Lexer
β”‚ β”œβ”€β”€ ast_generator.py # Layer 3: AST Generator
β”‚ β”œβ”€β”€ operator_registry.py # Layer 4: Operator Semantic Registry
β”‚ β”œβ”€β”€ serializer.py # Layer 5: Structural Traversal & Serialization
β”‚ β”œβ”€β”€ metadata.py # Layer 6: Structural Attention Metadata
β”‚ β”œβ”€β”€ vocabulary.py # Layer 7: Two-Tier Vocabulary
β”‚ └── pipeline.py # Orchestrator Pipeline
β”œβ”€β”€ evaluation/
β”‚ β”œβ”€β”€ metrics.py # Definition of core evaluation metrics
β”‚ β”œβ”€β”€ benchmark.py # Quick benchmarking scripts
β”‚ β”œβ”€β”€ comparison.py # Full 4-way comparative framework (SentencePiece integrated)
β”‚ β”œβ”€β”€ visualize.py # Custom dashboard visualization engine
β”‚ └── results/ # JSON/JSONL reports & visual plots
└── tests/ # 110+ passing unit tests
```
---
## Future Work
- Tree-aware transformer attention integration
- Native mathematical pretraining corpus
- Symbolic reasoning benchmarks
- Neural theorem proving interfaces
- Equation graph embeddings
- Mathematical multimodal tokenization
- Integration with Lean/Coq theorem systems
---
## Citation
```bibtex
@article{surweesh2026mathtok,
title = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling and Symbolic Reasoning},
author = {Surweesh SP},
year = {2026},
journal = {Preprint},
note = {Open-source research framework available on GitHub and Hugging Face},
keywords = {Mathematical Tokenization, Symbolic AI, Abstract Syntax Trees, LLMs, NLP, Mathematical Reasoning, Canonicalization},
url = {https://huggingface.co/Surweesh/MathTok}
}
```
---
## Links
- GitHub: https://github.com/SurweeshSP/mathtok
- Hugging Face: https://huggingface.co/Surweesh/MathTok