mathtok / README.md

SurweeshSP

Update README.md

c1f7e74 verified 4 days ago

11.5 kB


	---
	language:
	- en

	license: mit

	library_name: custom

	tags:
	- tokenizer
	- symbolic-ai
	- mathematics
	- llm
	- reasoning
	- ast
	- compiler
	- nlp
	- deep-learning
	- machine-learning
	- mathematical-reasoning
	- symbolic-reasoning
	- tokenization
	- parser
	- transformers
	- artificial-intelligence

	pipeline_tag: text-generation

	datasets:
	- custom-mathematical-dataset

	metrics:
	- semantic-density
	- structural-efficiency
	- symbolic-compression-ratio

	model-index:
	- name: MathTok
	results:
	- task:
	type: tokenization
	name: Mathematical Tokenization
	dataset:
	name: Custom Mathematical Benchmark
	type: symbolic-math
	metrics:
	- type: semantic-density
	value: Improved
	name: Semantic Density
	- type: structural-efficiency
	value: Optimized
	name: Structural Efficiency
	- type: symbolic-compression-ratio
	value: Enhanced
	name: SCR

	co2_eq_emissions:
	emissions: 0

	license_name: mit

	pretty_name: MathTok

	thumbnail: assets/mathtok_architecture_improvements.svg
	---

	# MathTok

	A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling

	![Python](https://img.shields.io/badge/Python-3.10-blue)
	![License](https://img.shields.io/badge/License-MIT-green)
	![HuggingFace](https://img.shields.io/badge/HuggingFace-Live-yellow)
	![Tests](https://img.shields.io/badge/Tests-110%2B-success)
	![Research](https://img.shields.io/badge/Focus-Symbolic%20AI-purple)

	---

	## Why MathTok?

	Traditional tokenizers such as BPE and SentencePiece treat mathematical
	expressions as plain text sequences, fragmenting semantic structure and
	discarding operator hierarchy.

	MathTok introduces a structure-aware tokenization pipeline that:
	- canonicalizes equivalent mathematical expressions,
	- preserves AST hierarchy,
	- encodes operator semantics explicitly,
	- improves symbolic compression efficiency,
	- and enables future tree-aware transformer architectures.

	---

	## Overview

	MathTok is a research-grade tokenizer pipeline that converts raw mathematical expressions (LaTeX or ASCII) into a structured, semantically-rich token stream. Unlike standard BPE or SentencePiece tokenizers, MathTok is structure-aware: it builds an Abstract Syntax Tree (AST) from each expression and serializes it via DFS preorder traversal, preserving full mathematical structure.

	```
	Raw Mathematical Expression
	↓
	Canonicalization Layer (sympy: simplify, expand, normalize)
	↓
	Hybrid Mathematical Lexer (split TEXT / MATH spans)
	↓
	AST Generator (SymPy tree → typed ASTNode tree)
	↓
	Operator-Aware Semantic Encoder (rich metadata per operator)
	↓
	Structural Serialization (DFS preorder → flat token stream)
	↓
	Structural Attention Metadata (per-token tree context)
	↓
	Vocabulary Mapping + BPE (fixed math vocab + HF BPE for text)
	↓
	Compressed Token Stream
	```

	---

	## Architecture

	![MathTok Architecture](assets/mathtok_architecture_improvements.svg)

	---

	## Installation

	Clone the repository and install the package in editable mode:

	```bash
	git clone https://github.com/SurweeshSP/mathtok.git

	cd mathtok

	pip install -e ".[eval,dev]"
	```

	---
	## Quick Start

	### Tokenize a Mathematical Expression

	Run the tokenizer pipeline directly from the command line:

	```bash
	python -m mathtok.pipeline "The derivative of sin(x^2) + 3x"
	```

	Example output:

	```text
	[
	FUNCTION_SIN,
	VARIABLE_x,
	POWER,
	NUMBER_2,
	OP_ADD,
	NUMBER_3,
	VARIABLE_x
	]
	```

	---

	## Running the Test Suite

	Execute the comprehensive unit and integration test suite:

	```bash
	pytest tests/ -v
	```

	Current coverage includes:

	- AST generation
	- Canonicalization
	- Lexer validation
	- Pipeline integration
	- Serialization consistency
	- Structural comparison metrics

	---

	## Comparative Tokenizer Evaluation

	Run the full benchmark evaluation pipeline:

	```bash
	python -m evaluation.comparison
	```

	This benchmark compares:

	- MathTok (Hybrid AST Tokenizer)
	- GPT-2 BPE
	- SentencePiece Unigram
	- Character-Level Tokenization

	Evaluation metrics include:

	- Symbolic Compression Ratio (SCR)
	- Semantic Density
	- Structural Efficiency
	- Token Fragmentation
	- Sequence Compactness

	---

	## Visualization Dashboard

	Generate benchmark plots and the unified evaluation dashboard:

	```bash
	python -m evaluation.visualize
	```

	Generated outputs include:

	- Semantic Density Comparison
	- SCR Comparison
	- Structural Efficiency Comparison
	- Token Count Analysis
	- Unified Metrics Dashboard

	All generated figures are stored in:

	```text
	evaluation/results/
	```

	---

	## Repository Structure

	```text
	mathtok/
	├── mathtok/ # Core tokenizer framework
	├── evaluation/ # Benchmarking and evaluation
	├── tests/ # Comprehensive test suite
	├── assets/ # Architecture diagrams
	├── README.md
	├── setup.py
	└── pyproject.toml
	```

	---

	## Python API

	```python
	from mathtok import MathTokPipeline

	pipeline = MathTokPipeline()

	# Encode mixed text + math (supporting LaTeX or ASCII syntax)
	out = pipeline.encode("The derivative of $\\sin(x^2)$ is $2x\\cos(x^2)$.")
	print(out.tokens) # ['[MATH_START]', 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]', ...]
	print(out.sexp) # (FUNC_SIN (OP_POW VAR_X CONST_2))
	print(out.input_ids) # [4, 27, 10, 45, 12, 5, ...]

	# Access structural metadata (for tree-aware attention masking)
	for meta in out.metadata:
	print(meta.token, meta.depth, meta.tree_position_key)

	# Pure math expression serialization
	out = pipeline.encode_math_only("(x+1)^2")
	print(out.sexp) # (OP_POW (OP_ADD VAR_X CONST_1) CONST_2)

	# HuggingFace-compatible tokenizer export
	hf_tok = pipeline.get_hf_tokenizer()
	hf_tok.save_pretrained("./mathtok-tokenizer")
	result = hf_tok("x^2 + 2*x + 1", return_tensors="pt")
	```

	---

	## Research Contributions

	### 1. Hybrid Lexer
	Separates natural language from mathematical content using LaTeX delimiter detection (`$...$`, `$...$`, `\[...\]`) and ASCII math heuristics.

	### 2. Canonicalization Engine
	Normalizes mathematically equivalent expressions via SymPy's `simplify()`, `expand()`, and internal representation (subtraction → addition + negation, division → multiplication + reciprocal).

	### 3. AST-Based Structural Serialization
	Maps SymPy's expression tree to a typed token vocabulary with semantic metadata per operator. Serializes via DFS preorder traversal.

	### 4. Operator Semantic Registry
	Every operator and function carries an explicit metadata record: `arity`, `precedence`, `associativity`, `semantic_role`. This is the primary novelty over standard tokenization.

	### 5. Structural Attention Metadata
	Per-token records encoding `depth`, `parent_id`, `children_ids`, `tree_position_key`, and `sibling_count` — enabling future structure-aware attention.

	### 6. Two-Tier Vocabulary
	- Fixed math vocabulary: deterministic IDs for all operators, functions, variables, constants.
	- BPE text vocabulary: HuggingFace `tokenizers` BPE for natural language spans.

	---

	## Evaluation Metrics & Benchmarks

	### Core Metrics

	\| Metric \| Symbol \| Meaning \|
	\|--------\|--------\|---------\|
	\| Semantic Compression Ratio \| SCR \| `structural_score / token_count` (Higher is better — measures parsed semantic content density) \|
	\| Semantic Density \| SD \| `math_tokens / total_tokens` (Ratio of high-value math tokens, measures information density) \|
	\| Structural Efficiency \| SE \| `parent_child_relations / token_count` (Ratio of hierarchy relationships encoded per token) \|
	\| Token Stability \| TS \| `1 - CoV(token count across rewritings)` (Fidelity and stability across representations) \|

	### Empirical Benchmarks (4-Way Comparison)

	Below are the empirical averages computed over our comprehensive suite of 70 mathematical test expressions:

	\| Tokenizer \| Mean SCR (↑ Better) \| Semantic Density (↑ Better) \| Structural Efficiency (↑ Better) \|
	\|:---\|:---:\|:---:\|:---:\|
	\| MathTok (Ours) \| 0.8501 \| 0.5285 \| 0.2339 \|
	\| GPT-2 BPE \| 0.4251 \| 0.1838 \| 0.1491 \|
	\| SentencePiece Unigram \| 0.3696 \| 0.1499 \| 0.1403 \|
	\| Character-Level \| 0.3708 \| 0.1518 \| 0.1518 \|

	> [!NOTE]
	> * MathTok achieves a 2.30x structural compression improvement over SentencePiece.
	> * MathTok packs 3.52x more math-centric information per token stream compared to SentencePiece unigrams (0.5285 vs 0.1499), showing immense semantic density.
	> * MathTok is 1.67x more efficient at encoding hierarchical ast relationships directly into token structures (0.2339 vs 0.1403).

	### High-Impact Visualizations

	The visualization system runs via `python -m evaluation.visualize` and exports professional visual assets under [`evaluation/results/`](file:///c:/Users/surwe/Project/math_token/evaluation/results/):
	- Unified Evaluation Dashboard (`metrics_dashboard.png`): 3-panel side-by-side display of SCR, Semantic Density, and Structural Efficiency.
	- Overall SCR Comparison (`scr_comparison.png`): Comparative summary bar chart.
	- Category-Level Breakdowns (`scr_by_category.png`): SCR analyzed by nested/standard categories.
	- Semantic Density Summary (`semantic_density_comparison.png`): Ratio of math structure to total tokens.

	---

	## Project Structure

	```
	math_token/
	├── mathtok/
	│ ├── canonicalizer.py # Layer 1: Canonicalization Engine
	│ ├── lexer.py # Layer 2: Hybrid Mathematical Lexer
	│ ├── ast_generator.py # Layer 3: AST Generator
	│ ├── operator_registry.py # Layer 4: Operator Semantic Registry
	│ ├── serializer.py # Layer 5: Structural Traversal & Serialization
	│ ├── metadata.py # Layer 6: Structural Attention Metadata
	│ ├── vocabulary.py # Layer 7: Two-Tier Vocabulary
	│ └── pipeline.py # Orchestrator Pipeline
	├── evaluation/
	│ ├── metrics.py # Definition of core evaluation metrics
	│ ├── benchmark.py # Quick benchmarking scripts
	│ ├── comparison.py # Full 4-way comparative framework (SentencePiece integrated)
	│ ├── visualize.py # Custom dashboard visualization engine
	│ └── results/ # JSON/JSONL reports & visual plots
	└── tests/ # 110+ passing unit tests
	```

	---

	## Future Work

	- Tree-aware transformer attention integration
	- Native mathematical pretraining corpus
	- Symbolic reasoning benchmarks
	- Neural theorem proving interfaces
	- Equation graph embeddings
	- Mathematical multimodal tokenization
	- Integration with Lean/Coq theorem systems

	---

	## Citation

	```bibtex
	@article{surweesh2026mathtok,
	title = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling and Symbolic Reasoning},
	author = {Surweesh SP},
	year = {2026},
	journal = {Preprint},
	note = {Open-source research framework available on GitHub and Hugging Face},
	keywords = {Mathematical Tokenization, Symbolic AI, Abstract Syntax Trees, LLMs, NLP, Mathematical Reasoning, Canonicalization},
	url = {https://huggingface.co/Surweesh/MathTok}
	}
	```

	---

	## Links

	- GitHub: https://github.com/SurweeshSP/mathtok
	- Hugging Face: https://huggingface.co/Surweesh/MathTok