SurweeshSP
/

mathtok

Model card Files Files and versions

SurweeshSP commited on 4 days ago

Commit

d56d262

·

verified ·

1 Parent(s): 8c9459c

Update README.md

Files changed (1) hide show

README.md +106 -7

README.md CHANGED Viewed

@@ -94,27 +94,126 @@ Compressed Token Stream
 ```
 ---
-## Quick Start
 ```bash
-# Install dependencies and package in editable mode
 pip install -e ".[eval,dev]"
-# Tokenize an expression using the CLI pipeline
 python -m mathtok.pipeline "The derivative of sin(x^2) + 3x"
-# Run the comprehensive 110+ test suite
 pytest tests/ -v
-# Run the 4-way comparative tokenizer evaluation benchmark
-# (MathTok vs GPT-2 BPE vs SentencePiece Unigram vs Char-level)
 python -m evaluation.comparison
-# Generate visual plots and the unified metrics dashboard
 python -m evaluation.visualize
 ```
 ---
 ## Python API

 ```
 ---
+## Installation
+Clone the repository and install the package in editable mode:
 ```bash
+git clone https://github.com/SurweeshSP/mathtok.git
+cd mathtok
 pip install -e ".[eval,dev]"
+```
+---
+## Quick Start
+### Tokenize a Mathematical Expression
+Run the tokenizer pipeline directly from the command line:
+```bash
 python -m mathtok.pipeline "The derivative of sin(x^2) + 3x"
+```
+Example output:
+```text
+[
+  FUNCTION_SIN,
+  VARIABLE_x,
+  POWER,
+  NUMBER_2,
+  OP_ADD,
+  NUMBER_3,
+  VARIABLE_x
+]
+```
+---
+## Running the Test Suite
+Execute the comprehensive unit and integration test suite:
+```bash
 pytest tests/ -v
+```
+Current coverage includes:
+- AST generation
+- Canonicalization
+- Lexer validation
+- Pipeline integration
+- Serialization consistency
+- Structural comparison metrics
+---
+## Comparative Tokenizer Evaluation
+Run the full benchmark evaluation pipeline:
+```bash
 python -m evaluation.comparison
+```
+This benchmark compares:
+- MathTok (Hybrid AST Tokenizer)
+- GPT-2 BPE
+- SentencePiece Unigram
+- Character-Level Tokenization
+Evaluation metrics include:
+- Symbolic Compression Ratio (SCR)
+- Semantic Density
+- Structural Efficiency
+- Token Fragmentation
+- Sequence Compactness
+---
+## Visualization Dashboard
+Generate benchmark plots and the unified evaluation dashboard:
+```bash
 python -m evaluation.visualize
 ```
+Generated outputs include:
+- Semantic Density Comparison
+- SCR Comparison
+- Structural Efficiency Comparison
+- Token Count Analysis
+- Unified Metrics Dashboard
+All generated figures are stored in:
+```text
+evaluation/results/
+```
+---
+## Repository Structure
+```text
+mathtok/
+├── mathtok/                 # Core tokenizer framework
+├── evaluation/              # Benchmarking and evaluation
+├── tests/                   # Comprehensive test suite
+├── assets/                  # Architecture diagrams
+├── README.md
+├── setup.py
+└── pyproject.toml
+```
 ---
 ## Python API