SurweeshSP commited on
Commit
d56d262
Β·
verified Β·
1 Parent(s): 8c9459c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -7
README.md CHANGED
@@ -94,27 +94,126 @@ Compressed Token Stream
94
  ```
95
 
96
  ---
 
97
 
98
- ## Quick Start
99
 
100
  ```bash
101
- # Install dependencies and package in editable mode
 
 
 
102
  pip install -e ".[eval,dev]"
 
 
 
 
 
 
 
 
103
 
104
- # Tokenize an expression using the CLI pipeline
105
  python -m mathtok.pipeline "The derivative of sin(x^2) + 3x"
 
106
 
107
- # Run the comprehensive 110+ test suite
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
  pytest tests/ -v
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
- # Run the 4-way comparative tokenizer evaluation benchmark
111
- # (MathTok vs GPT-2 BPE vs SentencePiece Unigram vs Char-level)
 
 
 
112
  python -m evaluation.comparison
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
- # Generate visual plots and the unified metrics dashboard
 
 
115
  python -m evaluation.visualize
116
  ```
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ---
119
 
120
  ## Python API
 
94
  ```
95
 
96
  ---
97
+ ## Installation
98
 
99
+ Clone the repository and install the package in editable mode:
100
 
101
  ```bash
102
+ git clone https://github.com/SurweeshSP/mathtok.git
103
+
104
+ cd mathtok
105
+
106
  pip install -e ".[eval,dev]"
107
+ ```
108
+
109
+ ---
110
+ ## Quick Start
111
+
112
+ ### Tokenize a Mathematical Expression
113
+
114
+ Run the tokenizer pipeline directly from the command line:
115
 
116
+ ```bash
117
  python -m mathtok.pipeline "The derivative of sin(x^2) + 3x"
118
+ ```
119
 
120
+ Example output:
121
+
122
+ ```text
123
+ [
124
+ FUNCTION_SIN,
125
+ VARIABLE_x,
126
+ POWER,
127
+ NUMBER_2,
128
+ OP_ADD,
129
+ NUMBER_3,
130
+ VARIABLE_x
131
+ ]
132
+ ```
133
+
134
+ ---
135
+
136
+ ## Running the Test Suite
137
+
138
+ Execute the comprehensive unit and integration test suite:
139
+
140
+ ```bash
141
  pytest tests/ -v
142
+ ```
143
+
144
+ Current coverage includes:
145
+
146
+ - AST generation
147
+ - Canonicalization
148
+ - Lexer validation
149
+ - Pipeline integration
150
+ - Serialization consistency
151
+ - Structural comparison metrics
152
+
153
+ ---
154
 
155
+ ## Comparative Tokenizer Evaluation
156
+
157
+ Run the full benchmark evaluation pipeline:
158
+
159
+ ```bash
160
  python -m evaluation.comparison
161
+ ```
162
+
163
+ This benchmark compares:
164
+
165
+ - MathTok (Hybrid AST Tokenizer)
166
+ - GPT-2 BPE
167
+ - SentencePiece Unigram
168
+ - Character-Level Tokenization
169
+
170
+ Evaluation metrics include:
171
+
172
+ - Symbolic Compression Ratio (SCR)
173
+ - Semantic Density
174
+ - Structural Efficiency
175
+ - Token Fragmentation
176
+ - Sequence Compactness
177
+
178
+ ---
179
+
180
+ ## Visualization Dashboard
181
 
182
+ Generate benchmark plots and the unified evaluation dashboard:
183
+
184
+ ```bash
185
  python -m evaluation.visualize
186
  ```
187
 
188
+ Generated outputs include:
189
+
190
+ - Semantic Density Comparison
191
+ - SCR Comparison
192
+ - Structural Efficiency Comparison
193
+ - Token Count Analysis
194
+ - Unified Metrics Dashboard
195
+
196
+ All generated figures are stored in:
197
+
198
+ ```text
199
+ evaluation/results/
200
+ ```
201
+
202
+ ---
203
+
204
+ ## Repository Structure
205
+
206
+ ```text
207
+ mathtok/
208
+ β”œβ”€β”€ mathtok/ # Core tokenizer framework
209
+ β”œβ”€β”€ evaluation/ # Benchmarking and evaluation
210
+ β”œβ”€β”€ tests/ # Comprehensive test suite
211
+ β”œβ”€β”€ assets/ # Architecture diagrams
212
+ β”œβ”€β”€ README.md
213
+ β”œβ”€β”€ setup.py
214
+ └── pyproject.toml
215
+ ```
216
+
217
  ---
218
 
219
  ## Python API