| MVM2 DATASETS AND UNIFIED SCHEMA | |
| ================================ | |
| Goal | |
| ---- | |
| Use real public datasets for math reasoning (text) and OCR math (image). | |
| Create a unified dataset format for training and evaluation. | |
| Proposed public datasets (text) | |
| ------------------------------- | |
| 1) GSM8K | |
| - Format: JSONL with question, answer. | |
| - Size: ~8.5k training, 1.3k test. | |
| - Suitability: Word problems with step-by-step reasoning and final answers. | |
| 2) MATH (by Hendrycks) | |
| - Format: JSON with problem, solution, final answer. | |
| - Size: ~12.5k problems. | |
| - Suitability: Higher difficulty; good for generalization and error analysis. | |
| 3) SVAMP | |
| - Format: JSON with structured fields. | |
| - Size: ~1k problems. | |
| - Suitability: Simple arithmetic word problems; good for early testing. | |
| Proposed public datasets (image / OCR) | |
| -------------------------------------- | |
| 1) CROHME | |
| - Format: InkML (handwritten math). | |
| - Size: Thousands of handwritten expressions. | |
| - Suitability: OCR pipeline evaluation. | |
| 2) Im2LaTeX-100K | |
| - Format: Image + LaTeX pairs. | |
| - Size: ~100k samples. | |
| - Suitability: Printed math OCR and text alignment. | |
| 3) MathVerse (image + question) | |
| - Format: Images + problems + answers. | |
| - Size: Varies by split. | |
| - Suitability: Multimodal math reasoning evaluation. | |
| Unified dataset schema | |
| ---------------------- | |
| Each example in unified JSON should follow: | |
| { | |
| "problem_id": "...", | |
| "input_type": "text" | "image", | |
| "input_text": "...", // for text problems | |
| "image_path": "...", // for image problems | |
| "ground_truth_answer": "...", | |
| "split": "train" | "val" | "test" | |
| } | |
| Notes | |
| ----- | |
| 1) Use small slices for development (100-300 samples). | |
| 2) Keep images local and store their paths in image_path. | |
| 3) Use separate train/val/test files for evaluation and training. | |
| 4) The learned classifier is trained only on the features derived from pipeline outputs. | |
| 5) LLM and OCR components are evaluated, not trained here. | |