Varshith dharmaj commited on
Commit
7f9b3c2
·
verified ·
1 Parent(s): bd44d08

Upload docs/datasets.txt with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/datasets.txt +62 -62
docs/datasets.txt CHANGED
@@ -1,62 +1,62 @@
1
- MVM2 DATASETS AND UNIFIED SCHEMA
2
- ================================
3
-
4
- Goal
5
- ----
6
- Use real public datasets for math reasoning (text) and OCR math (image).
7
- Create a unified dataset format for training and evaluation.
8
-
9
- Proposed public datasets (text)
10
- -------------------------------
11
- 1) GSM8K
12
- - Format: JSONL with question, answer.
13
- - Size: ~8.5k training, 1.3k test.
14
- - Suitability: Word problems with step-by-step reasoning and final answers.
15
-
16
- 2) MATH (by Hendrycks)
17
- - Format: JSON with problem, solution, final answer.
18
- - Size: ~12.5k problems.
19
- - Suitability: Higher difficulty; good for generalization and error analysis.
20
-
21
- 3) SVAMP
22
- - Format: JSON with structured fields.
23
- - Size: ~1k problems.
24
- - Suitability: Simple arithmetic word problems; good for early testing.
25
-
26
- Proposed public datasets (image / OCR)
27
- --------------------------------------
28
- 1) CROHME
29
- - Format: InkML (handwritten math).
30
- - Size: Thousands of handwritten expressions.
31
- - Suitability: OCR pipeline evaluation.
32
-
33
- 2) Im2LaTeX-100K
34
- - Format: Image + LaTeX pairs.
35
- - Size: ~100k samples.
36
- - Suitability: Printed math OCR and text alignment.
37
-
38
- 3) MathVerse (image + question)
39
- - Format: Images + problems + answers.
40
- - Size: Varies by split.
41
- - Suitability: Multimodal math reasoning evaluation.
42
-
43
- Unified dataset schema
44
- ----------------------
45
- Each example in unified JSON should follow:
46
-
47
- {
48
- "problem_id": "...",
49
- "input_type": "text" | "image",
50
- "input_text": "...", // for text problems
51
- "image_path": "...", // for image problems
52
- "ground_truth_answer": "...",
53
- "split": "train" | "val" | "test"
54
- }
55
-
56
- Notes
57
- -----
58
- 1) Use small slices for development (100-300 samples).
59
- 2) Keep images local and store their paths in image_path.
60
- 3) Use separate train/val/test files for evaluation and training.
61
- 4) The learned classifier is trained only on the features derived from pipeline outputs.
62
- 5) LLM and OCR components are evaluated, not trained here.
 
1
+ MVM2 DATASETS AND UNIFIED SCHEMA
2
+ ================================
3
+
4
+ Goal
5
+ ----
6
+ Use real public datasets for math reasoning (text) and OCR math (image).
7
+ Create a unified dataset format for training and evaluation.
8
+
9
+ Proposed public datasets (text)
10
+ -------------------------------
11
+ 1) GSM8K
12
+ - Format: JSONL with question, answer.
13
+ - Size: ~8.5k training, 1.3k test.
14
+ - Suitability: Word problems with step-by-step reasoning and final answers.
15
+
16
+ 2) MATH (by Hendrycks)
17
+ - Format: JSON with problem, solution, final answer.
18
+ - Size: ~12.5k problems.
19
+ - Suitability: Higher difficulty; good for generalization and error analysis.
20
+
21
+ 3) SVAMP
22
+ - Format: JSON with structured fields.
23
+ - Size: ~1k problems.
24
+ - Suitability: Simple arithmetic word problems; good for early testing.
25
+
26
+ Proposed public datasets (image / OCR)
27
+ --------------------------------------
28
+ 1) CROHME
29
+ - Format: InkML (handwritten math).
30
+ - Size: Thousands of handwritten expressions.
31
+ - Suitability: OCR pipeline evaluation.
32
+
33
+ 2) Im2LaTeX-100K
34
+ - Format: Image + LaTeX pairs.
35
+ - Size: ~100k samples.
36
+ - Suitability: Printed math OCR and text alignment.
37
+
38
+ 3) MathVerse (image + question)
39
+ - Format: Images + problems + answers.
40
+ - Size: Varies by split.
41
+ - Suitability: Multimodal math reasoning evaluation.
42
+
43
+ Unified dataset schema
44
+ ----------------------
45
+ Each example in unified JSON should follow:
46
+
47
+ {
48
+ "problem_id": "...",
49
+ "input_type": "text" | "image",
50
+ "input_text": "...", // for text problems
51
+ "image_path": "...", // for image problems
52
+ "ground_truth_answer": "...",
53
+ "split": "train" | "val" | "test"
54
+ }
55
+
56
+ Notes
57
+ -----
58
+ 1) Use small slices for development (100-300 samples).
59
+ 2) Keep images local and store their paths in image_path.
60
+ 3) Use separate train/val/test files for evaluation and training.
61
+ 4) The learned classifier is trained only on the features derived from pipeline outputs.
62
+ 5) LLM and OCR components are evaluated, not trained here.