ayushm98 commited on
Commit
b7cedbd
Β·
1 Parent(s): 1133321

docs: add README for ML training pipeline

Browse files
Files changed (1) hide show
  1. ml/README.md +40 -0
ml/README.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ML Pipeline
2
+
3
+ This directory contains the machine learning pipeline for training the complexity classifier.
4
+
5
+ ## Structure
6
+
7
+ ```
8
+ ml/
9
+ β”œβ”€β”€ data/ # Dataset loading and preprocessing
10
+ β”‚ └── load_dataset.py
11
+ β”œβ”€β”€ training/ # Model training and evaluation
12
+ β”‚ β”œβ”€β”€ train.py # DistilBERT fine-tuning
13
+ β”‚ └── evaluate.py # Model evaluation
14
+ β”œβ”€β”€ export/ # Model export
15
+ β”‚ └── convert_to_onnx.py
16
+ └── artifacts/ # Saved models and metrics
17
+ β”œβ”€β”€ model.onnx
18
+ └── metrics.json
19
+ ```
20
+
21
+ ## Training
22
+
23
+ ```bash
24
+ # Train the complexity classifier
25
+ python -m ml.training.train --dataset arc --epochs 5
26
+
27
+ # Evaluate the model
28
+ python -m ml.training.evaluate --model-dir ml/artifacts/complexity-classifier
29
+
30
+ # Export to ONNX
31
+ python -m ml.export.convert_to_onnx --model-dir ml/artifacts/complexity-classifier
32
+ ```
33
+
34
+ ## Dataset
35
+
36
+ The classifier is trained on the ARC dataset (AI2 Reasoning Challenge) which provides:
37
+ - **Easy examples**: Simple questions that can be handled by smaller models
38
+ - **Challenge examples**: Complex questions requiring more capable models
39
+
40
+ Alternatively, Easy2Hard-Bench can be used for continuous difficulty scores.