cascade / ml /README.md
ayushm98's picture
docs: add README for ML training pipeline
b7cedbd
# ML Pipeline
This directory contains the machine learning pipeline for training the complexity classifier.
## Structure
```
ml/
β”œβ”€β”€ data/ # Dataset loading and preprocessing
β”‚ └── load_dataset.py
β”œβ”€β”€ training/ # Model training and evaluation
β”‚ β”œβ”€β”€ train.py # DistilBERT fine-tuning
β”‚ └── evaluate.py # Model evaluation
β”œβ”€β”€ export/ # Model export
β”‚ └── convert_to_onnx.py
└── artifacts/ # Saved models and metrics
β”œβ”€β”€ model.onnx
└── metrics.json
```
## Training
```bash
# Train the complexity classifier
python -m ml.training.train --dataset arc --epochs 5
# Evaluate the model
python -m ml.training.evaluate --model-dir ml/artifacts/complexity-classifier
# Export to ONNX
python -m ml.export.convert_to_onnx --model-dir ml/artifacts/complexity-classifier
```
## Dataset
The classifier is trained on the ARC dataset (AI2 Reasoning Challenge) which provides:
- **Easy examples**: Simple questions that can be handled by smaller models
- **Challenge examples**: Complex questions requiring more capable models
Alternatively, Easy2Hard-Bench can be used for continuous difficulty scores.