cascade / ml /README.md
ayushm98's picture
docs: add README for ML training pipeline
b7cedbd

ML Pipeline

This directory contains the machine learning pipeline for training the complexity classifier.

Structure

ml/
β”œβ”€β”€ data/              # Dataset loading and preprocessing
β”‚   └── load_dataset.py
β”œβ”€β”€ training/          # Model training and evaluation
β”‚   β”œβ”€β”€ train.py       # DistilBERT fine-tuning
β”‚   └── evaluate.py    # Model evaluation
β”œβ”€β”€ export/            # Model export
β”‚   └── convert_to_onnx.py
└── artifacts/         # Saved models and metrics
    β”œβ”€β”€ model.onnx
    └── metrics.json

Training

# Train the complexity classifier
python -m ml.training.train --dataset arc --epochs 5

# Evaluate the model
python -m ml.training.evaluate --model-dir ml/artifacts/complexity-classifier

# Export to ONNX
python -m ml.export.convert_to_onnx --model-dir ml/artifacts/complexity-classifier

Dataset

The classifier is trained on the ARC dataset (AI2 Reasoning Challenge) which provides:

  • Easy examples: Simple questions that can be handled by smaller models
  • Challenge examples: Complex questions requiring more capable models

Alternatively, Easy2Hard-Bench can be used for continuous difficulty scores.