Multi-Evaluation Orchestrator
A Python script for running multiple evaluation scripts in parallel with organized logging and configurable parameters.
Overview
This orchestrator runs multiple evaluation scripts (AIMO, AIME, COPA, ART, GoEmotion, GSM8K) against both raw and fine-tuned models, with support for parallel execution, real-time log streaming, and comprehensive result tracking.
Features
- Parallel Execution: Run multiple evaluations simultaneously on different GPUs
- Real-time Logging: Stream output to console and log files simultaneously
- Organized Results: Timestamped directories with individual and master logs
- Flexible Configuration: Override settings via command line or configuration section
- Path Injection: Automatically inject model and output paths into evaluation scripts
- Comprehensive Reporting: Master log with execution summary and individual results
Installation
No additional dependencies beyond standard Python libraries and your evaluation scripts.
chmod +x run_evaluations.py
Configuration
Get the directory where this script is located
SCRIPT_DIR = Path(file).resolve().parent
Setting Up Paths
Edit the configuration section at the top of the script:
# Model and training paths
RAW_MODEL_PATH = "/path/to/your/raw/model"
TRAINING_DIR = "/path/to/your/training/results"
BASE_OUTPUT_DIR = "/path/to/output/directory"
Configuring CUDA Devices
Specify which GPUs to use for parallel execution:
# GPUs will be used in order (cycling through if needed)
CUDA_DEVICES = ['2', '3'] # Use GPUs 2 and 3
Adding/Removing Evaluation Scripts
Modify the EVALUATION_SCRIPTS list (you can comment others if you want to evaluate just a specific set):
EVALUATION_SCRIPTS = [
{
'script': str(SCRIPT_DIR / 'evaluate_aimo_raw_vs_finetuned.py'),
'name': 'AIMO Dataset Evaluation',
'output_subdir': 'aimo_evaluation_results',
'params': {
'split': 'test',
},
'override_terminal': False # If True, script params override CLI args
},
# Add more scripts here...
]
Usage
Basic Usage
Run all evaluations sequentially with default settings:
python run_evaluations.py
Parallel Execution
Run 2 evaluations in parallel (will cycle through CUDA_DEVICES):
python run_evaluations.py --parallel 2
Using Checkpoints
Option 1: Specific Checkpoint Path
python run_evaluations.py --checkpoint_path /path/to/checkpoint-640
This will use the exact checkpoint you specify for all evaluations.
Option 2: Checkpoint Directory
python run_evaluations.py --checkpoint_dir /path/to/checkpoints
This passes the directory to evaluation scripts, which may select checkpoints based on their own logic.
Setting Training Direction
The training direction is determined by the TRAINING_DIR variable:
# Override via command line
python run_evaluations.py --training_dir /path/to/specific/training/run
Or edit in the configuration section:
TRAINING_DIR = "/home/user/training/results/abductive_dt10.25.17:43_e20_..."
Common Parameters
# Limit samples for faster testing
python run_evaluations.py --max_samples 100
# Change batch size
python run_evaluations.py --batch_size 4
# Use specific dataset split
python run_evaluations.py --split test
# Skip raw model evaluation
python run_evaluations.py --skip_raw
# Skip fine-tuned model evaluation
python run_evaluations.py --skip_finetuned
Advanced Examples
# Run 3 evaluations in parallel with custom settings
python run_evaluations.py --parallel 3 --batch_size 4 --max_samples 200
# Use custom checkpoint and disable real-time console output
python run_evaluations.py --checkpoint_path /path/to/checkpoint-1280 --no_realtime
# Override all paths
python run_evaluations.py \
--raw_model_path /custom/raw/model \
--training_dir /custom/training/results \
--base_output_dir /custom/output
Command Line Arguments
Orchestrator Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--parallel |
int | 2 | Number of scripts to run in parallel |
--output_dir |
str | ./multi_evaluation_results |
Output directory for results |
--no_realtime |
flag | False | Disable real-time log streaming to console |
Path Override Arguments
| Argument | Type | Description |
|---|---|---|
--raw_model_path |
str | Override raw model path |
--training_dir |
str | Override training directory |
--base_output_dir |
str | Override base output directory |
Evaluation Arguments
| Argument | Type | Description |
|---|---|---|
--max_samples |
int | Maximum samples to evaluate |
--cuda_device |
str | CUDA device (only for sequential) |
--batch_size |
int | Batch size for evaluation |
--split |
str | Dataset split (train, test, validation) |
--skip_raw |
flag | Skip raw model evaluation |
--skip_finetuned |
flag | Skip fine-tuned model evaluation |
--checkpoint_path |
str | Path to specific checkpoint |
--checkpoint_dir |
str | Directory containing checkpoints |
Output Structure
multi_evaluation_results/ βββ run_20251105_143022/ βββ master_log.txt # Consolidated summary βββ 01_evaluate_aimo_raw_vs_finetuned.txt # Individual logs βββ 02_evaluate_aime_raw_vs_finetuned.txt βββ 03_evaluate_copa_raw_vs_finetuned_guess_cause.txt βββ ...
Master Log Contents
- Execution summary (success/failure counts)
- Path configuration used
- Individual script results with:
- Duration
- Status
- Error messages (if any)
- Output directories
- Full command executed
How It Works
Path Injection
The orchestrator injects paths into evaluation scripts via environment variables:
EVAL_RAW_MODEL_PATH: Raw model pathEVAL_TRAINING_DIR: Training directoryEVAL_OUTPUT_DIR: Output directory for each evaluation
Your evaluation scripts should read these variables:
raw_model_path = os.getenv('EVAL_RAW_MODEL_PATH', 'default/path')
training_dir = os.getenv('EVAL_TRAINING_DIR', 'default/path')
output_dir = os.getenv('EVAL_OUTPUT_DIR', 'default/path')
Parameter Priority
For each script, parameters are merged in this order (later overrides earlier):
DEFAULT_PARAMS(in configuration)- Command line arguments (terminal args)
- Script-specific
params(inEVALUATION_SCRIPTS)
Exception: If override_terminal: True, script params take highest priority.
Parallel Execution
- Scripts are distributed across
CUDA_DEVICESin round-robin fashion - Each script gets exclusive access to one GPU
- Real-time logs are thread-safe and properly attributed
Troubleshooting
Script Fails Immediately
Check that:
- Evaluation scripts exist and are executable
- Paths in configuration section are correct
- Required models and datasets are accessible
GPU Out of Memory
- Reduce
--batch_size - Reduce
--parallelcount - Use fewer/different GPUs in
CUDA_DEVICES
Missing Checkpoint
Ensure either:
--checkpoint_pathpoints to valid checkpoint file, or--checkpoint_dircontains checkpoint files, or- Evaluation scripts can find checkpoints from
TRAINING_DIR
Logs Not Appearing in Console
- Remove
--no_realtimeflag - Check that scripts are actually producing output
Exit Codes
0: All evaluations succeeded1: One or more evaluations failed
Notes
- Each evaluation's output directory is created under
BASE_OUTPUT_DIR/<output_subdir> - Logs are buffered line-by-line for real-time streaming
- Master log provides complete audit trail of all executions
- Failed evaluations don't stop other evaluations from running