| # Multi-Evaluation Orchestrator | |
| A Python script for running multiple evaluation scripts in parallel with organized logging and configurable parameters. | |
| ## Overview | |
| This orchestrator runs multiple evaluation scripts (AIMO, AIME, COPA, ART, GoEmotion, GSM8K) against both raw and fine-tuned models, with support for parallel execution, real-time log streaming, and comprehensive result tracking. | |
| ## Features | |
| - **Parallel Execution**: Run multiple evaluations simultaneously on different GPUs | |
| - **Real-time Logging**: Stream output to console and log files simultaneously | |
| - **Organized Results**: Timestamped directories with individual and master logs | |
| - **Flexible Configuration**: Override settings via command line or configuration section | |
| - **Path Injection**: Automatically inject model and output paths into evaluation scripts | |
| - **Comprehensive Reporting**: Master log with execution summary and individual results | |
| ## Installation | |
| No additional dependencies beyond standard Python libraries and your evaluation scripts. | |
| ```bash | |
| chmod +x run_evaluations.py | |
| ``` | |
| ## Configuration | |
| # Get the directory where this script is located | |
| SCRIPT_DIR = Path(__file__).resolve().parent | |
| ### Setting Up Paths | |
| Edit the configuration section at the top of the script: | |
| ```python | |
| # Model and training paths | |
| RAW_MODEL_PATH = "/path/to/your/raw/model" | |
| TRAINING_DIR = "/path/to/your/training/results" | |
| BASE_OUTPUT_DIR = "/path/to/output/directory" | |
| ``` | |
| ### Configuring CUDA Devices | |
| Specify which GPUs to use for parallel execution: | |
| ```python | |
| # GPUs will be used in order (cycling through if needed) | |
| CUDA_DEVICES = ['2', '3'] # Use GPUs 2 and 3 | |
| ``` | |
| ### Adding/Removing Evaluation Scripts | |
| Modify the `EVALUATION_SCRIPTS` list (you can comment others if you want to evaluate just a specific set): | |
| ```python | |
| EVALUATION_SCRIPTS = [ | |
| { | |
| 'script': str(SCRIPT_DIR / 'evaluate_aimo_raw_vs_finetuned.py'), | |
| 'name': 'AIMO Dataset Evaluation', | |
| 'output_subdir': 'aimo_evaluation_results', | |
| 'params': { | |
| 'split': 'test', | |
| }, | |
| 'override_terminal': False # If True, script params override CLI args | |
| }, | |
| # Add more scripts here... | |
| ] | |
| ``` | |
| ## Usage | |
| ### Basic Usage | |
| Run all evaluations sequentially with default settings: | |
| ```bash | |
| python run_evaluations.py | |
| ``` | |
| ### Parallel Execution | |
| Run 2 evaluations in parallel (will cycle through `CUDA_DEVICES`): | |
| ```bash | |
| python run_evaluations.py --parallel 2 | |
| ``` | |
| ### Using Checkpoints | |
| #### Option 1: Specific Checkpoint Path | |
| ```bash | |
| python run_evaluations.py --checkpoint_path /path/to/checkpoint-640 | |
| ``` | |
| This will use the exact checkpoint you specify for all evaluations. | |
| #### Option 2: Checkpoint Directory | |
| ```bash | |
| python run_evaluations.py --checkpoint_dir /path/to/checkpoints | |
| ``` | |
| This passes the directory to evaluation scripts, which may select checkpoints based on their own logic. | |
| ### Setting Training Direction | |
| The training direction is determined by the `TRAINING_DIR` variable: | |
| ```bash | |
| # Override via command line | |
| python run_evaluations.py --training_dir /path/to/specific/training/run | |
| ``` | |
| Or edit in the configuration section: | |
| ```python | |
| TRAINING_DIR = "/home/user/training/results/abductive_dt10.25.17:43_e20_..." | |
| ``` | |
| ### Common Parameters | |
| ```bash | |
| # Limit samples for faster testing | |
| python run_evaluations.py --max_samples 100 | |
| # Change batch size | |
| python run_evaluations.py --batch_size 4 | |
| # Use specific dataset split | |
| python run_evaluations.py --split test | |
| # Skip raw model evaluation | |
| python run_evaluations.py --skip_raw | |
| # Skip fine-tuned model evaluation | |
| python run_evaluations.py --skip_finetuned | |
| ``` | |
| ### Advanced Examples | |
| ```bash | |
| # Run 3 evaluations in parallel with custom settings | |
| python run_evaluations.py --parallel 3 --batch_size 4 --max_samples 200 | |
| # Use custom checkpoint and disable real-time console output | |
| python run_evaluations.py --checkpoint_path /path/to/checkpoint-1280 --no_realtime | |
| # Override all paths | |
| python run_evaluations.py \ | |
| --raw_model_path /custom/raw/model \ | |
| --training_dir /custom/training/results \ | |
| --base_output_dir /custom/output | |
| ``` | |
| ## Command Line Arguments | |
| ### Orchestrator Arguments | |
| | Argument | Type | Default | Description | | |
| |----------|------|---------|-------------| | |
| | `--parallel` | int | 2 | Number of scripts to run in parallel | | |
| | `--output_dir` | str | `./multi_evaluation_results` | Output directory for results | | |
| | `--no_realtime` | flag | False | Disable real-time log streaming to console | | |
| ### Path Override Arguments | |
| | Argument | Type | Description | | |
| |----------|------|-------------| | |
| | `--raw_model_path` | str | Override raw model path | | |
| | `--training_dir` | str | Override training directory | | |
| | `--base_output_dir` | str | Override base output directory | | |
| ### Evaluation Arguments | |
| | Argument | Type | Description | | |
| |----------|------|-------------| | |
| | `--max_samples` | int | Maximum samples to evaluate | | |
| | `--cuda_device` | str | CUDA device (only for sequential) | | |
| | `--batch_size` | int | Batch size for evaluation | | |
| | `--split` | str | Dataset split (`train`, `test`, `validation`) | | |
| | `--skip_raw` | flag | Skip raw model evaluation | | |
| | `--skip_finetuned` | flag | Skip fine-tuned model evaluation | | |
| | `--checkpoint_path` | str | Path to specific checkpoint | | |
| | `--checkpoint_dir` | str | Directory containing checkpoints | | |
| ## Output Structure | |
| multi_evaluation_results/ | |
| βββ run_20251105_143022/ | |
| βββ master_log.txt # Consolidated summary | |
| βββ 01_evaluate_aimo_raw_vs_finetuned.txt # Individual logs | |
| βββ 02_evaluate_aime_raw_vs_finetuned.txt | |
| βββ 03_evaluate_copa_raw_vs_finetuned_guess_cause.txt | |
| βββ ... | |
| ### Master Log Contents | |
| - Execution summary (success/failure counts) | |
| - Path configuration used | |
| - Individual script results with: | |
| - Duration | |
| - Status | |
| - Error messages (if any) | |
| - Output directories | |
| - Full command executed | |
| ## How It Works | |
| ### Path Injection | |
| The orchestrator injects paths into evaluation scripts via environment variables: | |
| - `EVAL_RAW_MODEL_PATH`: Raw model path | |
| - `EVAL_TRAINING_DIR`: Training directory | |
| - `EVAL_OUTPUT_DIR`: Output directory for each evaluation | |
| Your evaluation scripts should read these variables: | |
| ```python | |
| raw_model_path = os.getenv('EVAL_RAW_MODEL_PATH', 'default/path') | |
| training_dir = os.getenv('EVAL_TRAINING_DIR', 'default/path') | |
| output_dir = os.getenv('EVAL_OUTPUT_DIR', 'default/path') | |
| ``` | |
| ### Parameter Priority | |
| For each script, parameters are merged in this order (later overrides earlier): | |
| 1. `DEFAULT_PARAMS` (in configuration) | |
| 2. Command line arguments (terminal args) | |
| 3. Script-specific `params` (in `EVALUATION_SCRIPTS`) | |
| **Exception**: If `override_terminal: True`, script params take highest priority. | |
| ### Parallel Execution | |
| - Scripts are distributed across `CUDA_DEVICES` in round-robin fashion | |
| - Each script gets exclusive access to one GPU | |
| - Real-time logs are thread-safe and properly attributed | |
| ## Troubleshooting | |
| ### Script Fails Immediately | |
| Check that: | |
| - Evaluation scripts exist and are executable | |
| - Paths in configuration section are correct | |
| - Required models and datasets are accessible | |
| ### GPU Out of Memory | |
| - Reduce `--batch_size` | |
| - Reduce `--parallel` count | |
| - Use fewer/different GPUs in `CUDA_DEVICES` | |
| ### Missing Checkpoint | |
| Ensure either: | |
| - `--checkpoint_path` points to valid checkpoint file, or | |
| - `--checkpoint_dir` contains checkpoint files, or | |
| - Evaluation scripts can find checkpoints from `TRAINING_DIR` | |
| ### Logs Not Appearing in Console | |
| - Remove `--no_realtime` flag | |
| - Check that scripts are actually producing output | |
| ## Exit Codes | |
| - `0`: All evaluations succeeded | |
| - `1`: One or more evaluations failed | |
| ## Notes | |
| - Each evaluation's output directory is created under `BASE_OUTPUT_DIR/<output_subdir>` | |
| - Logs are buffered line-by-line for real-time streaming | |
| - Master log provides complete audit trail of all executions | |
| - Failed evaluations don't stop other evaluations from running |