# Multi-Evaluation Orchestrator A Python script for running multiple evaluation scripts in parallel with organized logging and configurable parameters. ## Overview This orchestrator runs multiple evaluation scripts (AIMO, AIME, COPA, ART, GoEmotion, GSM8K) against both raw and fine-tuned models, with support for parallel execution, real-time log streaming, and comprehensive result tracking. ## Features - **Parallel Execution**: Run multiple evaluations simultaneously on different GPUs - **Real-time Logging**: Stream output to console and log files simultaneously - **Organized Results**: Timestamped directories with individual and master logs - **Flexible Configuration**: Override settings via command line or configuration section - **Path Injection**: Automatically inject model and output paths into evaluation scripts - **Comprehensive Reporting**: Master log with execution summary and individual results ## Installation No additional dependencies beyond standard Python libraries and your evaluation scripts. ```bash chmod +x run_evaluations.py ``` ## Configuration # Get the directory where this script is located SCRIPT_DIR = Path(__file__).resolve().parent ### Setting Up Paths Edit the configuration section at the top of the script: ```python # Model and training paths RAW_MODEL_PATH = "/path/to/your/raw/model" TRAINING_DIR = "/path/to/your/training/results" BASE_OUTPUT_DIR = "/path/to/output/directory" ``` ### Configuring CUDA Devices Specify which GPUs to use for parallel execution: ```python # GPUs will be used in order (cycling through if needed) CUDA_DEVICES = ['2', '3'] # Use GPUs 2 and 3 ``` ### Adding/Removing Evaluation Scripts Modify the `EVALUATION_SCRIPTS` list (you can comment others if you want to evaluate just a specific set): ```python EVALUATION_SCRIPTS = [ { 'script': str(SCRIPT_DIR / 'evaluate_aimo_raw_vs_finetuned.py'), 'name': 'AIMO Dataset Evaluation', 'output_subdir': 'aimo_evaluation_results', 'params': { 'split': 'test', }, 'override_terminal': False # If True, script params override CLI args }, # Add more scripts here... ] ``` ## Usage ### Basic Usage Run all evaluations sequentially with default settings: ```bash python run_evaluations.py ``` ### Parallel Execution Run 2 evaluations in parallel (will cycle through `CUDA_DEVICES`): ```bash python run_evaluations.py --parallel 2 ``` ### Using Checkpoints #### Option 1: Specific Checkpoint Path ```bash python run_evaluations.py --checkpoint_path /path/to/checkpoint-640 ``` This will use the exact checkpoint you specify for all evaluations. #### Option 2: Checkpoint Directory ```bash python run_evaluations.py --checkpoint_dir /path/to/checkpoints ``` This passes the directory to evaluation scripts, which may select checkpoints based on their own logic. ### Setting Training Direction The training direction is determined by the `TRAINING_DIR` variable: ```bash # Override via command line python run_evaluations.py --training_dir /path/to/specific/training/run ``` Or edit in the configuration section: ```python TRAINING_DIR = "/home/user/training/results/abductive_dt10.25.17:43_e20_..." ``` ### Common Parameters ```bash # Limit samples for faster testing python run_evaluations.py --max_samples 100 # Change batch size python run_evaluations.py --batch_size 4 # Use specific dataset split python run_evaluations.py --split test # Skip raw model evaluation python run_evaluations.py --skip_raw # Skip fine-tuned model evaluation python run_evaluations.py --skip_finetuned ``` ### Advanced Examples ```bash # Run 3 evaluations in parallel with custom settings python run_evaluations.py --parallel 3 --batch_size 4 --max_samples 200 # Use custom checkpoint and disable real-time console output python run_evaluations.py --checkpoint_path /path/to/checkpoint-1280 --no_realtime # Override all paths python run_evaluations.py \ --raw_model_path /custom/raw/model \ --training_dir /custom/training/results \ --base_output_dir /custom/output ``` ## Command Line Arguments ### Orchestrator Arguments | Argument | Type | Default | Description | |----------|------|---------|-------------| | `--parallel` | int | 2 | Number of scripts to run in parallel | | `--output_dir` | str | `./multi_evaluation_results` | Output directory for results | | `--no_realtime` | flag | False | Disable real-time log streaming to console | ### Path Override Arguments | Argument | Type | Description | |----------|------|-------------| | `--raw_model_path` | str | Override raw model path | | `--training_dir` | str | Override training directory | | `--base_output_dir` | str | Override base output directory | ### Evaluation Arguments | Argument | Type | Description | |----------|------|-------------| | `--max_samples` | int | Maximum samples to evaluate | | `--cuda_device` | str | CUDA device (only for sequential) | | `--batch_size` | int | Batch size for evaluation | | `--split` | str | Dataset split (`train`, `test`, `validation`) | | `--skip_raw` | flag | Skip raw model evaluation | | `--skip_finetuned` | flag | Skip fine-tuned model evaluation | | `--checkpoint_path` | str | Path to specific checkpoint | | `--checkpoint_dir` | str | Directory containing checkpoints | ## Output Structure multi_evaluation_results/ └── run_20251105_143022/ ├── master_log.txt # Consolidated summary ├── 01_evaluate_aimo_raw_vs_finetuned.txt # Individual logs ├── 02_evaluate_aime_raw_vs_finetuned.txt ├── 03_evaluate_copa_raw_vs_finetuned_guess_cause.txt └── ... ### Master Log Contents - Execution summary (success/failure counts) - Path configuration used - Individual script results with: - Duration - Status - Error messages (if any) - Output directories - Full command executed ## How It Works ### Path Injection The orchestrator injects paths into evaluation scripts via environment variables: - `EVAL_RAW_MODEL_PATH`: Raw model path - `EVAL_TRAINING_DIR`: Training directory - `EVAL_OUTPUT_DIR`: Output directory for each evaluation Your evaluation scripts should read these variables: ```python raw_model_path = os.getenv('EVAL_RAW_MODEL_PATH', 'default/path') training_dir = os.getenv('EVAL_TRAINING_DIR', 'default/path') output_dir = os.getenv('EVAL_OUTPUT_DIR', 'default/path') ``` ### Parameter Priority For each script, parameters are merged in this order (later overrides earlier): 1. `DEFAULT_PARAMS` (in configuration) 2. Command line arguments (terminal args) 3. Script-specific `params` (in `EVALUATION_SCRIPTS`) **Exception**: If `override_terminal: True`, script params take highest priority. ### Parallel Execution - Scripts are distributed across `CUDA_DEVICES` in round-robin fashion - Each script gets exclusive access to one GPU - Real-time logs are thread-safe and properly attributed ## Troubleshooting ### Script Fails Immediately Check that: - Evaluation scripts exist and are executable - Paths in configuration section are correct - Required models and datasets are accessible ### GPU Out of Memory - Reduce `--batch_size` - Reduce `--parallel` count - Use fewer/different GPUs in `CUDA_DEVICES` ### Missing Checkpoint Ensure either: - `--checkpoint_path` points to valid checkpoint file, or - `--checkpoint_dir` contains checkpoint files, or - Evaluation scripts can find checkpoints from `TRAINING_DIR` ### Logs Not Appearing in Console - Remove `--no_realtime` flag - Check that scripts are actually producing output ## Exit Codes - `0`: All evaluations succeeded - `1`: One or more evaluations failed ## Notes - Each evaluation's output directory is created under `BASE_OUTPUT_DIR/` - Logs are buffered line-by-line for real-time streaming - Master log provides complete audit trail of all executions - Failed evaluations don't stop other evaluations from running