SFT_Dataset / Evaluation /Evaluate_all_tutorial.md
Parsagh1383's picture
Upload folder using huggingface_hub
e6fad38 verified
# Multi-Evaluation Orchestrator
A Python script for running multiple evaluation scripts in parallel with organized logging and configurable parameters.
## Overview
This orchestrator runs multiple evaluation scripts (AIMO, AIME, COPA, ART, GoEmotion, GSM8K) against both raw and fine-tuned models, with support for parallel execution, real-time log streaming, and comprehensive result tracking.
## Features
- **Parallel Execution**: Run multiple evaluations simultaneously on different GPUs
- **Real-time Logging**: Stream output to console and log files simultaneously
- **Organized Results**: Timestamped directories with individual and master logs
- **Flexible Configuration**: Override settings via command line or configuration section
- **Path Injection**: Automatically inject model and output paths into evaluation scripts
- **Comprehensive Reporting**: Master log with execution summary and individual results
## Installation
No additional dependencies beyond standard Python libraries and your evaluation scripts.
```bash
chmod +x run_evaluations.py
```
## Configuration
# Get the directory where this script is located
SCRIPT_DIR = Path(__file__).resolve().parent
### Setting Up Paths
Edit the configuration section at the top of the script:
```python
# Model and training paths
RAW_MODEL_PATH = "/path/to/your/raw/model"
TRAINING_DIR = "/path/to/your/training/results"
BASE_OUTPUT_DIR = "/path/to/output/directory"
```
### Configuring CUDA Devices
Specify which GPUs to use for parallel execution:
```python
# GPUs will be used in order (cycling through if needed)
CUDA_DEVICES = ['2', '3'] # Use GPUs 2 and 3
```
### Adding/Removing Evaluation Scripts
Modify the `EVALUATION_SCRIPTS` list (you can comment others if you want to evaluate just a specific set):
```python
EVALUATION_SCRIPTS = [
{
'script': str(SCRIPT_DIR / 'evaluate_aimo_raw_vs_finetuned.py'),
'name': 'AIMO Dataset Evaluation',
'output_subdir': 'aimo_evaluation_results',
'params': {
'split': 'test',
},
'override_terminal': False # If True, script params override CLI args
},
# Add more scripts here...
]
```
## Usage
### Basic Usage
Run all evaluations sequentially with default settings:
```bash
python run_evaluations.py
```
### Parallel Execution
Run 2 evaluations in parallel (will cycle through `CUDA_DEVICES`):
```bash
python run_evaluations.py --parallel 2
```
### Using Checkpoints
#### Option 1: Specific Checkpoint Path
```bash
python run_evaluations.py --checkpoint_path /path/to/checkpoint-640
```
This will use the exact checkpoint you specify for all evaluations.
#### Option 2: Checkpoint Directory
```bash
python run_evaluations.py --checkpoint_dir /path/to/checkpoints
```
This passes the directory to evaluation scripts, which may select checkpoints based on their own logic.
### Setting Training Direction
The training direction is determined by the `TRAINING_DIR` variable:
```bash
# Override via command line
python run_evaluations.py --training_dir /path/to/specific/training/run
```
Or edit in the configuration section:
```python
TRAINING_DIR = "/home/user/training/results/abductive_dt10.25.17:43_e20_..."
```
### Common Parameters
```bash
# Limit samples for faster testing
python run_evaluations.py --max_samples 100
# Change batch size
python run_evaluations.py --batch_size 4
# Use specific dataset split
python run_evaluations.py --split test
# Skip raw model evaluation
python run_evaluations.py --skip_raw
# Skip fine-tuned model evaluation
python run_evaluations.py --skip_finetuned
```
### Advanced Examples
```bash
# Run 3 evaluations in parallel with custom settings
python run_evaluations.py --parallel 3 --batch_size 4 --max_samples 200
# Use custom checkpoint and disable real-time console output
python run_evaluations.py --checkpoint_path /path/to/checkpoint-1280 --no_realtime
# Override all paths
python run_evaluations.py \
--raw_model_path /custom/raw/model \
--training_dir /custom/training/results \
--base_output_dir /custom/output
```
## Command Line Arguments
### Orchestrator Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--parallel` | int | 2 | Number of scripts to run in parallel |
| `--output_dir` | str | `./multi_evaluation_results` | Output directory for results |
| `--no_realtime` | flag | False | Disable real-time log streaming to console |
### Path Override Arguments
| Argument | Type | Description |
|----------|------|-------------|
| `--raw_model_path` | str | Override raw model path |
| `--training_dir` | str | Override training directory |
| `--base_output_dir` | str | Override base output directory |
### Evaluation Arguments
| Argument | Type | Description |
|----------|------|-------------|
| `--max_samples` | int | Maximum samples to evaluate |
| `--cuda_device` | str | CUDA device (only for sequential) |
| `--batch_size` | int | Batch size for evaluation |
| `--split` | str | Dataset split (`train`, `test`, `validation`) |
| `--skip_raw` | flag | Skip raw model evaluation |
| `--skip_finetuned` | flag | Skip fine-tuned model evaluation |
| `--checkpoint_path` | str | Path to specific checkpoint |
| `--checkpoint_dir` | str | Directory containing checkpoints |
## Output Structure
multi_evaluation_results/
└── run_20251105_143022/
β”œβ”€β”€ master_log.txt # Consolidated summary
β”œβ”€β”€ 01_evaluate_aimo_raw_vs_finetuned.txt # Individual logs
β”œβ”€β”€ 02_evaluate_aime_raw_vs_finetuned.txt
β”œβ”€β”€ 03_evaluate_copa_raw_vs_finetuned_guess_cause.txt
└── ...
### Master Log Contents
- Execution summary (success/failure counts)
- Path configuration used
- Individual script results with:
- Duration
- Status
- Error messages (if any)
- Output directories
- Full command executed
## How It Works
### Path Injection
The orchestrator injects paths into evaluation scripts via environment variables:
- `EVAL_RAW_MODEL_PATH`: Raw model path
- `EVAL_TRAINING_DIR`: Training directory
- `EVAL_OUTPUT_DIR`: Output directory for each evaluation
Your evaluation scripts should read these variables:
```python
raw_model_path = os.getenv('EVAL_RAW_MODEL_PATH', 'default/path')
training_dir = os.getenv('EVAL_TRAINING_DIR', 'default/path')
output_dir = os.getenv('EVAL_OUTPUT_DIR', 'default/path')
```
### Parameter Priority
For each script, parameters are merged in this order (later overrides earlier):
1. `DEFAULT_PARAMS` (in configuration)
2. Command line arguments (terminal args)
3. Script-specific `params` (in `EVALUATION_SCRIPTS`)
**Exception**: If `override_terminal: True`, script params take highest priority.
### Parallel Execution
- Scripts are distributed across `CUDA_DEVICES` in round-robin fashion
- Each script gets exclusive access to one GPU
- Real-time logs are thread-safe and properly attributed
## Troubleshooting
### Script Fails Immediately
Check that:
- Evaluation scripts exist and are executable
- Paths in configuration section are correct
- Required models and datasets are accessible
### GPU Out of Memory
- Reduce `--batch_size`
- Reduce `--parallel` count
- Use fewer/different GPUs in `CUDA_DEVICES`
### Missing Checkpoint
Ensure either:
- `--checkpoint_path` points to valid checkpoint file, or
- `--checkpoint_dir` contains checkpoint files, or
- Evaluation scripts can find checkpoints from `TRAINING_DIR`
### Logs Not Appearing in Console
- Remove `--no_realtime` flag
- Check that scripts are actually producing output
## Exit Codes
- `0`: All evaluations succeeded
- `1`: One or more evaluations failed
## Notes
- Each evaluation's output directory is created under `BASE_OUTPUT_DIR/<output_subdir>`
- Logs are buffered line-by-line for real-time streaming
- Master log provides complete audit trail of all executions
- Failed evaluations don't stop other evaluations from running