SFT_Dataset / Evaluation /Evaluate_all_tutorial.md

Upload folder using huggingface_hub

e6fad38 verified 3 months ago

7.97 kB

	# Multi-Evaluation Orchestrator

	A Python script for running multiple evaluation scripts in parallel with organized logging and configurable parameters.

	## Overview

	This orchestrator runs multiple evaluation scripts (AIMO, AIME, COPA, ART, GoEmotion, GSM8K) against both raw and fine-tuned models, with support for parallel execution, real-time log streaming, and comprehensive result tracking.

	## Features

	- Parallel Execution: Run multiple evaluations simultaneously on different GPUs
	- Real-time Logging: Stream output to console and log files simultaneously
	- Organized Results: Timestamped directories with individual and master logs
	- Flexible Configuration: Override settings via command line or configuration section
	- Path Injection: Automatically inject model and output paths into evaluation scripts
	- Comprehensive Reporting: Master log with execution summary and individual results

	## Installation

	No additional dependencies beyond standard Python libraries and your evaluation scripts.

	```bash
	chmod +x run_evaluations.py
	```

	## Configuration

	# Get the directory where this script is located
	SCRIPT_DIR = Path(__file__).resolve().parent


	### Setting Up Paths

	Edit the configuration section at the top of the script:

	```python
	# Model and training paths
	RAW_MODEL_PATH = "/path/to/your/raw/model"
	TRAINING_DIR = "/path/to/your/training/results"
	BASE_OUTPUT_DIR = "/path/to/output/directory"
	```

	### Configuring CUDA Devices

	Specify which GPUs to use for parallel execution:

	```python
	# GPUs will be used in order (cycling through if needed)
	CUDA_DEVICES = ['2', '3'] # Use GPUs 2 and 3
	```

	### Adding/Removing Evaluation Scripts

	Modify the `EVALUATION_SCRIPTS` list (you can comment others if you want to evaluate just a specific set):

	```python
	EVALUATION_SCRIPTS = [
	{
	'script': str(SCRIPT_DIR / 'evaluate_aimo_raw_vs_finetuned.py'),
	'name': 'AIMO Dataset Evaluation',
	'output_subdir': 'aimo_evaluation_results',
	'params': {
	'split': 'test',
	},
	'override_terminal': False # If True, script params override CLI args
	},
	# Add more scripts here...
	]
	```

	## Usage

	### Basic Usage

	Run all evaluations sequentially with default settings:

	```bash
	python run_evaluations.py
	```

	### Parallel Execution

	Run 2 evaluations in parallel (will cycle through `CUDA_DEVICES`):

	```bash
	python run_evaluations.py --parallel 2
	```

	### Using Checkpoints

	#### Option 1: Specific Checkpoint Path

	```bash
	python run_evaluations.py --checkpoint_path /path/to/checkpoint-640
	```

	This will use the exact checkpoint you specify for all evaluations.

	#### Option 2: Checkpoint Directory

	```bash
	python run_evaluations.py --checkpoint_dir /path/to/checkpoints
	```

	This passes the directory to evaluation scripts, which may select checkpoints based on their own logic.

	### Setting Training Direction

	The training direction is determined by the `TRAINING_DIR` variable:

	```bash
	# Override via command line
	python run_evaluations.py --training_dir /path/to/specific/training/run
	```

	Or edit in the configuration section:

	```python
	TRAINING_DIR = "/home/user/training/results/abductive_dt10.25.17:43_e20_..."
	```

	### Common Parameters

	```bash
	# Limit samples for faster testing
	python run_evaluations.py --max_samples 100

	# Change batch size
	python run_evaluations.py --batch_size 4

	# Use specific dataset split
	python run_evaluations.py --split test

	# Skip raw model evaluation
	python run_evaluations.py --skip_raw

	# Skip fine-tuned model evaluation
	python run_evaluations.py --skip_finetuned
	```

	### Advanced Examples

	```bash
	# Run 3 evaluations in parallel with custom settings
	python run_evaluations.py --parallel 3 --batch_size 4 --max_samples 200

	# Use custom checkpoint and disable real-time console output
	python run_evaluations.py --checkpoint_path /path/to/checkpoint-1280 --no_realtime

	# Override all paths
	python run_evaluations.py \
	--raw_model_path /custom/raw/model \
	--training_dir /custom/training/results \
	--base_output_dir /custom/output
	```

	## Command Line Arguments

	### Orchestrator Arguments

	\| Argument \| Type \| Default \| Description \|
	\|----------\|------\|---------\|-------------\|
	\| `--parallel` \| int \| 2 \| Number of scripts to run in parallel \|
	\| `--output_dir` \| str \| `./multi_evaluation_results` \| Output directory for results \|
	\| `--no_realtime` \| flag \| False \| Disable real-time log streaming to console \|

	### Path Override Arguments

	\| Argument \| Type \| Description \|
	\|----------\|------\|-------------\|
	\| `--raw_model_path` \| str \| Override raw model path \|
	\| `--training_dir` \| str \| Override training directory \|
	\| `--base_output_dir` \| str \| Override base output directory \|

	### Evaluation Arguments

	\| Argument \| Type \| Description \|
	\|----------\|------\|-------------\|
	\| `--max_samples` \| int \| Maximum samples to evaluate \|
	\| `--cuda_device` \| str \| CUDA device (only for sequential) \|
	\| `--batch_size` \| int \| Batch size for evaluation \|
	\| `--split` \| str \| Dataset split (`train`, `test`, `validation`) \|
	\| `--skip_raw` \| flag \| Skip raw model evaluation \|
	\| `--skip_finetuned` \| flag \| Skip fine-tuned model evaluation \|
	\| `--checkpoint_path` \| str \| Path to specific checkpoint \|
	\| `--checkpoint_dir` \| str \| Directory containing checkpoints \|

	## Output Structure

	multi_evaluation_results/
	└── run_20251105_143022/
	├── master_log.txt # Consolidated summary
	├── 01_evaluate_aimo_raw_vs_finetuned.txt # Individual logs
	├── 02_evaluate_aime_raw_vs_finetuned.txt
	├── 03_evaluate_copa_raw_vs_finetuned_guess_cause.txt
	└── ...


	### Master Log Contents

	- Execution summary (success/failure counts)
	- Path configuration used
	- Individual script results with:
	- Duration
	- Status
	- Error messages (if any)
	- Output directories
	- Full command executed

	## How It Works

	### Path Injection

	The orchestrator injects paths into evaluation scripts via environment variables:

	- `EVAL_RAW_MODEL_PATH`: Raw model path
	- `EVAL_TRAINING_DIR`: Training directory
	- `EVAL_OUTPUT_DIR`: Output directory for each evaluation

	Your evaluation scripts should read these variables:

	```python
	raw_model_path = os.getenv('EVAL_RAW_MODEL_PATH', 'default/path')
	training_dir = os.getenv('EVAL_TRAINING_DIR', 'default/path')
	output_dir = os.getenv('EVAL_OUTPUT_DIR', 'default/path')
	```

	### Parameter Priority

	For each script, parameters are merged in this order (later overrides earlier):

	1. `DEFAULT_PARAMS` (in configuration)
	2. Command line arguments (terminal args)
	3. Script-specific `params` (in `EVALUATION_SCRIPTS`)

	Exception: If `override_terminal: True`, script params take highest priority.

	### Parallel Execution

	- Scripts are distributed across `CUDA_DEVICES` in round-robin fashion
	- Each script gets exclusive access to one GPU
	- Real-time logs are thread-safe and properly attributed

	## Troubleshooting

	### Script Fails Immediately

	Check that:
	- Evaluation scripts exist and are executable
	- Paths in configuration section are correct
	- Required models and datasets are accessible

	### GPU Out of Memory

	- Reduce `--batch_size`
	- Reduce `--parallel` count
	- Use fewer/different GPUs in `CUDA_DEVICES`

	### Missing Checkpoint

	Ensure either:
	- `--checkpoint_path` points to valid checkpoint file, or
	- `--checkpoint_dir` contains checkpoint files, or
	- Evaluation scripts can find checkpoints from `TRAINING_DIR`

	### Logs Not Appearing in Console

	- Remove `--no_realtime` flag
	- Check that scripts are actually producing output

	## Exit Codes

	- `0`: All evaluations succeeded
	- `1`: One or more evaluations failed

	## Notes

	- Each evaluation's output directory is created under `BASE_OUTPUT_DIR/<output_subdir>`
	- Logs are buffered line-by-line for real-time streaming
	- Master log provides complete audit trail of all executions
	- Failed evaluations don't stop other evaluations from running