File size: 7,968 Bytes
e6fad38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
 # Multi-Evaluation Orchestrator

A Python script for running multiple evaluation scripts in parallel with organized logging and configurable parameters.

## Overview

This orchestrator runs multiple evaluation scripts (AIMO, AIME, COPA, ART, GoEmotion, GSM8K) against both raw and fine-tuned models, with support for parallel execution, real-time log streaming, and comprehensive result tracking.

## Features

- **Parallel Execution**: Run multiple evaluations simultaneously on different GPUs
- **Real-time Logging**: Stream output to console and log files simultaneously
- **Organized Results**: Timestamped directories with individual and master logs
- **Flexible Configuration**: Override settings via command line or configuration section
- **Path Injection**: Automatically inject model and output paths into evaluation scripts
- **Comprehensive Reporting**: Master log with execution summary and individual results

## Installation

No additional dependencies beyond standard Python libraries and your evaluation scripts.

```bash
chmod +x run_evaluations.py
```

## Configuration

# Get the directory where this script is located
SCRIPT_DIR = Path(__file__).resolve().parent


### Setting Up Paths

Edit the configuration section at the top of the script:

```python
# Model and training paths
RAW_MODEL_PATH = "/path/to/your/raw/model"
TRAINING_DIR = "/path/to/your/training/results"
BASE_OUTPUT_DIR = "/path/to/output/directory"
```

### Configuring CUDA Devices

Specify which GPUs to use for parallel execution:

```python
# GPUs will be used in order (cycling through if needed)
CUDA_DEVICES = ['2', '3']  # Use GPUs 2 and 3
```

### Adding/Removing Evaluation Scripts

Modify the `EVALUATION_SCRIPTS` list (you can comment others if you want to evaluate just a specific set):

```python
EVALUATION_SCRIPTS = [
    {
        'script': str(SCRIPT_DIR / 'evaluate_aimo_raw_vs_finetuned.py'),
        'name': 'AIMO Dataset Evaluation',
        'output_subdir': 'aimo_evaluation_results',
        'params': {
            'split': 'test',
        },
        'override_terminal': False  # If True, script params override CLI args
    },
    # Add more scripts here...
]
```

## Usage

### Basic Usage

Run all evaluations sequentially with default settings:

```bash
python run_evaluations.py
```

### Parallel Execution

Run 2 evaluations in parallel (will cycle through `CUDA_DEVICES`):

```bash
python run_evaluations.py --parallel 2
```

### Using Checkpoints

#### Option 1: Specific Checkpoint Path

```bash
python run_evaluations.py --checkpoint_path /path/to/checkpoint-640
```

This will use the exact checkpoint you specify for all evaluations.

#### Option 2: Checkpoint Directory

```bash
python run_evaluations.py --checkpoint_dir /path/to/checkpoints
```

This passes the directory to evaluation scripts, which may select checkpoints based on their own logic.

### Setting Training Direction

The training direction is determined by the `TRAINING_DIR` variable:

```bash
# Override via command line
python run_evaluations.py --training_dir /path/to/specific/training/run
```

Or edit in the configuration section:

```python
TRAINING_DIR = "/home/user/training/results/abductive_dt10.25.17:43_e20_..."
```

### Common Parameters

```bash
# Limit samples for faster testing
python run_evaluations.py --max_samples 100

# Change batch size
python run_evaluations.py --batch_size 4

# Use specific dataset split
python run_evaluations.py --split test

# Skip raw model evaluation
python run_evaluations.py --skip_raw

# Skip fine-tuned model evaluation
python run_evaluations.py --skip_finetuned
```

### Advanced Examples

```bash
# Run 3 evaluations in parallel with custom settings
python run_evaluations.py --parallel 3 --batch_size 4 --max_samples 200

# Use custom checkpoint and disable real-time console output
python run_evaluations.py --checkpoint_path /path/to/checkpoint-1280 --no_realtime

# Override all paths
python run_evaluations.py \
  --raw_model_path /custom/raw/model \
  --training_dir /custom/training/results \
  --base_output_dir /custom/output
```

## Command Line Arguments

### Orchestrator Arguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--parallel` | int | 2 | Number of scripts to run in parallel |
| `--output_dir` | str | `./multi_evaluation_results` | Output directory for results |
| `--no_realtime` | flag | False | Disable real-time log streaming to console |

### Path Override Arguments

| Argument | Type | Description |
|----------|------|-------------|
| `--raw_model_path` | str | Override raw model path |
| `--training_dir` | str | Override training directory |
| `--base_output_dir` | str | Override base output directory |

### Evaluation Arguments

| Argument | Type | Description |
|----------|------|-------------|
| `--max_samples` | int | Maximum samples to evaluate |
| `--cuda_device` | str | CUDA device (only for sequential) |
| `--batch_size` | int | Batch size for evaluation |
| `--split` | str | Dataset split (`train`, `test`, `validation`) |
| `--skip_raw` | flag | Skip raw model evaluation |
| `--skip_finetuned` | flag | Skip fine-tuned model evaluation |
| `--checkpoint_path` | str | Path to specific checkpoint |
| `--checkpoint_dir` | str | Directory containing checkpoints |

## Output Structure

multi_evaluation_results/
└── run_20251105_143022/
    β”œβ”€β”€ master_log.txt                          # Consolidated summary
    β”œβ”€β”€ 01_evaluate_aimo_raw_vs_finetuned.txt  # Individual logs
    β”œβ”€β”€ 02_evaluate_aime_raw_vs_finetuned.txt
    β”œβ”€β”€ 03_evaluate_copa_raw_vs_finetuned_guess_cause.txt
    └── ...


### Master Log Contents

- Execution summary (success/failure counts)
- Path configuration used
- Individual script results with:
  - Duration
  - Status
  - Error messages (if any)
  - Output directories
  - Full command executed

## How It Works

### Path Injection

The orchestrator injects paths into evaluation scripts via environment variables:

- `EVAL_RAW_MODEL_PATH`: Raw model path
- `EVAL_TRAINING_DIR`: Training directory
- `EVAL_OUTPUT_DIR`: Output directory for each evaluation

Your evaluation scripts should read these variables:

```python
raw_model_path = os.getenv('EVAL_RAW_MODEL_PATH', 'default/path')
training_dir = os.getenv('EVAL_TRAINING_DIR', 'default/path')
output_dir = os.getenv('EVAL_OUTPUT_DIR', 'default/path')
```

### Parameter Priority

For each script, parameters are merged in this order (later overrides earlier):

1. `DEFAULT_PARAMS` (in configuration)
2. Command line arguments (terminal args)
3. Script-specific `params` (in `EVALUATION_SCRIPTS`)

**Exception**: If `override_terminal: True`, script params take highest priority.

### Parallel Execution

- Scripts are distributed across `CUDA_DEVICES` in round-robin fashion
- Each script gets exclusive access to one GPU
- Real-time logs are thread-safe and properly attributed

## Troubleshooting

### Script Fails Immediately

Check that:
- Evaluation scripts exist and are executable
- Paths in configuration section are correct
- Required models and datasets are accessible

### GPU Out of Memory

- Reduce `--batch_size`
- Reduce `--parallel` count
- Use fewer/different GPUs in `CUDA_DEVICES`

### Missing Checkpoint

Ensure either:
- `--checkpoint_path` points to valid checkpoint file, or
- `--checkpoint_dir` contains checkpoint files, or
- Evaluation scripts can find checkpoints from `TRAINING_DIR`

### Logs Not Appearing in Console

- Remove `--no_realtime` flag
- Check that scripts are actually producing output

## Exit Codes

- `0`: All evaluations succeeded
- `1`: One or more evaluations failed

## Notes

- Each evaluation's output directory is created under `BASE_OUTPUT_DIR/<output_subdir>`
- Logs are buffered line-by-line for real-time streaming
- Master log provides complete audit trail of all executions
- Failed evaluations don't stop other evaluations from running