File size: 13,920 Bytes
242932b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
# Large Language Model Analysis Framework for High Energy Physics

A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.

## Table of Contents
- [Setup](#setup)
- [Data and Solution](#data-and-solution)
- [Running Tests](#running-tests)
- [Analysis and Visualization](#analysis-and-visualization)
- [Project Structure](#project-structure)
- [Advanced Usage](#advanced-usage)

---

## Setup

### Prerequisites

**CBORG API Access Required**

This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:

1. Access to the CBORG API (contact LBL for access)
2. A CBORG API key
3. Network access to the CBORG API endpoint

**Note for External Users:** CBORG is an internal LBL system. External users may need to:
- Request guest access through LBL collaborations
- Adapt the code to use OpenAI API directly (requires code modifications)
- Contact the repository maintainers for alternative deployment options

### Environment Setup
Create Conda environment:  
```bash
mamba env create -f environment.yml
conda activate llm_env
```

### API Configuration
Create script `~/.apikeys.sh` to export CBORG API key:
```bash
export CBORG_API_KEY="INSERT_API_KEY"
```

Then source it before running tests:
```bash
source ~/.apikeys.sh
```

### Initial Configuration

Before running tests, set up your configuration files:

```bash
# Copy example configuration files
cp config.example.yml config.yml
cp models.example.txt models.txt

# Edit config.yml to set your preferred models and parameters
# Edit models.txt to list models you want to test
```

**Important:** The `models.txt` file must end with a blank line.

---

## Data and Solution

### ATLAS Open Data Samples
All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
```
/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
```

**Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
```bash
chmod -R a-w /path/to/data/directory
```

### Reference Solution
- Navigate to `solution/` directory and run `python soln.py`
- Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution

### Reference Arrays for Validation
Large `.npy` reference arrays are not committed to Git (see `.gitignore`).

**Quick fetch from repo root:**
```bash
bash scripts/fetch_solution_arrays.sh
```

**Or copy from NERSC shared path:**
```
/global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
```

---

## Running Tests

### Model Configuration

Three model list files control testing:
- **`models.txt`**: Models for sequential testing
- **`models_supervisor.txt`**: Supervisor models for paired testing
- **`models_coder.txt`**: Coder models for paired testing

**Important formatting rules:**
- One model per line
- File must end with a blank line
- Repeat model names for multiple trials
- Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`)

See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions.

### Testing Workflows

#### 1. Sequential Testing (Single Model at a Time)
```bash
bash test_models.sh output_dir_name
```
Tests all models in `models.txt` sequentially.

#### 2. Parallel Testing (Multiple Models)
```bash
# Basic parallel execution
bash test_models_parallel.sh output_dir_name

# GNU Parallel (recommended for large-scale testing)
bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]

# Examples:
bash test_models_parallel_gnu.sh experiment1        # Default: 5 models, 5 tasks each
bash test_models_parallel_gnu.sh test 3 5           # 3 models, 5 tasks per model
bash test_models_parallel_gnu.sh large_test 10 5    # 10 models, 5 tasks each
```

**GNU Parallel features:**
- Scales to 20-30 models with 200-300 total parallel jobs
- Automatic resource management
- Fast I/O using `/dev/shm` temporary workspace
- Comprehensive error handling and logging

#### 3. Step-by-Step Testing with Validation
```bash
# Run all 5 steps with validation
./run_smk_sequential.sh --validate

# Run specific steps
./run_smk_sequential.sh --step2 --step3 --validate --job-id 002

# Run individual steps
./run_smk_sequential.sh --step1 --validate  # Step 1: Summarize ROOT
./run_smk_sequential.sh --step2 --validate  # Step 2: Create NumPy arrays
./run_smk_sequential.sh --step3 --validate  # Step 3: Preprocess
./run_smk_sequential.sh --step4 --validate  # Step 4: Compute scores
./run_smk_sequential.sh --step5 --validate  # Step 5: Categorization

# Custom output directory
./run_smk_sequential.sh --step1 --validate --auto-dir  # Creates timestamped dir
```

**Directory naming options:**
- `--job-id ID`: Creates `results_job_ID/`
- `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/`
- `--out-dir DIR`: Custom directory name

### Validation

**Automatic validation (during execution):**
```bash
./run_smk_sequential.sh --step1 --step2 --validate
```
Validation logs saved to `{output_dir}/logs/*_validation.log`

**Manual validation (after execution):**
```bash
# Validate all steps
python check_soln.py --out_dir results_job_002

# Validate specific step
python check_soln.py --out_dir results_job_002 --step 2
```

**Validation features:**
- βœ… Adaptive tolerance with 4 significant digit precision
- πŸ“Š Column-by-column difference analysis
- πŸ“‹ Side-by-side value comparison
- 🎯 Clear, actionable error messages

### Speed Optimization

Reduce iteration counts in `config.yml`:
```yaml
# Limit LLM coder attempts (default 10)
max_iterations: 3
```

---

## Analysis and Visualization

### Results Summary
All test results are aggregated in:
```
results_summary.csv
```

**Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions

### Error Analysis and Categorization

**Automated error analysis:**
```bash
python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
```

Uses LLM to analyze comprehensive logs and categorize errors into:
- Semantic errors
- Function-calling errors  
- Intermediate file not found
- Incorrect branch name
- OpenAI API errors
- Data quality issues (all weights = 0)
- Other/uncategorized

### Interactive Analysis Notebooks

#### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)
Comprehensive analysis of model performance across all 5 workflow steps:
- **Success rate heatmap** (models Γ— steps)
- **Agent work progression** (iterations over steps)
- **API call statistics** (by step and model)
- **Cost analysis** (input/output tokens, estimated pricing)

**Output plots:**
- `plots/1_success_rate_heatmap.pdf`
- `plots/2_agent_work_line_plot.pdf`
- `plots/3_api_calls_line_plot.pdf`
- `plots/4_cost_per_step.pdf`
- `plots/five_step_summary_stats.csv`

#### 2. Error Category Analysis (`error_analysis.ipynb`)
Deep dive into error patterns and failure modes:
- **Normalized error distribution** (stacked bar chart with percentages)
- **Error type heatmap** (models Γ— error categories)
- **Top model breakdowns** (faceted plots for top 9 models)
- **Error trends across steps** (stacked area chart)

**Output plots:**
- `plots/error_distribution_by_model.pdf`
- `plots/error_heatmap_by_model.pdf`
- `plots/error_categories_top_models.pdf`
- `plots/errors_by_step.pdf`

#### 3. Quick Statistics (`plot_stats.ipynb`)
Legacy notebook for basic statistics visualization.

### Log Interpretation

**Automated log analysis:**
```bash
python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
```

Analyzes comprehensive supervisor-coder logs to identify:
- Root causes of failures
- Responsible parties (user, supervisor, coder, external)
- Error patterns across iterations

---

## Project Structure

### Core Scripts
- **`supervisor_coder.py`**: Supervisor-coder framework implementation
- **`check_soln.py`**: Solution validation with enhanced comparison
- **`write_prompt.py`**: Prompt management and context chaining
- **`update_stats.py`**: Statistics tracking and CSV updates
- **`error_analysis.py`**: LLM-powered error categorization

### Test Runners
- **`test_models.sh`**: Sequential model testing
- **`test_models_parallel.sh`**: Parallel testing (basic)
- **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended)
- **`test_stats.sh`**: Individual model statistics
- **`test_stats_parallel.sh`**: Parallel step execution
- **`run_smk_sequential.sh`**: Step-by-step workflow runner

### Snakemake Workflows (`workflow/`)
The analysis workflow is divided into 5 sequential steps:

1. **`summarize_root.smk`**: Extract ROOT file structure and branch information
2. **`create_numpy.smk`**: Convert ROOT β†’ NumPy arrays
3. **`preprocess.smk`**: Apply preprocessing transformations
4. **`scores.smk`**: Compute signal/background classification scores
5. **`categorization.smk`**: Final categorization and statistical analysis

**Note:** Later steps use solution outputs to enable testing even when earlier steps fail.

### Prompts (`prompts/`)
- `summarize_root.txt`: Step 1 task description
- `create_numpy.txt`: Step 2 task description
- `preprocess.txt`: Step 3 task description
- `scores.txt`: Step 4 task description
- `categorization.txt`: Step 5 task description
- `supervisor_first_call.txt`: Initial supervisor instructions
- `supervisor_call.txt`: Subsequent supervisor instructions

### Utility Scripts (`util/`)
- **`inspect_root.py`**: ROOT file inspection tools
- **`analyze_particles.py`**: Particle-level analysis
- **`compare_arrays.py`**: NumPy array comparison utilities

### Model Documentation
- **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias β†’ actual model mappings
- **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models
- **`MODEL_NAME_UPDATES.md`**: Model name standardization notes
- **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison

### Analysis Notebooks
- **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis
- **`error_analysis.ipynb`**: Error categorization and pattern analysis
- **`error_analysis_plotting.ipynb`**: Additional error visualizations
- **`plot_stats.ipynb`**: Legacy statistics plots

### Output Structure
Each test run creates:
```
output_name/
β”œβ”€β”€ model_timestamp/
β”‚   β”œβ”€β”€ generated_code/     # LLM-generated Python scripts
β”‚   β”œβ”€β”€ logs/               # Execution logs and supervisor records
β”‚   β”œβ”€β”€ arrays/             # NumPy arrays produced by generated code
β”‚   β”œβ”€β”€ plots/              # Comparison plots (generated vs. solution)
β”‚   β”œβ”€β”€ prompt_pairs/       # User + supervisor prompts
β”‚   β”œβ”€β”€ results/            # Temporary ROOT files (job-scoped)
β”‚   └── snakemake_log/      # Snakemake execution logs
```

**Job-scoped ROOT outputs:**
- Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`)
- Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference
- Automatically cleaned after significance calculation

---

## Advanced Usage

### Supervisor-Coder Configuration

Control iteration limits in `config.yml`:
```yaml
model: 'anthropic/claude-sonnet:latest'
name: 'experiment_name'
out_dir: 'results/experiment_name'
max_iterations: 10  # Maximum supervisor-coder iterations per step
```

### Parallel Execution Tuning

For `test_models_parallel_gnu.sh`:
```bash
# Syntax:
bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>

# Conservative (safe for shared systems):
bash test_models_parallel_gnu.sh test 3 5    # 15 total jobs

# Aggressive (dedicated nodes):
bash test_models_parallel_gnu.sh test 10 10  # 100 total jobs
```

### Custom Validation

Run validation on specific steps or with custom tolerances:
```bash
# Validate only data conversion step
python check_soln.py --out_dir results/ --step 2

# Check multiple specific steps
python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
```

### Log Analysis Pipeline

```bash
# 1. Run tests
bash test_models_parallel_gnu.sh experiment1 5 5

# 2. Analyze logs with LLM
python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt

# 3. Categorize errors
python error_analysis.py --results_dirs experiment1/*/ --output summary.csv

# 4. Generate visualizations
jupyter notebook error_analysis.ipynb
```

---

## Roadmap and Future Directions

### Planned Improvements

**Prompt Engineering:**
- Auto-load context (file lists, logs) at step start
- Provide comprehensive inputs/outputs/summaries upfront
- Develop prompt-management layer for cross-analysis reuse

**Validation & Monitoring:**
- Embed validation in workflows for immediate error detection
- Record input/output and state transitions for reproducibility
- Enhanced situation awareness through comprehensive logging

**Multi-Analysis Extension:**
- Rerun H→γγ with improved system prompts
- Extend to H→4ℓ and other Higgs+X channels
- Provide learned materials from previous analyses as reference

**Self-Improvement:**
- Reinforcement learning–style feedback loops
- Agent-driven prompt refinement
- Automatic generalization across HEP analyses

---

## Citation and Acknowledgments

This framework tests LLM agents on ATLAS Open Data from:
- 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006

Models tested via CBORG API (Lawrence Berkeley National Laboratory).

---

## Support and Contributing

For questions or issues:
1. Check existing documentation in `*.md` files
2. Review example configurations in `config.yml`
3. Examine validation logs in output directories

For contributions, please ensure:
- Model lists end with blank lines
- Prompts follow established format
- Validation passes for all test cases