File size: 13,920 Bytes
242932b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 |
# Large Language Model Analysis Framework for High Energy Physics
A framework for testing and evaluating Large Language Models (LLMs) on ATLAS HβΞ³Ξ³ analysis tasks using a supervisor-coder architecture.
## Table of Contents
- [Setup](#setup)
- [Data and Solution](#data-and-solution)
- [Running Tests](#running-tests)
- [Analysis and Visualization](#analysis-and-visualization)
- [Project Structure](#project-structure)
- [Advanced Usage](#advanced-usage)
---
## Setup
### Prerequisites
**CBORG API Access Required**
This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:
1. Access to the CBORG API (contact LBL for access)
2. A CBORG API key
3. Network access to the CBORG API endpoint
**Note for External Users:** CBORG is an internal LBL system. External users may need to:
- Request guest access through LBL collaborations
- Adapt the code to use OpenAI API directly (requires code modifications)
- Contact the repository maintainers for alternative deployment options
### Environment Setup
Create Conda environment:
```bash
mamba env create -f environment.yml
conda activate llm_env
```
### API Configuration
Create script `~/.apikeys.sh` to export CBORG API key:
```bash
export CBORG_API_KEY="INSERT_API_KEY"
```
Then source it before running tests:
```bash
source ~/.apikeys.sh
```
### Initial Configuration
Before running tests, set up your configuration files:
```bash
# Copy example configuration files
cp config.example.yml config.yml
cp models.example.txt models.txt
# Edit config.yml to set your preferred models and parameters
# Edit models.txt to list models you want to test
```
**Important:** The `models.txt` file must end with a blank line.
---
## Data and Solution
### ATLAS Open Data Samples
All four data samples and Monte Carlo HiggsβΞ³Ξ³ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
```
/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
```
**Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
```bash
chmod -R a-w /path/to/data/directory
```
### Reference Solution
- Navigate to `solution/` directory and run `python soln.py`
- Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution
### Reference Arrays for Validation
Large `.npy` reference arrays are not committed to Git (see `.gitignore`).
**Quick fetch from repo root:**
```bash
bash scripts/fetch_solution_arrays.sh
```
**Or copy from NERSC shared path:**
```
/global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
```
---
## Running Tests
### Model Configuration
Three model list files control testing:
- **`models.txt`**: Models for sequential testing
- **`models_supervisor.txt`**: Supervisor models for paired testing
- **`models_coder.txt`**: Coder models for paired testing
**Important formatting rules:**
- One model per line
- File must end with a blank line
- Repeat model names for multiple trials
- Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`)
See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions.
### Testing Workflows
#### 1. Sequential Testing (Single Model at a Time)
```bash
bash test_models.sh output_dir_name
```
Tests all models in `models.txt` sequentially.
#### 2. Parallel Testing (Multiple Models)
```bash
# Basic parallel execution
bash test_models_parallel.sh output_dir_name
# GNU Parallel (recommended for large-scale testing)
bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]
# Examples:
bash test_models_parallel_gnu.sh experiment1 # Default: 5 models, 5 tasks each
bash test_models_parallel_gnu.sh test 3 5 # 3 models, 5 tasks per model
bash test_models_parallel_gnu.sh large_test 10 5 # 10 models, 5 tasks each
```
**GNU Parallel features:**
- Scales to 20-30 models with 200-300 total parallel jobs
- Automatic resource management
- Fast I/O using `/dev/shm` temporary workspace
- Comprehensive error handling and logging
#### 3. Step-by-Step Testing with Validation
```bash
# Run all 5 steps with validation
./run_smk_sequential.sh --validate
# Run specific steps
./run_smk_sequential.sh --step2 --step3 --validate --job-id 002
# Run individual steps
./run_smk_sequential.sh --step1 --validate # Step 1: Summarize ROOT
./run_smk_sequential.sh --step2 --validate # Step 2: Create NumPy arrays
./run_smk_sequential.sh --step3 --validate # Step 3: Preprocess
./run_smk_sequential.sh --step4 --validate # Step 4: Compute scores
./run_smk_sequential.sh --step5 --validate # Step 5: Categorization
# Custom output directory
./run_smk_sequential.sh --step1 --validate --auto-dir # Creates timestamped dir
```
**Directory naming options:**
- `--job-id ID`: Creates `results_job_ID/`
- `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/`
- `--out-dir DIR`: Custom directory name
### Validation
**Automatic validation (during execution):**
```bash
./run_smk_sequential.sh --step1 --step2 --validate
```
Validation logs saved to `{output_dir}/logs/*_validation.log`
**Manual validation (after execution):**
```bash
# Validate all steps
python check_soln.py --out_dir results_job_002
# Validate specific step
python check_soln.py --out_dir results_job_002 --step 2
```
**Validation features:**
- β
Adaptive tolerance with 4 significant digit precision
- π Column-by-column difference analysis
- π Side-by-side value comparison
- π― Clear, actionable error messages
### Speed Optimization
Reduce iteration counts in `config.yml`:
```yaml
# Limit LLM coder attempts (default 10)
max_iterations: 3
```
---
## Analysis and Visualization
### Results Summary
All test results are aggregated in:
```
results_summary.csv
```
**Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions
### Error Analysis and Categorization
**Automated error analysis:**
```bash
python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
```
Uses LLM to analyze comprehensive logs and categorize errors into:
- Semantic errors
- Function-calling errors
- Intermediate file not found
- Incorrect branch name
- OpenAI API errors
- Data quality issues (all weights = 0)
- Other/uncategorized
### Interactive Analysis Notebooks
#### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)
Comprehensive analysis of model performance across all 5 workflow steps:
- **Success rate heatmap** (models Γ steps)
- **Agent work progression** (iterations over steps)
- **API call statistics** (by step and model)
- **Cost analysis** (input/output tokens, estimated pricing)
**Output plots:**
- `plots/1_success_rate_heatmap.pdf`
- `plots/2_agent_work_line_plot.pdf`
- `plots/3_api_calls_line_plot.pdf`
- `plots/4_cost_per_step.pdf`
- `plots/five_step_summary_stats.csv`
#### 2. Error Category Analysis (`error_analysis.ipynb`)
Deep dive into error patterns and failure modes:
- **Normalized error distribution** (stacked bar chart with percentages)
- **Error type heatmap** (models Γ error categories)
- **Top model breakdowns** (faceted plots for top 9 models)
- **Error trends across steps** (stacked area chart)
**Output plots:**
- `plots/error_distribution_by_model.pdf`
- `plots/error_heatmap_by_model.pdf`
- `plots/error_categories_top_models.pdf`
- `plots/errors_by_step.pdf`
#### 3. Quick Statistics (`plot_stats.ipynb`)
Legacy notebook for basic statistics visualization.
### Log Interpretation
**Automated log analysis:**
```bash
python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
```
Analyzes comprehensive supervisor-coder logs to identify:
- Root causes of failures
- Responsible parties (user, supervisor, coder, external)
- Error patterns across iterations
---
## Project Structure
### Core Scripts
- **`supervisor_coder.py`**: Supervisor-coder framework implementation
- **`check_soln.py`**: Solution validation with enhanced comparison
- **`write_prompt.py`**: Prompt management and context chaining
- **`update_stats.py`**: Statistics tracking and CSV updates
- **`error_analysis.py`**: LLM-powered error categorization
### Test Runners
- **`test_models.sh`**: Sequential model testing
- **`test_models_parallel.sh`**: Parallel testing (basic)
- **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended)
- **`test_stats.sh`**: Individual model statistics
- **`test_stats_parallel.sh`**: Parallel step execution
- **`run_smk_sequential.sh`**: Step-by-step workflow runner
### Snakemake Workflows (`workflow/`)
The analysis workflow is divided into 5 sequential steps:
1. **`summarize_root.smk`**: Extract ROOT file structure and branch information
2. **`create_numpy.smk`**: Convert ROOT β NumPy arrays
3. **`preprocess.smk`**: Apply preprocessing transformations
4. **`scores.smk`**: Compute signal/background classification scores
5. **`categorization.smk`**: Final categorization and statistical analysis
**Note:** Later steps use solution outputs to enable testing even when earlier steps fail.
### Prompts (`prompts/`)
- `summarize_root.txt`: Step 1 task description
- `create_numpy.txt`: Step 2 task description
- `preprocess.txt`: Step 3 task description
- `scores.txt`: Step 4 task description
- `categorization.txt`: Step 5 task description
- `supervisor_first_call.txt`: Initial supervisor instructions
- `supervisor_call.txt`: Subsequent supervisor instructions
### Utility Scripts (`util/`)
- **`inspect_root.py`**: ROOT file inspection tools
- **`analyze_particles.py`**: Particle-level analysis
- **`compare_arrays.py`**: NumPy array comparison utilities
### Model Documentation
- **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias β actual model mappings
- **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models
- **`MODEL_NAME_UPDATES.md`**: Model name standardization notes
- **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison
### Analysis Notebooks
- **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis
- **`error_analysis.ipynb`**: Error categorization and pattern analysis
- **`error_analysis_plotting.ipynb`**: Additional error visualizations
- **`plot_stats.ipynb`**: Legacy statistics plots
### Output Structure
Each test run creates:
```
output_name/
βββ model_timestamp/
β βββ generated_code/ # LLM-generated Python scripts
β βββ logs/ # Execution logs and supervisor records
β βββ arrays/ # NumPy arrays produced by generated code
β βββ plots/ # Comparison plots (generated vs. solution)
β βββ prompt_pairs/ # User + supervisor prompts
β βββ results/ # Temporary ROOT files (job-scoped)
β βββ snakemake_log/ # Snakemake execution logs
```
**Job-scoped ROOT outputs:**
- Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`)
- Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference
- Automatically cleaned after significance calculation
---
## Advanced Usage
### Supervisor-Coder Configuration
Control iteration limits in `config.yml`:
```yaml
model: 'anthropic/claude-sonnet:latest'
name: 'experiment_name'
out_dir: 'results/experiment_name'
max_iterations: 10 # Maximum supervisor-coder iterations per step
```
### Parallel Execution Tuning
For `test_models_parallel_gnu.sh`:
```bash
# Syntax:
bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>
# Conservative (safe for shared systems):
bash test_models_parallel_gnu.sh test 3 5 # 15 total jobs
# Aggressive (dedicated nodes):
bash test_models_parallel_gnu.sh test 10 10 # 100 total jobs
```
### Custom Validation
Run validation on specific steps or with custom tolerances:
```bash
# Validate only data conversion step
python check_soln.py --out_dir results/ --step 2
# Check multiple specific steps
python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
```
### Log Analysis Pipeline
```bash
# 1. Run tests
bash test_models_parallel_gnu.sh experiment1 5 5
# 2. Analyze logs with LLM
python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt
# 3. Categorize errors
python error_analysis.py --results_dirs experiment1/*/ --output summary.csv
# 4. Generate visualizations
jupyter notebook error_analysis.ipynb
```
---
## Roadmap and Future Directions
### Planned Improvements
**Prompt Engineering:**
- Auto-load context (file lists, logs) at step start
- Provide comprehensive inputs/outputs/summaries upfront
- Develop prompt-management layer for cross-analysis reuse
**Validation & Monitoring:**
- Embed validation in workflows for immediate error detection
- Record input/output and state transitions for reproducibility
- Enhanced situation awareness through comprehensive logging
**Multi-Analysis Extension:**
- Rerun HβΞ³Ξ³ with improved system prompts
- Extend to Hβ4β and other Higgs+X channels
- Provide learned materials from previous analyses as reference
**Self-Improvement:**
- Reinforcement learningβstyle feedback loops
- Agent-driven prompt refinement
- Automatic generalization across HEP analyses
---
## Citation and Acknowledgments
This framework tests LLM agents on ATLAS Open Data from:
- 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006
Models tested via CBORG API (Lawrence Berkeley National Laboratory).
---
## Support and Contributing
For questions or issues:
1. Check existing documentation in `*.md` files
2. Review example configurations in `config.yml`
3. Examine validation logs in output directories
For contributions, please ensure:
- Model lists end with blank lines
- Prompts follow established format
- Validation passes for all test cases |