LLM4HEP / SETUP.md
ho22joshua's picture
new readme following paper, renaming original readme to SETUP.md
242932b
# Large Language Model Analysis Framework for High Energy Physics
A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.
## Table of Contents
- [Setup](#setup)
- [Data and Solution](#data-and-solution)
- [Running Tests](#running-tests)
- [Analysis and Visualization](#analysis-and-visualization)
- [Project Structure](#project-structure)
- [Advanced Usage](#advanced-usage)
---
## Setup
### Prerequisites
**CBORG API Access Required**
This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:
1. Access to the CBORG API (contact LBL for access)
2. A CBORG API key
3. Network access to the CBORG API endpoint
**Note for External Users:** CBORG is an internal LBL system. External users may need to:
- Request guest access through LBL collaborations
- Adapt the code to use OpenAI API directly (requires code modifications)
- Contact the repository maintainers for alternative deployment options
### Environment Setup
Create Conda environment:
```bash
mamba env create -f environment.yml
conda activate llm_env
```
### API Configuration
Create script `~/.apikeys.sh` to export CBORG API key:
```bash
export CBORG_API_KEY="INSERT_API_KEY"
```
Then source it before running tests:
```bash
source ~/.apikeys.sh
```
### Initial Configuration
Before running tests, set up your configuration files:
```bash
# Copy example configuration files
cp config.example.yml config.yml
cp models.example.txt models.txt
# Edit config.yml to set your preferred models and parameters
# Edit models.txt to list models you want to test
```
**Important:** The `models.txt` file must end with a blank line.
---
## Data and Solution
### ATLAS Open Data Samples
All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
```
/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
```
**Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
```bash
chmod -R a-w /path/to/data/directory
```
### Reference Solution
- Navigate to `solution/` directory and run `python soln.py`
- Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution
### Reference Arrays for Validation
Large `.npy` reference arrays are not committed to Git (see `.gitignore`).
**Quick fetch from repo root:**
```bash
bash scripts/fetch_solution_arrays.sh
```
**Or copy from NERSC shared path:**
```
/global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
```
---
## Running Tests
### Model Configuration
Three model list files control testing:
- **`models.txt`**: Models for sequential testing
- **`models_supervisor.txt`**: Supervisor models for paired testing
- **`models_coder.txt`**: Coder models for paired testing
**Important formatting rules:**
- One model per line
- File must end with a blank line
- Repeat model names for multiple trials
- Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`)
See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions.
### Testing Workflows
#### 1. Sequential Testing (Single Model at a Time)
```bash
bash test_models.sh output_dir_name
```
Tests all models in `models.txt` sequentially.
#### 2. Parallel Testing (Multiple Models)
```bash
# Basic parallel execution
bash test_models_parallel.sh output_dir_name
# GNU Parallel (recommended for large-scale testing)
bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]
# Examples:
bash test_models_parallel_gnu.sh experiment1 # Default: 5 models, 5 tasks each
bash test_models_parallel_gnu.sh test 3 5 # 3 models, 5 tasks per model
bash test_models_parallel_gnu.sh large_test 10 5 # 10 models, 5 tasks each
```
**GNU Parallel features:**
- Scales to 20-30 models with 200-300 total parallel jobs
- Automatic resource management
- Fast I/O using `/dev/shm` temporary workspace
- Comprehensive error handling and logging
#### 3. Step-by-Step Testing with Validation
```bash
# Run all 5 steps with validation
./run_smk_sequential.sh --validate
# Run specific steps
./run_smk_sequential.sh --step2 --step3 --validate --job-id 002
# Run individual steps
./run_smk_sequential.sh --step1 --validate # Step 1: Summarize ROOT
./run_smk_sequential.sh --step2 --validate # Step 2: Create NumPy arrays
./run_smk_sequential.sh --step3 --validate # Step 3: Preprocess
./run_smk_sequential.sh --step4 --validate # Step 4: Compute scores
./run_smk_sequential.sh --step5 --validate # Step 5: Categorization
# Custom output directory
./run_smk_sequential.sh --step1 --validate --auto-dir # Creates timestamped dir
```
**Directory naming options:**
- `--job-id ID`: Creates `results_job_ID/`
- `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/`
- `--out-dir DIR`: Custom directory name
### Validation
**Automatic validation (during execution):**
```bash
./run_smk_sequential.sh --step1 --step2 --validate
```
Validation logs saved to `{output_dir}/logs/*_validation.log`
**Manual validation (after execution):**
```bash
# Validate all steps
python check_soln.py --out_dir results_job_002
# Validate specific step
python check_soln.py --out_dir results_job_002 --step 2
```
**Validation features:**
- βœ… Adaptive tolerance with 4 significant digit precision
- πŸ“Š Column-by-column difference analysis
- πŸ“‹ Side-by-side value comparison
- 🎯 Clear, actionable error messages
### Speed Optimization
Reduce iteration counts in `config.yml`:
```yaml
# Limit LLM coder attempts (default 10)
max_iterations: 3
```
---
## Analysis and Visualization
### Results Summary
All test results are aggregated in:
```
results_summary.csv
```
**Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions
### Error Analysis and Categorization
**Automated error analysis:**
```bash
python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
```
Uses LLM to analyze comprehensive logs and categorize errors into:
- Semantic errors
- Function-calling errors
- Intermediate file not found
- Incorrect branch name
- OpenAI API errors
- Data quality issues (all weights = 0)
- Other/uncategorized
### Interactive Analysis Notebooks
#### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)
Comprehensive analysis of model performance across all 5 workflow steps:
- **Success rate heatmap** (models Γ— steps)
- **Agent work progression** (iterations over steps)
- **API call statistics** (by step and model)
- **Cost analysis** (input/output tokens, estimated pricing)
**Output plots:**
- `plots/1_success_rate_heatmap.pdf`
- `plots/2_agent_work_line_plot.pdf`
- `plots/3_api_calls_line_plot.pdf`
- `plots/4_cost_per_step.pdf`
- `plots/five_step_summary_stats.csv`
#### 2. Error Category Analysis (`error_analysis.ipynb`)
Deep dive into error patterns and failure modes:
- **Normalized error distribution** (stacked bar chart with percentages)
- **Error type heatmap** (models Γ— error categories)
- **Top model breakdowns** (faceted plots for top 9 models)
- **Error trends across steps** (stacked area chart)
**Output plots:**
- `plots/error_distribution_by_model.pdf`
- `plots/error_heatmap_by_model.pdf`
- `plots/error_categories_top_models.pdf`
- `plots/errors_by_step.pdf`
#### 3. Quick Statistics (`plot_stats.ipynb`)
Legacy notebook for basic statistics visualization.
### Log Interpretation
**Automated log analysis:**
```bash
python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
```
Analyzes comprehensive supervisor-coder logs to identify:
- Root causes of failures
- Responsible parties (user, supervisor, coder, external)
- Error patterns across iterations
---
## Project Structure
### Core Scripts
- **`supervisor_coder.py`**: Supervisor-coder framework implementation
- **`check_soln.py`**: Solution validation with enhanced comparison
- **`write_prompt.py`**: Prompt management and context chaining
- **`update_stats.py`**: Statistics tracking and CSV updates
- **`error_analysis.py`**: LLM-powered error categorization
### Test Runners
- **`test_models.sh`**: Sequential model testing
- **`test_models_parallel.sh`**: Parallel testing (basic)
- **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended)
- **`test_stats.sh`**: Individual model statistics
- **`test_stats_parallel.sh`**: Parallel step execution
- **`run_smk_sequential.sh`**: Step-by-step workflow runner
### Snakemake Workflows (`workflow/`)
The analysis workflow is divided into 5 sequential steps:
1. **`summarize_root.smk`**: Extract ROOT file structure and branch information
2. **`create_numpy.smk`**: Convert ROOT β†’ NumPy arrays
3. **`preprocess.smk`**: Apply preprocessing transformations
4. **`scores.smk`**: Compute signal/background classification scores
5. **`categorization.smk`**: Final categorization and statistical analysis
**Note:** Later steps use solution outputs to enable testing even when earlier steps fail.
### Prompts (`prompts/`)
- `summarize_root.txt`: Step 1 task description
- `create_numpy.txt`: Step 2 task description
- `preprocess.txt`: Step 3 task description
- `scores.txt`: Step 4 task description
- `categorization.txt`: Step 5 task description
- `supervisor_first_call.txt`: Initial supervisor instructions
- `supervisor_call.txt`: Subsequent supervisor instructions
### Utility Scripts (`util/`)
- **`inspect_root.py`**: ROOT file inspection tools
- **`analyze_particles.py`**: Particle-level analysis
- **`compare_arrays.py`**: NumPy array comparison utilities
### Model Documentation
- **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias β†’ actual model mappings
- **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models
- **`MODEL_NAME_UPDATES.md`**: Model name standardization notes
- **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison
### Analysis Notebooks
- **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis
- **`error_analysis.ipynb`**: Error categorization and pattern analysis
- **`error_analysis_plotting.ipynb`**: Additional error visualizations
- **`plot_stats.ipynb`**: Legacy statistics plots
### Output Structure
Each test run creates:
```
output_name/
β”œβ”€β”€ model_timestamp/
β”‚ β”œβ”€β”€ generated_code/ # LLM-generated Python scripts
β”‚ β”œβ”€β”€ logs/ # Execution logs and supervisor records
β”‚ β”œβ”€β”€ arrays/ # NumPy arrays produced by generated code
β”‚ β”œβ”€β”€ plots/ # Comparison plots (generated vs. solution)
β”‚ β”œβ”€β”€ prompt_pairs/ # User + supervisor prompts
β”‚ β”œβ”€β”€ results/ # Temporary ROOT files (job-scoped)
β”‚ └── snakemake_log/ # Snakemake execution logs
```
**Job-scoped ROOT outputs:**
- Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`)
- Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference
- Automatically cleaned after significance calculation
---
## Advanced Usage
### Supervisor-Coder Configuration
Control iteration limits in `config.yml`:
```yaml
model: 'anthropic/claude-sonnet:latest'
name: 'experiment_name'
out_dir: 'results/experiment_name'
max_iterations: 10 # Maximum supervisor-coder iterations per step
```
### Parallel Execution Tuning
For `test_models_parallel_gnu.sh`:
```bash
# Syntax:
bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>
# Conservative (safe for shared systems):
bash test_models_parallel_gnu.sh test 3 5 # 15 total jobs
# Aggressive (dedicated nodes):
bash test_models_parallel_gnu.sh test 10 10 # 100 total jobs
```
### Custom Validation
Run validation on specific steps or with custom tolerances:
```bash
# Validate only data conversion step
python check_soln.py --out_dir results/ --step 2
# Check multiple specific steps
python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
```
### Log Analysis Pipeline
```bash
# 1. Run tests
bash test_models_parallel_gnu.sh experiment1 5 5
# 2. Analyze logs with LLM
python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt
# 3. Categorize errors
python error_analysis.py --results_dirs experiment1/*/ --output summary.csv
# 4. Generate visualizations
jupyter notebook error_analysis.ipynb
```
---
## Roadmap and Future Directions
### Planned Improvements
**Prompt Engineering:**
- Auto-load context (file lists, logs) at step start
- Provide comprehensive inputs/outputs/summaries upfront
- Develop prompt-management layer for cross-analysis reuse
**Validation & Monitoring:**
- Embed validation in workflows for immediate error detection
- Record input/output and state transitions for reproducibility
- Enhanced situation awareness through comprehensive logging
**Multi-Analysis Extension:**
- Rerun H→γγ with improved system prompts
- Extend to H→4ℓ and other Higgs+X channels
- Provide learned materials from previous analyses as reference
**Self-Improvement:**
- Reinforcement learning–style feedback loops
- Agent-driven prompt refinement
- Automatic generalization across HEP analyses
---
## Citation and Acknowledgments
This framework tests LLM agents on ATLAS Open Data from:
- 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006
Models tested via CBORG API (Lawrence Berkeley National Laboratory).
---
## Support and Contributing
For questions or issues:
1. Check existing documentation in `*.md` files
2. Review example configurations in `config.yml`
3. Examine validation logs in output directories
For contributions, please ensure:
- Model lists end with blank lines
- Prompts follow established format
- Validation passes for all test cases