LLM4HEP / SETUP.md

new readme following paper, renaming original readme to SETUP.md

242932b 3 months ago

13.9 kB

	# Large Language Model Analysis Framework for High Energy Physics

	A framework for testing and evaluating Large Language Models (LLMs) on ATLAS H→γγ analysis tasks using a supervisor-coder architecture.

	## Table of Contents
	- [Setup](#setup)
	- [Data and Solution](#data-and-solution)
	- [Running Tests](#running-tests)
	- [Analysis and Visualization](#analysis-and-visualization)
	- [Project Structure](#project-structure)
	- [Advanced Usage](#advanced-usage)

	---

	## Setup

	### Prerequisites

	CBORG API Access Required

	This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need:

	1. Access to the CBORG API (contact LBL for access)
	2. A CBORG API key
	3. Network access to the CBORG API endpoint

	Note for External Users: CBORG is an internal LBL system. External users may need to:
	- Request guest access through LBL collaborations
	- Adapt the code to use OpenAI API directly (requires code modifications)
	- Contact the repository maintainers for alternative deployment options

	### Environment Setup
	Create Conda environment:
	```bash
	mamba env create -f environment.yml
	conda activate llm_env
	```

	### API Configuration
	Create script `~/.apikeys.sh` to export CBORG API key:
	```bash
	export CBORG_API_KEY="INSERT_API_KEY"
	```

	Then source it before running tests:
	```bash
	source ~/.apikeys.sh
	```

	### Initial Configuration

	Before running tests, set up your configuration files:

	```bash
	# Copy example configuration files
	cp config.example.yml config.yml
	cp models.example.txt models.txt

	# Edit config.yml to set your preferred models and parameters
	# Edit models.txt to list models you want to test
	```

	Important: The `models.txt` file must end with a blank line.

	---

	## Data and Solution

	### ATLAS Open Data Samples
	All four data samples and Monte Carlo Higgs→γγ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at:
	```
	/global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/
	```

	Important: If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files:
	```bash
	chmod -R a-w /path/to/data/directory
	```

	### Reference Solution
	- Navigate to `solution/` directory and run `python soln.py`
	- Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution

	### Reference Arrays for Validation
	Large `.npy` reference arrays are not committed to Git (see `.gitignore`).

	Quick fetch from repo root:
	```bash
	bash scripts/fetch_solution_arrays.sh
	```

	Or copy from NERSC shared path:
	```
	/global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays
	```

	---

	## Running Tests

	### Model Configuration

	Three model list files control testing:
	- `models.txt`: Models for sequential testing
	- `models_supervisor.txt`: Supervisor models for paired testing
	- `models_coder.txt`: Coder models for paired testing

	Important formatting rules:
	- One model per line
	- File must end with a blank line
	- Repeat model names for multiple trials
	- Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`)

	See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions.

	### Testing Workflows

	#### 1. Sequential Testing (Single Model at a Time)
	```bash
	bash test_models.sh output_dir_name
	```
	Tests all models in `models.txt` sequentially.

	#### 2. Parallel Testing (Multiple Models)
	```bash
	# Basic parallel execution
	bash test_models_parallel.sh output_dir_name

	# GNU Parallel (recommended for large-scale testing)
	bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model]

	# Examples:
	bash test_models_parallel_gnu.sh experiment1 # Default: 5 models, 5 tasks each
	bash test_models_parallel_gnu.sh test 3 5 # 3 models, 5 tasks per model
	bash test_models_parallel_gnu.sh large_test 10 5 # 10 models, 5 tasks each
	```

	GNU Parallel features:
	- Scales to 20-30 models with 200-300 total parallel jobs
	- Automatic resource management
	- Fast I/O using `/dev/shm` temporary workspace
	- Comprehensive error handling and logging

	#### 3. Step-by-Step Testing with Validation
	```bash
	# Run all 5 steps with validation
	./run_smk_sequential.sh --validate

	# Run specific steps
	./run_smk_sequential.sh --step2 --step3 --validate --job-id 002

	# Run individual steps
	./run_smk_sequential.sh --step1 --validate # Step 1: Summarize ROOT
	./run_smk_sequential.sh --step2 --validate # Step 2: Create NumPy arrays
	./run_smk_sequential.sh --step3 --validate # Step 3: Preprocess
	./run_smk_sequential.sh --step4 --validate # Step 4: Compute scores
	./run_smk_sequential.sh --step5 --validate # Step 5: Categorization

	# Custom output directory
	./run_smk_sequential.sh --step1 --validate --auto-dir # Creates timestamped dir
	```

	Directory naming options:
	- `--job-id ID`: Creates `results_job_ID/`
	- `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/`
	- `--out-dir DIR`: Custom directory name

	### Validation

	Automatic validation (during execution):
	```bash
	./run_smk_sequential.sh --step1 --step2 --validate
	```
	Validation logs saved to `{output_dir}/logs/*_validation.log`

	Manual validation (after execution):
	```bash
	# Validate all steps
	python check_soln.py --out_dir results_job_002

	# Validate specific step
	python check_soln.py --out_dir results_job_002 --step 2
	```

	Validation features:
	- ✅ Adaptive tolerance with 4 significant digit precision
	- 📊 Column-by-column difference analysis
	- 📋 Side-by-side value comparison
	- 🎯 Clear, actionable error messages

	### Speed Optimization

	Reduce iteration counts in `config.yml`:
	```yaml
	# Limit LLM coder attempts (default 10)
	max_iterations: 3
	```

	---

	## Analysis and Visualization

	### Results Summary
	All test results are aggregated in:
	```
	results_summary.csv
	```

	Columns include: supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions

	### Error Analysis and Categorization

	Automated error analysis:
	```bash
	python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name>
	```

	Uses LLM to analyze comprehensive logs and categorize errors into:
	- Semantic errors
	- Function-calling errors
	- Intermediate file not found
	- Incorrect branch name
	- OpenAI API errors
	- Data quality issues (all weights = 0)
	- Other/uncategorized

	### Interactive Analysis Notebooks

	#### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`)
	Comprehensive analysis of model performance across all 5 workflow steps:
	- Success rate heatmap (models × steps)
	- Agent work progression (iterations over steps)
	- API call statistics (by step and model)
	- Cost analysis (input/output tokens, estimated pricing)

	Output plots:
	- `plots/1_success_rate_heatmap.pdf`
	- `plots/2_agent_work_line_plot.pdf`
	- `plots/3_api_calls_line_plot.pdf`
	- `plots/4_cost_per_step.pdf`
	- `plots/five_step_summary_stats.csv`

	#### 2. Error Category Analysis (`error_analysis.ipynb`)
	Deep dive into error patterns and failure modes:
	- Normalized error distribution (stacked bar chart with percentages)
	- Error type heatmap (models × error categories)
	- Top model breakdowns (faceted plots for top 9 models)
	- Error trends across steps (stacked area chart)

	Output plots:
	- `plots/error_distribution_by_model.pdf`
	- `plots/error_heatmap_by_model.pdf`
	- `plots/error_categories_top_models.pdf`
	- `plots/errors_by_step.pdf`

	#### 3. Quick Statistics (`plot_stats.ipynb`)
	Legacy notebook for basic statistics visualization.

	### Log Interpretation

	Automated log analysis:
	```bash
	python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt
	```

	Analyzes comprehensive supervisor-coder logs to identify:
	- Root causes of failures
	- Responsible parties (user, supervisor, coder, external)
	- Error patterns across iterations

	---

	## Project Structure

	### Core Scripts
	- `supervisor_coder.py`: Supervisor-coder framework implementation
	- `check_soln.py`: Solution validation with enhanced comparison
	- `write_prompt.py`: Prompt management and context chaining
	- `update_stats.py`: Statistics tracking and CSV updates
	- `error_analysis.py`: LLM-powered error categorization

	### Test Runners
	- `test_models.sh`: Sequential model testing
	- `test_models_parallel.sh`: Parallel testing (basic)
	- `test_models_parallel_gnu.sh`: GNU Parallel testing (recommended)
	- `test_stats.sh`: Individual model statistics
	- `test_stats_parallel.sh`: Parallel step execution
	- `run_smk_sequential.sh`: Step-by-step workflow runner

	### Snakemake Workflows (`workflow/`)
	The analysis workflow is divided into 5 sequential steps:

	1. `summarize_root.smk`: Extract ROOT file structure and branch information
	2. `create_numpy.smk`: Convert ROOT → NumPy arrays
	3. `preprocess.smk`: Apply preprocessing transformations
	4. `scores.smk`: Compute signal/background classification scores
	5. `categorization.smk`: Final categorization and statistical analysis

	Note: Later steps use solution outputs to enable testing even when earlier steps fail.

	### Prompts (`prompts/`)
	- `summarize_root.txt`: Step 1 task description
	- `create_numpy.txt`: Step 2 task description
	- `preprocess.txt`: Step 3 task description
	- `scores.txt`: Step 4 task description
	- `categorization.txt`: Step 5 task description
	- `supervisor_first_call.txt`: Initial supervisor instructions
	- `supervisor_call.txt`: Subsequent supervisor instructions

	### Utility Scripts (`util/`)
	- `inspect_root.py`: ROOT file inspection tools
	- `analyze_particles.py`: Particle-level analysis
	- `compare_arrays.py`: NumPy array comparison utilities

	### Model Documentation
	- `CBORG_MODEL_MAPPINGS.md`: CBORG alias → actual model mappings
	- `COMPLETE_MODEL_VERSIONS.md`: Full version information for all tested models
	- `MODEL_NAME_UPDATES.md`: Model name standardization notes
	- `O3_MODEL_COMPARISON.md`: OpenAI O3 model variant comparison

	### Analysis Notebooks
	- `five_step_analysis.ipynb`: Comprehensive 5-step performance analysis
	- `error_analysis.ipynb`: Error categorization and pattern analysis
	- `error_analysis_plotting.ipynb`: Additional error visualizations
	- `plot_stats.ipynb`: Legacy statistics plots

	### Output Structure
	Each test run creates:
	```
	output_name/
	├── model_timestamp/
	│ ├── generated_code/ # LLM-generated Python scripts
	│ ├── logs/ # Execution logs and supervisor records
	│ ├── arrays/ # NumPy arrays produced by generated code
	│ ├── plots/ # Comparison plots (generated vs. solution)
	│ ├── prompt_pairs/ # User + supervisor prompts
	│ ├── results/ # Temporary ROOT files (job-scoped)
	│ └── snakemake_log/ # Snakemake execution logs
	```

	Job-scoped ROOT outputs:
	- Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`)
	- Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference
	- Automatically cleaned after significance calculation

	---

	## Advanced Usage

	### Supervisor-Coder Configuration

	Control iteration limits in `config.yml`:
	```yaml
	model: 'anthropic/claude-sonnet:latest'
	name: 'experiment_name'
	out_dir: 'results/experiment_name'
	max_iterations: 10 # Maximum supervisor-coder iterations per step
	```

	### Parallel Execution Tuning

	For `test_models_parallel_gnu.sh`:
	```bash
	# Syntax:
	bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model>

	# Conservative (safe for shared systems):
	bash test_models_parallel_gnu.sh test 3 5 # 15 total jobs

	# Aggressive (dedicated nodes):
	bash test_models_parallel_gnu.sh test 10 10 # 100 total jobs
	```

	### Custom Validation

	Run validation on specific steps or with custom tolerances:
	```bash
	# Validate only data conversion step
	python check_soln.py --out_dir results/ --step 2

	# Check multiple specific steps
	python check_soln.py --out_dir results/ --step 2 --step 3 --step 4
	```

	### Log Analysis Pipeline

	```bash
	# 1. Run tests
	bash test_models_parallel_gnu.sh experiment1 5 5

	# 2. Analyze logs with LLM
	python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt

	# 3. Categorize errors
	python error_analysis.py --results_dirs experiment1/*/ --output summary.csv

	# 4. Generate visualizations
	jupyter notebook error_analysis.ipynb
	```

	---

	## Roadmap and Future Directions

	### Planned Improvements

	Prompt Engineering:
	- Auto-load context (file lists, logs) at step start
	- Provide comprehensive inputs/outputs/summaries upfront
	- Develop prompt-management layer for cross-analysis reuse

	Validation & Monitoring:
	- Embed validation in workflows for immediate error detection
	- Record input/output and state transitions for reproducibility
	- Enhanced situation awareness through comprehensive logging

	Multi-Analysis Extension:
	- Rerun H→γγ with improved system prompts
	- Extend to H→4ℓ and other Higgs+X channels
	- Provide learned materials from previous analyses as reference

	Self-Improvement:
	- Reinforcement learning–style feedback loops
	- Agent-driven prompt refinement
	- Automatic generalization across HEP analyses

	---

	## Citation and Acknowledgments

	This framework tests LLM agents on ATLAS Open Data from:
	- 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006

	Models tested via CBORG API (Lawrence Berkeley National Laboratory).

	---

	## Support and Contributing

	For questions or issues:
	1. Check existing documentation in `*.md` files
	2. Review example configurations in `config.yml`
	3. Examine validation logs in output directories

	For contributions, please ensure:
	- Model lists end with blank lines
	- Prompts follow established format
	- Validation passes for all test cases