| # Large Language Model Analysis Framework for High Energy Physics | |
| A framework for testing and evaluating Large Language Models (LLMs) on ATLAS HβΞ³Ξ³ analysis tasks using a supervisor-coder architecture. | |
| ## Table of Contents | |
| - [Setup](#setup) | |
| - [Data and Solution](#data-and-solution) | |
| - [Running Tests](#running-tests) | |
| - [Analysis and Visualization](#analysis-and-visualization) | |
| - [Project Structure](#project-structure) | |
| - [Advanced Usage](#advanced-usage) | |
| --- | |
| ## Setup | |
| ### Prerequisites | |
| **CBORG API Access Required** | |
| This framework uses Lawrence Berkeley National Laboratory's CBORG API to access various LLM models. To use this code, you will need: | |
| 1. Access to the CBORG API (contact LBL for access) | |
| 2. A CBORG API key | |
| 3. Network access to the CBORG API endpoint | |
| **Note for External Users:** CBORG is an internal LBL system. External users may need to: | |
| - Request guest access through LBL collaborations | |
| - Adapt the code to use OpenAI API directly (requires code modifications) | |
| - Contact the repository maintainers for alternative deployment options | |
| ### Environment Setup | |
| Create Conda environment: | |
| ```bash | |
| mamba env create -f environment.yml | |
| conda activate llm_env | |
| ``` | |
| ### API Configuration | |
| Create script `~/.apikeys.sh` to export CBORG API key: | |
| ```bash | |
| export CBORG_API_KEY="INSERT_API_KEY" | |
| ``` | |
| Then source it before running tests: | |
| ```bash | |
| source ~/.apikeys.sh | |
| ``` | |
| ### Initial Configuration | |
| Before running tests, set up your configuration files: | |
| ```bash | |
| # Copy example configuration files | |
| cp config.example.yml config.yml | |
| cp models.example.txt models.txt | |
| # Edit config.yml to set your preferred models and parameters | |
| # Edit models.txt to list models you want to test | |
| ``` | |
| **Important:** The `models.txt` file must end with a blank line. | |
| --- | |
| ## Data and Solution | |
| ### ATLAS Open Data Samples | |
| All four data samples and Monte Carlo HiggsβΞ³Ξ³ samples (including ttH) from the 2020 ATLAS Open Data diphoton campaign are available at: | |
| ``` | |
| /global/cfs/projectdirs/atlas/eligd/llm_for_analysis_copy/data/ | |
| ``` | |
| **Important:** If copying data elsewhere, make the directory read-only to prevent LLM-generated code from modifying files: | |
| ```bash | |
| chmod -R a-w /path/to/data/directory | |
| ``` | |
| ### Reference Solution | |
| - Navigate to `solution/` directory and run `python soln.py` | |
| - Use flags: `--step1`, `--step2`, `--step3`, `--plot` to control execution | |
| ### Reference Arrays for Validation | |
| Large `.npy` reference arrays are not committed to Git (see `.gitignore`). | |
| **Quick fetch from repo root:** | |
| ```bash | |
| bash scripts/fetch_solution_arrays.sh | |
| ``` | |
| **Or copy from NERSC shared path:** | |
| ``` | |
| /global/cfs/projectdirs/atlas/dwkim/llm_test_dev_cxyang/llm_for_analysis/solution/arrays | |
| ``` | |
| --- | |
| ## Running Tests | |
| ### Model Configuration | |
| Three model list files control testing: | |
| - **`models.txt`**: Models for sequential testing | |
| - **`models_supervisor.txt`**: Supervisor models for paired testing | |
| - **`models_coder.txt`**: Coder models for paired testing | |
| **Important formatting rules:** | |
| - One model per line | |
| - File must end with a blank line | |
| - Repeat model names for multiple trials | |
| - Use CBORG aliases (e.g., `anthropic/claude-sonnet:latest`) | |
| See `CBORG_MODEL_MAPPINGS.md` for available models and their actual versions. | |
| ### Testing Workflows | |
| #### 1. Sequential Testing (Single Model at a Time) | |
| ```bash | |
| bash test_models.sh output_dir_name | |
| ``` | |
| Tests all models in `models.txt` sequentially. | |
| #### 2. Parallel Testing (Multiple Models) | |
| ```bash | |
| # Basic parallel execution | |
| bash test_models_parallel.sh output_dir_name | |
| # GNU Parallel (recommended for large-scale testing) | |
| bash test_models_parallel_gnu.sh output_dir_name [max_models] [tasks_per_model] | |
| # Examples: | |
| bash test_models_parallel_gnu.sh experiment1 # Default: 5 models, 5 tasks each | |
| bash test_models_parallel_gnu.sh test 3 5 # 3 models, 5 tasks per model | |
| bash test_models_parallel_gnu.sh large_test 10 5 # 10 models, 5 tasks each | |
| ``` | |
| **GNU Parallel features:** | |
| - Scales to 20-30 models with 200-300 total parallel jobs | |
| - Automatic resource management | |
| - Fast I/O using `/dev/shm` temporary workspace | |
| - Comprehensive error handling and logging | |
| #### 3. Step-by-Step Testing with Validation | |
| ```bash | |
| # Run all 5 steps with validation | |
| ./run_smk_sequential.sh --validate | |
| # Run specific steps | |
| ./run_smk_sequential.sh --step2 --step3 --validate --job-id 002 | |
| # Run individual steps | |
| ./run_smk_sequential.sh --step1 --validate # Step 1: Summarize ROOT | |
| ./run_smk_sequential.sh --step2 --validate # Step 2: Create NumPy arrays | |
| ./run_smk_sequential.sh --step3 --validate # Step 3: Preprocess | |
| ./run_smk_sequential.sh --step4 --validate # Step 4: Compute scores | |
| ./run_smk_sequential.sh --step5 --validate # Step 5: Categorization | |
| # Custom output directory | |
| ./run_smk_sequential.sh --step1 --validate --auto-dir # Creates timestamped dir | |
| ``` | |
| **Directory naming options:** | |
| - `--job-id ID`: Creates `results_job_ID/` | |
| - `--auto-dir`: Creates `results_YYYYMMDD_HHMMSS/` | |
| - `--out-dir DIR`: Custom directory name | |
| ### Validation | |
| **Automatic validation (during execution):** | |
| ```bash | |
| ./run_smk_sequential.sh --step1 --step2 --validate | |
| ``` | |
| Validation logs saved to `{output_dir}/logs/*_validation.log` | |
| **Manual validation (after execution):** | |
| ```bash | |
| # Validate all steps | |
| python check_soln.py --out_dir results_job_002 | |
| # Validate specific step | |
| python check_soln.py --out_dir results_job_002 --step 2 | |
| ``` | |
| **Validation features:** | |
| - β Adaptive tolerance with 4 significant digit precision | |
| - π Column-by-column difference analysis | |
| - π Side-by-side value comparison | |
| - π― Clear, actionable error messages | |
| ### Speed Optimization | |
| Reduce iteration counts in `config.yml`: | |
| ```yaml | |
| # Limit LLM coder attempts (default 10) | |
| max_iterations: 3 | |
| ``` | |
| --- | |
| ## Analysis and Visualization | |
| ### Results Summary | |
| All test results are aggregated in: | |
| ``` | |
| results_summary.csv | |
| ``` | |
| **Columns include:** supervisor, coder, step, success, iterations, duration, API_calls, tokens, errors, error_descriptions | |
| ### Error Analysis and Categorization | |
| **Automated error analysis:** | |
| ```bash | |
| python error_analysis.py --results_dirs <dir1> <dir2> ... --output results_summary.csv --model <model_name> | |
| ``` | |
| Uses LLM to analyze comprehensive logs and categorize errors into: | |
| - Semantic errors | |
| - Function-calling errors | |
| - Intermediate file not found | |
| - Incorrect branch name | |
| - OpenAI API errors | |
| - Data quality issues (all weights = 0) | |
| - Other/uncategorized | |
| ### Interactive Analysis Notebooks | |
| #### 1. Five-Step Performance Analysis (`five_step_analysis.ipynb`) | |
| Comprehensive analysis of model performance across all 5 workflow steps: | |
| - **Success rate heatmap** (models Γ steps) | |
| - **Agent work progression** (iterations over steps) | |
| - **API call statistics** (by step and model) | |
| - **Cost analysis** (input/output tokens, estimated pricing) | |
| **Output plots:** | |
| - `plots/1_success_rate_heatmap.pdf` | |
| - `plots/2_agent_work_line_plot.pdf` | |
| - `plots/3_api_calls_line_plot.pdf` | |
| - `plots/4_cost_per_step.pdf` | |
| - `plots/five_step_summary_stats.csv` | |
| #### 2. Error Category Analysis (`error_analysis.ipynb`) | |
| Deep dive into error patterns and failure modes: | |
| - **Normalized error distribution** (stacked bar chart with percentages) | |
| - **Error type heatmap** (models Γ error categories) | |
| - **Top model breakdowns** (faceted plots for top 9 models) | |
| - **Error trends across steps** (stacked area chart) | |
| **Output plots:** | |
| - `plots/error_distribution_by_model.pdf` | |
| - `plots/error_heatmap_by_model.pdf` | |
| - `plots/error_categories_top_models.pdf` | |
| - `plots/errors_by_step.pdf` | |
| #### 3. Quick Statistics (`plot_stats.ipynb`) | |
| Legacy notebook for basic statistics visualization. | |
| ### Log Interpretation | |
| **Automated log analysis:** | |
| ```bash | |
| python logs_interpreter.py --log_dir <output_dir> --model lbl/cborg-deepthought --output analysis.txt | |
| ``` | |
| Analyzes comprehensive supervisor-coder logs to identify: | |
| - Root causes of failures | |
| - Responsible parties (user, supervisor, coder, external) | |
| - Error patterns across iterations | |
| --- | |
| ## Project Structure | |
| ### Core Scripts | |
| - **`supervisor_coder.py`**: Supervisor-coder framework implementation | |
| - **`check_soln.py`**: Solution validation with enhanced comparison | |
| - **`write_prompt.py`**: Prompt management and context chaining | |
| - **`update_stats.py`**: Statistics tracking and CSV updates | |
| - **`error_analysis.py`**: LLM-powered error categorization | |
| ### Test Runners | |
| - **`test_models.sh`**: Sequential model testing | |
| - **`test_models_parallel.sh`**: Parallel testing (basic) | |
| - **`test_models_parallel_gnu.sh`**: GNU Parallel testing (recommended) | |
| - **`test_stats.sh`**: Individual model statistics | |
| - **`test_stats_parallel.sh`**: Parallel step execution | |
| - **`run_smk_sequential.sh`**: Step-by-step workflow runner | |
| ### Snakemake Workflows (`workflow/`) | |
| The analysis workflow is divided into 5 sequential steps: | |
| 1. **`summarize_root.smk`**: Extract ROOT file structure and branch information | |
| 2. **`create_numpy.smk`**: Convert ROOT β NumPy arrays | |
| 3. **`preprocess.smk`**: Apply preprocessing transformations | |
| 4. **`scores.smk`**: Compute signal/background classification scores | |
| 5. **`categorization.smk`**: Final categorization and statistical analysis | |
| **Note:** Later steps use solution outputs to enable testing even when earlier steps fail. | |
| ### Prompts (`prompts/`) | |
| - `summarize_root.txt`: Step 1 task description | |
| - `create_numpy.txt`: Step 2 task description | |
| - `preprocess.txt`: Step 3 task description | |
| - `scores.txt`: Step 4 task description | |
| - `categorization.txt`: Step 5 task description | |
| - `supervisor_first_call.txt`: Initial supervisor instructions | |
| - `supervisor_call.txt`: Subsequent supervisor instructions | |
| ### Utility Scripts (`util/`) | |
| - **`inspect_root.py`**: ROOT file inspection tools | |
| - **`analyze_particles.py`**: Particle-level analysis | |
| - **`compare_arrays.py`**: NumPy array comparison utilities | |
| ### Model Documentation | |
| - **`CBORG_MODEL_MAPPINGS.md`**: CBORG alias β actual model mappings | |
| - **`COMPLETE_MODEL_VERSIONS.md`**: Full version information for all tested models | |
| - **`MODEL_NAME_UPDATES.md`**: Model name standardization notes | |
| - **`O3_MODEL_COMPARISON.md`**: OpenAI O3 model variant comparison | |
| ### Analysis Notebooks | |
| - **`five_step_analysis.ipynb`**: Comprehensive 5-step performance analysis | |
| - **`error_analysis.ipynb`**: Error categorization and pattern analysis | |
| - **`error_analysis_plotting.ipynb`**: Additional error visualizations | |
| - **`plot_stats.ipynb`**: Legacy statistics plots | |
| ### Output Structure | |
| Each test run creates: | |
| ``` | |
| output_name/ | |
| βββ model_timestamp/ | |
| β βββ generated_code/ # LLM-generated Python scripts | |
| β βββ logs/ # Execution logs and supervisor records | |
| β βββ arrays/ # NumPy arrays produced by generated code | |
| β βββ plots/ # Comparison plots (generated vs. solution) | |
| β βββ prompt_pairs/ # User + supervisor prompts | |
| β βββ results/ # Temporary ROOT files (job-scoped) | |
| β βββ snakemake_log/ # Snakemake execution logs | |
| ``` | |
| **Job-scoped ROOT outputs:** | |
| - Step 5 uses temporary ROOT files (`signal.root`, `bkgd.root`) | |
| - Written to `${OUTPUT_DIR}/results/` to prevent cross-run interference | |
| - Automatically cleaned after significance calculation | |
| --- | |
| ## Advanced Usage | |
| ### Supervisor-Coder Configuration | |
| Control iteration limits in `config.yml`: | |
| ```yaml | |
| model: 'anthropic/claude-sonnet:latest' | |
| name: 'experiment_name' | |
| out_dir: 'results/experiment_name' | |
| max_iterations: 10 # Maximum supervisor-coder iterations per step | |
| ``` | |
| ### Parallel Execution Tuning | |
| For `test_models_parallel_gnu.sh`: | |
| ```bash | |
| # Syntax: | |
| bash test_models_parallel_gnu.sh <output> <max_models> <tasks_per_model> | |
| # Conservative (safe for shared systems): | |
| bash test_models_parallel_gnu.sh test 3 5 # 15 total jobs | |
| # Aggressive (dedicated nodes): | |
| bash test_models_parallel_gnu.sh test 10 10 # 100 total jobs | |
| ``` | |
| ### Custom Validation | |
| Run validation on specific steps or with custom tolerances: | |
| ```bash | |
| # Validate only data conversion step | |
| python check_soln.py --out_dir results/ --step 2 | |
| # Check multiple specific steps | |
| python check_soln.py --out_dir results/ --step 2 --step 3 --step 4 | |
| ``` | |
| ### Log Analysis Pipeline | |
| ```bash | |
| # 1. Run tests | |
| bash test_models_parallel_gnu.sh experiment1 5 5 | |
| # 2. Analyze logs with LLM | |
| python logs_interpreter.py --log_dir experiment1/model_timestamp/ --output analysis.txt | |
| # 3. Categorize errors | |
| python error_analysis.py --results_dirs experiment1/*/ --output summary.csv | |
| # 4. Generate visualizations | |
| jupyter notebook error_analysis.ipynb | |
| ``` | |
| --- | |
| ## Roadmap and Future Directions | |
| ### Planned Improvements | |
| **Prompt Engineering:** | |
| - Auto-load context (file lists, logs) at step start | |
| - Provide comprehensive inputs/outputs/summaries upfront | |
| - Develop prompt-management layer for cross-analysis reuse | |
| **Validation & Monitoring:** | |
| - Embed validation in workflows for immediate error detection | |
| - Record input/output and state transitions for reproducibility | |
| - Enhanced situation awareness through comprehensive logging | |
| **Multi-Analysis Extension:** | |
| - Rerun HβΞ³Ξ³ with improved system prompts | |
| - Extend to Hβ4β and other Higgs+X channels | |
| - Provide learned materials from previous analyses as reference | |
| **Self-Improvement:** | |
| - Reinforcement learningβstyle feedback loops | |
| - Agent-driven prompt refinement | |
| - Automatic generalization across HEP analyses | |
| --- | |
| ## Citation and Acknowledgments | |
| This framework tests LLM agents on ATLAS Open Data from: | |
| - 2020 ATLAS Open Data diphoton samples: https://opendata.cern.ch/record/15006 | |
| Models tested via CBORG API (Lawrence Berkeley National Laboratory). | |
| --- | |
| ## Support and Contributing | |
| For questions or issues: | |
| 1. Check existing documentation in `*.md` files | |
| 2. Review example configurations in `config.yml` | |
| 3. Examine validation logs in output directories | |
| For contributions, please ensure: | |
| - Model lists end with blank lines | |
| - Prompts follow established format | |
| - Validation passes for all test cases |