LLM4HEP / PRE_RELEASE_CHECKLIST.md
ho22joshua's picture
initial commit
cfcbbc8
# Pre-Release Checklist for llm4hep Repository
## βœ… Ready for Public Release
### Documentation
- [x] Comprehensive README.md with all 5 steps documented
- [x] Model mapping documentation (CBORG_MODEL_MAPPINGS.md)
- [x] Analysis notebooks documented
- [x] Installation instructions clear
- [x] Example usage provided
### Core Functionality
- [x] All 5 workflow steps (Snakemake files present)
- [x] Supervisor-coder framework
- [x] Validation system
- [x] Error analysis tools
- [x] Log interpretation
## ⚠️ Issues to Address Before Public Release
### 1. **CRITICAL: API Key Setup**
**Issue:** Users won't have CBORG API access
**Current state:** Code expects `CBORG_API_KEY` from LBL's CBORG system
**Impact:** External users cannot run the code without CBORG access
**Solutions:**
- [x] Add clear notice in README that CBORG access is required
- [x] Provide instructions for requesting CBORG access
- [x] Document how to get CBORG credentials
- [ ] OR: Add alternative OpenAI API support as fallback (optional enhancement)
**Status:** βœ… README now includes Prerequisites section with CBORG access requirements
### 2. **Data Access**
**Issue:** Reference data paths are NERSC-specific
**Current paths:** `/global/cfs/projectdirs/atlas/...`
**Impact:** External users cannot access data
**Solutions:**
- [x] Already documented in README (users can download from ATLAS Open Data)
- [ ] Add explicit download links for ATLAS Open Data
- [ ] Provide script to download data automatically
- [ ] Document expected directory structure
**Suggested addition:**
```markdown
### Downloading ATLAS Open Data
```bash
# Download script example
wget https://opendata.cern.ch/record/15006/files/...
# Or provide helper script
bash scripts/download_atlas_data.sh
```
```
### 3. **Reference Solution Arrays**
**Status:** βœ… Partially addressed
- [x] `.gitignore` properly excludes large .npy files
- [x] `solution/arrays/README.md` explains missing files
- [x] `scripts/fetch_solution_arrays.sh` exists
- [ ] Script hardcoded to NERSC path - won't work externally
**Fix needed:**
```bash
# In fetch_solution_arrays.sh, line 7:
# Current:
SRC_DIR=${REF_SOLN_DIR:-/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays}
# Should be:
SRC_DIR=${REF_SOLN_DIR:-./solution_reference}
# And add instructions to generate arrays or download them
```
### 4. **Configuration Files**
**Status:** βœ… COMPLETED
**config.example.yml:**
- [x] Created comprehensive example config with all options
- [x] Added comments explaining each field
- [x] Listed all available CBORG models
- [x] Documented supervisor/coder roles, temperature, max_iterations, out_dir
**models.example.txt:**
- [x] Created example file with clear formatting
- [x] Added examples for major model families (Anthropic, OpenAI, Google, xAI, AWS)
- [x] Emphasized blank line requirement
### 5. **Model Lists**
**Status:** βœ… COMPLETED
**models.example.txt:**
- [x] Created clean example with proper formatting
- [x] Added clear comments and instructions
- [x] Included examples for all major model families
- [x] Emphasized blank line requirement with warning
**Note:** Actual `models.txt` and `config.yml` are user-specific and properly excluded from git
### 6. **Dependencies and Environment**
**environment.yml:**
- [x] Looks complete
- [ ] Should test on fresh environment to verify
- [ ] Some packages may have version conflicts (ROOT + latest Python)
**Missing:**
- [ ] No `requirements.txt` for pip-only users
- [ ] No Docker/container option for reproducibility
**Suggestions:**
```bash
# Add requirements.txt
pip freeze > requirements.txt
# Add Dockerfile
# Or at minimum, document tested versions
```
### 7. **Unused/Testing Files**
**Status:** βœ… COMPLETED
**Cleaned up:**
- [x] `testing_area/` - Deleted by user
- [x] `model_test_output.txt` - Added to .gitignore
- [x] `tmp_results/` - Added to .gitignore
- [x] `all_stats.csv` - Added to .gitignore
- [x] `solution/arrays_incorrect/` - Deleted (unused development files)
- [x] `solution/results/` - Deleted (redundant ROOT files)
- [x] `solution/__pycache__/` - Deleted
- [x] `jobs/slurm/*.out` - Old SLURM outputs deleted, added to .gitignore
**Action:** βœ… All test artifacts cleaned up and properly ignored
### 8. **Licensing**
**Status:** βœ… COMPLETED
**CRITICAL for public release:**
- [x] LICENSE file added (MIT License)
- [x] Copyright notice includes UC Berkeley and all contributors
- [x] Proper legal protection for public repository
**Copyright:** The Regents of the University of California, on behalf of its Berkeley campus, and contributors
### 9. **Citation and Attribution**
**Should add:**
- [ ] CITATION.cff file
- [ ] BibTeX entry in README
- [ ] Acknowledgments section
- [ ] Links to papers (if applicable)
### 10. **Testing and Examples**
**Should provide:**
- [ ] Quick start example (5-minute test)
- [ ] Full workflow example
- [ ] Expected output examples
- [ ] Sample results for validation
**Suggested: Add `examples/` directory:**
```
examples/
quick_start.sh # 1-step test
full_workflow.sh # All 5 steps
expected_output/ # What users should see
```
## πŸ“‹ Recommended File Additions
### 1. LICENSE
Choose appropriate open-source license (MIT recommended for max reuse)
### 2. CONTRIBUTING.md
Guidelines for external contributors
### 3. CHANGELOG.md
Track versions and changes
### 4. .github/workflows/
- [ ] CI/CD for testing
- [ ] Automated documentation builds
### 5. scripts/setup.sh
One-command setup script:
```bash
#!/bin/bash
# Complete setup for llm4hep
# 1. Check prerequisites
# 2. Set up conda environment
# 3. Configure API keys
# 4. Download reference data
# 5. Validate installation
```
## πŸ” Code Quality Issues
### Fixed Issues:
1. **SLURM output path:** βœ… Fixed in `jobs/run_tests.sh` to use relative path `jobs/slurm/%j.out`
2. **Test file cleanup:** βœ… All temporary files removed and ignored
### Minor Issues Remaining:
1. **Commented-out code:** `test_models.sh` has `# source ~/.apikeys.sh` commented
- Should either uncomment or remove
2. **Inconsistent error handling:** Some scripts check for API key, others don't
- Not critical for initial release
3. **Hard-coded paths:** Several scripts have NERSC-specific paths
- Documented in README as institutional limitation
## βœ… Action Items Summary
**High Priority (blocking release):**
1. βœ… Add LICENSE file - **COMPLETED (MIT License)**
2. βœ… Document CBORG API access requirements clearly - **COMPLETED in README**
3. βœ… Fix/remove NERSC-specific paths - **DOCUMENTED as institutional limitation**
4. βœ… Clean up test files or add to .gitignore - **COMPLETED**
5. βœ… Add external data download instructions - **PARTIALLY DONE** (documented in README)
**Medium Priority (improve usability):**
6. βœ… Create config.example.yml with documentation - **COMPLETED**
7. βœ… Create models.example.txt - **COMPLETED**
8. [ ] Add quick-start example
9. [ ] Add CITATION.cff
10. [ ] Create setup script
11. [ ] Test environment.yml on fresh install
**Low Priority (nice to have):**
12. [ ] Add requirements.txt
13. [ ] Add Docker option
14. [ ] Add CI/CD
15. [ ] Add CONTRIBUTING.md
## 🎯 Minimal Viable Public Release
**Status: βœ… READY FOR PUBLIC RELEASE**
All minimal viable release requirements completed:
1. βœ… **LICENSE** - MIT License added with UC Berkeley copyright
2. βœ… **Updated README** - Comprehensive documentation with CBORG access notice and Prerequisites section
3. βœ… **Clean up** - testing_area/, temp files, and old SLURM outputs removed; .gitignore updated
4. βœ… **config.example.yml** and **models.example.txt** - Created with full documentation
5. βœ… **Data download instructions** - Documented in README with reference to ATLAS Open Data
**Additional improvements made:**
- βœ… Fixed SLURM output path in jobs/run_tests.sh
- βœ… Cleaned solution/ directory (removed arrays_incorrect/, results/, __pycache__/)
- βœ… Updated .gitignore comprehensively
- βœ… All critical paths and dependencies documented
**The repository is now ready to be made public with clear expectations and proper documentation.**