LLM4HEP / PRE_RELEASE_CHECKLIST.md

initial commit

cfcbbc8 3 months ago

8.16 kB

	# Pre-Release Checklist for llm4hep Repository

	## ✅ Ready for Public Release

	### Documentation
	- [x] Comprehensive README.md with all 5 steps documented
	- [x] Model mapping documentation (CBORG_MODEL_MAPPINGS.md)
	- [x] Analysis notebooks documented
	- [x] Installation instructions clear
	- [x] Example usage provided

	### Core Functionality
	- [x] All 5 workflow steps (Snakemake files present)
	- [x] Supervisor-coder framework
	- [x] Validation system
	- [x] Error analysis tools
	- [x] Log interpretation

	## ⚠️ Issues to Address Before Public Release

	### 1. CRITICAL: API Key Setup
	Issue: Users won't have CBORG API access
	Current state: Code expects `CBORG_API_KEY` from LBL's CBORG system
	Impact: External users cannot run the code without CBORG access

	Solutions:
	- [x] Add clear notice in README that CBORG access is required
	- [x] Provide instructions for requesting CBORG access
	- [x] Document how to get CBORG credentials
	- [ ] OR: Add alternative OpenAI API support as fallback (optional enhancement)

	Status: ✅ README now includes Prerequisites section with CBORG access requirements

	### 2. Data Access
	Issue: Reference data paths are NERSC-specific
	Current paths: `/global/cfs/projectdirs/atlas/...`
	Impact: External users cannot access data

	Solutions:
	- [x] Already documented in README (users can download from ATLAS Open Data)
	- [ ] Add explicit download links for ATLAS Open Data
	- [ ] Provide script to download data automatically
	- [ ] Document expected directory structure

	Suggested addition:
	```markdown
	### Downloading ATLAS Open Data

	```bash
	# Download script example
	wget https://opendata.cern.ch/record/15006/files/...
	# Or provide helper script
	bash scripts/download_atlas_data.sh
	```
	```

	### 3. Reference Solution Arrays
	Status: ✅ Partially addressed
	- [x] `.gitignore` properly excludes large .npy files
	- [x] `solution/arrays/README.md` explains missing files
	- [x] `scripts/fetch_solution_arrays.sh` exists
	- [ ] Script hardcoded to NERSC path - won't work externally

	Fix needed:
	```bash
	# In fetch_solution_arrays.sh, line 7:
	# Current:
	SRC_DIR=${REF_SOLN_DIR:-/global/cfs/projectdirs/atlas/dwkim/llm4hep/solution/arrays}

	# Should be:
	SRC_DIR=${REF_SOLN_DIR:-./solution_reference}
	# And add instructions to generate arrays or download them
	```

	### 4. Configuration Files

	Status: ✅ COMPLETED

	config.example.yml:
	- [x] Created comprehensive example config with all options
	- [x] Added comments explaining each field
	- [x] Listed all available CBORG models
	- [x] Documented supervisor/coder roles, temperature, max_iterations, out_dir

	models.example.txt:
	- [x] Created example file with clear formatting
	- [x] Added examples for major model families (Anthropic, OpenAI, Google, xAI, AWS)
	- [x] Emphasized blank line requirement

	### 5. Model Lists

	Status: ✅ COMPLETED

	models.example.txt:
	- [x] Created clean example with proper formatting
	- [x] Added clear comments and instructions
	- [x] Included examples for all major model families
	- [x] Emphasized blank line requirement with warning

	Note: Actual `models.txt` and `config.yml` are user-specific and properly excluded from git

	### 6. Dependencies and Environment

	environment.yml:
	- [x] Looks complete
	- [ ] Should test on fresh environment to verify
	- [ ] Some packages may have version conflicts (ROOT + latest Python)

	Missing:
	- [ ] No `requirements.txt` for pip-only users
	- [ ] No Docker/container option for reproducibility

	Suggestions:
	```bash
	# Add requirements.txt
	pip freeze > requirements.txt

	# Add Dockerfile
	# Or at minimum, document tested versions
	```

	### 7. Unused/Testing Files

	Status: ✅ COMPLETED

	Cleaned up:
	- [x] `testing_area/` - Deleted by user
	- [x] `model_test_output.txt` - Added to .gitignore
	- [x] `tmp_results/` - Added to .gitignore
	- [x] `all_stats.csv` - Added to .gitignore
	- [x] `solution/arrays_incorrect/` - Deleted (unused development files)
	- [x] `solution/results/` - Deleted (redundant ROOT files)
	- [x] `solution/__pycache__/` - Deleted
	- [x] `jobs/slurm/*.out` - Old SLURM outputs deleted, added to .gitignore

	Action: ✅ All test artifacts cleaned up and properly ignored

	### 8. Licensing

	Status: ✅ COMPLETED

	CRITICAL for public release:
	- [x] LICENSE file added (MIT License)
	- [x] Copyright notice includes UC Berkeley and all contributors
	- [x] Proper legal protection for public repository

	Copyright: The Regents of the University of California, on behalf of its Berkeley campus, and contributors

	### 9. Citation and Attribution

	Should add:
	- [ ] CITATION.cff file
	- [ ] BibTeX entry in README
	- [ ] Acknowledgments section
	- [ ] Links to papers (if applicable)

	### 10. Testing and Examples

	Should provide:
	- [ ] Quick start example (5-minute test)
	- [ ] Full workflow example
	- [ ] Expected output examples
	- [ ] Sample results for validation

	Suggested: Add `examples/` directory:
	```
	examples/
	quick_start.sh # 1-step test
	full_workflow.sh # All 5 steps
	expected_output/ # What users should see
	```

	## 📋 Recommended File Additions

	### 1. LICENSE
	Choose appropriate open-source license (MIT recommended for max reuse)

	### 2. CONTRIBUTING.md
	Guidelines for external contributors

	### 3. CHANGELOG.md
	Track versions and changes

	### 4. .github/workflows/
	- [ ] CI/CD for testing
	- [ ] Automated documentation builds

	### 5. scripts/setup.sh
	One-command setup script:
	```bash
	#!/bin/bash
	# Complete setup for llm4hep

	# 1. Check prerequisites
	# 2. Set up conda environment
	# 3. Configure API keys
	# 4. Download reference data
	# 5. Validate installation
	```

	## 🔍 Code Quality Issues

	### Fixed Issues:
	1. SLURM output path: ✅ Fixed in `jobs/run_tests.sh` to use relative path `jobs/slurm/%j.out`
	2. Test file cleanup: ✅ All temporary files removed and ignored

	### Minor Issues Remaining:
	1. Commented-out code: `test_models.sh` has `# source ~/.apikeys.sh` commented
	- Should either uncomment or remove

	2. Inconsistent error handling: Some scripts check for API key, others don't
	- Not critical for initial release

	3. Hard-coded paths: Several scripts have NERSC-specific paths
	- Documented in README as institutional limitation

	## ✅ Action Items Summary

	High Priority (blocking release):
	1. ✅ Add LICENSE file - COMPLETED (MIT License)
	2. ✅ Document CBORG API access requirements clearly - COMPLETED in README
	3. ✅ Fix/remove NERSC-specific paths - DOCUMENTED as institutional limitation
	4. ✅ Clean up test files or add to .gitignore - COMPLETED
	5. ✅ Add external data download instructions - PARTIALLY DONE (documented in README)

	Medium Priority (improve usability):
	6. ✅ Create config.example.yml with documentation - COMPLETED
	7. ✅ Create models.example.txt - COMPLETED
	8. [ ] Add quick-start example
	9. [ ] Add CITATION.cff
	10. [ ] Create setup script
	11. [ ] Test environment.yml on fresh install

	Low Priority (nice to have):
	12. [ ] Add requirements.txt
	13. [ ] Add Docker option
	14. [ ] Add CI/CD
	15. [ ] Add CONTRIBUTING.md

	## 🎯 Minimal Viable Public Release

	Status: ✅ READY FOR PUBLIC RELEASE

	All minimal viable release requirements completed:
	1. ✅ LICENSE - MIT License added with UC Berkeley copyright
	2. ✅ Updated README - Comprehensive documentation with CBORG access notice and Prerequisites section
	3. ✅ Clean up - testing_area/, temp files, and old SLURM outputs removed; .gitignore updated
	4. ✅ config.example.yml and models.example.txt - Created with full documentation
	5. ✅ Data download instructions - Documented in README with reference to ATLAS Open Data

	Additional improvements made:
	- ✅ Fixed SLURM output path in jobs/run_tests.sh
	- ✅ Cleaned solution/ directory (removed arrays_incorrect/, results/, __pycache__/)
	- ✅ Updated .gitignore comprehensively
	- ✅ All critical paths and dependencies documented

	The repository is now ready to be made public with clear expectations and proper documentation.