| # LA4SR TI-inclusive algaGPT Distribution | |
| This distribution contains the TI-inclusive algaGPT model packaged for easy use with Singularity, along with datasets, inference scripts, and evaluation tools. | |
| ## Directory Structure | |
| * **`la4sr_sp2.sif`** (6.8 GB): Singularity container with the complete computational environment. | |
| * **Model Files:** | |
| * `ckpt.pt`: Model checkpoint file (990 MB). | |
| * `meta.pkl`: Metadata for the model. | |
| * `model.py`: Python script defining the model architecture. | |
| * **Datasets:** | |
| * FASTA files (`generated_prompts_*_headed.fa`) for various taxa (algae, archaea, bacteria, fungi, viruses). | |
| * **Scripts:** | |
| * `run_la4sr_TI-inc-algaGPT.sh`: Main inference and metrics generation script. | |
| * `run_la4sr_loop.sbatch`: SLURM batch script to run multiple inference jobs. | |
| * `infer_TI-inc-algaGPT.py`: Python inference script. | |
| * `llm-metrics-two-files.py`: Generates classification metrics and visualizations. | |
| * **Utility Files:** | |
| * `filelist.txt`, `contam-filelist.txt`, `algae-filelist.txt`: Lists of FASTA files to analyze. | |
| * `la4sr_sp2.sif.md5`: Checksum for container verification. | |
| ### Subdirectories: | |
| * **`cache/`**: Cache directory for Hugging Face models and tokenizers. | |
| * **`results-archive/`**: Archived results from previous runs. | |
| * **`algaGPT_fungi-algae2x-update_cleaned/`**: Fine-tuned algaGPT variant optimized for fungi. | |
| * **`TI-free-la4sr/`**: Pythia-based TI-free flagship model. | |
| * **`slurm-logs/`**: SLURM job output logs. | |
| ## Quick Start | |
| 1. **Setup:** Ensure Singularity is installed on your HPC or local system. | |
| 2. **Inference (no scheduler):** | |
| ```bash | |
| ./run_la4sr_TI-inc-algaGPT.sh resume <algal_fasta> <contaminant_fasta> | |
| ``` | |
| Replace `<algal_fasta>` and `<contaminant_fasta>` with your FASTA file paths. | |
| 3. **Inference (with SLURM scheduler):** | |
| * Update `algae-filelist.txt` and `contam-filelist.txt` with paths to your FASTA files. | |
| * Submit the SLURM job array: | |
| ```bash | |
| sbatch run_la4sr_loop.sbatch | |
| ``` | |
| ## Outputs | |
| Results, including TSV files, metrics reports, misclassification reports, and visualizations, are stored in the `results/` directory. | |
| ## Additional Information | |
| * To manually run the inference script: | |
| ```bash | |
| singularity exec --nv la4sr_sp2.sif python3 infer_TI-inc-algaGPT.py --init_from resume input.fasta -o output.tsv | |
| ``` | |
| * To generate metrics independently: | |
| ```bash | |
| singularity exec la4sr_sp2.sif python3 llm-metrics-two-files.py algal_results.tsv contaminant_results.tsv -o metrics_report.txt -m misclassified_report.txt -v | |
| ``` | |
| --- | |
| For further assistance, contact the maintainers. | |
| The algaGPT model was trained by Kourosh Salehi-Ashtiani, Ph.D. with help in training data preparation from Ashish Kumar Jaiswal, Msc. | |