--- title: Agllm Public emoji: 🦀 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.28.3 app_file: app.py pinned: false license: apache-2.0 --- ## PestIDBot - Quick Reference ### Environment ```bash source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15 ``` ### Key Commands | Task | Command | |------|---------| | Build DB | `python app_database_prep.py` | | Run Eval | `python retrieval_evaluation.py` | | Run App | `python app.py` | | Deploy Dev | `git push space3 fresh-start:main` | | Deploy Prod | `git push space2 fresh-start:main` | ### Git Remotes - `space2` → `git@hf.co:spaces/arbabarshad/agllm2` (production) - `space3` → `git@hf.co:spaces/arbabarshad/agllm2-dev` (dev) ### Project Structure ``` ├── app.py # Main Gradio app (deployed) ├── app_database_prep.py # Builds ChromaDB from PDFs + Excel ├── retrieval_evaluation.py # Runs 4-filter evaluation ├── retrieval_evaluation_results.json # Eval metrics output │ ├── agllm-data/ │ ├── agllm-data-isu-field-insects-all-species/ │ │ ├── *.pdf # Insect IPM documents │ │ └── matched_species_results_v2.csv # Species metadata │ ├── agllm-data-isu-field-weeds-all-species/ │ │ ├── *.pdf # Weed IPM documents │ │ └── matched_species_results_v2.csv # Species metadata │ └── PestID Species.xlsx # India & Africa data (sheets) │ ├── vector-databases-deployed/ │ └── db5-agllm-data-isu-field-insects-all-species/ # ChromaDB output │ ├── species-organized/ # Analysis scripts & outputs │ ├── species_analysis.py # Generates paper Figure 3 │ └── species_table.tex # LaTeX species table │ └── writing/ # Paper drafts ``` ### Database Build Flow **US Data (80 species):** 1. PDFs loaded from `agllm-data/.../raw-pdfs/` (content source) 2. `matched_species_results_v2.csv` maps PDF filename → species name (metadata) **Africa/India Data (35 + 11 species):** 3. Excel `species-organized/PestID Species - Organized.xlsx` provides both content (IPM Info) and metadata **All Data:** 4. Documents chunked (512 tokens, 10 overlap) 5. Tagged with `matched_specie_X` + `region` metadata 6. Stored in ChromaDB at `vector-databases-deployed/db5-*/` ### Evaluation Filters (retrieval_evaluation.py) | Filter | P@5 | nDCG@5 | |--------|-----|--------| | No Filter | 0.82 | 0.72 | | Species Only | 0.99 | 0.89 | | Region Only | 0.83 | 0.73 | | Species + Region | **1.00** | **0.90** | --- ## Git LFS Troubleshooting Notes This repository encountered several Git LFS issues during setup. Here's a summary for future reference: 1. **Missing LFS Objects in History:** Initial pushes failed because the branch history contained references to LFS objects (specifically `a11f8941...` related to `db5/.../data_level0.bin`) that were no longer available locally or on the remote LFS store. Attempts to rewrite history using `git filter-branch` also failed because the rewrite process itself required fetching *other* missing LFS objects. * **Resolution:** We created a clean base branch (`fresh-start`) with no history (`git checkout --orphan fresh-start`), committed a placeholder file, and pushed it forcefully to the remote (`git push -u space3 fresh-start:main --force`). This reset the remote `main` branch. 2. **Importing State & Untracked Binaries:** We copied the desired file state from the old branch (`git checkout -- .`) into the clean `fresh-start` branch. However, the subsequent push failed because some binary files (e.g., `.png`) were included but weren't tracked by LFS according to the `.gitattributes` file *at that time*. * **Resolution:** * Added the necessary file patterns (e.g., `*.png filter=lfs ...`) to `.gitattributes`. * Crucially, we had to ensure the commit correctly reflected this change. Amending wasn't sufficient. We used: ```bash # Reset the commit but keep files in working dir git reset HEAD~1 # Re-stage files, forcing re-evaluation based on current .gitattributes git add --renormalize . # Commit the properly processed files git commit -m "Commit message" # Force push the corrected commit git push --force ``` 3. **Ignoring Necessary Directories:** A required directory (`vector-databases-deployed`) was unintentionally ignored via `.gitignore`. * **Resolution:** * Removed the corresponding line from `.gitignore`. * Staged the `.gitignore` file and the previously ignored directory (`git add .gitignore vector-databases-deployed/`). * Committed and pushed the changes. **Key Takeaways:** * Pushing branches with problematic LFS history to a fresh remote can fail. Starting the remote with a clean, history-free branch is a workaround. * When adding LFS tracking for existing binary files via `.gitattributes`, ensure the commit correctly converts files to LFS pointers. `git add --renormalize .` after updating `.gitattributes` and *before* committing is often necessary. * Double-check `.gitignore` if expected files or directories are missing after a `git add .`.