Spaces:
Sleeping
Sleeping
| title: Agllm Public | |
| emoji: π¦ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 4.28.3 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| ## PestIDBot - Quick Reference | |
| ### Environment | |
| ```bash | |
| source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15 | |
| ``` | |
| ### Key Commands | |
| | Task | Command | | |
| |------|---------| | |
| | Build DB | `python app_database_prep.py` | | |
| | Run Eval | `python retrieval_evaluation.py` | | |
| | Run App | `python app.py` | | |
| | Deploy Dev | `git push space3 fresh-start:main` | | |
| | Deploy Prod | `git push space2 fresh-start:main` | | |
| ### Git Remotes | |
| - `space2` β `git@hf.co:spaces/arbabarshad/agllm2` (production) | |
| - `space3` β `git@hf.co:spaces/arbabarshad/agllm2-dev` (dev) | |
| ### Project Structure | |
| ``` | |
| βββ app.py # Main Gradio app (deployed) | |
| βββ app_database_prep.py # Builds ChromaDB from PDFs + Excel | |
| βββ retrieval_evaluation.py # Runs 4-filter evaluation | |
| βββ retrieval_evaluation_results.json # Eval metrics output | |
| β | |
| βββ agllm-data/ | |
| β βββ agllm-data-isu-field-insects-all-species/ | |
| β β βββ *.pdf # Insect IPM documents | |
| β β βββ matched_species_results_v2.csv # Species metadata | |
| β βββ agllm-data-isu-field-weeds-all-species/ | |
| β β βββ *.pdf # Weed IPM documents | |
| β β βββ matched_species_results_v2.csv # Species metadata | |
| β βββ PestID Species.xlsx # India & Africa data (sheets) | |
| β | |
| βββ vector-databases-deployed/ | |
| β βββ db5-agllm-data-isu-field-insects-all-species/ # ChromaDB output | |
| β | |
| βββ species-organized/ # Analysis scripts & outputs | |
| β βββ species_analysis.py # Generates paper Figure 3 | |
| β βββ species_table.tex # LaTeX species table | |
| β | |
| βββ writing/ # Paper drafts | |
| ``` | |
| ### Database Build Flow | |
| **US Data (80 species):** | |
| 1. PDFs loaded from `agllm-data/.../raw-pdfs/` (content source) | |
| 2. `matched_species_results_v2.csv` maps PDF filename β species name (metadata) | |
| **Africa/India Data (35 + 11 species):** | |
| 3. Excel `species-organized/PestID Species - Organized.xlsx` provides both content (IPM Info) and metadata | |
| **All Data:** | |
| 4. Documents chunked (512 tokens, 10 overlap) | |
| 5. Tagged with `matched_specie_X` + `region` metadata | |
| 6. Stored in ChromaDB at `vector-databases-deployed/db5-*/` | |
| ### Evaluation Filters (retrieval_evaluation.py) | |
| | Filter | P@5 | nDCG@5 | | |
| |--------|-----|--------| | |
| | No Filter | 0.82 | 0.72 | | |
| | Species Only | 0.99 | 0.89 | | |
| | Region Only | 0.83 | 0.73 | | |
| | Species + Region | **1.00** | **0.90** | | |
| --- | |
| ## Git LFS Troubleshooting Notes | |
| This repository encountered several Git LFS issues during setup. Here's a summary for future reference: | |
| 1. **Missing LFS Objects in History:** Initial pushes failed because the branch history contained references to LFS objects (specifically `a11f8941...` related to `db5/.../data_level0.bin`) that were no longer available locally or on the remote LFS store. Attempts to rewrite history using `git filter-branch` also failed because the rewrite process itself required fetching *other* missing LFS objects. | |
| * **Resolution:** We created a clean base branch (`fresh-start`) with no history (`git checkout --orphan fresh-start`), committed a placeholder file, and pushed it forcefully to the remote (`git push -u space3 fresh-start:main --force`). This reset the remote `main` branch. | |
| 2. **Importing State & Untracked Binaries:** We copied the desired file state from the old branch (`git checkout <old-branch> -- .`) into the clean `fresh-start` branch. However, the subsequent push failed because some binary files (e.g., `.png`) were included but weren't tracked by LFS according to the `.gitattributes` file *at that time*. | |
| * **Resolution:** | |
| * Added the necessary file patterns (e.g., `*.png filter=lfs ...`) to `.gitattributes`. | |
| * Crucially, we had to ensure the commit correctly reflected this change. Amending wasn't sufficient. We used: | |
| ```bash | |
| # Reset the commit but keep files in working dir | |
| git reset HEAD~1 | |
| # Re-stage files, forcing re-evaluation based on current .gitattributes | |
| git add --renormalize . | |
| # Commit the properly processed files | |
| git commit -m "Commit message" | |
| # Force push the corrected commit | |
| git push --force | |
| ``` | |
| 3. **Ignoring Necessary Directories:** A required directory (`vector-databases-deployed`) was unintentionally ignored via `.gitignore`. | |
| * **Resolution:** | |
| * Removed the corresponding line from `.gitignore`. | |
| * Staged the `.gitignore` file and the previously ignored directory (`git add .gitignore vector-databases-deployed/`). | |
| * Committed and pushed the changes. | |
| **Key Takeaways:** | |
| * Pushing branches with problematic LFS history to a fresh remote can fail. Starting the remote with a clean, history-free branch is a workaround. | |
| * When adding LFS tracking for existing binary files via `.gitattributes`, ensure the commit correctly converts files to LFS pointers. `git add --renormalize .` after updating `.gitattributes` and *before* committing is often necessary. | |
| * Double-check `.gitignore` if expected files or directories are missing after a `git add .`. | |