Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
title: Agllm Public
emoji: π¦
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false
license: apache-2.0
PestIDBot - Quick Reference
Environment
source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15
Key Commands
| Task | Command |
|---|---|
| Build DB | python app_database_prep.py |
| Run Eval | python retrieval_evaluation.py |
| Run App | python app.py |
| Deploy Dev | git push space3 fresh-start:main |
| Deploy Prod | git push space2 fresh-start:main |
Git Remotes
space2βgit@hf.co:spaces/arbabarshad/agllm2(production)space3βgit@hf.co:spaces/arbabarshad/agllm2-dev(dev)
Project Structure
βββ app.py # Main Gradio app (deployed)
βββ app_database_prep.py # Builds ChromaDB from PDFs + Excel
βββ retrieval_evaluation.py # Runs 4-filter evaluation
βββ retrieval_evaluation_results.json # Eval metrics output
β
βββ agllm-data/
β βββ agllm-data-isu-field-insects-all-species/
β β βββ *.pdf # Insect IPM documents
β β βββ matched_species_results_v2.csv # Species metadata
β βββ agllm-data-isu-field-weeds-all-species/
β β βββ *.pdf # Weed IPM documents
β β βββ matched_species_results_v2.csv # Species metadata
β βββ PestID Species.xlsx # India & Africa data (sheets)
β
βββ vector-databases-deployed/
β βββ db5-agllm-data-isu-field-insects-all-species/ # ChromaDB output
β
βββ species-organized/ # Analysis scripts & outputs
β βββ species_analysis.py # Generates paper Figure 3
β βββ species_table.tex # LaTeX species table
β
βββ writing/ # Paper drafts
Database Build Flow
US Data (80 species):
- PDFs loaded from
agllm-data/.../raw-pdfs/(content source) matched_species_results_v2.csvmaps PDF filename β species name (metadata)
Africa/India Data (35 + 11 species):
3. Excel species-organized/PestID Species - Organized.xlsx provides both content (IPM Info) and metadata
All Data:
4. Documents chunked (512 tokens, 10 overlap)
5. Tagged with matched_specie_X + region metadata
6. Stored in ChromaDB at vector-databases-deployed/db5-*/
Evaluation Filters (retrieval_evaluation.py)
| Filter | P@5 | nDCG@5 |
|---|---|---|
| No Filter | 0.82 | 0.72 |
| Species Only | 0.99 | 0.89 |
| Region Only | 0.83 | 0.73 |
| Species + Region | 1.00 | 0.90 |
Git LFS Troubleshooting Notes
This repository encountered several Git LFS issues during setup. Here's a summary for future reference:
Missing LFS Objects in History: Initial pushes failed because the branch history contained references to LFS objects (specifically
a11f8941...related todb5/.../data_level0.bin) that were no longer available locally or on the remote LFS store. Attempts to rewrite history usinggit filter-branchalso failed because the rewrite process itself required fetching other missing LFS objects.- Resolution: We created a clean base branch (
fresh-start) with no history (git checkout --orphan fresh-start), committed a placeholder file, and pushed it forcefully to the remote (git push -u space3 fresh-start:main --force). This reset the remotemainbranch.
- Resolution: We created a clean base branch (
Importing State & Untracked Binaries: We copied the desired file state from the old branch (
git checkout <old-branch> -- .) into the cleanfresh-startbranch. However, the subsequent push failed because some binary files (e.g.,.png) were included but weren't tracked by LFS according to the.gitattributesfile at that time.- Resolution:
- Added the necessary file patterns (e.g.,
*.png filter=lfs ...) to.gitattributes. - Crucially, we had to ensure the commit correctly reflected this change. Amending wasn't sufficient. We used:
# Reset the commit but keep files in working dir git reset HEAD~1 # Re-stage files, forcing re-evaluation based on current .gitattributes git add --renormalize . # Commit the properly processed files git commit -m "Commit message" # Force push the corrected commit git push --force
- Added the necessary file patterns (e.g.,
- Resolution:
Ignoring Necessary Directories: A required directory (
vector-databases-deployed) was unintentionally ignored via.gitignore.- Resolution:
- Removed the corresponding line from
.gitignore. - Staged the
.gitignorefile and the previously ignored directory (git add .gitignore vector-databases-deployed/). - Committed and pushed the changes.
- Removed the corresponding line from
- Resolution:
Key Takeaways:
- Pushing branches with problematic LFS history to a fresh remote can fail. Starting the remote with a clean, history-free branch is a workaround.
- When adding LFS tracking for existing binary files via
.gitattributes, ensure the commit correctly converts files to LFS pointers.git add --renormalize .after updating.gitattributesand before committing is often necessary. - Double-check
.gitignoreif expected files or directories are missing after agit add ..