agllm2-dev / README.md
arbabarshad's picture
fixed count of species
8302c34

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Agllm Public
emoji: πŸ¦€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false
license: apache-2.0

PestIDBot - Quick Reference

Environment

source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15

Key Commands

Task Command
Build DB python app_database_prep.py
Run Eval python retrieval_evaluation.py
Run App python app.py
Deploy Dev git push space3 fresh-start:main
Deploy Prod git push space2 fresh-start:main

Git Remotes

  • space2 β†’ git@hf.co:spaces/arbabarshad/agllm2 (production)
  • space3 β†’ git@hf.co:spaces/arbabarshad/agllm2-dev (dev)

Project Structure

β”œβ”€β”€ app.py                      # Main Gradio app (deployed)
β”œβ”€β”€ app_database_prep.py        # Builds ChromaDB from PDFs + Excel
β”œβ”€β”€ retrieval_evaluation.py     # Runs 4-filter evaluation
β”œβ”€β”€ retrieval_evaluation_results.json  # Eval metrics output
β”‚
β”œβ”€β”€ agllm-data/
β”‚   β”œβ”€β”€ agllm-data-isu-field-insects-all-species/
β”‚   β”‚   β”œβ”€β”€ *.pdf               # Insect IPM documents
β”‚   β”‚   └── matched_species_results_v2.csv  # Species metadata
β”‚   β”œβ”€β”€ agllm-data-isu-field-weeds-all-species/
β”‚   β”‚   β”œβ”€β”€ *.pdf               # Weed IPM documents
β”‚   β”‚   └── matched_species_results_v2.csv  # Species metadata
β”‚   └── PestID Species.xlsx     # India & Africa data (sheets)
β”‚
β”œβ”€β”€ vector-databases-deployed/
β”‚   └── db5-agllm-data-isu-field-insects-all-species/  # ChromaDB output
β”‚
β”œβ”€β”€ species-organized/          # Analysis scripts & outputs
β”‚   β”œβ”€β”€ species_analysis.py     # Generates paper Figure 3
β”‚   └── species_table.tex       # LaTeX species table
β”‚
└── writing/                    # Paper drafts

Database Build Flow

US Data (80 species):

  1. PDFs loaded from agllm-data/.../raw-pdfs/ (content source)
  2. matched_species_results_v2.csv maps PDF filename β†’ species name (metadata)

Africa/India Data (35 + 11 species): 3. Excel species-organized/PestID Species - Organized.xlsx provides both content (IPM Info) and metadata

All Data: 4. Documents chunked (512 tokens, 10 overlap) 5. Tagged with matched_specie_X + region metadata 6. Stored in ChromaDB at vector-databases-deployed/db5-*/

Evaluation Filters (retrieval_evaluation.py)

Filter P@5 nDCG@5
No Filter 0.82 0.72
Species Only 0.99 0.89
Region Only 0.83 0.73
Species + Region 1.00 0.90

Git LFS Troubleshooting Notes

This repository encountered several Git LFS issues during setup. Here's a summary for future reference:

  1. Missing LFS Objects in History: Initial pushes failed because the branch history contained references to LFS objects (specifically a11f8941... related to db5/.../data_level0.bin) that were no longer available locally or on the remote LFS store. Attempts to rewrite history using git filter-branch also failed because the rewrite process itself required fetching other missing LFS objects.

    • Resolution: We created a clean base branch (fresh-start) with no history (git checkout --orphan fresh-start), committed a placeholder file, and pushed it forcefully to the remote (git push -u space3 fresh-start:main --force). This reset the remote main branch.
  2. Importing State & Untracked Binaries: We copied the desired file state from the old branch (git checkout <old-branch> -- .) into the clean fresh-start branch. However, the subsequent push failed because some binary files (e.g., .png) were included but weren't tracked by LFS according to the .gitattributes file at that time.

    • Resolution:
      • Added the necessary file patterns (e.g., *.png filter=lfs ...) to .gitattributes.
      • Crucially, we had to ensure the commit correctly reflected this change. Amending wasn't sufficient. We used:
        # Reset the commit but keep files in working dir
        git reset HEAD~1
        # Re-stage files, forcing re-evaluation based on current .gitattributes
        git add --renormalize .
        # Commit the properly processed files
        git commit -m "Commit message"
        # Force push the corrected commit
        git push --force
        
  3. Ignoring Necessary Directories: A required directory (vector-databases-deployed) was unintentionally ignored via .gitignore.

    • Resolution:
      • Removed the corresponding line from .gitignore.
      • Staged the .gitignore file and the previously ignored directory (git add .gitignore vector-databases-deployed/).
      • Committed and pushed the changes.

Key Takeaways:

  • Pushing branches with problematic LFS history to a fresh remote can fail. Starting the remote with a clean, history-free branch is a workaround.
  • When adding LFS tracking for existing binary files via .gitattributes, ensure the commit correctly converts files to LFS pointers. git add --renormalize . after updating .gitattributes and before committing is often necessary.
  • Double-check .gitignore if expected files or directories are missing after a git add ..