agllm2-dev / README.md
arbabarshad's picture
fixed count of species
8302c34
---
title: Agllm Public
emoji: πŸ¦€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false
license: apache-2.0
---
## PestIDBot - Quick Reference
### Environment
```bash
source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15
```
### Key Commands
| Task | Command |
|------|---------|
| Build DB | `python app_database_prep.py` |
| Run Eval | `python retrieval_evaluation.py` |
| Run App | `python app.py` |
| Deploy Dev | `git push space3 fresh-start:main` |
| Deploy Prod | `git push space2 fresh-start:main` |
### Git Remotes
- `space2` β†’ `git@hf.co:spaces/arbabarshad/agllm2` (production)
- `space3` β†’ `git@hf.co:spaces/arbabarshad/agllm2-dev` (dev)
### Project Structure
```
β”œβ”€β”€ app.py # Main Gradio app (deployed)
β”œβ”€β”€ app_database_prep.py # Builds ChromaDB from PDFs + Excel
β”œβ”€β”€ retrieval_evaluation.py # Runs 4-filter evaluation
β”œβ”€β”€ retrieval_evaluation_results.json # Eval metrics output
β”‚
β”œβ”€β”€ agllm-data/
β”‚ β”œβ”€β”€ agllm-data-isu-field-insects-all-species/
β”‚ β”‚ β”œβ”€β”€ *.pdf # Insect IPM documents
β”‚ β”‚ └── matched_species_results_v2.csv # Species metadata
β”‚ β”œβ”€β”€ agllm-data-isu-field-weeds-all-species/
β”‚ β”‚ β”œβ”€β”€ *.pdf # Weed IPM documents
β”‚ β”‚ └── matched_species_results_v2.csv # Species metadata
β”‚ └── PestID Species.xlsx # India & Africa data (sheets)
β”‚
β”œβ”€β”€ vector-databases-deployed/
β”‚ └── db5-agllm-data-isu-field-insects-all-species/ # ChromaDB output
β”‚
β”œβ”€β”€ species-organized/ # Analysis scripts & outputs
β”‚ β”œβ”€β”€ species_analysis.py # Generates paper Figure 3
β”‚ └── species_table.tex # LaTeX species table
β”‚
└── writing/ # Paper drafts
```
### Database Build Flow
**US Data (80 species):**
1. PDFs loaded from `agllm-data/.../raw-pdfs/` (content source)
2. `matched_species_results_v2.csv` maps PDF filename β†’ species name (metadata)
**Africa/India Data (35 + 11 species):**
3. Excel `species-organized/PestID Species - Organized.xlsx` provides both content (IPM Info) and metadata
**All Data:**
4. Documents chunked (512 tokens, 10 overlap)
5. Tagged with `matched_specie_X` + `region` metadata
6. Stored in ChromaDB at `vector-databases-deployed/db5-*/`
### Evaluation Filters (retrieval_evaluation.py)
| Filter | P@5 | nDCG@5 |
|--------|-----|--------|
| No Filter | 0.82 | 0.72 |
| Species Only | 0.99 | 0.89 |
| Region Only | 0.83 | 0.73 |
| Species + Region | **1.00** | **0.90** |
---
## Git LFS Troubleshooting Notes
This repository encountered several Git LFS issues during setup. Here's a summary for future reference:
1. **Missing LFS Objects in History:** Initial pushes failed because the branch history contained references to LFS objects (specifically `a11f8941...` related to `db5/.../data_level0.bin`) that were no longer available locally or on the remote LFS store. Attempts to rewrite history using `git filter-branch` also failed because the rewrite process itself required fetching *other* missing LFS objects.
* **Resolution:** We created a clean base branch (`fresh-start`) with no history (`git checkout --orphan fresh-start`), committed a placeholder file, and pushed it forcefully to the remote (`git push -u space3 fresh-start:main --force`). This reset the remote `main` branch.
2. **Importing State & Untracked Binaries:** We copied the desired file state from the old branch (`git checkout <old-branch> -- .`) into the clean `fresh-start` branch. However, the subsequent push failed because some binary files (e.g., `.png`) were included but weren't tracked by LFS according to the `.gitattributes` file *at that time*.
* **Resolution:**
* Added the necessary file patterns (e.g., `*.png filter=lfs ...`) to `.gitattributes`.
* Crucially, we had to ensure the commit correctly reflected this change. Amending wasn't sufficient. We used:
```bash
# Reset the commit but keep files in working dir
git reset HEAD~1
# Re-stage files, forcing re-evaluation based on current .gitattributes
git add --renormalize .
# Commit the properly processed files
git commit -m "Commit message"
# Force push the corrected commit
git push --force
```
3. **Ignoring Necessary Directories:** A required directory (`vector-databases-deployed`) was unintentionally ignored via `.gitignore`.
* **Resolution:**
* Removed the corresponding line from `.gitignore`.
* Staged the `.gitignore` file and the previously ignored directory (`git add .gitignore vector-databases-deployed/`).
* Committed and pushed the changes.
**Key Takeaways:**
* Pushing branches with problematic LFS history to a fresh remote can fail. Starting the remote with a clean, history-free branch is a workaround.
* When adding LFS tracking for existing binary files via `.gitattributes`, ensure the commit correctly converts files to LFS pointers. `git add --renormalize .` after updating `.gitattributes` and *before* committing is often necessary.
* Double-check `.gitignore` if expected files or directories are missing after a `git add .`.