Spaces:
Sleeping
Sleeping
File size: 6,599 Bytes
de952e4 42f7194 03b8643 8302c34 03b8643 42f7194 03b8643 8302c34 03b8643 8302c34 03b8643 8302c34 03b8643 42f7194 3912de6 f7bad94 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
title: Agllm Public
emoji: π¦
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false
license: apache-2.0
---
## PestIDBot - Quick Reference
### Environment
```bash
# Conda environment: agllm-june-15
source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15
# Required env vars (in .env file)
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-... # optional, for Claude models
OPENROUTER_API_KEY=... # optional, for Llama/Gemini
```
### Key Commands
| Task | Command |
|------|---------|
| Build DB | `python app_database_prep.py` |
| Run Eval | `python retrieval_evaluation.py` |
| Run App | `python app.py` |
| Deploy Dev | `git push space3 fresh-start:main` |
| Deploy Prod | `git push space2 fresh-start:main` |
### Git Remotes
- `space2` β `git@hf.co:spaces/arbabarshad/agllm2` (production)
- `space3` β `git@hf.co:spaces/arbabarshad/agllm2-dev` (dev)
### Project Structure
```
βββ app.py # Main Gradio app (deployed)
βββ app_database_prep.py # Builds ChromaDB from PDFs + Excel
βββ retrieval_evaluation.py # Runs 4-filter evaluation
βββ retrieval_evaluation_results.json # Eval metrics output
β
βββ agllm-data/
β βββ agllm-data-isu-field-insects-all-species/
β β βββ *.pdf # Insect IPM documents
β β βββ matched_species_results_v2.csv # Species metadata
β βββ agllm-data-isu-field-weeds-all-species/
β β βββ *.pdf # Weed IPM documents
β β βββ matched_species_results_v2.csv # Species metadata
β βββ PestID Species.xlsx # India & Africa data (sheets)
β
βββ vector-databases-deployed/
β βββ db5-agllm-data-isu-field-insects-all-species/ # ChromaDB output
β
βββ species-organized/ # Analysis scripts & outputs
β βββ species_analysis.py # Generates paper Figure 3
β βββ species_table.tex # LaTeX species table
β
βββ writing/ # Paper drafts
```
### Database Build Flow (4 Geographic Tiers)
| Tier | Species | Source |
|------|---------|--------|
| Midwest USA | 80 | ISU Handbook PDFs |
| USA | 110 | GPT-4o generated IPM |
| Africa | 35 | Expert-curated Excel |
| India | 11 | Expert-curated Excel |
**Midwest USA Data (80 species):**
1. PDFs loaded from `agllm-data/.../raw-pdfs/` (content source)
2. `matched_species_results_v2.csv` maps PDF filename β species name (metadata)
**USA Data (110 species - LLM generated):**
3. Run `generate_usa_ipm_info.py` to query GPT-4o for all species
4. Creates "USA" sheet in Excel with IPM info for all US-present species
**Africa/India Data (35 + 11 species):**
5. Excel `species-organized/PestID Species - Organized.xlsx` provides both content (IPM Info) and metadata
**All Data:**
6. Documents chunked (512 tokens, 10 overlap)
7. Tagged with `matched_specie_X` + `region` metadata
8. Stored in ChromaDB at `vector-databases-deployed/db5-*/`
### Generate USA IPM Info (GPT-4o)
```bash
# Full run (prepare β process β parse)
export OPENAI_API_KEY="your-key-here"
python generate_usa_ipm_info.py --force
# Or run steps individually:
python generate_usa_ipm_info.py --step prepare # Create JSONL requests
python generate_usa_ipm_info.py --step process # Call GPT-4o API
python generate_usa_ipm_info.py --step parse # Create Excel sheet
```
**Output:** Updates `species-organized/PestID Species - Organized.xlsx` with "USA" sheet containing 110 species present in the United States (pests + beneficials).
### Evaluation Filters (retrieval_evaluation.py)
| Filter | P@5 | nDCG@5 |
|--------|-----|--------|
| No Filter | 0.82 | 0.72 |
| Species Only | 0.99 | 0.89 |
| Region Only | 0.83 | 0.73 |
| Species + Region | **1.00** | **0.90** |
---
## Git LFS Troubleshooting Notes
This repository encountered several Git LFS issues during setup. Here's a summary for future reference:
1. **Missing LFS Objects in History:** Initial pushes failed because the branch history contained references to LFS objects (specifically `a11f8941...` related to `db5/.../data_level0.bin`) that were no longer available locally or on the remote LFS store. Attempts to rewrite history using `git filter-branch` also failed because the rewrite process itself required fetching *other* missing LFS objects.
* **Resolution:** We created a clean base branch (`fresh-start`) with no history (`git checkout --orphan fresh-start`), committed a placeholder file, and pushed it forcefully to the remote (`git push -u space3 fresh-start:main --force`). This reset the remote `main` branch.
2. **Importing State & Untracked Binaries:** We copied the desired file state from the old branch (`git checkout <old-branch> -- .`) into the clean `fresh-start` branch. However, the subsequent push failed because some binary files (e.g., `.png`) were included but weren't tracked by LFS according to the `.gitattributes` file *at that time*.
* **Resolution:**
* Added the necessary file patterns (e.g., `*.png filter=lfs ...`) to `.gitattributes`.
* Crucially, we had to ensure the commit correctly reflected this change. Amending wasn't sufficient. We used:
```bash
# Reset the commit but keep files in working dir
git reset HEAD~1
# Re-stage files, forcing re-evaluation based on current .gitattributes
git add --renormalize .
# Commit the properly processed files
git commit -m "Commit message"
# Force push the corrected commit
git push --force
```
3. **Ignoring Necessary Directories:** A required directory (`vector-databases-deployed`) was unintentionally ignored via `.gitignore`.
* **Resolution:**
* Removed the corresponding line from `.gitignore`.
* Staged the `.gitignore` file and the previously ignored directory (`git add .gitignore vector-databases-deployed/`).
* Committed and pushed the changes.
**Key Takeaways:**
* Pushing branches with problematic LFS history to a fresh remote can fail. Starting the remote with a clean, history-free branch is a workaround.
* When adding LFS tracking for existing binary files via `.gitattributes`, ensure the commit correctly converts files to LFS pointers. `git add --renormalize .` after updating `.gitattributes` and *before* committing is often necessary.
* Double-check `.gitignore` if expected files or directories are missing after a `git add .`.
|