File size: 6,599 Bytes
de952e4
 
 
 
 
 
 
 
 
 
 
42f7194
 
 
 
 
03b8643
8302c34
03b8643
 
 
 
 
42f7194
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
03b8643
 
 
 
 
 
 
 
 
 
8302c34
 
 
03b8643
 
 
 
8302c34
03b8643
8302c34
 
03b8643
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42f7194
 
 
 
 
 
 
 
 
 
3912de6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f7bad94
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: Agllm Public
emoji: πŸ¦€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false
license: apache-2.0
---

## PestIDBot - Quick Reference

### Environment
```bash
# Conda environment: agllm-june-15
source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15

# Required env vars (in .env file)
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...  # optional, for Claude models
OPENROUTER_API_KEY=...        # optional, for Llama/Gemini
```

### Key Commands
| Task | Command |
|------|---------|
| Build DB | `python app_database_prep.py` |
| Run Eval | `python retrieval_evaluation.py` |
| Run App | `python app.py` |
| Deploy Dev | `git push space3 fresh-start:main` |
| Deploy Prod | `git push space2 fresh-start:main` |

### Git Remotes
- `space2` β†’ `git@hf.co:spaces/arbabarshad/agllm2` (production)
- `space3` β†’ `git@hf.co:spaces/arbabarshad/agllm2-dev` (dev)

### Project Structure
```
β”œβ”€β”€ app.py                      # Main Gradio app (deployed)
β”œβ”€β”€ app_database_prep.py        # Builds ChromaDB from PDFs + Excel
β”œβ”€β”€ retrieval_evaluation.py     # Runs 4-filter evaluation
β”œβ”€β”€ retrieval_evaluation_results.json  # Eval metrics output
β”‚
β”œβ”€β”€ agllm-data/
β”‚   β”œβ”€β”€ agllm-data-isu-field-insects-all-species/
β”‚   β”‚   β”œβ”€β”€ *.pdf               # Insect IPM documents
β”‚   β”‚   └── matched_species_results_v2.csv  # Species metadata
β”‚   β”œβ”€β”€ agllm-data-isu-field-weeds-all-species/
β”‚   β”‚   β”œβ”€β”€ *.pdf               # Weed IPM documents
β”‚   β”‚   └── matched_species_results_v2.csv  # Species metadata
β”‚   └── PestID Species.xlsx     # India & Africa data (sheets)
β”‚
β”œβ”€β”€ vector-databases-deployed/
β”‚   └── db5-agllm-data-isu-field-insects-all-species/  # ChromaDB output
β”‚
β”œβ”€β”€ species-organized/          # Analysis scripts & outputs
β”‚   β”œβ”€β”€ species_analysis.py     # Generates paper Figure 3
β”‚   └── species_table.tex       # LaTeX species table
β”‚
└── writing/                    # Paper drafts
```

### Database Build Flow (4 Geographic Tiers)

| Tier | Species | Source |
|------|---------|--------|
| Midwest USA | 80 | ISU Handbook PDFs |
| USA | 110 | GPT-4o generated IPM |
| Africa | 35 | Expert-curated Excel |
| India | 11 | Expert-curated Excel |

**Midwest USA Data (80 species):**
1. PDFs loaded from `agllm-data/.../raw-pdfs/` (content source)
2. `matched_species_results_v2.csv` maps PDF filename β†’ species name (metadata)

**USA Data (110 species - LLM generated):**
3. Run `generate_usa_ipm_info.py` to query GPT-4o for all species
4. Creates "USA" sheet in Excel with IPM info for all US-present species

**Africa/India Data (35 + 11 species):**
5. Excel `species-organized/PestID Species - Organized.xlsx` provides both content (IPM Info) and metadata

**All Data:**
6. Documents chunked (512 tokens, 10 overlap)
7. Tagged with `matched_specie_X` + `region` metadata
8. Stored in ChromaDB at `vector-databases-deployed/db5-*/`

### Generate USA IPM Info (GPT-4o)
```bash
# Full run (prepare β†’ process β†’ parse)
export OPENAI_API_KEY="your-key-here"
python generate_usa_ipm_info.py --force

# Or run steps individually:
python generate_usa_ipm_info.py --step prepare   # Create JSONL requests
python generate_usa_ipm_info.py --step process   # Call GPT-4o API
python generate_usa_ipm_info.py --step parse     # Create Excel sheet
```

**Output:** Updates `species-organized/PestID Species - Organized.xlsx` with "USA" sheet containing 110 species present in the United States (pests + beneficials).

### Evaluation Filters (retrieval_evaluation.py)
| Filter | P@5 | nDCG@5 |
|--------|-----|--------|
| No Filter | 0.82 | 0.72 |
| Species Only | 0.99 | 0.89 |
| Region Only | 0.83 | 0.73 |
| Species + Region | **1.00** | **0.90** |

---

## Git LFS Troubleshooting Notes

This repository encountered several Git LFS issues during setup. Here's a summary for future reference:

1.  **Missing LFS Objects in History:** Initial pushes failed because the branch history contained references to LFS objects (specifically `a11f8941...` related to `db5/.../data_level0.bin`) that were no longer available locally or on the remote LFS store. Attempts to rewrite history using `git filter-branch` also failed because the rewrite process itself required fetching *other* missing LFS objects.
    *   **Resolution:** We created a clean base branch (`fresh-start`) with no history (`git checkout --orphan fresh-start`), committed a placeholder file, and pushed it forcefully to the remote (`git push -u space3 fresh-start:main --force`). This reset the remote `main` branch.

2.  **Importing State & Untracked Binaries:** We copied the desired file state from the old branch (`git checkout <old-branch> -- .`) into the clean `fresh-start` branch. However, the subsequent push failed because some binary files (e.g., `.png`) were included but weren't tracked by LFS according to the `.gitattributes` file *at that time*.
    *   **Resolution:**
        *   Added the necessary file patterns (e.g., `*.png filter=lfs ...`) to `.gitattributes`.
        *   Crucially, we had to ensure the commit correctly reflected this change. Amending wasn't sufficient. We used:
            ```bash
            # Reset the commit but keep files in working dir
            git reset HEAD~1
            # Re-stage files, forcing re-evaluation based on current .gitattributes
            git add --renormalize .
            # Commit the properly processed files
            git commit -m "Commit message"
            # Force push the corrected commit
            git push --force
            ```

3.  **Ignoring Necessary Directories:** A required directory (`vector-databases-deployed`) was unintentionally ignored via `.gitignore`.
    *   **Resolution:**
        *   Removed the corresponding line from `.gitignore`.
        *   Staged the `.gitignore` file and the previously ignored directory (`git add .gitignore vector-databases-deployed/`).
        *   Committed and pushed the changes.

**Key Takeaways:**

*   Pushing branches with problematic LFS history to a fresh remote can fail. Starting the remote with a clean, history-free branch is a workaround.
*   When adding LFS tracking for existing binary files via `.gitattributes`, ensure the commit correctly converts files to LFS pointers. `git add --renormalize .` after updating `.gitattributes` and *before* committing is often necessary.
*   Double-check `.gitignore` if expected files or directories are missing after a `git add .`.