Spaces:
Sleeping
Sleeping
Commit
Β·
03b8643
1
Parent(s):
4fa67e1
Update README.md
Browse files
README.md
CHANGED
|
@@ -14,7 +14,13 @@ license: apache-2.0
|
|
| 14 |
|
| 15 |
### Environment
|
| 16 |
```bash
|
|
|
|
| 17 |
source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
```
|
| 19 |
|
| 20 |
### Key Commands
|
|
@@ -56,18 +62,44 @@ source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15
|
|
| 56 |
βββ writing/ # Paper drafts
|
| 57 |
```
|
| 58 |
|
| 59 |
-
### Database Build Flow
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
1. PDFs loaded from `agllm-data/.../raw-pdfs/` (content source)
|
| 62 |
2. `matched_species_results_v2.csv` maps PDF filename β species name (metadata)
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
**Africa/India Data (35 + 11 species):**
|
| 65 |
-
|
| 66 |
|
| 67 |
**All Data:**
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
### Evaluation Filters (retrieval_evaluation.py)
|
| 73 |
| Filter | P@5 | nDCG@5 |
|
|
|
|
| 14 |
|
| 15 |
### Environment
|
| 16 |
```bash
|
| 17 |
+
# Conda environment: agllm-june-15
|
| 18 |
source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15
|
| 19 |
+
|
| 20 |
+
# Required env vars (in .env file)
|
| 21 |
+
OPENAI_API_KEY=sk-proj-...
|
| 22 |
+
ANTHROPIC_API_KEY=sk-ant-... # optional, for Claude models
|
| 23 |
+
OPENROUTER_API_KEY=... # optional, for Llama/Gemini
|
| 24 |
```
|
| 25 |
|
| 26 |
### Key Commands
|
|
|
|
| 62 |
βββ writing/ # Paper drafts
|
| 63 |
```
|
| 64 |
|
| 65 |
+
### Database Build Flow (4 Geographic Tiers)
|
| 66 |
+
|
| 67 |
+
| Tier | Species | Source |
|
| 68 |
+
|------|---------|--------|
|
| 69 |
+
| Midwest USA | 80 | ISU Handbook PDFs |
|
| 70 |
+
| USA | 110 | GPT-4o generated IPM |
|
| 71 |
+
| Africa | 35 | Expert-curated Excel |
|
| 72 |
+
| India | 11 | Expert-curated Excel |
|
| 73 |
+
|
| 74 |
+
**Midwest USA Data (80 species):**
|
| 75 |
1. PDFs loaded from `agllm-data/.../raw-pdfs/` (content source)
|
| 76 |
2. `matched_species_results_v2.csv` maps PDF filename β species name (metadata)
|
| 77 |
|
| 78 |
+
**USA Data (110 species - LLM generated):**
|
| 79 |
+
3. Run `generate_usa_ipm_info.py` to query GPT-4o for all species
|
| 80 |
+
4. Creates "USA" sheet in Excel with IPM info for all US-present species
|
| 81 |
+
|
| 82 |
**Africa/India Data (35 + 11 species):**
|
| 83 |
+
5. Excel `species-organized/PestID Species - Organized.xlsx` provides both content (IPM Info) and metadata
|
| 84 |
|
| 85 |
**All Data:**
|
| 86 |
+
6. Documents chunked (512 tokens, 10 overlap)
|
| 87 |
+
7. Tagged with `matched_specie_X` + `region` metadata
|
| 88 |
+
8. Stored in ChromaDB at `vector-databases-deployed/db5-*/`
|
| 89 |
+
|
| 90 |
+
### Generate USA IPM Info (GPT-4o)
|
| 91 |
+
```bash
|
| 92 |
+
# Full run (prepare β process β parse)
|
| 93 |
+
export OPENAI_API_KEY="your-key-here"
|
| 94 |
+
python generate_usa_ipm_info.py --force
|
| 95 |
+
|
| 96 |
+
# Or run steps individually:
|
| 97 |
+
python generate_usa_ipm_info.py --step prepare # Create JSONL requests
|
| 98 |
+
python generate_usa_ipm_info.py --step process # Call GPT-4o API
|
| 99 |
+
python generate_usa_ipm_info.py --step parse # Create Excel sheet
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
**Output:** Updates `species-organized/PestID Species - Organized.xlsx` with "USA" sheet containing 110 species present in the United States (pests + beneficials).
|
| 103 |
|
| 104 |
### Evaluation Filters (retrieval_evaluation.py)
|
| 105 |
| Filter | P@5 | nDCG@5 |
|