arbabarshad commited on
Commit
03b8643
Β·
1 Parent(s): 4fa67e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -6
README.md CHANGED
@@ -14,7 +14,13 @@ license: apache-2.0
14
 
15
  ### Environment
16
  ```bash
 
17
  source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15
 
 
 
 
 
18
  ```
19
 
20
  ### Key Commands
@@ -56,18 +62,44 @@ source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15
56
  └── writing/ # Paper drafts
57
  ```
58
 
59
- ### Database Build Flow
60
- **US Data (80 species):**
 
 
 
 
 
 
 
 
61
  1. PDFs loaded from `agllm-data/.../raw-pdfs/` (content source)
62
  2. `matched_species_results_v2.csv` maps PDF filename β†’ species name (metadata)
63
 
 
 
 
 
64
  **Africa/India Data (35 + 11 species):**
65
- 3. Excel `species-organized/PestID Species - Organized.xlsx` provides both content (IPM Info) and metadata
66
 
67
  **All Data:**
68
- 4. Documents chunked (512 tokens, 10 overlap)
69
- 5. Tagged with `matched_specie_X` + `region` metadata
70
- 6. Stored in ChromaDB at `vector-databases-deployed/db5-*/`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ### Evaluation Filters (retrieval_evaluation.py)
73
  | Filter | P@5 | nDCG@5 |
 
14
 
15
  ### Environment
16
  ```bash
17
+ # Conda environment: agllm-june-15
18
  source ~/miniconda3/etc/profile.d/conda.sh && conda activate agllm-june-15
19
+
20
+ # Required env vars (in .env file)
21
+ OPENAI_API_KEY=sk-proj-...
22
+ ANTHROPIC_API_KEY=sk-ant-... # optional, for Claude models
23
+ OPENROUTER_API_KEY=... # optional, for Llama/Gemini
24
  ```
25
 
26
  ### Key Commands
 
62
  └── writing/ # Paper drafts
63
  ```
64
 
65
+ ### Database Build Flow (4 Geographic Tiers)
66
+
67
+ | Tier | Species | Source |
68
+ |------|---------|--------|
69
+ | Midwest USA | 80 | ISU Handbook PDFs |
70
+ | USA | 110 | GPT-4o generated IPM |
71
+ | Africa | 35 | Expert-curated Excel |
72
+ | India | 11 | Expert-curated Excel |
73
+
74
+ **Midwest USA Data (80 species):**
75
  1. PDFs loaded from `agllm-data/.../raw-pdfs/` (content source)
76
  2. `matched_species_results_v2.csv` maps PDF filename β†’ species name (metadata)
77
 
78
+ **USA Data (110 species - LLM generated):**
79
+ 3. Run `generate_usa_ipm_info.py` to query GPT-4o for all species
80
+ 4. Creates "USA" sheet in Excel with IPM info for all US-present species
81
+
82
  **Africa/India Data (35 + 11 species):**
83
+ 5. Excel `species-organized/PestID Species - Organized.xlsx` provides both content (IPM Info) and metadata
84
 
85
  **All Data:**
86
+ 6. Documents chunked (512 tokens, 10 overlap)
87
+ 7. Tagged with `matched_specie_X` + `region` metadata
88
+ 8. Stored in ChromaDB at `vector-databases-deployed/db5-*/`
89
+
90
+ ### Generate USA IPM Info (GPT-4o)
91
+ ```bash
92
+ # Full run (prepare β†’ process β†’ parse)
93
+ export OPENAI_API_KEY="your-key-here"
94
+ python generate_usa_ipm_info.py --force
95
+
96
+ # Or run steps individually:
97
+ python generate_usa_ipm_info.py --step prepare # Create JSONL requests
98
+ python generate_usa_ipm_info.py --step process # Call GPT-4o API
99
+ python generate_usa_ipm_info.py --step parse # Create Excel sheet
100
+ ```
101
+
102
+ **Output:** Updates `species-organized/PestID Species - Organized.xlsx` with "USA" sheet containing 110 species present in the United States (pests + beneficials).
103
 
104
  ### Evaluation Filters (retrieval_evaluation.py)
105
  | Filter | P@5 | nDCG@5 |