atharvthite05 commited on
Commit
2cb3200
Β·
verified Β·
1 Parent(s): 8546295

Update agent.py

Browse files
Files changed (1) hide show
  1. agent.py +42 -42
agent.py CHANGED
@@ -189,15 +189,15 @@ Golden thread: CSV β†’ Sentences β†’ Vectors β†’ Clusters β†’ Topics
189
  Tool 1: load_scopus_csv(filepath)
190
  Load CSV, show columns, estimate sentence count.
191
 
192
- Tool 2: run_bertopic_discovery(run_key, threshold)
193
- Split β†’ embed β†’ AgglomerativeClustering cosine β†’ centroid nearest 5 β†’ Plotly charts.
194
 
195
  Tool 3: label_topics_with_llm(run_key)
196
  5 nearest centroid sentences β†’ Mistral only β†’ initial topic labels.
197
 
198
  Tool 4: verify_topic_labels_with_groq(run_key)
199
  Run only when researcher types VERIFY at STOP GATE 1.
200
- Return Mistral vs Groq comparison in chat for manual verification.
201
 
202
  Tool 5: consolidate_into_themes(run_key, theme_map)
203
  Merge researcher-approved topic groups β†’ recompute centroids β†’ new evidence.
@@ -234,22 +234,20 @@ Golden thread: CSV β†’ Sentences β†’ Vectors β†’ Clusters β†’ Topics
234
  - Researcher is active interpreter, not passive receiver of themes
235
 
236
  Grootendorst (2022), arXiv:2203.05794 β€” BERTopic:
237
- - Modular: any embedding, any clustering, any dim reduction
238
- - Supports AgglomerativeClustering as alternative to HDBSCAN
239
- - c-TF-IDF extracts distinguishing words per cluster
240
- - BERTopic uses AgglomerativeClustering internally for topic reduction
241
-
242
- Ward (1963), JASA + Lance & Williams (1967) β€” Agglomerative Clustering:
243
- - Groups by pairwise cosine similarity threshold
244
- - No density estimation needed β€” works in ANY dimension (384d)
245
- - distance_threshold controls granularity (lower = more topics)
246
- - Every sentence assigned to a cluster (no outliers)
247
- - 62-year-old algorithm, gold standard for hierarchical grouping
248
-
249
- Reimers & Gurevych (2019), EMNLP β€” Sentence-BERT:
250
- - all-MiniLM-L6-v2 produces 384d normalized vectors
251
- - Cosine similarity = semantic relatedness
252
- - Same meaning clusters together regardless of exact wording
253
 
254
  PACIS/ICIS Research Categories:
255
  IS Design Science, HCI, E-Commerce, Knowledge Management,
@@ -281,8 +279,8 @@ When researcher uploads CSV or says "analyze":
281
  Loaded [N] papers (~[M] sentences estimated)
282
  Columns: Title βœ… | Abstract βœ… | Author Keywords (optional) βœ…
283
 
284
- Sentence-level approach: each abstract splits into ~10
285
- sentences, each becomes a 384d vector. One paper can
286
  contribute to MULTIPLE topics.
287
 
288
  I can run 3 configurations:
@@ -290,15 +288,15 @@ When researcher uploads CSV or says "analyze":
290
  2️⃣ **Title only** β€” what papers CLAIM to be about (author's framing)
291
  3️⃣ **Keywords only** β€” author-declared focus areas (author keywords)
292
 
293
- βš™οΈ Defaults: threshold=0.7, cosine AgglomerativeClustering, 5 nearest
294
 
295
  **Ready to proceed to Phase 2?**
296
  β€’ `run` β€” execute BERTopic discovery
297
  β€’ `run abstract` β€” single config
298
  β€’ `run title` β€” single config
299
  β€’ `run keywords` β€” single config
300
- β€’ `change threshold to 0.65` β€” more topics (stricter grouping)
301
- β€’ `change threshold to 0.8` β€” fewer topics (looser grouping)"
302
 
303
  3. WAIT for researcher confirmation before proceeding.
304
 
@@ -310,11 +308,11 @@ When researcher uploads CSV or says "analyze":
310
 
311
  After researcher confirms:
312
 
313
- 1. Call run_bertopic_discovery(run_key, threshold)
314
  β†’ Splits papers into sentences (regex, min 30 chars)
315
  β†’ Filters publisher boilerplate (copyright, license text)
316
- β†’ Embeds with all-MiniLM-L6-v2 (384d, L2-normalized)
317
- β†’ AgglomerativeClustering cosine (no UMAP, no dimension reduction)
318
  β†’ Finds 5 nearest centroid sentences per topic
319
  β†’ Saves Plotly HTML visualizations
320
  β†’ Saves embeddings + summaries checkpoints
@@ -325,7 +323,7 @@ After researcher confirms:
325
  β†’ Writes review table with Mistral labels by default
326
  OPTIONAL: if researcher types `VERIFY` at STOP GATE 1,
327
  call verify_topic_labels_with_groq(run_key) and present side-by-side
328
- Mistral vs Groq label comparison directly in chat.
329
  NOTE: NO PACIS categories in Phase 2. PACIS comparison comes in Phase 5.5.
330
 
331
  3. Present CODED data with EVIDENCE under each topic:
@@ -347,10 +345,10 @@ After researcher confirms:
347
  πŸ“Š 4 Plotly visualizations saved (download below)
348
 
349
  **Review these codes. Ready for Phase 3 (theme search)?**
350
- β€’ `VERIFY` β€” run Groq labels and compare with Mistral in chat output
351
  β€’ `approve` β€” codes look good, move to theme grouping
352
- β€’ `re-run 0.65` β€” re-run with stricter threshold (more topics)
353
- β€’ `re-run 0.8` β€” re-run with looser threshold (fewer topics)
354
  β€’ `show topic 4 papers` β€” see all paper titles in topic 4
355
  β€’ `code 2 looks wrong` β€” I will show why it was labeled that way
356
 
@@ -388,9 +386,10 @@ After researcher confirms:
388
 
389
  7. If researcher questions a code:
390
  β†’ Show the 5 sentences that generated the label
391
- β†’ Explain reasoning: "AgglomerativeClustering groups sentences
392
- where cosine distance < threshold. These sentences share
393
- semantic proximity in 384d space even if keywords differ."
 
394
  β†’ Offer re-run with adjusted parameters
395
 
396
  ═══════════════════════════════════════════════════════════════
@@ -614,9 +613,9 @@ After all requested run configs have finalized themes:
614
  - ONLY call verify_taxonomy_mapping_with_groq when user explicitly says VERIFY
615
  and the workflow is at STOP GATE 4 (post-Phase 5.5 mapping).
616
  - ALWAYS call compare_with_taxonomy before claiming PAJAIS mappings.
617
- - Use threshold=0.7 as default (lower = more topics, higher = fewer).
618
- - If too many topics (>200), suggest increasing threshold to 0.8.
619
- - If too few topics (<20), suggest decreasing threshold to 0.6.
620
  - NEVER skip Phase 4 saturation check or Phase 5.5 taxonomy comparison.
621
  - NEVER proceed to Phase 6 unless every run that was executed has completed Phase 5.5.
622
  - NEVER invent topic labels β€” only present labels returned by Tool 3.
@@ -1032,14 +1031,15 @@ def _build_verify_chat_report(rows: list[dict]) -> str:
1032
 
1033
  shown = rows[:VERIFY_CHAT_MAX_ROWS]
1034
  header = [
1035
- "| # | Mistral Label | Groq Label |",
1036
- "|---|---|---|",
1037
  ]
1038
  lines = list(map(
1039
  lambda r: (
1040
  f"| {int(r.get('cluster_id', 0))} "
1041
  f"| {_sanitize_markdown_cell(r.get('mistral_label') or r.get('label', ''))} "
1042
- f"| {_sanitize_markdown_cell(r.get('groq_label', ''))} |"
 
1043
  ),
1044
  shown,
1045
  ))
@@ -1120,9 +1120,9 @@ def _handle_verify_command(state: dict) -> tuple[str, dict]:
1120
  report = _build_verify_chat_report(labels_rows)
1121
 
1122
  reply = (
1123
- "VERIFY complete. Groq topic labeling has been added for Phase 2 topics.\n\n"
1124
  f"Verified topics: {verified_count}/{labelled_count}\n"
1125
- "Mistral vs Groq comparison is shown below in chat.\n\n"
1126
  f"{report}\n\n"
1127
  "Compare labels, edit Rename To/Approve, then click Submit Review to continue.\n\n"
1128
  "[STOP GATE 1 β€” AWAITING REVIEW TABLE SUBMISSION]"
 
189
  Tool 1: load_scopus_csv(filepath)
190
  Load CSV, show columns, estimate sentence count.
191
 
192
+ Tool 2: run_bertopic_discovery(run_key, min_cluster_size, max_cluster_size)
193
+ Split β†’ embed β†’ UMAP + HDBSCAN β†’ centroid nearest 5 β†’ Plotly charts.
194
 
195
  Tool 3: label_topics_with_llm(run_key)
196
  5 nearest centroid sentences β†’ Mistral only β†’ initial topic labels.
197
 
198
  Tool 4: verify_topic_labels_with_groq(run_key)
199
  Run only when researcher types VERIFY at STOP GATE 1.
200
+ Return Mistral vs Groq-Ollama vs Groq-GPT comparison in chat for manual verification.
201
 
202
  Tool 5: consolidate_into_themes(run_key, theme_map)
203
  Merge researcher-approved topic groups β†’ recompute centroids β†’ new evidence.
 
234
  - Researcher is active interpreter, not passive receiver of themes
235
 
236
  Grootendorst (2022), arXiv:2203.05794 β€” BERTopic:
237
+ - Modular: any embedding, any clustering, any dim reduction
238
+ - UMAP + HDBSCAN is a common discovery stack for density-based topics
239
+ - c-TF-IDF extracts distinguishing words per cluster
240
+
241
+ McInnes et al. (2017) β€” HDBSCAN:
242
+ - Density-based clustering with variable-density support
243
+ - Allows noise points (unassigned sentences)
244
+ - min_cluster_size controls granularity (lower = more topics)
245
+ - max_cluster_size caps oversized clusters
246
+
247
+ Cohan et al. (2020) β€” SPECTER2:
248
+ - SPECTER2 produces semantically aligned embeddings for scientific text
249
+ - Cosine similarity = semantic relatedness
250
+ - Same meaning clusters together regardless of exact wording
 
 
251
 
252
  PACIS/ICIS Research Categories:
253
  IS Design Science, HCI, E-Commerce, Knowledge Management,
 
279
  Loaded [N] papers (~[M] sentences estimated)
280
  Columns: Title βœ… | Abstract βœ… | Author Keywords (optional) βœ…
281
 
282
+ Sentence-level approach: each abstract splits into ~10
283
+ sentences, each becomes a SPECTER2 vector. One paper can
284
  contribute to MULTIPLE topics.
285
 
286
  I can run 3 configurations:
 
288
  2️⃣ **Title only** β€” what papers CLAIM to be about (author's framing)
289
  3️⃣ **Keywords only** β€” author-declared focus areas (author keywords)
290
 
291
+ βš™οΈ Defaults: UMAP + HDBSCAN (min_cluster_size=20, max_cluster_size=120), 5 nearest
292
 
293
  **Ready to proceed to Phase 2?**
294
  β€’ `run` β€” execute BERTopic discovery
295
  β€’ `run abstract` β€” single config
296
  β€’ `run title` β€” single config
297
  β€’ `run keywords` β€” single config
298
+ β€’ `change min_cluster_size to 4` β€” more topics (smaller groups)
299
+ β€’ `change max_cluster_size to 100` β€” cap oversized clusters"
300
 
301
  3. WAIT for researcher confirmation before proceeding.
302
 
 
308
 
309
  After researcher confirms:
310
 
311
+ 1. Call run_bertopic_discovery(run_key, min_cluster_size, max_cluster_size)
312
  β†’ Splits papers into sentences (regex, min 30 chars)
313
  β†’ Filters publisher boilerplate (copyright, license text)
314
+ β†’ Embeds with SPECTER2 (L2-normalized)
315
+ β†’ UMAP reduces dimensions for HDBSCAN clustering
316
  β†’ Finds 5 nearest centroid sentences per topic
317
  β†’ Saves Plotly HTML visualizations
318
  β†’ Saves embeddings + summaries checkpoints
 
323
  β†’ Writes review table with Mistral labels by default
324
  OPTIONAL: if researcher types `VERIFY` at STOP GATE 1,
325
  call verify_topic_labels_with_groq(run_key) and present side-by-side
326
+ Mistral vs Groq-Ollama vs Groq-GPT label comparison directly in chat.
327
  NOTE: NO PACIS categories in Phase 2. PACIS comparison comes in Phase 5.5.
328
 
329
  3. Present CODED data with EVIDENCE under each topic:
 
345
  πŸ“Š 4 Plotly visualizations saved (download below)
346
 
347
  **Review these codes. Ready for Phase 3 (theme search)?**
348
+ β€’ `VERIFY` β€” run Groq-Ollama + Groq-GPT labels and compare with Mistral in chat output
349
  β€’ `approve` β€” codes look good, move to theme grouping
350
+ β€’ `re-run min_cluster_size=4` β€” more topics (smaller groups)
351
+ β€’ `re-run max_cluster_size=100` β€” cap oversized clusters
352
  β€’ `show topic 4 papers` β€” see all paper titles in topic 4
353
  β€’ `code 2 looks wrong` β€” I will show why it was labeled that way
354
 
 
386
 
387
  7. If researcher questions a code:
388
  β†’ Show the 5 sentences that generated the label
389
+ β†’ Explain reasoning: "UMAP preserves semantic neighborhoods,
390
+ and HDBSCAN finds dense groups without forcing every point
391
+ into a cluster. These sentences share semantic proximity even
392
+ if keywords differ."
393
  β†’ Offer re-run with adjusted parameters
394
 
395
  ═══════════════════════════════════════════════════════════════
 
613
  - ONLY call verify_taxonomy_mapping_with_groq when user explicitly says VERIFY
614
  and the workflow is at STOP GATE 4 (post-Phase 5.5 mapping).
615
  - ALWAYS call compare_with_taxonomy before claiming PAJAIS mappings.
616
+ - Use min_cluster_size=20, max_cluster_size=120 as default.
617
+ - If too many topics (>200), suggest increasing min_cluster_size.
618
+ - If too few topics (<20), suggest decreasing min_cluster_size.
619
  - NEVER skip Phase 4 saturation check or Phase 5.5 taxonomy comparison.
620
  - NEVER proceed to Phase 6 unless every run that was executed has completed Phase 5.5.
621
  - NEVER invent topic labels β€” only present labels returned by Tool 3.
 
1031
 
1032
  shown = rows[:VERIFY_CHAT_MAX_ROWS]
1033
  header = [
1034
+ "| # | Mistral Label | Groq-Ollama Label | Groq-GPT Label |",
1035
+ "|---|---|---|---|",
1036
  ]
1037
  lines = list(map(
1038
  lambda r: (
1039
  f"| {int(r.get('cluster_id', 0))} "
1040
  f"| {_sanitize_markdown_cell(r.get('mistral_label') or r.get('label', ''))} "
1041
+ f"| {_sanitize_markdown_cell(r.get('groq_ollama_label') or r.get('groq_label', ''))} "
1042
+ f"| {_sanitize_markdown_cell(r.get('groq_gpt_label', ''))} |"
1043
  ),
1044
  shown,
1045
  ))
 
1120
  report = _build_verify_chat_report(labels_rows)
1121
 
1122
  reply = (
1123
+ "VERIFY complete. Groq-Ollama and Groq-GPT topic labeling has been added for Phase 2 topics.\n\n"
1124
  f"Verified topics: {verified_count}/{labelled_count}\n"
1125
+ "Mistral vs Groq-Ollama vs Groq-GPT comparison is shown below in chat.\n\n"
1126
  f"{report}\n\n"
1127
  "Compare labels, edit Rename To/Approve, then click Submit Review to continue.\n\n"
1128
  "[STOP GATE 1 β€” AWAITING REVIEW TABLE SUBMISSION]"