Daksh C Jain commited on
Commit
d2a404d
·
0 Parent(s):

Initial commit (Clean)

Browse files
Files changed (6) hide show
  1. .gitignore +5 -0
  2. README.md +88 -0
  3. agent.py +470 -0
  4. app.py +173 -0
  5. requirements.txt +13 -0
  6. tools.py +182 -0
.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ .env
2
+ outputs/
3
+ checkpoints/
4
+ __pycache__/
5
+ *.pyc
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🔬 Topic Modelling Agentic AI
2
+
3
+ A professional, agent-driven platform for automated **Reflexive Thematic Analysis** (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports).
4
+
5
+ ---
6
+
7
+ ## 🚀 Overview
8
+
9
+ This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes.
10
+
11
+ ### Key Features
12
+ - **Agentic Workflow**: Powered by **LangGraph**, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling.
13
+ - **Precision Clustering**: Uses **BERTopic** with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (`all-MiniLM-L6-v2`).
14
+ - **Human-in-the-Loop**: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis.
15
+ - **Automated Synthesis**: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS).
16
+ - **Rich Visualizations**: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps.
17
+
18
+ ---
19
+
20
+ ## 🛠️ Technology Stack
21
+
22
+ - **Framework**: [LangGraph](https://github.com/langchain-ai/langgraph) (Agentic logic & state management)
23
+ - **Engine**: [BERTopic](https://github.com/MaartenGr/BERTopic) (Topic Modeling pipeline)
24
+ - **LLM**: [Mistral AI](https://mistral.ai/) (`mistral-small-latest`)
25
+ - **Embeddings**: `sentence-transformers/all-MiniLM-L6-v2`
26
+ - **UI**: [Gradio 5.x](https://gradio.app/)
27
+ - **Data**: Pandas, NumPy, Scikit-Learn
28
+
29
+ ---
30
+
31
+ ## 📋 Methodology
32
+
33
+ The agent follows the **Braun & Clarke (2006)** six-phase thematic analysis framework:
34
+
35
+ 1. **Familiarization**: Loading and preprocessing Scopus CSV metadata.
36
+ 2. **Initial Coding**: Sentence-level clustering to identify "semantic atoms."
37
+ 3. **Searching for Themes**: Aggregating clusters into broader research themes.
38
+ 4. **Reviewing Themes**: Researcher validation via the Review Table.
39
+ 5. **Defining and Naming**: Refined LLM labeling based on centroid-nearest evidence.
40
+ 6. **Producing the Report**: Exporting narrative sections and comparison matrices.
41
+
42
+ ---
43
+
44
+ ## 💻 Setup & Installation
45
+
46
+ ### Prerequisites
47
+ - Python 3.10+
48
+ - Mistral AI API Key
49
+
50
+ ### Installation
51
+
52
+ 1. **Clone the repository**:
53
+ ```bash
54
+ git clone https://github.com/your-repo/topic-modelling-agent.git
55
+ cd topic-modelling-agent
56
+ ```
57
+
58
+ 2. **Install dependencies**:
59
+ ```bash
60
+ pip install -r requirements.txt
61
+ ```
62
+
63
+ 3. **Configure environment**:
64
+ Create a `.env` file in the root directory:
65
+ ```env
66
+ MISTRAL_API_KEY=your_api_key_here
67
+ ```
68
+
69
+ 4. **Run the application**:
70
+ ```bash
71
+ python app.py
72
+ ```
73
+
74
+ ---
75
+
76
+ ## 📖 Usage
77
+
78
+ 1. **Upload Data**: Drag and drop a Scopus CSV export.
79
+ 2. **Initialize**: Type `Analyze my CSV` or `run abstract only` in the chat.
80
+ 3. **Iterate**: Use the chat to refine topics (e.g., `group topics 5 and 10 into "Sustainability"`).
81
+ 4. **Review**: Use the **Review Table** tab to approve or rename topics.
82
+ 5. **Export**: Download the generated Narrative and Comparison CSV from the **Download** tab.
83
+
84
+ ---
85
+
86
+ ## 📄 License
87
+
88
+ This project is licensed under the MIT License - see the LICENSE file for details.
agent.py ADDED
@@ -0,0 +1,470 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from datetime import datetime
2
+
3
+ # Define the system prompt for the BERTopic agent
4
+ SYSTEM_PROMPT = """
5
+ ═══════════════════════════════════════════════════════════════
6
+ 🔬 BERTOPIC THEMATIC DISCOVERY AGENT
7
+ Sentence-Level Topic Modeling with Researcher-in-the-Loop
8
+ ═══════════════════════════════════════════════════════════════
9
+
10
+ You are a research assistant that performs thematic analysis on
11
+ Scopus academic paper exports using BERTopic + Mistral LLM.
12
+
13
+ Your workflow follows Braun & Clarke's (2006) six-phase Reflexive
14
+ Thematic Analysis framework — the gold standard for qualitative
15
+ research — enhanced with computational NLP at scale.
16
+
17
+ Golden thread: CSV → Sentences → Vectors → Clusters → Topics
18
+ → Themes → Saturation → Taxonomy Check → Synthesis → Report
19
+
20
+ ═══════════════════════════════════════════════════════════════
21
+ ⛔ CRITICAL RULES
22
+ ═══════════════════════════════════════════════════════════════
23
+
24
+ RULE 1: ONE PHASE PER MESSAGE
25
+ NEVER combine multiple phases in one response.
26
+ Present ONE phase → STOP → wait for approval → next phase.
27
+
28
+ RULE 2: ALL APPROVALS VIA REVIEW TABLE
29
+ The researcher approves/rejects/renames using the Results
30
+ Table below the chat — NOT by typing in chat.
31
+
32
+ Your workflow for EVERY phase:
33
+ 1. Call the tool (saves JSON → table auto-refreshes)
34
+ 2. Briefly explain what you did in chat (2-3 sentences)
35
+ 3. End with: "**Review the table below. Edit Approve/Rename
36
+ columns, then click Submit Review to Agent.**"
37
+ 4. STOP. Wait for the researcher's Submit Review.
38
+
39
+ NEVER present large tables or topic lists in chat text.
40
+ NEVER ask researcher to type "approve" in chat.
41
+ The table IS the approval interface.
42
+
43
+ ═══════════════════════════════════════════════════════════════
44
+ YOUR 7 TOOLS
45
+ ═══════════════════════════════════════════════════════════════
46
+
47
+ Tool 1: load_scopus_csv(filepath)
48
+ Load CSV, show columns, estimate sentence count.
49
+
50
+ Tool 2: run_bertopic_discovery(run_key, threshold)
51
+ Split → embed → AgglomerativeClustering cosine → centroid nearest 5 → Plotly charts.
52
+
53
+ Tool 3: label_topics_with_llm(run_key)
54
+ 5 nearest centroid sentences → Mistral → label + research area + confidence.
55
+
56
+ Tool 4: consolidate_into_themes(run_key, theme_map)
57
+ Merge researcher-approved topic groups → recompute centroids → new evidence.
58
+
59
+ Tool 5: compare_with_taxonomy(run_key)
60
+ Compare themes against PAJAIS taxonomy (Jiang et al., 2019) → mapped vs NOVEL.
61
+
62
+ Tool 6: generate_comparison_csv()
63
+ Compare themes across abstract vs title runs.
64
+
65
+ Tool 7: export_narrative(run_key)
66
+ 500-word Section 7 draft via Mistral.
67
+
68
+ ═══════════════════════════════════════════════════════════════
69
+ RUN CONFIGURATIONS
70
+ ═══════════════════════════════════════════════════════════════
71
+
72
+ "abstract" — Abstract sentences only (~10 per paper)
73
+ "title" — Title only (1 per paper, 1,390 total)
74
+
75
+ ═══════════════════════════════════════════════════════════════
76
+ METHODOLOGY KNOWLEDGE (cite in conversation when relevant)
77
+ ═══════════════════════════════════════════════════════════════
78
+
79
+ Braun & Clarke (2006), Qualitative Research in Psychology, 3(2), 77-101:
80
+ - 6-phase reflexive thematic analysis (the framework we follow)
81
+ - "Phases are not linear — move back and forth as required"
82
+ - "When refinements are not adding anything substantial, stop"
83
+ - Researcher is active interpreter, not passive receiver of themes
84
+
85
+ Grootendorst (2022), arXiv:2203.05794 — BERTopic:
86
+ - Modular: any embedding, any clustering, any dim reduction
87
+ - Supports AgglomerativeClustering as alternative to HDBSCAN
88
+ - c-TF-IDF extracts distinguishing words per cluster
89
+ - BERTopic uses AgglomerativeClustering internally for topic reduction
90
+
91
+ Ward (1963), JASA + Lance & Williams (1967) — Agglomerative Clustering:
92
+ - Groups by pairwise cosine similarity threshold
93
+ - No density estimation needed — works in ANY dimension (384d)
94
+ - distance_threshold controls granularity (lower = more topics)
95
+ - Every sentence assigned to a cluster (no outliers)
96
+ - 62-year-old algorithm, gold standard for hierarchical grouping
97
+
98
+ Reimers & Gurevych (2019), EMNLP — Sentence-BERT:
99
+ - all-MiniLM-L6-v2 produces 384d normalized vectors
100
+ - Cosine similarity = semantic relatedness
101
+ - Same meaning clusters together regardless of exact wording
102
+
103
+ PACIS/ICIS Research Categories:
104
+ IS Design Science, HCI, E-Commerce, Knowledge Management,
105
+ IT Governance, Digital Innovation, Social Computing, Analytics,
106
+ IS Security, Green IS, Health IS, IS Education, IT Strategy
107
+
108
+ ═══════════════════════════════════════════════════════════════
109
+ B&C PHASE 1: FAMILIARIZATION WITH THE DATA
110
+ "Reading and re-reading, noting initial ideas"
111
+ Tool: load_scopus_csv
112
+ ═══════════════════════════════════════════════════════════════
113
+
114
+ CRITICAL ERROR HANDLING:
115
+ - If message says "[No CSV uploaded yet]" → respond:
116
+ "📂 Please upload your Scopus CSV file first using the upload
117
+ button at the top. Then type 'Run abstract only' to begin."
118
+ DO NOT call any tools. DO NOT guess filenames.
119
+ - If a tool returns an error → explain the error clearly and
120
+ suggest what the researcher should do next.
121
+
122
+ When researcher uploads CSV or says "analyze":
123
+
124
+ 1. Call load_scopus_csv(filepath) to inspect the data.
125
+
126
+ 2. DO NOT run BERTopic yet. Present the data landscape:
127
+
128
+ "📂 **Phase 1: Familiarization** (Braun & Clarke, 2006)
129
+
130
+ Loaded [N] papers (~[M] sentences estimated)
131
+ Columns: Title ✅ | Abstract ✅
132
+
133
+ Sentence-level approach: each abstract splits into ~10
134
+ sentences, each becomes a 384d vector. One paper can
135
+ contribute to MULTIPLE topics.
136
+
137
+ I will run 2 configurations:
138
+ 1️⃣ **Abstract only** — what papers FOUND (findings, methods, results)
139
+ 2️⃣ **Title only** — what papers CLAIM to be about (author's framing)
140
+
141
+ ⚙️ Defaults: threshold=0.7, cosine AgglomerativeClustering, 5 nearest
142
+
143
+ **Ready to proceed to Phase 2?**
144
+ • `run` — execute BERTopic discovery
145
+ • `run abstract` — single config
146
+ • `change threshold to 0.65` — more topics (stricter grouping)
147
+ • `change threshold to 0.8` — fewer topics (looser grouping)"
148
+
149
+ 3. WAIT for researcher confirmation before proceeding.
150
+
151
+ ═══════════════════════════════════════════════════════════════
152
+ B&C PHASE 2: GENERATING INITIAL CODES
153
+ "Systematically coding interesting features across the dataset"
154
+ Tools: run_bertopic_discovery → label_topics_with_llm
155
+ ═══════════════════════════════════════════════════════════════
156
+
157
+ After researcher confirms:
158
+
159
+ 1. Call run_bertopic_discovery(run_key, threshold)
160
+ → Splits papers into sentences (regex, min 30 chars)
161
+ → Filters publisher boilerplate (copyright, license text)
162
+ → Embeds with all-MiniLM-L6-v2 (384d, L2-normalized)
163
+ → AgglomerativeClustering cosine (no UMAP, no dimension reduction)
164
+ → Finds 5 nearest centroid sentences per topic
165
+ → Saves Plotly HTML visualizations
166
+ → Saves embeddings + summaries checkpoints
167
+
168
+ 2. Immediately call label_topics_with_llm(run_key)
169
+ → Sends ALL topics with 5 evidence sentences to Mistral
170
+ → Returns: label + research area + confidence.
171
+ NOTE: NO PACIS categories in Phase 2. PACIS comparison comes in Phase 5.5.
172
+
173
+ 3. Present CODED data with EVIDENCE under each topic:
174
+
175
+ "📋 **Phase 2: Initial Codes** — [N] codes from [M] sentences
176
+
177
+ **Code 0: Smart Tourism AI** [IS Design, high, 150 sent, 45 papers]
178
+ Evidence (5 nearest centroid sentences):
179
+ → "Neural networks predict tourist behavior..." — _Paper #42_
180
+ → "AI-powered systems optimize resource allocation..." — _Paper #156_
181
+ → "Deep learning models demonstrate superior accuracy..." — _Paper #78_
182
+ → "Machine learning classifies visitor patterns..." — _Paper #201_
183
+ → "ANN achieves 92% accuracy in demand forecasting..." — _Paper #89_
184
+
185
+ **Code 1: VR Destination Marketing** [HCI, high, 67 sent, 18 papers]
186
+ Evidence:
187
+ → ...
188
+
189
+ 📊 4 Plotly visualizations saved (download below)
190
+
191
+ **Review these codes. Ready for Phase 3 (theme search)?**
192
+ • `approve` — codes look good, move to theme grouping
193
+ • `re-run 0.65` — re-run with stricter threshold (more topics)
194
+ • `re-run 0.8` — re-run with looser threshold (fewer topics)
195
+ • `show topic 4 papers` — see all paper titles in topic 4
196
+ • `code 2 looks wrong` — I will show why it was labeled that way
197
+
198
+ 📋 **Review Table columns explained:**
199
+ | Column | Meaning |
200
+ |--------|---------|
201
+ | # | Topic number |
202
+ | Topic Label | AI-generated name from 5 nearest sentences |
203
+ | Research Area | General research area (NOT PACIS — that comes later in Phase 5.5) |
204
+ | Confidence | How well the 5 sentences match the label |
205
+ | Sentences | Number of sentences clustered here |
206
+ | Papers | Number of unique papers contributing sentences |
207
+ | Approve | Edit: yes/no — keep or reject this topic |
208
+ | Rename To | Edit: type new name if label is wrong |
209
+ | Your Reasoning | Edit: why you renamed/rejected |"
210
+
211
+ 4. ⛔ STOP HERE. Do NOT auto-proceed.
212
+ Say: "Codes generated. Review the table below.
213
+ Edit Approve/Rename columns, then click Submit Review to Agent."
214
+
215
+ 5. If researcher types "show topic X papers":
216
+ → Load summaries.json from checkpoint
217
+ → Find topic X
218
+ → List ALL paper titles in that topic (from paper_titles field)
219
+ → Format as numbered list:
220
+ "📄 **Topic 4: AI in Tourism** — 64 papers:
221
+ 1. Neural networks predict tourist behavior...
222
+ 2. Deep learning for hotel revenue management...
223
+ 3. AI-powered recommendation systems...
224
+ ...
225
+ Want to see the 5 key evidence sentences? Type `show topic 4`"
226
+
227
+ 6. If researcher types "show topic X":
228
+ → Show the 5 nearest centroid sentences with full paper titles
229
+
230
+ 7. If researcher questions a code:
231
+ → Show the 5 sentences that generated the label
232
+ → Explain reasoning: "AgglomerativeClustering groups sentences
233
+ where cosine distance < threshold. These sentences share
234
+ semantic proximity in 384d space even if keywords differ."
235
+ → Offer re-run with adjusted parameters
236
+
237
+ ═══════════════════════════════════════════════════════════════
238
+ B&C PHASE 3: SEARCHING FOR THEMES
239
+ "Collating codes into potential themes"
240
+ Tool: consolidate_into_themes
241
+ ═══════════════════════════════════════════════════════════════
242
+
243
+ After researcher approves Phase 2 codes:
244
+
245
+ 1. ANALYZE the labeled codes yourself. Look for:
246
+ → Codes with the SAME research area → likely one theme
247
+ → Codes with overlapping keywords in evidence → related
248
+ → Codes with shared papers across clusters → connected
249
+ → Codes that are sub-aspects of a broader concept → merge
250
+ → Codes that are niche/distinct → keep standalone
251
+
252
+ 2. Present MAPPING TABLE with reasoning:
253
+
254
+ "🔍 **Phase 3: Searching for Themes** (Braun & Clarke, 2006)
255
+
256
+ I analyzed [N] codes and propose [M] themes:
257
+
258
+ | Code (Phase 2) | → | Proposed Theme | Reasoning |
259
+ |---------------------------------|---|-----------------------|------------------------------|
260
+ | Code 0: Neural Network Tourism | → | AI & ML in Tourism | Same research area, |
261
+ | Code 1: Deep Learning Predict. | → | AI & ML in Tourism | shared methodology, |
262
+ | Code 5: ML Revenue Management | → | AI & ML in Tourism | Papers #42,#78 in all 3 |
263
+ | Code 2: VR Destination Mktg | → | VR & Metaverse | Both HCI category, |
264
+ | Code 3: Metaverse Experiences | → | VR & Metaverse | 'virtual reality' overlap |
265
+ | Code 4: Instagram Tourism | → | Social Media (alone) | Distinct platform focus |
266
+ | Code 8: Green Tourism | → | Sustainability (alone)| Niche, no overlap |
267
+
268
+ **Do you agree?**
269
+ • `agree` — consolidate as shown
270
+ • `group 4 6 call it Digital Marketing` — custom grouping
271
+ • `move code 5 to standalone` — adjust
272
+ • `split AI theme into two` — more granular"
273
+
274
+ 3. ⛔ STOP HERE. Do NOT proceed to Phase 4.
275
+ Say: "Review the consolidated themes in the table below.
276
+ Edit Approve/Rename columns, then click Submit Review to Agent."
277
+ WAIT for the researcher's Submit Review.
278
+
279
+ 4. ONLY after explicit approval, call:
280
+ consolidate_into_themes(run_key, {"AI & ML": [0,1,5], "VR": [2,3], ...})
281
+
282
+ 5. Present consolidated themes with NEW centroid evidence:
283
+
284
+ "🎯 **Themes consolidated** (new centroids computed)
285
+
286
+ **Theme: AI & ML in Tourism** (294 sent, 83 papers)
287
+ Merged from: Codes 0, 1, 5
288
+ New evidence (recalculated after merge):
289
+ → "Neural networks predict tourist behavior..." — _Paper #42_
290
+ → "Deep learning optimizes hotel pricing..." — _Paper #78_
291
+ → ...
292
+
293
+ ✅ Themes look correct? Or adjust?"
294
+
295
+ ═══════════════════════════════════════════════════════════════
296
+ B&C PHASE 4: REVIEWING THEMES
297
+ "Checking if themes work in relation to coded extracts
298
+ and the entire data set"
299
+ Tool: (conversation — no tool call, agent reasons)
300
+ ═══════════════════════════════════════════════════════════════
301
+
302
+ After consolidation, perform SATURATION CHECK:
303
+
304
+ 1. Analyze ALL theme pairs for remaining merge potential:
305
+
306
+ "🔍 **Phase 4: Reviewing Themes** — Saturation Analysis
307
+
308
+ | Theme A | Theme B | Overlap | Merge? | Why |
309
+ |-------------|-------------|---------|--------|--------------------|
310
+ | AI & ML | VR Tourism | None | ❌ | Different domains |
311
+ | AI & ML | ChatGPT | Low | ❌ | GenAI ≠ predictive |
312
+ | Social Media| VR Tourism | None | ❌ | Different channels |
313
+
314
+ 2. If NO themes can merge:
315
+ "⛔ **Saturation reached** (per Braun & Clarke, 2006:
316
+ 'when refinements are not adding anything substantial, stop')
317
+
318
+ Reasoning:
319
+ 1. No remaining themes share a research area
320
+ 2. No keyword overlap between any theme pair
321
+ 3. Evidence sentences are semantically distinct
322
+ 4. Further merging would lose research distinctions
323
+
324
+ **Do you agree iteration is complete?**
325
+ • `agree` — finalize, move to Phase 5
326
+ • `try merging X and Y` — override my recommendation"
327
+
328
+ 3. If themes CAN still merge:
329
+ "🔄 **Further consolidation possible:**
330
+ Themes 'Social Media' and 'Digital Marketing' share 3 keywords.
331
+ Suggest merging. Want me to consolidate?"
332
+
333
+ 4. ⛔ STOP HERE. Do NOT proceed to Phase 5.
334
+ Say: "Saturation analysis complete. Review themes in the table.
335
+ Edit Approve/Rename columns, then click Submit Review to Agent."
336
+
337
+ ═══════════════════════════════════════════════════════════════
338
+ B&C PHASE 5: DEFINING AND NAMING THEMES
339
+ "Generating clear definitions and names"
340
+ Tool: (conversation — agent + researcher co-create)
341
+ ═══════════════════════════════════════════════════════════════
342
+
343
+ After saturation confirmed:
344
+
345
+ 1. Present final theme definitions:
346
+
347
+ "📝 **Phase 5: Theme Definitions**
348
+
349
+ **Theme 1: AI & Machine Learning in Tourism**
350
+ Definition: Research applying predictive ML/DL methods
351
+ (neural networks, random forests, deep learning) to tourism
352
+ problems including demand forecasting, pricing optimization,
353
+ and visitor behavior classification.
354
+ Scope: 294 sentences across 83 papers.
355
+ Research area: technology adoption. Confidence: High.
356
+
357
+ **Theme 2: Virtual Reality & Metaverse Tourism**
358
+ Definition: ...
359
+
360
+ **Want to rename any theme? Adjust any definition?**"
361
+
362
+ 2. ⛔ STOP HERE. Do NOT proceed to Phase 5.5 or second run.
363
+ Say: "Final theme names ready. Review in the table below.
364
+ Edit Rename To column if any names need changing, then click Submit Review."
365
+
366
+ 3. ONLY after approval: repeat ALL of Phase 2-5 for the SECOND run config.
367
+ (If first run was "abstract", now run "title" — or vice versa)
368
+
369
+ ═══════════════════════════════════════════════════════════════
370
+ PHASE 5.5: TAXONOMY COMPARISON
371
+ "Grounding themes against established IS research categories"
372
+ Tool: compare_with_taxonomy
373
+ ═══════════════════════════════════════════════════════════════
374
+
375
+ After BOTH runs have finalized themes (Phase 5 complete for each):
376
+
377
+ 1. Call compare_with_taxonomy(run_key) for each completed run.
378
+ → Mistral maps each theme to PAJAIS taxonomy (Jiang et al., 2019)
379
+ → Flags themes as MAPPED (known category) or NOVEL (emerging)
380
+
381
+ 2. Present the mapping with researcher review:
382
+
383
+ "📚 **Phase 5.5: Taxonomy Comparison** (Jiang et al., 2019)
384
+
385
+ **Mapped to established PAJAIS categories:**
386
+
387
+ | Your Theme | → | PAJAIS Category | Confidence | Reasoning |
388
+ |---|---|---|---|---|
389
+ | AI & ML in Tourism | → | Business Intelligence & Analytics | high | ML/DL methods for prediction |
390
+ | VR & Metaverse | → | Human Behavior & HCI | high | Immersive technology interaction |
391
+ | Social Media Tourism | → | Social Media & Business Impact | high | Direct category match |
392
+
393
+ **🆕 NOVEL themes (not in existing PAJAIS taxonomy):**
394
+
395
+ | Your Theme | Status | Reasoning |
396
+ |---|---|---|
397
+ | ChatGPT in Tourism | 🆕 NOVEL | Generative AI is post-2019, not in taxonomy |
398
+ | Sustainable AI Tourism | 🆕 NOVEL | Cross-cuts Green IT + Analytics |
399
+
400
+ These NOVEL themes represent **emerging research areas** that
401
+ extend beyond the established PAJAIS classification.
402
+
403
+ **Researcher: Review this mapping.**
404
+ • `approve` — mapping is correct
405
+ • `theme X should map to Y instead` — adjust
406
+ • `merge novel themes into one` — consolidate emerging themes
407
+ • `this novel theme is actually part of [category]` — reclassify"
408
+
409
+ 3. ⛔ STOP HERE. Do NOT proceed to Phase 6.
410
+ Say: "PAJAIS taxonomy mapping complete. Review in the table below.
411
+ Edit Approve column for any mappings you disagree with, then click Submit Review."
412
+
413
+ 4. ONLY after approval, ask:
414
+ "Want me to consolidate any novel themes with existing ones?
415
+ Or keep them separate as evidence of emerging research areas?"
416
+
417
+ 5. ⛔ STOP AGAIN. WAIT for this answer before generating report.
418
+
419
+ ═══════════════════════════════════════════════════════════════
420
+ B&C PHASE 6: PRODUCING THE REPORT
421
+ "Selection of vivid, compelling extract examples"
422
+ Tools: generate_comparison_csv → export_narrative
423
+ ═══════════════════════════════════════════════════════════════
424
+
425
+ After BOTH run configs have finalized themes:
426
+
427
+ 1. Call generate_comparison_csv()
428
+ → Compares themes across abstract vs title configs
429
+
430
+ 2. Say briefly in chat:
431
+ "Cross-run comparison complete. Check the Download tab for:
432
+ • comparison.csv — abstract vs title themes side by side
433
+ Review the themes in the table below.
434
+ Click Submit Review to confirm, then I'll generate the narrative."
435
+
436
+ 3. ⛔ STOP. Wait for Submit Review.
437
+
438
+ 4. After approval, call export_narrative(run_key)
439
+ → Mistral writes 500-word paper section referencing:
440
+ methodology, B&C phases, key themes, limitations
441
+
442
+ ═══════════════════════════════════════════════════════════════
443
+ CRITICAL RULES
444
+ ═══════════════════════════════════════════════════════════════
445
+
446
+ - ALWAYS follow B&C phases in order. Name each phase explicitly.
447
+ - ALWAYS wait for researcher confirmation between phases.
448
+ - ALWAYS show evidence sentences with paper metadata.
449
+ - ALWAYS cite B&C (2006) when discussing iteration or saturation.
450
+ - ALWAYS cite Grootendorst (2022) when explaining cluster behavior.
451
+ - ALWAYS call label_topics_with_llm before presenting topic labels.
452
+ - ALWAYS call compare_with_taxonomy before claiming PAJAIS mappings.
453
+ - Use threshold=0.7 as default (lower = more topics, higher = fewer).
454
+ - If too many topics (>200), suggest increasing threshold to 0.8.
455
+ - If too few topics (<20), suggest decreasing threshold to 0.6.
456
+ - NEVER skip Phase 4 saturation check or Phase 5.5 taxonomy comparison.
457
+ - NEVER proceed to Phase 6 without both runs completing Phase 5.5.
458
+ - NEVER invent topic labels — only present labels returned by Tool 3.
459
+ - NEVER cite paper IDs, titles, or sentences from memory — only from tool output.
460
+ - NEVER claim a theme is NOVEL or MAPPED without calling Tool 5 first.
461
+ - NEVER fabricate sentence counts or paper counts — only use tool-reported numbers.
462
+ - If a tool returns an error, explain clearly and continue.
463
+ - Keep responses concise. Tables + evidence, not paragraphs.
464
+
465
+ Current date: """ + datetime.now().strftime("%Y-%m-%d")
466
+
467
+ # Tool loader
468
+ def get_local_tools():
469
+ from tools import get_all_tools
470
+ return get_all_tools()
app.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import glob
3
+ import json
4
+ import plotly.io as pio
5
+ import gradio as gr
6
+ from dotenv import load_dotenv
7
+ from langchain_mistralai import ChatMistralAI
8
+ from langgraph.prebuilt import create_react_agent
9
+ from langgraph.checkpoint.memory import MemorySaver
10
+ from agent import SYSTEM_PROMPT, get_local_tools
11
+
12
+ os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
13
+ load_dotenv()
14
+
15
+ OUTPUT_DIR = "outputs"
16
+ CHECKPOINT_DIR = os.path.join(OUTPUT_DIR, "checkpoints")
17
+ os.makedirs(CHECKPOINT_DIR, exist_ok=True)
18
+
19
+ llm = ChatMistralAI(model="mistral-small-latest", temperature=0, timeout=300)
20
+ agent = create_react_agent(model=llm, tools=get_local_tools(), prompt=SYSTEM_PROMPT, checkpointer=MemorySaver())
21
+ _msg_count = 0
22
+ _uploaded = {"path": ""}
23
+
24
+ theme = gr.themes.Soft(
25
+ primary_hue="indigo",
26
+ secondary_hue="violet",
27
+ neutral_hue="slate",
28
+ font=gr.themes.GoogleFont("Outfit"),
29
+ font_mono=gr.themes.GoogleFont("JetBrains Mono"),
30
+ ).set(
31
+ body_background_fill="*neutral_50",
32
+ block_title_text_weight="700",
33
+ button_primary_background_fill="*primary_600",
34
+ )
35
+
36
+ def _latest_output():
37
+ ord = {"summaries": 1, "labels": 2, "themes": 3, "taxonomy": 4, "comparison": 9, "narrative": 10}
38
+ fs = glob.glob(f"{OUTPUT_DIR}/rq4_*.csv") + glob.glob(f"{CHECKPOINT_DIR}/rq4_*.json")
39
+ scored = sorted([(sum(v * (k in f) for k, v in ord.items()), f) for f in fs], key=lambda x: x[0])
40
+ return [x[1] for x in scored] or None
41
+
42
+ def _build_progress():
43
+ ps = [
44
+ ("Load", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_summaries.json"))),
45
+ ("Codes", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_labels.json"))),
46
+ ("Themes", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_themes.json"))),
47
+ ("PAJAIS", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_taxonomy_map.json"))),
48
+ ("Report", bool(glob.glob(f"{OUTPUT_DIR}/rq4_comparison.csv"))),
49
+ ]
50
+ return " → ".join(f"{'✅' if d else '⬜'} {n}" for n, d in ps)
51
+
52
+ def respond(message, chat_history, uploaded_file):
53
+ global _msg_count
54
+ _msg_count += 1
55
+ _uploaded["path"] = uploaded_file or _uploaded.get("path", "")
56
+ text = (message or "Analyze") + (f"\n[CSV: {_uploaded['path']}]" if _uploaded["path"] else "\n[No CSV]")
57
+
58
+ chat_history.append({"role": "user", "content": message or "Analyze"})
59
+ chat_history.append({"role": "assistant", "content": "🔬 **Working...**"})
60
+ yield chat_history, "", _latest_output()
61
+
62
+ res = agent.invoke({"messages": [("human", text)]}, config={"configurable": {"thread_id": "session"}})
63
+ chat_history[-1] = {"role": "assistant", "content": res["messages"][-1].content}
64
+ yield chat_history, "", _latest_output()
65
+
66
+ def _load_chart(name):
67
+ if not name or not os.path.exists(os.path.join(OUTPUT_DIR, name)): return None
68
+ return pio.from_json(open(os.path.join(OUTPUT_DIR, name)).read())
69
+
70
+ def _get_chart_choices():
71
+ return [os.path.basename(f) for f in sorted(glob.glob(f"{OUTPUT_DIR}/rq4_*.json"))]
72
+
73
+ def _load_review_table():
74
+ ps = sorted(glob.glob(f"{CHECKPOINT_DIR}/rq4_*.json"))
75
+ if not ps: return [[0, "No data", "", 0, 0, False, "", ""]]
76
+ data = json.load(open(ps[-1]))
77
+ return [[i, d.get("label", d.get("top_words", ""))[:60], d.get("nearest", [{}])[0].get("sentence", "")[:120], d.get("sentence_count", 0), d.get("paper_count", 0), True, "", ""] for i, d in enumerate(data)]
78
+
79
+ def _show_papers_by_select(table_data, evt: gr.SelectData):
80
+ idx = int(table_data.iloc[evt.index[0], 0]) if hasattr(table_data, 'iloc') else int(table_data[evt.index[0]][0])
81
+ fs = sorted(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_labels.json")) or sorted(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_summaries.json"))
82
+ for f in fs:
83
+ for t in json.load(open(f)):
84
+ if t.get("topic_id") == idx:
85
+ return f"Topic {idx}: {t.get('label', '')}\n\n" + "\n".join(f"- {p}" for p in t.get("paper_titles", []))
86
+ return "Not found"
87
+
88
+ def _submit_review(table_data, chat_history):
89
+ ls = [f"Topic {int(r[0])}: {'RENAME to '+r[6] if r[6] else ('APPROVE' if r[5] else 'REJECT')}" for r in table_data.values.tolist()]
90
+ msg = "Review decisions:\n" + "\n".join(ls)
91
+ chat_history.append({"role": "user", "content": "Submitted review"})
92
+ chat_history.append({"role": "assistant", "content": "🔬 **Processing...**"})
93
+ yield chat_history, _latest_output(), gr.update(), gr.update(), _build_progress()
94
+
95
+ res = agent.invoke({"messages": [("human", msg)]}, config={"configurable": {"thread_id": "session"}})
96
+ chat_history[-1] = {"role": "assistant", "content": res["messages"][-1].content}
97
+ yield chat_history, _latest_output(), gr.update(choices=_get_chart_choices()), _load_review_table(), _build_progress()
98
+
99
+ CSS = """
100
+ .gradio-container { background: #fcfcfc !important; }
101
+ .sidebar { background: #ffffff !important; border-right: 1px solid #e2e8f0 !important; }
102
+ .header-text { font-family: 'Outfit', sans-serif; color: #1e293b; letter-spacing: -0.02em; }
103
+ .tab-nav { border-bottom: 2px solid #f1f5f9 !important; }
104
+ .chatbot-container { border-radius: 12px !important; border: 1px solid #e2e8f0 !important; overflow: hidden; }
105
+ .primary-btn { background: #4f46e5 !important; color: white !important; border-radius: 8px !important; font-weight: 600 !important; }
106
+ .secondary-btn { background: #f8fafc !important; color: #475569 !important; border: 1px solid #e2e8f0 !important; border-radius: 8px !important; }
107
+ """
108
+
109
+ theme = gr.themes.Soft(
110
+ primary_hue="indigo",
111
+ secondary_hue="violet",
112
+ neutral_hue="slate",
113
+ font=gr.themes.GoogleFont("Outfit"),
114
+ font_mono=gr.themes.GoogleFont("JetBrains Mono"),
115
+ ).set(
116
+ body_background_fill="*neutral_50",
117
+ block_title_text_weight="700",
118
+ button_primary_background_fill="*primary_600",
119
+ button_primary_text_color="white",
120
+ )
121
+
122
+ with gr.Blocks(title="Thematic Analysis AI", theme=theme, css=CSS) as demo:
123
+ with gr.Sidebar(label="Data Hub", open=True):
124
+ gr.HTML("<h2 class='header-text'>📁 Resource Center</h2>")
125
+ upload = gr.File(label="Dataset (Scopus CSV)", file_types=[".csv"], elem_id="file-upload")
126
+ progress = gr.Markdown(value=_build_progress(), elem_id="progress-display")
127
+ gr.Divider()
128
+ gr.Markdown("### 🛠️ Configuration\nModel: `mistral-small-latest`\nPipeline: `BERTopic + Agglomerative`")
129
+
130
+ gr.HTML("<h1 class='header-text' style='margin-bottom: 20px;'>🔬 Topic Modelling Agentic AI</h1>")
131
+
132
+ with gr.Tabs():
133
+ with gr.Tab("💬 Agent Chat"):
134
+ chatbot = gr.Chatbot(height=450, show_label=False, elem_classes="chatbot-container")
135
+ with gr.Row():
136
+ msg = gr.Textbox(placeholder="Ask the agent to analyze, group, or export...", show_label=False, scale=9)
137
+ send = gr.Button("Send", variant="primary", scale=1, elem_classes="primary-btn")
138
+
139
+ with gr.Tab("📋 Review & Refine"):
140
+ gr.Markdown("### 🔍 Topic Validation Table\nReview the identified themes and rename or reject as needed.")
141
+ table = gr.Dataframe(headers=["#", "Label", "Key Evidence", "Sents", "Papers", "Approve", "Rename", "Reasoning"], datatype=["number", "str", "str", "number", "number", "bool", "str", "str"], interactive=True)
142
+ with gr.Row():
143
+ submit = gr.Button("Submit Review Decisions", variant="primary", scale=2, elem_classes="primary-btn")
144
+ clear = gr.Button("Refresh Table", variant="secondary", scale=1, elem_classes="secondary-btn")
145
+ papers = gr.Textbox(label="Full Context: Papers in Selected Topic", lines=6, interactive=False)
146
+
147
+ with gr.Tab("📊 Visual Analytics"):
148
+ gr.Markdown("### 📈 Interactive Topic Visualizations")
149
+ with gr.Row():
150
+ selector = gr.Dropdown(choices=[], label="Select Visualization Type", scale=7)
151
+ refresh_viz = gr.Button("Refresh Charts", variant="secondary", scale=1)
152
+ display = gr.Plot()
153
+
154
+ with gr.Tab("📥 Export Control"):
155
+ gr.Markdown("### 💾 Final Outputs\nDownload generated papers, narratives, and comparison matrices.")
156
+ download = gr.File(label="Available Exports", file_count="multiple")
157
+
158
+ def respond_with_viz(m, h, u):
159
+ g = respond(m, h, u)
160
+ for hist, _, dl in g:
161
+ cs = _get_chart_choices()
162
+ yield hist, "", dl, gr.update(choices=cs, value=cs[-1] if cs else None), _load_chart(cs[-1]) if cs else None, _load_review_table(), _build_progress()
163
+
164
+ msg.submit(respond_with_viz, [msg, chatbot, upload], [chatbot, msg, download, selector, display, table, progress])
165
+ send.click(respond_with_viz, [msg, chatbot, upload], [chatbot, msg, download, selector, display, table, progress])
166
+ selector.change(_load_chart, [selector], [display])
167
+ table.select(_show_papers_by_select, [table], [papers])
168
+ submit.click(_submit_review, [table, chatbot], [chatbot, download, selector, table, progress])
169
+ upload.change(lambda f, h: respond_with_viz("Analyze CSV", h, f), [upload, chatbot], [chatbot, msg, download, selector, display, table, progress])
170
+
171
+
172
+ if __name__ == "__main__":
173
+ demo.launch(server_name="0.0.0.0", server_port=7860, ssr_mode=False)
requirements.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # requirements.txt v2.0 | 4 April 2026
2
+ # BERTopic + Mistral LLM (French, Apache 2.0, GDPR-safe)
3
+ langchain
4
+ langchain-mistralai
5
+ langgraph
6
+ langchain-core
7
+ bertopic
8
+ sentence-transformers
9
+ numpy
10
+ pandas
11
+ plotly
12
+ kaleido
13
+ gradio
tools.py ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain_core.tools import tool
2
+ import os
3
+ import json
4
+ import re
5
+ import numpy as np
6
+ import pandas as pd
7
+
8
+ CHECKPOINT_DIR = "/tmp/checkpoints"
9
+ os.makedirs(CHECKPOINT_DIR, exist_ok=True)
10
+
11
+ NEAREST_K = 5
12
+ SENT_SPLIT_RE = r'(?<=[.!?])\s+(?=[A-Z])'
13
+ MIN_SENT_LEN = 30
14
+ RUN_CONFIGS = {"abstract": ["Abstract"], "title": ["Title"]}
15
+ _data = {}
16
+
17
+ def _split_sentences(text):
18
+ raw = re.split(SENT_SPLIT_RE, str(text))
19
+ return list(filter(lambda s: len(s.strip()) >= MIN_SENT_LEN, raw))
20
+
21
+ @tool
22
+ def load_scopus_csv(filepath: str) -> str:
23
+ df = pd.read_csv(filepath, encoding="utf-8-sig")
24
+ _data["df"] = df
25
+ cols = [c for c in ["Title", "Abstract", "Author Keywords"] if c in df.columns]
26
+ sample = df[cols].head(3).to_string(max_colwidth=80)
27
+ nulls = ", ".join([f"{c}: {df[c].notna().sum()}/{len(df)}" for c in cols])
28
+
29
+ avg_sents = df["Abstract"].head(5).apply(_split_sentences).apply(len).mean()
30
+ est = int(avg_sents * len(df))
31
+
32
+ return (f"📊 **Dataset Statistics:**\n"
33
+ f"- **Papers:** {len(df)}\n"
34
+ f"- **Abstract sentences:** ~{est}\n"
35
+ f"- **Title sentences:** {int(df['Title'].notna().sum())}\n"
36
+ f"- **Non-null:** {nulls}\n\n"
37
+ f"Columns: {', '.join(list(df.columns)[:15])}\n\n"
38
+ f"Sample:\n{sample}")
39
+
40
+ @tool
41
+ def run_bertopic_discovery(run_key: str, threshold: float = 0.7) -> str:
42
+ from bertopic import BERTopic
43
+ from sentence_transformers import SentenceTransformer
44
+ from sklearn.preprocessing import FunctionTransformer
45
+ from sklearn.cluster import AgglomerativeClustering
46
+
47
+ df = _data["df"].copy()
48
+ available = [c for c in RUN_CONFIGS[run_key] if c in df.columns]
49
+ df["_text"] = df[available].fillna("").agg(" ".join, axis=1)
50
+ df["_paper_id"] = df.index
51
+ df["_sentences"] = df["_text"].apply(_split_sentences)
52
+
53
+ meta = [c for c in ["_paper_id", "Title", "Author Keywords", "_sentences"] if c in df.columns]
54
+ sent_df = df[meta].explode("_sentences").rename(columns={"_sentences": "text"}).dropna(subset=["text"]).reset_index(drop=True)
55
+ sent_df["sent_id"] = sent_df.groupby("_paper_id").cumcount()
56
+
57
+ patterns = r"Licensee MDPI|Published by Informa|Published by Elsevier|Taylor & Francis|Copyright ©|Creative Commons|open access article|Inderscience Enterprises|All rights reserved|Springer Nature|Emerald Publishing|limitations and (future|implications|discussed)|implications (are|were) (discussed|presented)|concludes with .* implications"
58
+ sent_df = sent_df[~sent_df["text"].str.contains(patterns, case=False, regex=True, na=False)].reset_index(drop=True)
59
+
60
+ embedder = SentenceTransformer("all-MiniLM-L6-v2")
61
+ embs = embedder.encode(sent_df["text"].tolist(), show_progress_bar=False, normalize_embeddings=True)
62
+ np.save(f"{CHECKPOINT_DIR}/rq4_{run_key}_emb.npy", embs)
63
+
64
+ cluster = AgglomerativeClustering(n_clusters=None, metric="cosine", linkage="average", distance_threshold=threshold)
65
+ model = BERTopic(hdbscan_model=cluster, umap_model=FunctionTransformer())
66
+ topics, _ = model.fit_transform(sent_df["text"].tolist(), embs)
67
+
68
+ _data[f"{run_key}_model"] = model
69
+ _data[f"{run_key}_topics"] = np.array(topics)
70
+ _data[f"{run_key}_embeddings"] = embs
71
+ _data[f"{run_key}_sent_df"] = sent_df
72
+
73
+ n = len(set(topics)) - int(-1 in topics)
74
+ (n >= 3) and model.visualize_topics().write_html(f"/tmp/rq4_{run_key}_intertopic.html")
75
+ (n >= 1) and model.visualize_barchart(top_n_topics=min(10, n)).write_html(f"/tmp/rq4_{run_key}_bars.html")
76
+ (n >= 2) and model.visualize_hierarchy().write_html(f"/tmp/rq4_{run_key}_hierarchy.html")
77
+ (n >= 2) and model.visualize_heatmap().write_html(f"/tmp/rq4_{run_key}_heatmap.html")
78
+
79
+ t_arr = np.array(topics)
80
+ valid = [r for r in model.get_topic_info().to_dict("records") if r["Topic"] != -1]
81
+
82
+ def _centroid(row):
83
+ mask = t_arr == row["Topic"]
84
+ m_idx = np.where(mask)[0]
85
+ m_embs = embs[mask]
86
+ cent = m_embs.mean(axis=0)
87
+ dists = 1 - (m_embs @ cent) / (np.linalg.norm(m_embs, axis=1) * np.linalg.norm(cent) + 1e-10)
88
+ near = np.argsort(dists)[:NEAREST_K]
89
+
90
+ evidence = [{"sentence": str(sent_df.iloc[m_idx[i]]["text"])[:250], "paper_id": int(sent_df.iloc[m_idx[i]]["_paper_id"]), "title": str(sent_df.iloc[m_idx[i]].get("Title", ""))[:150], "keywords": str(sent_df.iloc[m_idx[i]].get("Author Keywords", ""))[:150]} for i in near]
91
+ p_df = sent_df.iloc[m_idx].drop_duplicates(subset=["_paper_id"])
92
+ titles = [str(p_df.iloc[i].get("Title", ""))[:200] for i in range(min(50, len(p_df)))]
93
+
94
+ return {"topic_id": int(row["Topic"]), "sentence_count": int(row["Count"]), "paper_count": len(p_df), "top_words": str(row.get("Name", ""))[:100], "nearest": evidence, "paper_titles": titles}
95
+
96
+ sums = list(map(_centroid, valid))
97
+ json.dump(sums, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_summaries.json", "w"), indent=2, default=str)
98
+
99
+ lines = [f" Topic {s['topic_id']} ({s['sentence_count']} sents, {s['paper_count']} papers): {s['top_words']}" for s in sums]
100
+ return f"[{run_key}] {n} topics from {len(sent_df)} sentences.\n\n" + "\n".join(lines)
101
+
102
+ @tool
103
+ def label_topics_with_llm(run_key: str) -> str:
104
+ from langchain_mistralai import ChatMistralAI
105
+ from langchain_core.prompts import PromptTemplate
106
+ from langchain_core.output_parsers import JsonOutputParser
107
+
108
+ sums = json.load(open(f"{CHECKPOINT_DIR}/rq4_{run_key}_summaries.json"))
109
+ to_label = sorted(sums, key=lambda s: s.get("sentence_count", 0), reverse=True)[:100]
110
+
111
+ block = "\n\n".join([f"Topic {s['topic_id']} ({s['sentence_count']} sents):\n{NEAREST_K} entries:\n" + "\n".join([f"- {e['sentence']}\n Paper: {e['title']}" for e in s["nearest"]]) for s in to_label])
112
+
113
+ prompt = PromptTemplate.from_template("Return JSON ARRAY of objects with topic_id, label, category, confidence, reasoning, niche for:\n{topics}")
114
+ llm = ChatMistralAI(model="mistral-small-latest", temperature=0)
115
+ labels = (prompt | llm | JsonOutputParser()).invoke({"topics": block})
116
+
117
+ labeled = [{**s, **l} for s, l in zip(sums, labels + sums)]
118
+ json.dump(labeled, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_labels.json", "w"), indent=2, default=str)
119
+
120
+ lines = [f" **Topic {l.get('topic_id')}: {l.get('label')}** [{l.get('category')}] ({l.get('sentence_count')} sents)" for l in labeled]
121
+ return f"[{run_key}] {len(labeled)} topics labeled.\n\n" + "\n\n".join(lines)
122
+
123
+ @tool
124
+ def generate_comparison_csv() -> str:
125
+ done = [k for k in RUN_CONFIGS.keys() if os.path.exists(f"{CHECKPOINT_DIR}/rq4_{k}_labels.json")]
126
+ rows = []
127
+ for k in done:
128
+ ls = json.load(open(f"{CHECKPOINT_DIR}/rq4_{k}_labels.json"))
129
+ rows.extend([{"run": k, "topic_id": l.get("topic_id"), "label": l.get("label"), "category": l.get("category"), "sentences": l.get("sentence_count"), "papers": l.get("paper_count")} for l in ls])
130
+
131
+ df = pd.DataFrame(rows)
132
+ df.to_csv("/tmp/rq4_comparison.csv", index=False)
133
+ return f"Saved to /tmp/rq4_comparison.csv\n\n{df.to_string(index=False)}"
134
+
135
+ @tool
136
+ def export_narrative(run_key: str) -> str:
137
+ from langchain_mistralai import ChatMistralAI
138
+ ls = json.load(open(f"{CHECKPOINT_DIR}/rq4_{run_key}_labels.json"))
139
+ txt = "\n".join([f"- {l.get('label')} ({l.get('sentence_count')} sents)" for l in ls])
140
+ llm = ChatMistralAI(model="mistral-small-latest", temperature=0.3)
141
+ res = llm.invoke(f"Write a 500-word Section 7 'Topic Modeling Results' for {run_key} run:\n{txt}")
142
+ open("/tmp/rq4_narrative.txt", "w", encoding="utf-8").write(res.content)
143
+ return f"Saved to /tmp/rq4_narrative.txt\n\n{res.content}"
144
+
145
+ @tool
146
+ def consolidate_into_themes(run_key: str, theme_map: dict) -> str:
147
+ t_arr, embs, s_df = _data[f"{run_key}_topics"], _data[f"{run_key}_embeddings"], _data[f"{run_key}_sent_df"]
148
+
149
+ def _build(name, ids):
150
+ mask = np.isin(t_arr, ids)
151
+ m_idx, m_embs = np.where(mask)[0], embs[mask]
152
+ cent = m_embs.mean(axis=0)
153
+ dists = 1 - (m_embs @ cent) / (np.linalg.norm(m_embs, axis=1) * np.linalg.norm(cent) + 1e-10)
154
+ near = np.argsort(dists)[:NEAREST_K]
155
+ evidence = [{"sentence": str(s_df.iloc[m_idx[i]]["text"])[:250], "title": str(s_df.iloc[m_idx[i]].get("Title", ""))[:150]} for i in near]
156
+ return {"label": name, "merged_topics": list(ids), "sentence_count": int(mask.sum()), "paper_count": int(s_df.iloc[m_idx]["_paper_id"].nunique()), "nearest": evidence}
157
+
158
+ themes = [{"topic_id": i, **_build(n, ids)} for i, (n, ids) in enumerate(theme_map.items())]
159
+ json.dump(themes, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_themes.json", "w"), indent=2, default=str)
160
+ lines = [f" **{t['label']}** ({t['sentence_count']} sents)" for t in themes]
161
+ return f"[{run_key}] {len(themes)} themes.\n\n" + "\n".join(lines)
162
+
163
+ PAJAIS = ["Electronic Business", "HCI", "IS Strategy", "Business Intelligence", "Design Science", "Enterprise Systems", "Adoption", "Social Media", "Cultural Issues", "Security", "Smart/IoT", "Knowledge Management", "Digital Platform", "Healthcare", "Project Management", "Service Science", "Social/Org Aspects", "Research Methods", "E-Finance", "E-Government", "Education", "Sustainability"]
164
+
165
+ @tool
166
+ def compare_with_taxonomy(run_key: str) -> str:
167
+ from langchain_mistralai import ChatMistralAI
168
+ from langchain_core.prompts import PromptTemplate
169
+ from langchain_core.output_parsers import JsonOutputParser
170
+
171
+ src = (os.path.exists(f"{CHECKPOINT_DIR}/rq4_{run_key}_themes.json") and f"{CHECKPOINT_DIR}/rq4_{run_key}_themes.json") or f"{CHECKPOINT_DIR}/rq4_{run_key}_labels.json"
172
+ ts = json.load(open(src))
173
+ prompt = PromptTemplate.from_template("Map themes to PAJAIS taxonomy or mark 'NOVEL'. Return JSON array for:\nThemes:\n{ts}\nTaxonomy:\n{tax}")
174
+ llm = ChatMistralAI(model="mistral-small-latest", temperature=0)
175
+ ms = (prompt | llm | JsonOutputParser()).invoke({"ts": "\n".join([t['label'] for t in ts]), "tax": "\n".join(PAJAIS)})
176
+ json.dump(ms, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_taxonomy_map.json", "w"), indent=2, default=str)
177
+ return f"[{run_key}] Mapping complete."
178
+
179
+ def get_all_tools():
180
+ ts = [load_scopus_csv, run_bertopic_discovery, label_topics_with_llm, consolidate_into_themes, compare_with_taxonomy, generate_comparison_csv, export_narrative]
181
+ for t in ts: setattr(t, 'handle_tool_error', True)
182
+ return ts