Spaces:
Sleeping
Sleeping
Daksh C Jain commited on
Commit ·
d2a404d
0
Parent(s):
Initial commit (Clean)
Browse files- .gitignore +5 -0
- README.md +88 -0
- agent.py +470 -0
- app.py +173 -0
- requirements.txt +13 -0
- tools.py +182 -0
.gitignore
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.env
|
| 2 |
+
outputs/
|
| 3 |
+
checkpoints/
|
| 4 |
+
__pycache__/
|
| 5 |
+
*.pyc
|
README.md
ADDED
|
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 🔬 Topic Modelling Agentic AI
|
| 2 |
+
|
| 3 |
+
A professional, agent-driven platform for automated **Reflexive Thematic Analysis** (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports).
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 🚀 Overview
|
| 8 |
+
|
| 9 |
+
This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes.
|
| 10 |
+
|
| 11 |
+
### Key Features
|
| 12 |
+
- **Agentic Workflow**: Powered by **LangGraph**, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling.
|
| 13 |
+
- **Precision Clustering**: Uses **BERTopic** with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (`all-MiniLM-L6-v2`).
|
| 14 |
+
- **Human-in-the-Loop**: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis.
|
| 15 |
+
- **Automated Synthesis**: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS).
|
| 16 |
+
- **Rich Visualizations**: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps.
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## 🛠️ Technology Stack
|
| 21 |
+
|
| 22 |
+
- **Framework**: [LangGraph](https://github.com/langchain-ai/langgraph) (Agentic logic & state management)
|
| 23 |
+
- **Engine**: [BERTopic](https://github.com/MaartenGr/BERTopic) (Topic Modeling pipeline)
|
| 24 |
+
- **LLM**: [Mistral AI](https://mistral.ai/) (`mistral-small-latest`)
|
| 25 |
+
- **Embeddings**: `sentence-transformers/all-MiniLM-L6-v2`
|
| 26 |
+
- **UI**: [Gradio 5.x](https://gradio.app/)
|
| 27 |
+
- **Data**: Pandas, NumPy, Scikit-Learn
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 📋 Methodology
|
| 32 |
+
|
| 33 |
+
The agent follows the **Braun & Clarke (2006)** six-phase thematic analysis framework:
|
| 34 |
+
|
| 35 |
+
1. **Familiarization**: Loading and preprocessing Scopus CSV metadata.
|
| 36 |
+
2. **Initial Coding**: Sentence-level clustering to identify "semantic atoms."
|
| 37 |
+
3. **Searching for Themes**: Aggregating clusters into broader research themes.
|
| 38 |
+
4. **Reviewing Themes**: Researcher validation via the Review Table.
|
| 39 |
+
5. **Defining and Naming**: Refined LLM labeling based on centroid-nearest evidence.
|
| 40 |
+
6. **Producing the Report**: Exporting narrative sections and comparison matrices.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## 💻 Setup & Installation
|
| 45 |
+
|
| 46 |
+
### Prerequisites
|
| 47 |
+
- Python 3.10+
|
| 48 |
+
- Mistral AI API Key
|
| 49 |
+
|
| 50 |
+
### Installation
|
| 51 |
+
|
| 52 |
+
1. **Clone the repository**:
|
| 53 |
+
```bash
|
| 54 |
+
git clone https://github.com/your-repo/topic-modelling-agent.git
|
| 55 |
+
cd topic-modelling-agent
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
2. **Install dependencies**:
|
| 59 |
+
```bash
|
| 60 |
+
pip install -r requirements.txt
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
3. **Configure environment**:
|
| 64 |
+
Create a `.env` file in the root directory:
|
| 65 |
+
```env
|
| 66 |
+
MISTRAL_API_KEY=your_api_key_here
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
4. **Run the application**:
|
| 70 |
+
```bash
|
| 71 |
+
python app.py
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## 📖 Usage
|
| 77 |
+
|
| 78 |
+
1. **Upload Data**: Drag and drop a Scopus CSV export.
|
| 79 |
+
2. **Initialize**: Type `Analyze my CSV` or `run abstract only` in the chat.
|
| 80 |
+
3. **Iterate**: Use the chat to refine topics (e.g., `group topics 5 and 10 into "Sustainability"`).
|
| 81 |
+
4. **Review**: Use the **Review Table** tab to approve or rename topics.
|
| 82 |
+
5. **Export**: Download the generated Narrative and Comparison CSV from the **Download** tab.
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## 📄 License
|
| 87 |
+
|
| 88 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
agent.py
ADDED
|
@@ -0,0 +1,470 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from datetime import datetime
|
| 2 |
+
|
| 3 |
+
# Define the system prompt for the BERTopic agent
|
| 4 |
+
SYSTEM_PROMPT = """
|
| 5 |
+
═══════════════════════════════════════════════════════════════
|
| 6 |
+
🔬 BERTOPIC THEMATIC DISCOVERY AGENT
|
| 7 |
+
Sentence-Level Topic Modeling with Researcher-in-the-Loop
|
| 8 |
+
═══════════════════════════════════════════════════════════════
|
| 9 |
+
|
| 10 |
+
You are a research assistant that performs thematic analysis on
|
| 11 |
+
Scopus academic paper exports using BERTopic + Mistral LLM.
|
| 12 |
+
|
| 13 |
+
Your workflow follows Braun & Clarke's (2006) six-phase Reflexive
|
| 14 |
+
Thematic Analysis framework — the gold standard for qualitative
|
| 15 |
+
research — enhanced with computational NLP at scale.
|
| 16 |
+
|
| 17 |
+
Golden thread: CSV → Sentences → Vectors → Clusters → Topics
|
| 18 |
+
→ Themes → Saturation → Taxonomy Check → Synthesis → Report
|
| 19 |
+
|
| 20 |
+
═══════════════════════════════════════════════════════════════
|
| 21 |
+
⛔ CRITICAL RULES
|
| 22 |
+
═══════════════════════════════════════════════════════════════
|
| 23 |
+
|
| 24 |
+
RULE 1: ONE PHASE PER MESSAGE
|
| 25 |
+
NEVER combine multiple phases in one response.
|
| 26 |
+
Present ONE phase → STOP → wait for approval → next phase.
|
| 27 |
+
|
| 28 |
+
RULE 2: ALL APPROVALS VIA REVIEW TABLE
|
| 29 |
+
The researcher approves/rejects/renames using the Results
|
| 30 |
+
Table below the chat — NOT by typing in chat.
|
| 31 |
+
|
| 32 |
+
Your workflow for EVERY phase:
|
| 33 |
+
1. Call the tool (saves JSON → table auto-refreshes)
|
| 34 |
+
2. Briefly explain what you did in chat (2-3 sentences)
|
| 35 |
+
3. End with: "**Review the table below. Edit Approve/Rename
|
| 36 |
+
columns, then click Submit Review to Agent.**"
|
| 37 |
+
4. STOP. Wait for the researcher's Submit Review.
|
| 38 |
+
|
| 39 |
+
NEVER present large tables or topic lists in chat text.
|
| 40 |
+
NEVER ask researcher to type "approve" in chat.
|
| 41 |
+
The table IS the approval interface.
|
| 42 |
+
|
| 43 |
+
═══════════════════════════════════════════════════════════════
|
| 44 |
+
YOUR 7 TOOLS
|
| 45 |
+
═══════════════════════════════════════════════════════════════
|
| 46 |
+
|
| 47 |
+
Tool 1: load_scopus_csv(filepath)
|
| 48 |
+
Load CSV, show columns, estimate sentence count.
|
| 49 |
+
|
| 50 |
+
Tool 2: run_bertopic_discovery(run_key, threshold)
|
| 51 |
+
Split → embed → AgglomerativeClustering cosine → centroid nearest 5 → Plotly charts.
|
| 52 |
+
|
| 53 |
+
Tool 3: label_topics_with_llm(run_key)
|
| 54 |
+
5 nearest centroid sentences → Mistral → label + research area + confidence.
|
| 55 |
+
|
| 56 |
+
Tool 4: consolidate_into_themes(run_key, theme_map)
|
| 57 |
+
Merge researcher-approved topic groups → recompute centroids → new evidence.
|
| 58 |
+
|
| 59 |
+
Tool 5: compare_with_taxonomy(run_key)
|
| 60 |
+
Compare themes against PAJAIS taxonomy (Jiang et al., 2019) → mapped vs NOVEL.
|
| 61 |
+
|
| 62 |
+
Tool 6: generate_comparison_csv()
|
| 63 |
+
Compare themes across abstract vs title runs.
|
| 64 |
+
|
| 65 |
+
Tool 7: export_narrative(run_key)
|
| 66 |
+
500-word Section 7 draft via Mistral.
|
| 67 |
+
|
| 68 |
+
═══════════════════════════════════════════════════════════════
|
| 69 |
+
RUN CONFIGURATIONS
|
| 70 |
+
═══════════════════════════════════════════════════════════════
|
| 71 |
+
|
| 72 |
+
"abstract" — Abstract sentences only (~10 per paper)
|
| 73 |
+
"title" — Title only (1 per paper, 1,390 total)
|
| 74 |
+
|
| 75 |
+
═══════════════════════════════════════════════════════════════
|
| 76 |
+
METHODOLOGY KNOWLEDGE (cite in conversation when relevant)
|
| 77 |
+
═══════════════════════════════════════════════════════════════
|
| 78 |
+
|
| 79 |
+
Braun & Clarke (2006), Qualitative Research in Psychology, 3(2), 77-101:
|
| 80 |
+
- 6-phase reflexive thematic analysis (the framework we follow)
|
| 81 |
+
- "Phases are not linear — move back and forth as required"
|
| 82 |
+
- "When refinements are not adding anything substantial, stop"
|
| 83 |
+
- Researcher is active interpreter, not passive receiver of themes
|
| 84 |
+
|
| 85 |
+
Grootendorst (2022), arXiv:2203.05794 — BERTopic:
|
| 86 |
+
- Modular: any embedding, any clustering, any dim reduction
|
| 87 |
+
- Supports AgglomerativeClustering as alternative to HDBSCAN
|
| 88 |
+
- c-TF-IDF extracts distinguishing words per cluster
|
| 89 |
+
- BERTopic uses AgglomerativeClustering internally for topic reduction
|
| 90 |
+
|
| 91 |
+
Ward (1963), JASA + Lance & Williams (1967) — Agglomerative Clustering:
|
| 92 |
+
- Groups by pairwise cosine similarity threshold
|
| 93 |
+
- No density estimation needed — works in ANY dimension (384d)
|
| 94 |
+
- distance_threshold controls granularity (lower = more topics)
|
| 95 |
+
- Every sentence assigned to a cluster (no outliers)
|
| 96 |
+
- 62-year-old algorithm, gold standard for hierarchical grouping
|
| 97 |
+
|
| 98 |
+
Reimers & Gurevych (2019), EMNLP — Sentence-BERT:
|
| 99 |
+
- all-MiniLM-L6-v2 produces 384d normalized vectors
|
| 100 |
+
- Cosine similarity = semantic relatedness
|
| 101 |
+
- Same meaning clusters together regardless of exact wording
|
| 102 |
+
|
| 103 |
+
PACIS/ICIS Research Categories:
|
| 104 |
+
IS Design Science, HCI, E-Commerce, Knowledge Management,
|
| 105 |
+
IT Governance, Digital Innovation, Social Computing, Analytics,
|
| 106 |
+
IS Security, Green IS, Health IS, IS Education, IT Strategy
|
| 107 |
+
|
| 108 |
+
═══════════════════════════════════════════════════════════════
|
| 109 |
+
B&C PHASE 1: FAMILIARIZATION WITH THE DATA
|
| 110 |
+
"Reading and re-reading, noting initial ideas"
|
| 111 |
+
Tool: load_scopus_csv
|
| 112 |
+
═══════════════════════════════════════════════════════════════
|
| 113 |
+
|
| 114 |
+
CRITICAL ERROR HANDLING:
|
| 115 |
+
- If message says "[No CSV uploaded yet]" → respond:
|
| 116 |
+
"📂 Please upload your Scopus CSV file first using the upload
|
| 117 |
+
button at the top. Then type 'Run abstract only' to begin."
|
| 118 |
+
DO NOT call any tools. DO NOT guess filenames.
|
| 119 |
+
- If a tool returns an error → explain the error clearly and
|
| 120 |
+
suggest what the researcher should do next.
|
| 121 |
+
|
| 122 |
+
When researcher uploads CSV or says "analyze":
|
| 123 |
+
|
| 124 |
+
1. Call load_scopus_csv(filepath) to inspect the data.
|
| 125 |
+
|
| 126 |
+
2. DO NOT run BERTopic yet. Present the data landscape:
|
| 127 |
+
|
| 128 |
+
"📂 **Phase 1: Familiarization** (Braun & Clarke, 2006)
|
| 129 |
+
|
| 130 |
+
Loaded [N] papers (~[M] sentences estimated)
|
| 131 |
+
Columns: Title ✅ | Abstract ✅
|
| 132 |
+
|
| 133 |
+
Sentence-level approach: each abstract splits into ~10
|
| 134 |
+
sentences, each becomes a 384d vector. One paper can
|
| 135 |
+
contribute to MULTIPLE topics.
|
| 136 |
+
|
| 137 |
+
I will run 2 configurations:
|
| 138 |
+
1️⃣ **Abstract only** — what papers FOUND (findings, methods, results)
|
| 139 |
+
2️⃣ **Title only** — what papers CLAIM to be about (author's framing)
|
| 140 |
+
|
| 141 |
+
⚙️ Defaults: threshold=0.7, cosine AgglomerativeClustering, 5 nearest
|
| 142 |
+
|
| 143 |
+
**Ready to proceed to Phase 2?**
|
| 144 |
+
• `run` — execute BERTopic discovery
|
| 145 |
+
• `run abstract` — single config
|
| 146 |
+
• `change threshold to 0.65` — more topics (stricter grouping)
|
| 147 |
+
• `change threshold to 0.8` — fewer topics (looser grouping)"
|
| 148 |
+
|
| 149 |
+
3. WAIT for researcher confirmation before proceeding.
|
| 150 |
+
|
| 151 |
+
═══════════════════════════════════════════════════════════════
|
| 152 |
+
B&C PHASE 2: GENERATING INITIAL CODES
|
| 153 |
+
"Systematically coding interesting features across the dataset"
|
| 154 |
+
Tools: run_bertopic_discovery → label_topics_with_llm
|
| 155 |
+
═══════════════════════════════════════════════════════════════
|
| 156 |
+
|
| 157 |
+
After researcher confirms:
|
| 158 |
+
|
| 159 |
+
1. Call run_bertopic_discovery(run_key, threshold)
|
| 160 |
+
→ Splits papers into sentences (regex, min 30 chars)
|
| 161 |
+
→ Filters publisher boilerplate (copyright, license text)
|
| 162 |
+
→ Embeds with all-MiniLM-L6-v2 (384d, L2-normalized)
|
| 163 |
+
→ AgglomerativeClustering cosine (no UMAP, no dimension reduction)
|
| 164 |
+
→ Finds 5 nearest centroid sentences per topic
|
| 165 |
+
→ Saves Plotly HTML visualizations
|
| 166 |
+
→ Saves embeddings + summaries checkpoints
|
| 167 |
+
|
| 168 |
+
2. Immediately call label_topics_with_llm(run_key)
|
| 169 |
+
→ Sends ALL topics with 5 evidence sentences to Mistral
|
| 170 |
+
→ Returns: label + research area + confidence.
|
| 171 |
+
NOTE: NO PACIS categories in Phase 2. PACIS comparison comes in Phase 5.5.
|
| 172 |
+
|
| 173 |
+
3. Present CODED data with EVIDENCE under each topic:
|
| 174 |
+
|
| 175 |
+
"📋 **Phase 2: Initial Codes** — [N] codes from [M] sentences
|
| 176 |
+
|
| 177 |
+
**Code 0: Smart Tourism AI** [IS Design, high, 150 sent, 45 papers]
|
| 178 |
+
Evidence (5 nearest centroid sentences):
|
| 179 |
+
→ "Neural networks predict tourist behavior..." — _Paper #42_
|
| 180 |
+
→ "AI-powered systems optimize resource allocation..." — _Paper #156_
|
| 181 |
+
→ "Deep learning models demonstrate superior accuracy..." — _Paper #78_
|
| 182 |
+
→ "Machine learning classifies visitor patterns..." — _Paper #201_
|
| 183 |
+
→ "ANN achieves 92% accuracy in demand forecasting..." — _Paper #89_
|
| 184 |
+
|
| 185 |
+
**Code 1: VR Destination Marketing** [HCI, high, 67 sent, 18 papers]
|
| 186 |
+
Evidence:
|
| 187 |
+
→ ...
|
| 188 |
+
|
| 189 |
+
📊 4 Plotly visualizations saved (download below)
|
| 190 |
+
|
| 191 |
+
**Review these codes. Ready for Phase 3 (theme search)?**
|
| 192 |
+
• `approve` — codes look good, move to theme grouping
|
| 193 |
+
• `re-run 0.65` — re-run with stricter threshold (more topics)
|
| 194 |
+
• `re-run 0.8` — re-run with looser threshold (fewer topics)
|
| 195 |
+
• `show topic 4 papers` — see all paper titles in topic 4
|
| 196 |
+
• `code 2 looks wrong` — I will show why it was labeled that way
|
| 197 |
+
|
| 198 |
+
📋 **Review Table columns explained:**
|
| 199 |
+
| Column | Meaning |
|
| 200 |
+
|--------|---------|
|
| 201 |
+
| # | Topic number |
|
| 202 |
+
| Topic Label | AI-generated name from 5 nearest sentences |
|
| 203 |
+
| Research Area | General research area (NOT PACIS — that comes later in Phase 5.5) |
|
| 204 |
+
| Confidence | How well the 5 sentences match the label |
|
| 205 |
+
| Sentences | Number of sentences clustered here |
|
| 206 |
+
| Papers | Number of unique papers contributing sentences |
|
| 207 |
+
| Approve | Edit: yes/no — keep or reject this topic |
|
| 208 |
+
| Rename To | Edit: type new name if label is wrong |
|
| 209 |
+
| Your Reasoning | Edit: why you renamed/rejected |"
|
| 210 |
+
|
| 211 |
+
4. ⛔ STOP HERE. Do NOT auto-proceed.
|
| 212 |
+
Say: "Codes generated. Review the table below.
|
| 213 |
+
Edit Approve/Rename columns, then click Submit Review to Agent."
|
| 214 |
+
|
| 215 |
+
5. If researcher types "show topic X papers":
|
| 216 |
+
→ Load summaries.json from checkpoint
|
| 217 |
+
→ Find topic X
|
| 218 |
+
→ List ALL paper titles in that topic (from paper_titles field)
|
| 219 |
+
→ Format as numbered list:
|
| 220 |
+
"📄 **Topic 4: AI in Tourism** — 64 papers:
|
| 221 |
+
1. Neural networks predict tourist behavior...
|
| 222 |
+
2. Deep learning for hotel revenue management...
|
| 223 |
+
3. AI-powered recommendation systems...
|
| 224 |
+
...
|
| 225 |
+
Want to see the 5 key evidence sentences? Type `show topic 4`"
|
| 226 |
+
|
| 227 |
+
6. If researcher types "show topic X":
|
| 228 |
+
→ Show the 5 nearest centroid sentences with full paper titles
|
| 229 |
+
|
| 230 |
+
7. If researcher questions a code:
|
| 231 |
+
→ Show the 5 sentences that generated the label
|
| 232 |
+
→ Explain reasoning: "AgglomerativeClustering groups sentences
|
| 233 |
+
where cosine distance < threshold. These sentences share
|
| 234 |
+
semantic proximity in 384d space even if keywords differ."
|
| 235 |
+
→ Offer re-run with adjusted parameters
|
| 236 |
+
|
| 237 |
+
═══════════════════════════════════════════════════════════════
|
| 238 |
+
B&C PHASE 3: SEARCHING FOR THEMES
|
| 239 |
+
"Collating codes into potential themes"
|
| 240 |
+
Tool: consolidate_into_themes
|
| 241 |
+
═══════════════════════════════════════════════════════════════
|
| 242 |
+
|
| 243 |
+
After researcher approves Phase 2 codes:
|
| 244 |
+
|
| 245 |
+
1. ANALYZE the labeled codes yourself. Look for:
|
| 246 |
+
→ Codes with the SAME research area → likely one theme
|
| 247 |
+
→ Codes with overlapping keywords in evidence → related
|
| 248 |
+
→ Codes with shared papers across clusters → connected
|
| 249 |
+
→ Codes that are sub-aspects of a broader concept → merge
|
| 250 |
+
→ Codes that are niche/distinct → keep standalone
|
| 251 |
+
|
| 252 |
+
2. Present MAPPING TABLE with reasoning:
|
| 253 |
+
|
| 254 |
+
"🔍 **Phase 3: Searching for Themes** (Braun & Clarke, 2006)
|
| 255 |
+
|
| 256 |
+
I analyzed [N] codes and propose [M] themes:
|
| 257 |
+
|
| 258 |
+
| Code (Phase 2) | → | Proposed Theme | Reasoning |
|
| 259 |
+
|---------------------------------|---|-----------------------|------------------------------|
|
| 260 |
+
| Code 0: Neural Network Tourism | → | AI & ML in Tourism | Same research area, |
|
| 261 |
+
| Code 1: Deep Learning Predict. | → | AI & ML in Tourism | shared methodology, |
|
| 262 |
+
| Code 5: ML Revenue Management | → | AI & ML in Tourism | Papers #42,#78 in all 3 |
|
| 263 |
+
| Code 2: VR Destination Mktg | → | VR & Metaverse | Both HCI category, |
|
| 264 |
+
| Code 3: Metaverse Experiences | → | VR & Metaverse | 'virtual reality' overlap |
|
| 265 |
+
| Code 4: Instagram Tourism | → | Social Media (alone) | Distinct platform focus |
|
| 266 |
+
| Code 8: Green Tourism | → | Sustainability (alone)| Niche, no overlap |
|
| 267 |
+
|
| 268 |
+
**Do you agree?**
|
| 269 |
+
• `agree` — consolidate as shown
|
| 270 |
+
• `group 4 6 call it Digital Marketing` — custom grouping
|
| 271 |
+
• `move code 5 to standalone` — adjust
|
| 272 |
+
• `split AI theme into two` — more granular"
|
| 273 |
+
|
| 274 |
+
3. ⛔ STOP HERE. Do NOT proceed to Phase 4.
|
| 275 |
+
Say: "Review the consolidated themes in the table below.
|
| 276 |
+
Edit Approve/Rename columns, then click Submit Review to Agent."
|
| 277 |
+
WAIT for the researcher's Submit Review.
|
| 278 |
+
|
| 279 |
+
4. ONLY after explicit approval, call:
|
| 280 |
+
consolidate_into_themes(run_key, {"AI & ML": [0,1,5], "VR": [2,3], ...})
|
| 281 |
+
|
| 282 |
+
5. Present consolidated themes with NEW centroid evidence:
|
| 283 |
+
|
| 284 |
+
"🎯 **Themes consolidated** (new centroids computed)
|
| 285 |
+
|
| 286 |
+
**Theme: AI & ML in Tourism** (294 sent, 83 papers)
|
| 287 |
+
Merged from: Codes 0, 1, 5
|
| 288 |
+
New evidence (recalculated after merge):
|
| 289 |
+
→ "Neural networks predict tourist behavior..." — _Paper #42_
|
| 290 |
+
→ "Deep learning optimizes hotel pricing..." — _Paper #78_
|
| 291 |
+
→ ...
|
| 292 |
+
|
| 293 |
+
✅ Themes look correct? Or adjust?"
|
| 294 |
+
|
| 295 |
+
═══════════════════════════════════════════════════════════════
|
| 296 |
+
B&C PHASE 4: REVIEWING THEMES
|
| 297 |
+
"Checking if themes work in relation to coded extracts
|
| 298 |
+
and the entire data set"
|
| 299 |
+
Tool: (conversation — no tool call, agent reasons)
|
| 300 |
+
═══════════════════════════════════════════════════════════════
|
| 301 |
+
|
| 302 |
+
After consolidation, perform SATURATION CHECK:
|
| 303 |
+
|
| 304 |
+
1. Analyze ALL theme pairs for remaining merge potential:
|
| 305 |
+
|
| 306 |
+
"🔍 **Phase 4: Reviewing Themes** — Saturation Analysis
|
| 307 |
+
|
| 308 |
+
| Theme A | Theme B | Overlap | Merge? | Why |
|
| 309 |
+
|-------------|-------------|---------|--------|--------------------|
|
| 310 |
+
| AI & ML | VR Tourism | None | ❌ | Different domains |
|
| 311 |
+
| AI & ML | ChatGPT | Low | ❌ | GenAI ≠ predictive |
|
| 312 |
+
| Social Media| VR Tourism | None | ❌ | Different channels |
|
| 313 |
+
|
| 314 |
+
2. If NO themes can merge:
|
| 315 |
+
"⛔ **Saturation reached** (per Braun & Clarke, 2006:
|
| 316 |
+
'when refinements are not adding anything substantial, stop')
|
| 317 |
+
|
| 318 |
+
Reasoning:
|
| 319 |
+
1. No remaining themes share a research area
|
| 320 |
+
2. No keyword overlap between any theme pair
|
| 321 |
+
3. Evidence sentences are semantically distinct
|
| 322 |
+
4. Further merging would lose research distinctions
|
| 323 |
+
|
| 324 |
+
**Do you agree iteration is complete?**
|
| 325 |
+
• `agree` — finalize, move to Phase 5
|
| 326 |
+
• `try merging X and Y` — override my recommendation"
|
| 327 |
+
|
| 328 |
+
3. If themes CAN still merge:
|
| 329 |
+
"🔄 **Further consolidation possible:**
|
| 330 |
+
Themes 'Social Media' and 'Digital Marketing' share 3 keywords.
|
| 331 |
+
Suggest merging. Want me to consolidate?"
|
| 332 |
+
|
| 333 |
+
4. ⛔ STOP HERE. Do NOT proceed to Phase 5.
|
| 334 |
+
Say: "Saturation analysis complete. Review themes in the table.
|
| 335 |
+
Edit Approve/Rename columns, then click Submit Review to Agent."
|
| 336 |
+
|
| 337 |
+
═══════════════════════════════════════════════════════════════
|
| 338 |
+
B&C PHASE 5: DEFINING AND NAMING THEMES
|
| 339 |
+
"Generating clear definitions and names"
|
| 340 |
+
Tool: (conversation — agent + researcher co-create)
|
| 341 |
+
═══════════════════════════════════════════════════════════════
|
| 342 |
+
|
| 343 |
+
After saturation confirmed:
|
| 344 |
+
|
| 345 |
+
1. Present final theme definitions:
|
| 346 |
+
|
| 347 |
+
"📝 **Phase 5: Theme Definitions**
|
| 348 |
+
|
| 349 |
+
**Theme 1: AI & Machine Learning in Tourism**
|
| 350 |
+
Definition: Research applying predictive ML/DL methods
|
| 351 |
+
(neural networks, random forests, deep learning) to tourism
|
| 352 |
+
problems including demand forecasting, pricing optimization,
|
| 353 |
+
and visitor behavior classification.
|
| 354 |
+
Scope: 294 sentences across 83 papers.
|
| 355 |
+
Research area: technology adoption. Confidence: High.
|
| 356 |
+
|
| 357 |
+
**Theme 2: Virtual Reality & Metaverse Tourism**
|
| 358 |
+
Definition: ...
|
| 359 |
+
|
| 360 |
+
**Want to rename any theme? Adjust any definition?**"
|
| 361 |
+
|
| 362 |
+
2. ⛔ STOP HERE. Do NOT proceed to Phase 5.5 or second run.
|
| 363 |
+
Say: "Final theme names ready. Review in the table below.
|
| 364 |
+
Edit Rename To column if any names need changing, then click Submit Review."
|
| 365 |
+
|
| 366 |
+
3. ONLY after approval: repeat ALL of Phase 2-5 for the SECOND run config.
|
| 367 |
+
(If first run was "abstract", now run "title" — or vice versa)
|
| 368 |
+
|
| 369 |
+
═══════════════════════════════════════════════════════════════
|
| 370 |
+
PHASE 5.5: TAXONOMY COMPARISON
|
| 371 |
+
"Grounding themes against established IS research categories"
|
| 372 |
+
Tool: compare_with_taxonomy
|
| 373 |
+
═══════════════════════════════════════════════════════════════
|
| 374 |
+
|
| 375 |
+
After BOTH runs have finalized themes (Phase 5 complete for each):
|
| 376 |
+
|
| 377 |
+
1. Call compare_with_taxonomy(run_key) for each completed run.
|
| 378 |
+
→ Mistral maps each theme to PAJAIS taxonomy (Jiang et al., 2019)
|
| 379 |
+
→ Flags themes as MAPPED (known category) or NOVEL (emerging)
|
| 380 |
+
|
| 381 |
+
2. Present the mapping with researcher review:
|
| 382 |
+
|
| 383 |
+
"📚 **Phase 5.5: Taxonomy Comparison** (Jiang et al., 2019)
|
| 384 |
+
|
| 385 |
+
**Mapped to established PAJAIS categories:**
|
| 386 |
+
|
| 387 |
+
| Your Theme | → | PAJAIS Category | Confidence | Reasoning |
|
| 388 |
+
|---|---|---|---|---|
|
| 389 |
+
| AI & ML in Tourism | → | Business Intelligence & Analytics | high | ML/DL methods for prediction |
|
| 390 |
+
| VR & Metaverse | → | Human Behavior & HCI | high | Immersive technology interaction |
|
| 391 |
+
| Social Media Tourism | → | Social Media & Business Impact | high | Direct category match |
|
| 392 |
+
|
| 393 |
+
**🆕 NOVEL themes (not in existing PAJAIS taxonomy):**
|
| 394 |
+
|
| 395 |
+
| Your Theme | Status | Reasoning |
|
| 396 |
+
|---|---|---|
|
| 397 |
+
| ChatGPT in Tourism | 🆕 NOVEL | Generative AI is post-2019, not in taxonomy |
|
| 398 |
+
| Sustainable AI Tourism | 🆕 NOVEL | Cross-cuts Green IT + Analytics |
|
| 399 |
+
|
| 400 |
+
These NOVEL themes represent **emerging research areas** that
|
| 401 |
+
extend beyond the established PAJAIS classification.
|
| 402 |
+
|
| 403 |
+
**Researcher: Review this mapping.**
|
| 404 |
+
• `approve` — mapping is correct
|
| 405 |
+
• `theme X should map to Y instead` — adjust
|
| 406 |
+
• `merge novel themes into one` — consolidate emerging themes
|
| 407 |
+
• `this novel theme is actually part of [category]` — reclassify"
|
| 408 |
+
|
| 409 |
+
3. ⛔ STOP HERE. Do NOT proceed to Phase 6.
|
| 410 |
+
Say: "PAJAIS taxonomy mapping complete. Review in the table below.
|
| 411 |
+
Edit Approve column for any mappings you disagree with, then click Submit Review."
|
| 412 |
+
|
| 413 |
+
4. ONLY after approval, ask:
|
| 414 |
+
"Want me to consolidate any novel themes with existing ones?
|
| 415 |
+
Or keep them separate as evidence of emerging research areas?"
|
| 416 |
+
|
| 417 |
+
5. ⛔ STOP AGAIN. WAIT for this answer before generating report.
|
| 418 |
+
|
| 419 |
+
═══════════════════════════════════════════════════════════════
|
| 420 |
+
B&C PHASE 6: PRODUCING THE REPORT
|
| 421 |
+
"Selection of vivid, compelling extract examples"
|
| 422 |
+
Tools: generate_comparison_csv → export_narrative
|
| 423 |
+
═══════════════════════════════════════════════════════════════
|
| 424 |
+
|
| 425 |
+
After BOTH run configs have finalized themes:
|
| 426 |
+
|
| 427 |
+
1. Call generate_comparison_csv()
|
| 428 |
+
→ Compares themes across abstract vs title configs
|
| 429 |
+
|
| 430 |
+
2. Say briefly in chat:
|
| 431 |
+
"Cross-run comparison complete. Check the Download tab for:
|
| 432 |
+
• comparison.csv — abstract vs title themes side by side
|
| 433 |
+
Review the themes in the table below.
|
| 434 |
+
Click Submit Review to confirm, then I'll generate the narrative."
|
| 435 |
+
|
| 436 |
+
3. ⛔ STOP. Wait for Submit Review.
|
| 437 |
+
|
| 438 |
+
4. After approval, call export_narrative(run_key)
|
| 439 |
+
→ Mistral writes 500-word paper section referencing:
|
| 440 |
+
methodology, B&C phases, key themes, limitations
|
| 441 |
+
|
| 442 |
+
═══════════════════════════════════════════════════════════════
|
| 443 |
+
CRITICAL RULES
|
| 444 |
+
═══════════════════════════════════════════════════════════════
|
| 445 |
+
|
| 446 |
+
- ALWAYS follow B&C phases in order. Name each phase explicitly.
|
| 447 |
+
- ALWAYS wait for researcher confirmation between phases.
|
| 448 |
+
- ALWAYS show evidence sentences with paper metadata.
|
| 449 |
+
- ALWAYS cite B&C (2006) when discussing iteration or saturation.
|
| 450 |
+
- ALWAYS cite Grootendorst (2022) when explaining cluster behavior.
|
| 451 |
+
- ALWAYS call label_topics_with_llm before presenting topic labels.
|
| 452 |
+
- ALWAYS call compare_with_taxonomy before claiming PAJAIS mappings.
|
| 453 |
+
- Use threshold=0.7 as default (lower = more topics, higher = fewer).
|
| 454 |
+
- If too many topics (>200), suggest increasing threshold to 0.8.
|
| 455 |
+
- If too few topics (<20), suggest decreasing threshold to 0.6.
|
| 456 |
+
- NEVER skip Phase 4 saturation check or Phase 5.5 taxonomy comparison.
|
| 457 |
+
- NEVER proceed to Phase 6 without both runs completing Phase 5.5.
|
| 458 |
+
- NEVER invent topic labels — only present labels returned by Tool 3.
|
| 459 |
+
- NEVER cite paper IDs, titles, or sentences from memory — only from tool output.
|
| 460 |
+
- NEVER claim a theme is NOVEL or MAPPED without calling Tool 5 first.
|
| 461 |
+
- NEVER fabricate sentence counts or paper counts — only use tool-reported numbers.
|
| 462 |
+
- If a tool returns an error, explain clearly and continue.
|
| 463 |
+
- Keep responses concise. Tables + evidence, not paragraphs.
|
| 464 |
+
|
| 465 |
+
Current date: """ + datetime.now().strftime("%Y-%m-%d")
|
| 466 |
+
|
| 467 |
+
# Tool loader
|
| 468 |
+
def get_local_tools():
|
| 469 |
+
from tools import get_all_tools
|
| 470 |
+
return get_all_tools()
|
app.py
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import glob
|
| 3 |
+
import json
|
| 4 |
+
import plotly.io as pio
|
| 5 |
+
import gradio as gr
|
| 6 |
+
from dotenv import load_dotenv
|
| 7 |
+
from langchain_mistralai import ChatMistralAI
|
| 8 |
+
from langgraph.prebuilt import create_react_agent
|
| 9 |
+
from langgraph.checkpoint.memory import MemorySaver
|
| 10 |
+
from agent import SYSTEM_PROMPT, get_local_tools
|
| 11 |
+
|
| 12 |
+
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
|
| 13 |
+
load_dotenv()
|
| 14 |
+
|
| 15 |
+
OUTPUT_DIR = "outputs"
|
| 16 |
+
CHECKPOINT_DIR = os.path.join(OUTPUT_DIR, "checkpoints")
|
| 17 |
+
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
|
| 18 |
+
|
| 19 |
+
llm = ChatMistralAI(model="mistral-small-latest", temperature=0, timeout=300)
|
| 20 |
+
agent = create_react_agent(model=llm, tools=get_local_tools(), prompt=SYSTEM_PROMPT, checkpointer=MemorySaver())
|
| 21 |
+
_msg_count = 0
|
| 22 |
+
_uploaded = {"path": ""}
|
| 23 |
+
|
| 24 |
+
theme = gr.themes.Soft(
|
| 25 |
+
primary_hue="indigo",
|
| 26 |
+
secondary_hue="violet",
|
| 27 |
+
neutral_hue="slate",
|
| 28 |
+
font=gr.themes.GoogleFont("Outfit"),
|
| 29 |
+
font_mono=gr.themes.GoogleFont("JetBrains Mono"),
|
| 30 |
+
).set(
|
| 31 |
+
body_background_fill="*neutral_50",
|
| 32 |
+
block_title_text_weight="700",
|
| 33 |
+
button_primary_background_fill="*primary_600",
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
def _latest_output():
|
| 37 |
+
ord = {"summaries": 1, "labels": 2, "themes": 3, "taxonomy": 4, "comparison": 9, "narrative": 10}
|
| 38 |
+
fs = glob.glob(f"{OUTPUT_DIR}/rq4_*.csv") + glob.glob(f"{CHECKPOINT_DIR}/rq4_*.json")
|
| 39 |
+
scored = sorted([(sum(v * (k in f) for k, v in ord.items()), f) for f in fs], key=lambda x: x[0])
|
| 40 |
+
return [x[1] for x in scored] or None
|
| 41 |
+
|
| 42 |
+
def _build_progress():
|
| 43 |
+
ps = [
|
| 44 |
+
("Load", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_summaries.json"))),
|
| 45 |
+
("Codes", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_labels.json"))),
|
| 46 |
+
("Themes", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_themes.json"))),
|
| 47 |
+
("PAJAIS", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_taxonomy_map.json"))),
|
| 48 |
+
("Report", bool(glob.glob(f"{OUTPUT_DIR}/rq4_comparison.csv"))),
|
| 49 |
+
]
|
| 50 |
+
return " → ".join(f"{'✅' if d else '⬜'} {n}" for n, d in ps)
|
| 51 |
+
|
| 52 |
+
def respond(message, chat_history, uploaded_file):
|
| 53 |
+
global _msg_count
|
| 54 |
+
_msg_count += 1
|
| 55 |
+
_uploaded["path"] = uploaded_file or _uploaded.get("path", "")
|
| 56 |
+
text = (message or "Analyze") + (f"\n[CSV: {_uploaded['path']}]" if _uploaded["path"] else "\n[No CSV]")
|
| 57 |
+
|
| 58 |
+
chat_history.append({"role": "user", "content": message or "Analyze"})
|
| 59 |
+
chat_history.append({"role": "assistant", "content": "🔬 **Working...**"})
|
| 60 |
+
yield chat_history, "", _latest_output()
|
| 61 |
+
|
| 62 |
+
res = agent.invoke({"messages": [("human", text)]}, config={"configurable": {"thread_id": "session"}})
|
| 63 |
+
chat_history[-1] = {"role": "assistant", "content": res["messages"][-1].content}
|
| 64 |
+
yield chat_history, "", _latest_output()
|
| 65 |
+
|
| 66 |
+
def _load_chart(name):
|
| 67 |
+
if not name or not os.path.exists(os.path.join(OUTPUT_DIR, name)): return None
|
| 68 |
+
return pio.from_json(open(os.path.join(OUTPUT_DIR, name)).read())
|
| 69 |
+
|
| 70 |
+
def _get_chart_choices():
|
| 71 |
+
return [os.path.basename(f) for f in sorted(glob.glob(f"{OUTPUT_DIR}/rq4_*.json"))]
|
| 72 |
+
|
| 73 |
+
def _load_review_table():
|
| 74 |
+
ps = sorted(glob.glob(f"{CHECKPOINT_DIR}/rq4_*.json"))
|
| 75 |
+
if not ps: return [[0, "No data", "", 0, 0, False, "", ""]]
|
| 76 |
+
data = json.load(open(ps[-1]))
|
| 77 |
+
return [[i, d.get("label", d.get("top_words", ""))[:60], d.get("nearest", [{}])[0].get("sentence", "")[:120], d.get("sentence_count", 0), d.get("paper_count", 0), True, "", ""] for i, d in enumerate(data)]
|
| 78 |
+
|
| 79 |
+
def _show_papers_by_select(table_data, evt: gr.SelectData):
|
| 80 |
+
idx = int(table_data.iloc[evt.index[0], 0]) if hasattr(table_data, 'iloc') else int(table_data[evt.index[0]][0])
|
| 81 |
+
fs = sorted(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_labels.json")) or sorted(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_summaries.json"))
|
| 82 |
+
for f in fs:
|
| 83 |
+
for t in json.load(open(f)):
|
| 84 |
+
if t.get("topic_id") == idx:
|
| 85 |
+
return f"Topic {idx}: {t.get('label', '')}\n\n" + "\n".join(f"- {p}" for p in t.get("paper_titles", []))
|
| 86 |
+
return "Not found"
|
| 87 |
+
|
| 88 |
+
def _submit_review(table_data, chat_history):
|
| 89 |
+
ls = [f"Topic {int(r[0])}: {'RENAME to '+r[6] if r[6] else ('APPROVE' if r[5] else 'REJECT')}" for r in table_data.values.tolist()]
|
| 90 |
+
msg = "Review decisions:\n" + "\n".join(ls)
|
| 91 |
+
chat_history.append({"role": "user", "content": "Submitted review"})
|
| 92 |
+
chat_history.append({"role": "assistant", "content": "🔬 **Processing...**"})
|
| 93 |
+
yield chat_history, _latest_output(), gr.update(), gr.update(), _build_progress()
|
| 94 |
+
|
| 95 |
+
res = agent.invoke({"messages": [("human", msg)]}, config={"configurable": {"thread_id": "session"}})
|
| 96 |
+
chat_history[-1] = {"role": "assistant", "content": res["messages"][-1].content}
|
| 97 |
+
yield chat_history, _latest_output(), gr.update(choices=_get_chart_choices()), _load_review_table(), _build_progress()
|
| 98 |
+
|
| 99 |
+
CSS = """
|
| 100 |
+
.gradio-container { background: #fcfcfc !important; }
|
| 101 |
+
.sidebar { background: #ffffff !important; border-right: 1px solid #e2e8f0 !important; }
|
| 102 |
+
.header-text { font-family: 'Outfit', sans-serif; color: #1e293b; letter-spacing: -0.02em; }
|
| 103 |
+
.tab-nav { border-bottom: 2px solid #f1f5f9 !important; }
|
| 104 |
+
.chatbot-container { border-radius: 12px !important; border: 1px solid #e2e8f0 !important; overflow: hidden; }
|
| 105 |
+
.primary-btn { background: #4f46e5 !important; color: white !important; border-radius: 8px !important; font-weight: 600 !important; }
|
| 106 |
+
.secondary-btn { background: #f8fafc !important; color: #475569 !important; border: 1px solid #e2e8f0 !important; border-radius: 8px !important; }
|
| 107 |
+
"""
|
| 108 |
+
|
| 109 |
+
theme = gr.themes.Soft(
|
| 110 |
+
primary_hue="indigo",
|
| 111 |
+
secondary_hue="violet",
|
| 112 |
+
neutral_hue="slate",
|
| 113 |
+
font=gr.themes.GoogleFont("Outfit"),
|
| 114 |
+
font_mono=gr.themes.GoogleFont("JetBrains Mono"),
|
| 115 |
+
).set(
|
| 116 |
+
body_background_fill="*neutral_50",
|
| 117 |
+
block_title_text_weight="700",
|
| 118 |
+
button_primary_background_fill="*primary_600",
|
| 119 |
+
button_primary_text_color="white",
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
with gr.Blocks(title="Thematic Analysis AI", theme=theme, css=CSS) as demo:
|
| 123 |
+
with gr.Sidebar(label="Data Hub", open=True):
|
| 124 |
+
gr.HTML("<h2 class='header-text'>📁 Resource Center</h2>")
|
| 125 |
+
upload = gr.File(label="Dataset (Scopus CSV)", file_types=[".csv"], elem_id="file-upload")
|
| 126 |
+
progress = gr.Markdown(value=_build_progress(), elem_id="progress-display")
|
| 127 |
+
gr.Divider()
|
| 128 |
+
gr.Markdown("### 🛠️ Configuration\nModel: `mistral-small-latest`\nPipeline: `BERTopic + Agglomerative`")
|
| 129 |
+
|
| 130 |
+
gr.HTML("<h1 class='header-text' style='margin-bottom: 20px;'>🔬 Topic Modelling Agentic AI</h1>")
|
| 131 |
+
|
| 132 |
+
with gr.Tabs():
|
| 133 |
+
with gr.Tab("💬 Agent Chat"):
|
| 134 |
+
chatbot = gr.Chatbot(height=450, show_label=False, elem_classes="chatbot-container")
|
| 135 |
+
with gr.Row():
|
| 136 |
+
msg = gr.Textbox(placeholder="Ask the agent to analyze, group, or export...", show_label=False, scale=9)
|
| 137 |
+
send = gr.Button("Send", variant="primary", scale=1, elem_classes="primary-btn")
|
| 138 |
+
|
| 139 |
+
with gr.Tab("📋 Review & Refine"):
|
| 140 |
+
gr.Markdown("### 🔍 Topic Validation Table\nReview the identified themes and rename or reject as needed.")
|
| 141 |
+
table = gr.Dataframe(headers=["#", "Label", "Key Evidence", "Sents", "Papers", "Approve", "Rename", "Reasoning"], datatype=["number", "str", "str", "number", "number", "bool", "str", "str"], interactive=True)
|
| 142 |
+
with gr.Row():
|
| 143 |
+
submit = gr.Button("Submit Review Decisions", variant="primary", scale=2, elem_classes="primary-btn")
|
| 144 |
+
clear = gr.Button("Refresh Table", variant="secondary", scale=1, elem_classes="secondary-btn")
|
| 145 |
+
papers = gr.Textbox(label="Full Context: Papers in Selected Topic", lines=6, interactive=False)
|
| 146 |
+
|
| 147 |
+
with gr.Tab("📊 Visual Analytics"):
|
| 148 |
+
gr.Markdown("### 📈 Interactive Topic Visualizations")
|
| 149 |
+
with gr.Row():
|
| 150 |
+
selector = gr.Dropdown(choices=[], label="Select Visualization Type", scale=7)
|
| 151 |
+
refresh_viz = gr.Button("Refresh Charts", variant="secondary", scale=1)
|
| 152 |
+
display = gr.Plot()
|
| 153 |
+
|
| 154 |
+
with gr.Tab("📥 Export Control"):
|
| 155 |
+
gr.Markdown("### 💾 Final Outputs\nDownload generated papers, narratives, and comparison matrices.")
|
| 156 |
+
download = gr.File(label="Available Exports", file_count="multiple")
|
| 157 |
+
|
| 158 |
+
def respond_with_viz(m, h, u):
|
| 159 |
+
g = respond(m, h, u)
|
| 160 |
+
for hist, _, dl in g:
|
| 161 |
+
cs = _get_chart_choices()
|
| 162 |
+
yield hist, "", dl, gr.update(choices=cs, value=cs[-1] if cs else None), _load_chart(cs[-1]) if cs else None, _load_review_table(), _build_progress()
|
| 163 |
+
|
| 164 |
+
msg.submit(respond_with_viz, [msg, chatbot, upload], [chatbot, msg, download, selector, display, table, progress])
|
| 165 |
+
send.click(respond_with_viz, [msg, chatbot, upload], [chatbot, msg, download, selector, display, table, progress])
|
| 166 |
+
selector.change(_load_chart, [selector], [display])
|
| 167 |
+
table.select(_show_papers_by_select, [table], [papers])
|
| 168 |
+
submit.click(_submit_review, [table, chatbot], [chatbot, download, selector, table, progress])
|
| 169 |
+
upload.change(lambda f, h: respond_with_viz("Analyze CSV", h, f), [upload, chatbot], [chatbot, msg, download, selector, display, table, progress])
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
if __name__ == "__main__":
|
| 173 |
+
demo.launch(server_name="0.0.0.0", server_port=7860, ssr_mode=False)
|
requirements.txt
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# requirements.txt v2.0 | 4 April 2026
|
| 2 |
+
# BERTopic + Mistral LLM (French, Apache 2.0, GDPR-safe)
|
| 3 |
+
langchain
|
| 4 |
+
langchain-mistralai
|
| 5 |
+
langgraph
|
| 6 |
+
langchain-core
|
| 7 |
+
bertopic
|
| 8 |
+
sentence-transformers
|
| 9 |
+
numpy
|
| 10 |
+
pandas
|
| 11 |
+
plotly
|
| 12 |
+
kaleido
|
| 13 |
+
gradio
|
tools.py
ADDED
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from langchain_core.tools import tool
|
| 2 |
+
import os
|
| 3 |
+
import json
|
| 4 |
+
import re
|
| 5 |
+
import numpy as np
|
| 6 |
+
import pandas as pd
|
| 7 |
+
|
| 8 |
+
CHECKPOINT_DIR = "/tmp/checkpoints"
|
| 9 |
+
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
|
| 10 |
+
|
| 11 |
+
NEAREST_K = 5
|
| 12 |
+
SENT_SPLIT_RE = r'(?<=[.!?])\s+(?=[A-Z])'
|
| 13 |
+
MIN_SENT_LEN = 30
|
| 14 |
+
RUN_CONFIGS = {"abstract": ["Abstract"], "title": ["Title"]}
|
| 15 |
+
_data = {}
|
| 16 |
+
|
| 17 |
+
def _split_sentences(text):
|
| 18 |
+
raw = re.split(SENT_SPLIT_RE, str(text))
|
| 19 |
+
return list(filter(lambda s: len(s.strip()) >= MIN_SENT_LEN, raw))
|
| 20 |
+
|
| 21 |
+
@tool
|
| 22 |
+
def load_scopus_csv(filepath: str) -> str:
|
| 23 |
+
df = pd.read_csv(filepath, encoding="utf-8-sig")
|
| 24 |
+
_data["df"] = df
|
| 25 |
+
cols = [c for c in ["Title", "Abstract", "Author Keywords"] if c in df.columns]
|
| 26 |
+
sample = df[cols].head(3).to_string(max_colwidth=80)
|
| 27 |
+
nulls = ", ".join([f"{c}: {df[c].notna().sum()}/{len(df)}" for c in cols])
|
| 28 |
+
|
| 29 |
+
avg_sents = df["Abstract"].head(5).apply(_split_sentences).apply(len).mean()
|
| 30 |
+
est = int(avg_sents * len(df))
|
| 31 |
+
|
| 32 |
+
return (f"📊 **Dataset Statistics:**\n"
|
| 33 |
+
f"- **Papers:** {len(df)}\n"
|
| 34 |
+
f"- **Abstract sentences:** ~{est}\n"
|
| 35 |
+
f"- **Title sentences:** {int(df['Title'].notna().sum())}\n"
|
| 36 |
+
f"- **Non-null:** {nulls}\n\n"
|
| 37 |
+
f"Columns: {', '.join(list(df.columns)[:15])}\n\n"
|
| 38 |
+
f"Sample:\n{sample}")
|
| 39 |
+
|
| 40 |
+
@tool
|
| 41 |
+
def run_bertopic_discovery(run_key: str, threshold: float = 0.7) -> str:
|
| 42 |
+
from bertopic import BERTopic
|
| 43 |
+
from sentence_transformers import SentenceTransformer
|
| 44 |
+
from sklearn.preprocessing import FunctionTransformer
|
| 45 |
+
from sklearn.cluster import AgglomerativeClustering
|
| 46 |
+
|
| 47 |
+
df = _data["df"].copy()
|
| 48 |
+
available = [c for c in RUN_CONFIGS[run_key] if c in df.columns]
|
| 49 |
+
df["_text"] = df[available].fillna("").agg(" ".join, axis=1)
|
| 50 |
+
df["_paper_id"] = df.index
|
| 51 |
+
df["_sentences"] = df["_text"].apply(_split_sentences)
|
| 52 |
+
|
| 53 |
+
meta = [c for c in ["_paper_id", "Title", "Author Keywords", "_sentences"] if c in df.columns]
|
| 54 |
+
sent_df = df[meta].explode("_sentences").rename(columns={"_sentences": "text"}).dropna(subset=["text"]).reset_index(drop=True)
|
| 55 |
+
sent_df["sent_id"] = sent_df.groupby("_paper_id").cumcount()
|
| 56 |
+
|
| 57 |
+
patterns = r"Licensee MDPI|Published by Informa|Published by Elsevier|Taylor & Francis|Copyright ©|Creative Commons|open access article|Inderscience Enterprises|All rights reserved|Springer Nature|Emerald Publishing|limitations and (future|implications|discussed)|implications (are|were) (discussed|presented)|concludes with .* implications"
|
| 58 |
+
sent_df = sent_df[~sent_df["text"].str.contains(patterns, case=False, regex=True, na=False)].reset_index(drop=True)
|
| 59 |
+
|
| 60 |
+
embedder = SentenceTransformer("all-MiniLM-L6-v2")
|
| 61 |
+
embs = embedder.encode(sent_df["text"].tolist(), show_progress_bar=False, normalize_embeddings=True)
|
| 62 |
+
np.save(f"{CHECKPOINT_DIR}/rq4_{run_key}_emb.npy", embs)
|
| 63 |
+
|
| 64 |
+
cluster = AgglomerativeClustering(n_clusters=None, metric="cosine", linkage="average", distance_threshold=threshold)
|
| 65 |
+
model = BERTopic(hdbscan_model=cluster, umap_model=FunctionTransformer())
|
| 66 |
+
topics, _ = model.fit_transform(sent_df["text"].tolist(), embs)
|
| 67 |
+
|
| 68 |
+
_data[f"{run_key}_model"] = model
|
| 69 |
+
_data[f"{run_key}_topics"] = np.array(topics)
|
| 70 |
+
_data[f"{run_key}_embeddings"] = embs
|
| 71 |
+
_data[f"{run_key}_sent_df"] = sent_df
|
| 72 |
+
|
| 73 |
+
n = len(set(topics)) - int(-1 in topics)
|
| 74 |
+
(n >= 3) and model.visualize_topics().write_html(f"/tmp/rq4_{run_key}_intertopic.html")
|
| 75 |
+
(n >= 1) and model.visualize_barchart(top_n_topics=min(10, n)).write_html(f"/tmp/rq4_{run_key}_bars.html")
|
| 76 |
+
(n >= 2) and model.visualize_hierarchy().write_html(f"/tmp/rq4_{run_key}_hierarchy.html")
|
| 77 |
+
(n >= 2) and model.visualize_heatmap().write_html(f"/tmp/rq4_{run_key}_heatmap.html")
|
| 78 |
+
|
| 79 |
+
t_arr = np.array(topics)
|
| 80 |
+
valid = [r for r in model.get_topic_info().to_dict("records") if r["Topic"] != -1]
|
| 81 |
+
|
| 82 |
+
def _centroid(row):
|
| 83 |
+
mask = t_arr == row["Topic"]
|
| 84 |
+
m_idx = np.where(mask)[0]
|
| 85 |
+
m_embs = embs[mask]
|
| 86 |
+
cent = m_embs.mean(axis=0)
|
| 87 |
+
dists = 1 - (m_embs @ cent) / (np.linalg.norm(m_embs, axis=1) * np.linalg.norm(cent) + 1e-10)
|
| 88 |
+
near = np.argsort(dists)[:NEAREST_K]
|
| 89 |
+
|
| 90 |
+
evidence = [{"sentence": str(sent_df.iloc[m_idx[i]]["text"])[:250], "paper_id": int(sent_df.iloc[m_idx[i]]["_paper_id"]), "title": str(sent_df.iloc[m_idx[i]].get("Title", ""))[:150], "keywords": str(sent_df.iloc[m_idx[i]].get("Author Keywords", ""))[:150]} for i in near]
|
| 91 |
+
p_df = sent_df.iloc[m_idx].drop_duplicates(subset=["_paper_id"])
|
| 92 |
+
titles = [str(p_df.iloc[i].get("Title", ""))[:200] for i in range(min(50, len(p_df)))]
|
| 93 |
+
|
| 94 |
+
return {"topic_id": int(row["Topic"]), "sentence_count": int(row["Count"]), "paper_count": len(p_df), "top_words": str(row.get("Name", ""))[:100], "nearest": evidence, "paper_titles": titles}
|
| 95 |
+
|
| 96 |
+
sums = list(map(_centroid, valid))
|
| 97 |
+
json.dump(sums, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_summaries.json", "w"), indent=2, default=str)
|
| 98 |
+
|
| 99 |
+
lines = [f" Topic {s['topic_id']} ({s['sentence_count']} sents, {s['paper_count']} papers): {s['top_words']}" for s in sums]
|
| 100 |
+
return f"[{run_key}] {n} topics from {len(sent_df)} sentences.\n\n" + "\n".join(lines)
|
| 101 |
+
|
| 102 |
+
@tool
|
| 103 |
+
def label_topics_with_llm(run_key: str) -> str:
|
| 104 |
+
from langchain_mistralai import ChatMistralAI
|
| 105 |
+
from langchain_core.prompts import PromptTemplate
|
| 106 |
+
from langchain_core.output_parsers import JsonOutputParser
|
| 107 |
+
|
| 108 |
+
sums = json.load(open(f"{CHECKPOINT_DIR}/rq4_{run_key}_summaries.json"))
|
| 109 |
+
to_label = sorted(sums, key=lambda s: s.get("sentence_count", 0), reverse=True)[:100]
|
| 110 |
+
|
| 111 |
+
block = "\n\n".join([f"Topic {s['topic_id']} ({s['sentence_count']} sents):\n{NEAREST_K} entries:\n" + "\n".join([f"- {e['sentence']}\n Paper: {e['title']}" for e in s["nearest"]]) for s in to_label])
|
| 112 |
+
|
| 113 |
+
prompt = PromptTemplate.from_template("Return JSON ARRAY of objects with topic_id, label, category, confidence, reasoning, niche for:\n{topics}")
|
| 114 |
+
llm = ChatMistralAI(model="mistral-small-latest", temperature=0)
|
| 115 |
+
labels = (prompt | llm | JsonOutputParser()).invoke({"topics": block})
|
| 116 |
+
|
| 117 |
+
labeled = [{**s, **l} for s, l in zip(sums, labels + sums)]
|
| 118 |
+
json.dump(labeled, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_labels.json", "w"), indent=2, default=str)
|
| 119 |
+
|
| 120 |
+
lines = [f" **Topic {l.get('topic_id')}: {l.get('label')}** [{l.get('category')}] ({l.get('sentence_count')} sents)" for l in labeled]
|
| 121 |
+
return f"[{run_key}] {len(labeled)} topics labeled.\n\n" + "\n\n".join(lines)
|
| 122 |
+
|
| 123 |
+
@tool
|
| 124 |
+
def generate_comparison_csv() -> str:
|
| 125 |
+
done = [k for k in RUN_CONFIGS.keys() if os.path.exists(f"{CHECKPOINT_DIR}/rq4_{k}_labels.json")]
|
| 126 |
+
rows = []
|
| 127 |
+
for k in done:
|
| 128 |
+
ls = json.load(open(f"{CHECKPOINT_DIR}/rq4_{k}_labels.json"))
|
| 129 |
+
rows.extend([{"run": k, "topic_id": l.get("topic_id"), "label": l.get("label"), "category": l.get("category"), "sentences": l.get("sentence_count"), "papers": l.get("paper_count")} for l in ls])
|
| 130 |
+
|
| 131 |
+
df = pd.DataFrame(rows)
|
| 132 |
+
df.to_csv("/tmp/rq4_comparison.csv", index=False)
|
| 133 |
+
return f"Saved to /tmp/rq4_comparison.csv\n\n{df.to_string(index=False)}"
|
| 134 |
+
|
| 135 |
+
@tool
|
| 136 |
+
def export_narrative(run_key: str) -> str:
|
| 137 |
+
from langchain_mistralai import ChatMistralAI
|
| 138 |
+
ls = json.load(open(f"{CHECKPOINT_DIR}/rq4_{run_key}_labels.json"))
|
| 139 |
+
txt = "\n".join([f"- {l.get('label')} ({l.get('sentence_count')} sents)" for l in ls])
|
| 140 |
+
llm = ChatMistralAI(model="mistral-small-latest", temperature=0.3)
|
| 141 |
+
res = llm.invoke(f"Write a 500-word Section 7 'Topic Modeling Results' for {run_key} run:\n{txt}")
|
| 142 |
+
open("/tmp/rq4_narrative.txt", "w", encoding="utf-8").write(res.content)
|
| 143 |
+
return f"Saved to /tmp/rq4_narrative.txt\n\n{res.content}"
|
| 144 |
+
|
| 145 |
+
@tool
|
| 146 |
+
def consolidate_into_themes(run_key: str, theme_map: dict) -> str:
|
| 147 |
+
t_arr, embs, s_df = _data[f"{run_key}_topics"], _data[f"{run_key}_embeddings"], _data[f"{run_key}_sent_df"]
|
| 148 |
+
|
| 149 |
+
def _build(name, ids):
|
| 150 |
+
mask = np.isin(t_arr, ids)
|
| 151 |
+
m_idx, m_embs = np.where(mask)[0], embs[mask]
|
| 152 |
+
cent = m_embs.mean(axis=0)
|
| 153 |
+
dists = 1 - (m_embs @ cent) / (np.linalg.norm(m_embs, axis=1) * np.linalg.norm(cent) + 1e-10)
|
| 154 |
+
near = np.argsort(dists)[:NEAREST_K]
|
| 155 |
+
evidence = [{"sentence": str(s_df.iloc[m_idx[i]]["text"])[:250], "title": str(s_df.iloc[m_idx[i]].get("Title", ""))[:150]} for i in near]
|
| 156 |
+
return {"label": name, "merged_topics": list(ids), "sentence_count": int(mask.sum()), "paper_count": int(s_df.iloc[m_idx]["_paper_id"].nunique()), "nearest": evidence}
|
| 157 |
+
|
| 158 |
+
themes = [{"topic_id": i, **_build(n, ids)} for i, (n, ids) in enumerate(theme_map.items())]
|
| 159 |
+
json.dump(themes, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_themes.json", "w"), indent=2, default=str)
|
| 160 |
+
lines = [f" **{t['label']}** ({t['sentence_count']} sents)" for t in themes]
|
| 161 |
+
return f"[{run_key}] {len(themes)} themes.\n\n" + "\n".join(lines)
|
| 162 |
+
|
| 163 |
+
PAJAIS = ["Electronic Business", "HCI", "IS Strategy", "Business Intelligence", "Design Science", "Enterprise Systems", "Adoption", "Social Media", "Cultural Issues", "Security", "Smart/IoT", "Knowledge Management", "Digital Platform", "Healthcare", "Project Management", "Service Science", "Social/Org Aspects", "Research Methods", "E-Finance", "E-Government", "Education", "Sustainability"]
|
| 164 |
+
|
| 165 |
+
@tool
|
| 166 |
+
def compare_with_taxonomy(run_key: str) -> str:
|
| 167 |
+
from langchain_mistralai import ChatMistralAI
|
| 168 |
+
from langchain_core.prompts import PromptTemplate
|
| 169 |
+
from langchain_core.output_parsers import JsonOutputParser
|
| 170 |
+
|
| 171 |
+
src = (os.path.exists(f"{CHECKPOINT_DIR}/rq4_{run_key}_themes.json") and f"{CHECKPOINT_DIR}/rq4_{run_key}_themes.json") or f"{CHECKPOINT_DIR}/rq4_{run_key}_labels.json"
|
| 172 |
+
ts = json.load(open(src))
|
| 173 |
+
prompt = PromptTemplate.from_template("Map themes to PAJAIS taxonomy or mark 'NOVEL'. Return JSON array for:\nThemes:\n{ts}\nTaxonomy:\n{tax}")
|
| 174 |
+
llm = ChatMistralAI(model="mistral-small-latest", temperature=0)
|
| 175 |
+
ms = (prompt | llm | JsonOutputParser()).invoke({"ts": "\n".join([t['label'] for t in ts]), "tax": "\n".join(PAJAIS)})
|
| 176 |
+
json.dump(ms, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_taxonomy_map.json", "w"), indent=2, default=str)
|
| 177 |
+
return f"[{run_key}] Mapping complete."
|
| 178 |
+
|
| 179 |
+
def get_all_tools():
|
| 180 |
+
ts = [load_scopus_csv, run_bertopic_discovery, label_topics_with_llm, consolidate_into_themes, compare_with_taxonomy, generate_comparison_csv, export_narrative]
|
| 181 |
+
for t in ts: setattr(t, 'handle_tool_error', True)
|
| 182 |
+
return ts
|