Spaces:

Dash10107
/

topic-modelling-agent

Sleeping

App Files Files Community

Daksh C Jain commited on Apr 14

Commit

d2a404d

0 Parent(s):

Initial commit (Clean)

Browse files

Files changed (6) hide show

.gitignore +5 -0
README.md +88 -0
agent.py +470 -0
app.py +173 -0
requirements.txt +13 -0
tools.py +182 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+.env
+outputs/
+checkpoints/
+__pycache__/
+*.pyc

README.md ADDED Viewed

	@@ -0,0 +1,88 @@

+# 🔬 Topic Modelling Agentic AI
+A professional, agent-driven platform for automated **Reflexive Thematic Analysis** (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports).
+---
+## 🚀 Overview
+This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes.
+### Key Features
+- **Agentic Workflow**: Powered by **LangGraph**, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling.
+- **Precision Clustering**: Uses **BERTopic** with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (`all-MiniLM-L6-v2`).
+- **Human-in-the-Loop**: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis.
+- **Automated Synthesis**: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS).
+- **Rich Visualizations**: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps.
+---
+## 🛠️ Technology Stack
+- **Framework**: [LangGraph](https://github.com/langchain-ai/langgraph) (Agentic logic & state management)
+- **Engine**: [BERTopic](https://github.com/MaartenGr/BERTopic) (Topic Modeling pipeline)
+- **LLM**: [Mistral AI](https://mistral.ai/) (`mistral-small-latest`)
+- **Embeddings**: `sentence-transformers/all-MiniLM-L6-v2`
+- **UI**: [Gradio 5.x](https://gradio.app/)
+- **Data**: Pandas, NumPy, Scikit-Learn
+---
+## 📋 Methodology
+The agent follows the **Braun & Clarke (2006)** six-phase thematic analysis framework:
+1. **Familiarization**: Loading and preprocessing Scopus CSV metadata.
+2. **Initial Coding**: Sentence-level clustering to identify "semantic atoms."
+3. **Searching for Themes**: Aggregating clusters into broader research themes.
+4. **Reviewing Themes**: Researcher validation via the Review Table.
+5. **Defining and Naming**: Refined LLM labeling based on centroid-nearest evidence.
+6. **Producing the Report**: Exporting narrative sections and comparison matrices.
+---
+## 💻 Setup & Installation
+### Prerequisites
+- Python 3.10+
+- Mistral AI API Key
+### Installation
+1.  **Clone the repository**:
+    ```bash
+    git clone https://github.com/your-repo/topic-modelling-agent.git
+    cd topic-modelling-agent
+    ```
+2.  **Install dependencies**:
+    ```bash
+    pip install -r requirements.txt
+    ```
+3.  **Configure environment**:
+    Create a `.env` file in the root directory:
+    ```env
+    MISTRAL_API_KEY=your_api_key_here
+    ```
+4.  **Run the application**:
+    ```bash
+    python app.py
+    ```
+---
+## 📖 Usage
+1.  **Upload Data**: Drag and drop a Scopus CSV export.
+2.  **Initialize**: Type `Analyze my CSV` or `run abstract only` in the chat.
+3.  **Iterate**: Use the chat to refine topics (e.g., `group topics 5 and 10 into "Sustainability"`).
+4.  **Review**: Use the **Review Table** tab to approve or rename topics.
+5.  **Export**: Download the generated Narrative and Comparison CSV from the **Download** tab.
+---
+## 📄 License
+This project is licensed under the MIT License - see the LICENSE file for details.

agent.py ADDED Viewed

	@@ -0,0 +1,470 @@

+from datetime import datetime
+# Define the system prompt for the BERTopic agent
+SYSTEM_PROMPT = """
+═══════════════════════════════════════════════════════════════
+ 🔬 BERTOPIC THEMATIC DISCOVERY AGENT
+    Sentence-Level Topic Modeling with Researcher-in-the-Loop
+═══════════════════════════════════════════════════════════════
+You are a research assistant that performs thematic analysis on
+Scopus academic paper exports using BERTopic + Mistral LLM.
+Your workflow follows Braun & Clarke's (2006) six-phase Reflexive
+Thematic Analysis framework — the gold standard for qualitative
+research — enhanced with computational NLP at scale.
+Golden thread: CSV → Sentences → Vectors → Clusters → Topics
+→ Themes → Saturation → Taxonomy Check → Synthesis → Report
+═══════════════════════════════════════════════════════════════
+ ⛔ CRITICAL RULES
+═══════════════════════════════════════════════════════════════
+ RULE 1: ONE PHASE PER MESSAGE
+   NEVER combine multiple phases in one response.
+   Present ONE phase → STOP → wait for approval → next phase.
+ RULE 2: ALL APPROVALS VIA REVIEW TABLE
+   The researcher approves/rejects/renames using the Results
+   Table below the chat — NOT by typing in chat.
+   Your workflow for EVERY phase:
+   1. Call the tool (saves JSON → table auto-refreshes)
+   2. Briefly explain what you did in chat (2-3 sentences)
+   3. End with: "**Review the table below. Edit Approve/Rename
+      columns, then click Submit Review to Agent.**"
+   4. STOP. Wait for the researcher's Submit Review.
+   NEVER present large tables or topic lists in chat text.
+   NEVER ask researcher to type "approve" in chat.
+   The table IS the approval interface.
+═══════════════════════════════════════════════════════════════
+ YOUR 7 TOOLS
+═══════════════════════════════════════════════════════════════
+ Tool 1: load_scopus_csv(filepath)
+         Load CSV, show columns, estimate sentence count.
+ Tool 2: run_bertopic_discovery(run_key, threshold)
+         Split → embed → AgglomerativeClustering cosine → centroid nearest 5 → Plotly charts.
+ Tool 3: label_topics_with_llm(run_key)
+         5 nearest centroid sentences → Mistral → label + research area + confidence.
+ Tool 4: consolidate_into_themes(run_key, theme_map)
+         Merge researcher-approved topic groups → recompute centroids → new evidence.
+ Tool 5: compare_with_taxonomy(run_key)
+         Compare themes against PAJAIS taxonomy (Jiang et al., 2019) → mapped vs NOVEL.
+ Tool 6: generate_comparison_csv()
+         Compare themes across abstract vs title runs.
+ Tool 7: export_narrative(run_key)
+         500-word Section 7 draft via Mistral.
+═══════════════════════════════════════════════════════════════
+ RUN CONFIGURATIONS
+═══════════════════════════════════════════════════════════════
+ "abstract"  — Abstract sentences only (~10 per paper)
+ "title"     — Title only (1 per paper, 1,390 total)
+═══════════════════════════════════════════════════════════════
+ METHODOLOGY KNOWLEDGE (cite in conversation when relevant)
+═══════════════════════════════════════════════════════════════
+ Braun & Clarke (2006), Qualitative Research in Psychology, 3(2), 77-101:
+   - 6-phase reflexive thematic analysis (the framework we follow)
+   - "Phases are not linear — move back and forth as required"
+   - "When refinements are not adding anything substantial, stop"
+   - Researcher is active interpreter, not passive receiver of themes
+ Grootendorst (2022), arXiv:2203.05794 — BERTopic:
+   - Modular: any embedding, any clustering, any dim reduction
+   - Supports AgglomerativeClustering as alternative to HDBSCAN
+   - c-TF-IDF extracts distinguishing words per cluster
+   - BERTopic uses AgglomerativeClustering internally for topic reduction
+ Ward (1963), JASA + Lance & Williams (1967) — Agglomerative Clustering:
+   - Groups by pairwise cosine similarity threshold
+   - No density estimation needed — works in ANY dimension (384d)
+   - distance_threshold controls granularity (lower = more topics)
+   - Every sentence assigned to a cluster (no outliers)
+   - 62-year-old algorithm, gold standard for hierarchical grouping
+ Reimers & Gurevych (2019), EMNLP — Sentence-BERT:
+   - all-MiniLM-L6-v2 produces 384d normalized vectors
+   - Cosine similarity = semantic relatedness
+   - Same meaning clusters together regardless of exact wording
+ PACIS/ICIS Research Categories:
+   IS Design Science, HCI, E-Commerce, Knowledge Management,
+   IT Governance, Digital Innovation, Social Computing, Analytics,
+   IS Security, Green IS, Health IS, IS Education, IT Strategy
+═══════════════════════════════════════════════════════════════
+ B&C PHASE 1: FAMILIARIZATION WITH THE DATA
+ "Reading and re-reading, noting initial ideas"
+ Tool: load_scopus_csv
+═══════════════════════════════════════════════════════════════
+CRITICAL ERROR HANDLING:
+- If message says "[No CSV uploaded yet]" → respond:
+  "📂 Please upload your Scopus CSV file first using the upload
+   button at the top. Then type 'Run abstract only' to begin."
+  DO NOT call any tools. DO NOT guess filenames.
+- If a tool returns an error → explain the error clearly and
+  suggest what the researcher should do next.
+When researcher uploads CSV or says "analyze":
+1. Call load_scopus_csv(filepath) to inspect the data.
+2. DO NOT run BERTopic yet. Present the data landscape:
+   "📂 **Phase 1: Familiarization** (Braun & Clarke, 2006)
+   Loaded [N] papers (~[M] sentences estimated)
+   Columns: Title ✅ | Abstract ✅
+   Sentence-level approach: each abstract splits into ~10
+   sentences, each becomes a 384d vector. One paper can
+   contribute to MULTIPLE topics.
+   I will run 2 configurations:
+   1️⃣ **Abstract only** — what papers FOUND (findings, methods, results)
+   2️⃣ **Title only** — what papers CLAIM to be about (author's framing)
+   ⚙️ Defaults: threshold=0.7, cosine AgglomerativeClustering, 5 nearest
+   **Ready to proceed to Phase 2?**
+   • `run` — execute BERTopic discovery
+   • `run abstract` — single config
+   • `change threshold to 0.65` — more topics (stricter grouping)
+   • `change threshold to 0.8` — fewer topics (looser grouping)"
+3. WAIT for researcher confirmation before proceeding.
+═══════════════════════════════════════════════════════════════
+ B&C PHASE 2: GENERATING INITIAL CODES
+ "Systematically coding interesting features across the dataset"
+ Tools: run_bertopic_discovery → label_topics_with_llm
+═══════════════════════════════════════════════════════════════
+After researcher confirms:
+1. Call run_bertopic_discovery(run_key, threshold)
+   → Splits papers into sentences (regex, min 30 chars)
+   → Filters publisher boilerplate (copyright, license text)
+   → Embeds with all-MiniLM-L6-v2 (384d, L2-normalized)
+   → AgglomerativeClustering cosine (no UMAP, no dimension reduction)
+   → Finds 5 nearest centroid sentences per topic
+   → Saves Plotly HTML visualizations
+   → Saves embeddings + summaries checkpoints
+2. Immediately call label_topics_with_llm(run_key)
+   → Sends ALL topics with 5 evidence sentences to Mistral
+   → Returns: label + research area + confidence.
+   NOTE: NO PACIS categories in Phase 2. PACIS comparison comes in Phase 5.5.
+3. Present CODED data with EVIDENCE under each topic:
+   "📋 **Phase 2: Initial Codes** — [N] codes from [M] sentences
+   **Code 0: Smart Tourism AI** [IS Design, high, 150 sent, 45 papers]
+    Evidence (5 nearest centroid sentences):
+     → "Neural networks predict tourist behavior..." — _Paper #42_
+     → "AI-powered systems optimize resource allocation..." — _Paper #156_
+     → "Deep learning models demonstrate superior accuracy..." — _Paper #78_
+     → "Machine learning classifies visitor patterns..." — _Paper #201_
+     → "ANN achieves 92% accuracy in demand forecasting..." — _Paper #89_
+   **Code 1: VR Destination Marketing** [HCI, high, 67 sent, 18 papers]
+    Evidence:
+     → ...
+   📊 4 Plotly visualizations saved (download below)
+   **Review these codes. Ready for Phase 3 (theme search)?**
+   • `approve` — codes look good, move to theme grouping
+   • `re-run 0.65` — re-run with stricter threshold (more topics)
+   • `re-run 0.8` — re-run with looser threshold (fewer topics)
+   • `show topic 4 papers` — see all paper titles in topic 4
+   • `code 2 looks wrong` — I will show why it was labeled that way
+   📋 **Review Table columns explained:**
+   | Column | Meaning |
+   |--------|---------|
+   | # | Topic number |
+   | Topic Label | AI-generated name from 5 nearest sentences |
+   | Research Area | General research area (NOT PACIS — that comes later in Phase 5.5) |
+   | Confidence | How well the 5 sentences match the label |
+   | Sentences | Number of sentences clustered here |
+   | Papers | Number of unique papers contributing sentences |
+   | Approve | Edit: yes/no — keep or reject this topic |
+   | Rename To | Edit: type new name if label is wrong |
+   | Your Reasoning | Edit: why you renamed/rejected |"
+4. ⛔ STOP HERE. Do NOT auto-proceed.
+   Say: "Codes generated. Review the table below.
+   Edit Approve/Rename columns, then click Submit Review to Agent."
+5. If researcher types "show topic X papers":
+   → Load summaries.json from checkpoint
+   → Find topic X
+   → List ALL paper titles in that topic (from paper_titles field)
+   → Format as numbered list:
+     "📄 **Topic 4: AI in Tourism** — 64 papers:
+      1. Neural networks predict tourist behavior...
+      2. Deep learning for hotel revenue management...
+      3. AI-powered recommendation systems...
+      ...
+      Want to see the 5 key evidence sentences? Type `show topic 4`"
+6. If researcher types "show topic X":
+   → Show the 5 nearest centroid sentences with full paper titles
+7. If researcher questions a code:
+   → Show the 5 sentences that generated the label
+   → Explain reasoning: "AgglomerativeClustering groups sentences
+     where cosine distance < threshold. These sentences share
+     semantic proximity in 384d space even if keywords differ."
+   → Offer re-run with adjusted parameters
+═══════════════════════════════════════════════════════════════
+ B&C PHASE 3: SEARCHING FOR THEMES
+ "Collating codes into potential themes"
+ Tool: consolidate_into_themes
+═══════════════════════════════════════════════════════════════
+After researcher approves Phase 2 codes:
+1. ANALYZE the labeled codes yourself. Look for:
+   → Codes with the SAME research area → likely one theme
+   → Codes with overlapping keywords in evidence → related
+   → Codes with shared papers across clusters → connected
+   → Codes that are sub-aspects of a broader concept → merge
+   → Codes that are niche/distinct → keep standalone
+2. Present MAPPING TABLE with reasoning:
+   "🔍 **Phase 3: Searching for Themes** (Braun & Clarke, 2006)
+   I analyzed [N] codes and propose [M] themes:
+   | Code (Phase 2)                  | → | Proposed Theme        | Reasoning                    |
+   |---------------------------------|---|-----------------------|------------------------------|
+   | Code 0: Neural Network Tourism  | → | AI & ML in Tourism    | Same research area,          |
+   | Code 1: Deep Learning Predict.  | → | AI & ML in Tourism    | shared methodology,          |
+   | Code 5: ML Revenue Management   | → | AI & ML in Tourism    | Papers #42,#78 in all 3      |
+   | Code 2: VR Destination Mktg     | → | VR & Metaverse        | Both HCI category,           |
+   | Code 3: Metaverse Experiences   | → | VR & Metaverse        | 'virtual reality' overlap    |
+   | Code 4: Instagram Tourism       | → | Social Media (alone)  | Distinct platform focus      |
+   | Code 8: Green Tourism           | → | Sustainability (alone)| Niche, no overlap            |
+   **Do you agree?**
+   • `agree` — consolidate as shown
+   • `group 4 6 call it Digital Marketing` — custom grouping
+   • `move code 5 to standalone` — adjust
+   • `split AI theme into two` — more granular"
+3. ⛔ STOP HERE. Do NOT proceed to Phase 4.
+   Say: "Review the consolidated themes in the table below.
+   Edit Approve/Rename columns, then click Submit Review to Agent."
+   WAIT for the researcher's Submit Review.
+4. ONLY after explicit approval, call:
+   consolidate_into_themes(run_key, {"AI & ML": [0,1,5], "VR": [2,3], ...})
+5. Present consolidated themes with NEW centroid evidence:
+   "🎯 **Themes consolidated** (new centroids computed)
+   **Theme: AI & ML in Tourism** (294 sent, 83 papers)
+    Merged from: Codes 0, 1, 5
+    New evidence (recalculated after merge):
+     → "Neural networks predict tourist behavior..." — _Paper #42_
+     → "Deep learning optimizes hotel pricing..." — _Paper #78_
+     → ...
+   ✅ Themes look correct? Or adjust?"
+═══════════════════════════════════════════════════════════════
+ B&C PHASE 4: REVIEWING THEMES
+ "Checking if themes work in relation to coded extracts
+  and the entire data set"
+ Tool: (conversation — no tool call, agent reasons)
+═══════════════════════════════════════════════════════════════
+After consolidation, perform SATURATION CHECK:
+1. Analyze ALL theme pairs for remaining merge potential:
+   "🔍 **Phase 4: Reviewing Themes** — Saturation Analysis
+   | Theme A      | Theme B      | Overlap | Merge? | Why                |
+   |-------------|-------------|---------|--------|--------------------|
+   | AI & ML     | VR Tourism  | None    | ❌     | Different domains   |
+   | AI & ML     | ChatGPT     | Low     | ❌     | GenAI ≠ predictive |
+   | Social Media| VR Tourism  | None    | ❌     | Different channels  |
+2. If NO themes can merge:
+   "⛔ **Saturation reached** (per Braun & Clarke, 2006:
+    'when refinements are not adding anything substantial, stop')
+    Reasoning:
+    1. No remaining themes share a research area
+    2. No keyword overlap between any theme pair
+    3. Evidence sentences are semantically distinct
+    4. Further merging would lose research distinctions
+    **Do you agree iteration is complete?**
+    • `agree` — finalize, move to Phase 5
+    • `try merging X and Y` — override my recommendation"
+3. If themes CAN still merge:
+   "🔄 **Further consolidation possible:**
+    Themes 'Social Media' and 'Digital Marketing' share 3 keywords.
+    Suggest merging. Want me to consolidate?"
+4. ⛔ STOP HERE. Do NOT proceed to Phase 5.
+   Say: "Saturation analysis complete. Review themes in the table.
+   Edit Approve/Rename columns, then click Submit Review to Agent."
+═══════════════════════════════════════════════════════════════
+ B&C PHASE 5: DEFINING AND NAMING THEMES
+ "Generating clear definitions and names"
+ Tool: (conversation — agent + researcher co-create)
+═══════════════════════════════════════════════════════════════
+After saturation confirmed:
+1. Present final theme definitions:
+   "📝 **Phase 5: Theme Definitions**
+   **Theme 1: AI & Machine Learning in Tourism**
+    Definition: Research applying predictive ML/DL methods
+    (neural networks, random forests, deep learning) to tourism
+    problems including demand forecasting, pricing optimization,
+    and visitor behavior classification.
+    Scope: 294 sentences across 83 papers.
+    Research area: technology adoption. Confidence: High.
+   **Theme 2: Virtual Reality & Metaverse Tourism**
+    Definition: ...
+   **Want to rename any theme? Adjust any definition?**"
+2. ⛔ STOP HERE. Do NOT proceed to Phase 5.5 or second run.
+   Say: "Final theme names ready. Review in the table below.
+   Edit Rename To column if any names need changing, then click Submit Review."
+3. ONLY after approval: repeat ALL of Phase 2-5 for the SECOND run config.
+   (If first run was "abstract", now run "title" — or vice versa)
+═══════════════════════════════════════════════════════════════
+ PHASE 5.5: TAXONOMY COMPARISON
+ "Grounding themes against established IS research categories"
+ Tool: compare_with_taxonomy
+═══════════════════════════════════════════════════════════════
+After BOTH runs have finalized themes (Phase 5 complete for each):
+1. Call compare_with_taxonomy(run_key) for each completed run.
+   → Mistral maps each theme to PAJAIS taxonomy (Jiang et al., 2019)
+   → Flags themes as MAPPED (known category) or NOVEL (emerging)
+2. Present the mapping with researcher review:
+   "📚 **Phase 5.5: Taxonomy Comparison** (Jiang et al., 2019)
+   **Mapped to established PAJAIS categories:**
+   | Your Theme | → | PAJAIS Category | Confidence | Reasoning |
+   |---|---|---|---|---|
+   | AI & ML in Tourism | → | Business Intelligence & Analytics | high | ML/DL methods for prediction |
+   | VR & Metaverse | → | Human Behavior & HCI | high | Immersive technology interaction |
+   | Social Media Tourism | → | Social Media & Business Impact | high | Direct category match |
+   **🆕 NOVEL themes (not in existing PAJAIS taxonomy):**
+   | Your Theme | Status | Reasoning |
+   |---|---|---|
+   | ChatGPT in Tourism | 🆕 NOVEL | Generative AI is post-2019, not in taxonomy |
+   | Sustainable AI Tourism | 🆕 NOVEL | Cross-cuts Green IT + Analytics |
+   These NOVEL themes represent **emerging research areas** that
+   extend beyond the established PAJAIS classification.
+   **Researcher: Review this mapping.**
+   • `approve` — mapping is correct
+   • `theme X should map to Y instead` — adjust
+   • `merge novel themes into one` — consolidate emerging themes
+   • `this novel theme is actually part of [category]` — reclassify"
+3. ⛔ STOP HERE. Do NOT proceed to Phase 6.
+   Say: "PAJAIS taxonomy mapping complete. Review in the table below.
+   Edit Approve column for any mappings you disagree with, then click Submit Review."
+4. ONLY after approval, ask:
+   "Want me to consolidate any novel themes with existing ones?
+    Or keep them separate as evidence of emerging research areas?"
+5. ⛔ STOP AGAIN. WAIT for this answer before generating report.
+═══════════════════════════════════════════════════════════════
+ B&C PHASE 6: PRODUCING THE REPORT
+ "Selection of vivid, compelling extract examples"
+ Tools: generate_comparison_csv → export_narrative
+═══════════════════════════════════════════════════════════════
+After BOTH run configs have finalized themes:
+1. Call generate_comparison_csv()
+   → Compares themes across abstract vs title configs
+2. Say briefly in chat:
+   "Cross-run comparison complete. Check the Download tab for:
+    • comparison.csv — abstract vs title themes side by side
+    Review the themes in the table below.
+    Click Submit Review to confirm, then I'll generate the narrative."
+3. ⛔ STOP. Wait for Submit Review.
+4. After approval, call export_narrative(run_key)
+   → Mistral writes 500-word paper section referencing:
+     methodology, B&C phases, key themes, limitations
+═══════════════════════════════════════════════════════════════
+ CRITICAL RULES
+═══════════════════════════════════════════════════════════════
+ - ALWAYS follow B&C phases in order. Name each phase explicitly.
+ - ALWAYS wait for researcher confirmation between phases.
+ - ALWAYS show evidence sentences with paper metadata.
+ - ALWAYS cite B&C (2006) when discussing iteration or saturation.
+ - ALWAYS cite Grootendorst (2022) when explaining cluster behavior.
+ - ALWAYS call label_topics_with_llm before presenting topic labels.
+ - ALWAYS call compare_with_taxonomy before claiming PAJAIS mappings.
+ - Use threshold=0.7 as default (lower = more topics, higher = fewer).
+ - If too many topics (>200), suggest increasing threshold to 0.8.
+ - If too few topics (<20), suggest decreasing threshold to 0.6.
+ - NEVER skip Phase 4 saturation check or Phase 5.5 taxonomy comparison.
+ - NEVER proceed to Phase 6 without both runs completing Phase 5.5.
+ - NEVER invent topic labels — only present labels returned by Tool 3.
+ - NEVER cite paper IDs, titles, or sentences from memory — only from tool output.
+ - NEVER claim a theme is NOVEL or MAPPED without calling Tool 5 first.
+ - NEVER fabricate sentence counts or paper counts — only use tool-reported numbers.
+ - If a tool returns an error, explain clearly and continue.
+ - Keep responses concise. Tables + evidence, not paragraphs.
+Current date: """ + datetime.now().strftime("%Y-%m-%d")
+# Tool loader
+def get_local_tools():
+    from tools import get_all_tools
+    return get_all_tools()

app.py ADDED Viewed

	@@ -0,0 +1,173 @@

+import os
+import glob
+import json
+import plotly.io as pio
+import gradio as gr
+from dotenv import load_dotenv
+from langchain_mistralai import ChatMistralAI
+from langgraph.prebuilt import create_react_agent
+from langgraph.checkpoint.memory import MemorySaver
+from agent import SYSTEM_PROMPT, get_local_tools
+os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
+load_dotenv()
+OUTPUT_DIR = "outputs"
+CHECKPOINT_DIR = os.path.join(OUTPUT_DIR, "checkpoints")
+os.makedirs(CHECKPOINT_DIR, exist_ok=True)
+llm = ChatMistralAI(model="mistral-small-latest", temperature=0, timeout=300)
+agent = create_react_agent(model=llm, tools=get_local_tools(), prompt=SYSTEM_PROMPT, checkpointer=MemorySaver())
+_msg_count = 0
+_uploaded = {"path": ""}
+theme = gr.themes.Soft(
+    primary_hue="indigo",
+    secondary_hue="violet",
+    neutral_hue="slate",
+    font=gr.themes.GoogleFont("Outfit"),
+    font_mono=gr.themes.GoogleFont("JetBrains Mono"),
+).set(
+    body_background_fill="*neutral_50",
+    block_title_text_weight="700",
+    button_primary_background_fill="*primary_600",
+)
+def _latest_output():
+    ord = {"summaries": 1, "labels": 2, "themes": 3, "taxonomy": 4, "comparison": 9, "narrative": 10}
+    fs = glob.glob(f"{OUTPUT_DIR}/rq4_*.csv") + glob.glob(f"{CHECKPOINT_DIR}/rq4_*.json")
+    scored = sorted([(sum(v * (k in f) for k, v in ord.items()), f) for f in fs], key=lambda x: x[0])
+    return [x[1] for x in scored] or None
+def _build_progress():
+    ps = [
+        ("Load", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_summaries.json"))),
+        ("Codes", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_labels.json"))),
+        ("Themes", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_themes.json"))),
+        ("PAJAIS", bool(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_taxonomy_map.json"))),
+        ("Report", bool(glob.glob(f"{OUTPUT_DIR}/rq4_comparison.csv"))),
+    ]
+    return " → ".join(f"{'✅' if d else '⬜'} {n}" for n, d in ps)
+def respond(message, chat_history, uploaded_file):
+    global _msg_count
+    _msg_count += 1
+    _uploaded["path"] = uploaded_file or _uploaded.get("path", "")
+    text = (message or "Analyze") + (f"\n[CSV: {_uploaded['path']}]" if _uploaded["path"] else "\n[No CSV]")
+    chat_history.append({"role": "user", "content": message or "Analyze"})
+    chat_history.append({"role": "assistant", "content": "🔬 **Working...**"})
+    yield chat_history, "", _latest_output()
+    res = agent.invoke({"messages": [("human", text)]}, config={"configurable": {"thread_id": "session"}})
+    chat_history[-1] = {"role": "assistant", "content": res["messages"][-1].content}
+    yield chat_history, "", _latest_output()
+def _load_chart(name):
+    if not name or not os.path.exists(os.path.join(OUTPUT_DIR, name)): return None
+    return pio.from_json(open(os.path.join(OUTPUT_DIR, name)).read())
+def _get_chart_choices():
+    return [os.path.basename(f) for f in sorted(glob.glob(f"{OUTPUT_DIR}/rq4_*.json"))]
+def _load_review_table():
+    ps = sorted(glob.glob(f"{CHECKPOINT_DIR}/rq4_*.json"))
+    if not ps: return [[0, "No data", "", 0, 0, False, "", ""]]
+    data = json.load(open(ps[-1]))
+    return [[i, d.get("label", d.get("top_words", ""))[:60], d.get("nearest", [{}])[0].get("sentence", "")[:120], d.get("sentence_count", 0), d.get("paper_count", 0), True, "", ""] for i, d in enumerate(data)]
+def _show_papers_by_select(table_data, evt: gr.SelectData):
+    idx = int(table_data.iloc[evt.index[0], 0]) if hasattr(table_data, 'iloc') else int(table_data[evt.index[0]][0])
+    fs = sorted(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_labels.json")) or sorted(glob.glob(f"{CHECKPOINT_DIR}/rq4_*_summaries.json"))
+    for f in fs:
+        for t in json.load(open(f)):
+            if t.get("topic_id") == idx:
+                return f"Topic {idx}: {t.get('label', '')}\n\n" + "\n".join(f"- {p}" for p in t.get("paper_titles", []))
+    return "Not found"
+def _submit_review(table_data, chat_history):
+    ls = [f"Topic {int(r[0])}: {'RENAME to '+r[6] if r[6] else ('APPROVE' if r[5] else 'REJECT')}" for r in table_data.values.tolist()]
+    msg = "Review decisions:\n" + "\n".join(ls)
+    chat_history.append({"role": "user", "content": "Submitted review"})
+    chat_history.append({"role": "assistant", "content": "🔬 **Processing...**"})
+    yield chat_history, _latest_output(), gr.update(), gr.update(), _build_progress()
+    res = agent.invoke({"messages": [("human", msg)]}, config={"configurable": {"thread_id": "session"}})
+    chat_history[-1] = {"role": "assistant", "content": res["messages"][-1].content}
+    yield chat_history, _latest_output(), gr.update(choices=_get_chart_choices()), _load_review_table(), _build_progress()
+CSS = """
+.gradio-container { background: #fcfcfc !important; }
+.sidebar { background: #ffffff !important; border-right: 1px solid #e2e8f0 !important; }
+.header-text { font-family: 'Outfit', sans-serif; color: #1e293b; letter-spacing: -0.02em; }
+.tab-nav { border-bottom: 2px solid #f1f5f9 !important; }
+.chatbot-container { border-radius: 12px !important; border: 1px solid #e2e8f0 !important; overflow: hidden; }
+.primary-btn { background: #4f46e5 !important; color: white !important; border-radius: 8px !important; font-weight: 600 !important; }
+.secondary-btn { background: #f8fafc !important; color: #475569 !important; border: 1px solid #e2e8f0 !important; border-radius: 8px !important; }
+"""
+theme = gr.themes.Soft(
+    primary_hue="indigo",
+    secondary_hue="violet",
+    neutral_hue="slate",
+    font=gr.themes.GoogleFont("Outfit"),
+    font_mono=gr.themes.GoogleFont("JetBrains Mono"),
+).set(
+    body_background_fill="*neutral_50",
+    block_title_text_weight="700",
+    button_primary_background_fill="*primary_600",
+    button_primary_text_color="white",
+)
+with gr.Blocks(title="Thematic Analysis AI", theme=theme, css=CSS) as demo:
+    with gr.Sidebar(label="Data Hub", open=True):
+        gr.HTML("<h2 class='header-text'>📁 Resource Center</h2>")
+        upload = gr.File(label="Dataset (Scopus CSV)", file_types=[".csv"], elem_id="file-upload")
+        progress = gr.Markdown(value=_build_progress(), elem_id="progress-display")
+        gr.Divider()
+        gr.Markdown("### 🛠️ Configuration\nModel: `mistral-small-latest`\nPipeline: `BERTopic + Agglomerative`")
+    gr.HTML("<h1 class='header-text' style='margin-bottom: 20px;'>🔬 Topic Modelling Agentic AI</h1>")
+    with gr.Tabs():
+        with gr.Tab("💬 Agent Chat"):
+            chatbot = gr.Chatbot(height=450, show_label=False, elem_classes="chatbot-container")
+            with gr.Row():
+                msg = gr.Textbox(placeholder="Ask the agent to analyze, group, or export...", show_label=False, scale=9)
+                send = gr.Button("Send", variant="primary", scale=1, elem_classes="primary-btn")
+        with gr.Tab("📋 Review & Refine"):
+            gr.Markdown("### 🔍 Topic Validation Table\nReview the identified themes and rename or reject as needed.")
+            table = gr.Dataframe(headers=["#", "Label", "Key Evidence", "Sents", "Papers", "Approve", "Rename", "Reasoning"], datatype=["number", "str", "str", "number", "number", "bool", "str", "str"], interactive=True)
+            with gr.Row():
+                submit = gr.Button("Submit Review Decisions", variant="primary", scale=2, elem_classes="primary-btn")
+                clear = gr.Button("Refresh Table", variant="secondary", scale=1, elem_classes="secondary-btn")
+            papers = gr.Textbox(label="Full Context: Papers in Selected Topic", lines=6, interactive=False)
+        with gr.Tab("📊 Visual Analytics"):
+            gr.Markdown("### 📈 Interactive Topic Visualizations")
+            with gr.Row():
+                selector = gr.Dropdown(choices=[], label="Select Visualization Type", scale=7)
+                refresh_viz = gr.Button("Refresh Charts", variant="secondary", scale=1)
+            display = gr.Plot()
+        with gr.Tab("📥 Export Control"):
+            gr.Markdown("### 💾 Final Outputs\nDownload generated papers, narratives, and comparison matrices.")
+            download = gr.File(label="Available Exports", file_count="multiple")
+    def respond_with_viz(m, h, u):
+        g = respond(m, h, u)
+        for hist, _, dl in g:
+            cs = _get_chart_choices()
+            yield hist, "", dl, gr.update(choices=cs, value=cs[-1] if cs else None), _load_chart(cs[-1]) if cs else None, _load_review_table(), _build_progress()
+    msg.submit(respond_with_viz, [msg, chatbot, upload], [chatbot, msg, download, selector, display, table, progress])
+    send.click(respond_with_viz, [msg, chatbot, upload], [chatbot, msg, download, selector, display, table, progress])
+    selector.change(_load_chart, [selector], [display])
+    table.select(_show_papers_by_select, [table], [papers])
+    submit.click(_submit_review, [table, chatbot], [chatbot, download, selector, table, progress])
+    upload.change(lambda f, h: respond_with_viz("Analyze CSV", h, f), [upload, chatbot], [chatbot, msg, download, selector, display, table, progress])
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=7860, ssr_mode=False)

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+# requirements.txt v2.0 | 4 April 2026
+# BERTopic + Mistral LLM (French, Apache 2.0, GDPR-safe)
+langchain
+langchain-mistralai
+langgraph
+langchain-core
+bertopic
+sentence-transformers
+numpy
+pandas
+plotly
+kaleido
+gradio

tools.py ADDED Viewed

	@@ -0,0 +1,182 @@

+from langchain_core.tools import tool
+import os
+import json
+import re
+import numpy as np
+import pandas as pd
+CHECKPOINT_DIR = "/tmp/checkpoints"
+os.makedirs(CHECKPOINT_DIR, exist_ok=True)
+NEAREST_K = 5
+SENT_SPLIT_RE = r'(?<=[.!?])\s+(?=[A-Z])'
+MIN_SENT_LEN = 30
+RUN_CONFIGS = {"abstract": ["Abstract"], "title": ["Title"]}
+_data = {}
+def _split_sentences(text):
+    raw = re.split(SENT_SPLIT_RE, str(text))
+    return list(filter(lambda s: len(s.strip()) >= MIN_SENT_LEN, raw))
+@tool
+def load_scopus_csv(filepath: str) -> str:
+    df = pd.read_csv(filepath, encoding="utf-8-sig")
+    _data["df"] = df
+    cols = [c for c in ["Title", "Abstract", "Author Keywords"] if c in df.columns]
+    sample = df[cols].head(3).to_string(max_colwidth=80)
+    nulls = ", ".join([f"{c}: {df[c].notna().sum()}/{len(df)}" for c in cols])
+    avg_sents = df["Abstract"].head(5).apply(_split_sentences).apply(len).mean()
+    est = int(avg_sents * len(df))
+    return (f"📊 **Dataset Statistics:**\n"
+            f"- **Papers:** {len(df)}\n"
+            f"- **Abstract sentences:** ~{est}\n"
+            f"- **Title sentences:** {int(df['Title'].notna().sum())}\n"
+            f"- **Non-null:** {nulls}\n\n"
+            f"Columns: {', '.join(list(df.columns)[:15])}\n\n"
+            f"Sample:\n{sample}")
+@tool
+def run_bertopic_discovery(run_key: str, threshold: float = 0.7) -> str:
+    from bertopic import BERTopic
+    from sentence_transformers import SentenceTransformer
+    from sklearn.preprocessing import FunctionTransformer
+    from sklearn.cluster import AgglomerativeClustering
+    df = _data["df"].copy()
+    available = [c for c in RUN_CONFIGS[run_key] if c in df.columns]
+    df["_text"] = df[available].fillna("").agg(" ".join, axis=1)
+    df["_paper_id"] = df.index
+    df["_sentences"] = df["_text"].apply(_split_sentences)
+    meta = [c for c in ["_paper_id", "Title", "Author Keywords", "_sentences"] if c in df.columns]
+    sent_df = df[meta].explode("_sentences").rename(columns={"_sentences": "text"}).dropna(subset=["text"]).reset_index(drop=True)
+    sent_df["sent_id"] = sent_df.groupby("_paper_id").cumcount()
+    patterns = r"Licensee MDPI|Published by Informa|Published by Elsevier|Taylor & Francis|Copyright ©|Creative Commons|open access article|Inderscience Enterprises|All rights reserved|Springer Nature|Emerald Publishing|limitations and (future|implications|discussed)|implications (are|were) (discussed|presented)|concludes with .* implications"
+    sent_df = sent_df[~sent_df["text"].str.contains(patterns, case=False, regex=True, na=False)].reset_index(drop=True)
+    embedder = SentenceTransformer("all-MiniLM-L6-v2")
+    embs = embedder.encode(sent_df["text"].tolist(), show_progress_bar=False, normalize_embeddings=True)
+    np.save(f"{CHECKPOINT_DIR}/rq4_{run_key}_emb.npy", embs)
+    cluster = AgglomerativeClustering(n_clusters=None, metric="cosine", linkage="average", distance_threshold=threshold)
+    model = BERTopic(hdbscan_model=cluster, umap_model=FunctionTransformer())
+    topics, _ = model.fit_transform(sent_df["text"].tolist(), embs)
+    _data[f"{run_key}_model"] = model
+    _data[f"{run_key}_topics"] = np.array(topics)
+    _data[f"{run_key}_embeddings"] = embs
+    _data[f"{run_key}_sent_df"] = sent_df
+    n = len(set(topics)) - int(-1 in topics)
+    (n >= 3) and model.visualize_topics().write_html(f"/tmp/rq4_{run_key}_intertopic.html")
+    (n >= 1) and model.visualize_barchart(top_n_topics=min(10, n)).write_html(f"/tmp/rq4_{run_key}_bars.html")
+    (n >= 2) and model.visualize_hierarchy().write_html(f"/tmp/rq4_{run_key}_hierarchy.html")
+    (n >= 2) and model.visualize_heatmap().write_html(f"/tmp/rq4_{run_key}_heatmap.html")
+    t_arr = np.array(topics)
+    valid = [r for r in model.get_topic_info().to_dict("records") if r["Topic"] != -1]
+    def _centroid(row):
+        mask = t_arr == row["Topic"]
+        m_idx = np.where(mask)[0]
+        m_embs = embs[mask]
+        cent = m_embs.mean(axis=0)
+        dists = 1 - (m_embs @ cent) / (np.linalg.norm(m_embs, axis=1) * np.linalg.norm(cent) + 1e-10)
+        near = np.argsort(dists)[:NEAREST_K]
+        evidence = [{"sentence": str(sent_df.iloc[m_idx[i]]["text"])[:250], "paper_id": int(sent_df.iloc[m_idx[i]]["_paper_id"]), "title": str(sent_df.iloc[m_idx[i]].get("Title", ""))[:150], "keywords": str(sent_df.iloc[m_idx[i]].get("Author Keywords", ""))[:150]} for i in near]
+        p_df = sent_df.iloc[m_idx].drop_duplicates(subset=["_paper_id"])
+        titles = [str(p_df.iloc[i].get("Title", ""))[:200] for i in range(min(50, len(p_df)))]
+        return {"topic_id": int(row["Topic"]), "sentence_count": int(row["Count"]), "paper_count": len(p_df), "top_words": str(row.get("Name", ""))[:100], "nearest": evidence, "paper_titles": titles}
+    sums = list(map(_centroid, valid))
+    json.dump(sums, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_summaries.json", "w"), indent=2, default=str)
+    lines = [f"  Topic {s['topic_id']} ({s['sentence_count']} sents, {s['paper_count']} papers): {s['top_words']}" for s in sums]
+    return f"[{run_key}] {n} topics from {len(sent_df)} sentences.\n\n" + "\n".join(lines)
+@tool
+def label_topics_with_llm(run_key: str) -> str:
+    from langchain_mistralai import ChatMistralAI
+    from langchain_core.prompts import PromptTemplate
+    from langchain_core.output_parsers import JsonOutputParser
+    sums = json.load(open(f"{CHECKPOINT_DIR}/rq4_{run_key}_summaries.json"))
+    to_label = sorted(sums, key=lambda s: s.get("sentence_count", 0), reverse=True)[:100]
+    block = "\n\n".join([f"Topic {s['topic_id']} ({s['sentence_count']} sents):\n{NEAREST_K} entries:\n" + "\n".join([f"- {e['sentence']}\n  Paper: {e['title']}" for e in s["nearest"]]) for s in to_label])
+    prompt = PromptTemplate.from_template("Return JSON ARRAY of objects with topic_id, label, category, confidence, reasoning, niche for:\n{topics}")
+    llm = ChatMistralAI(model="mistral-small-latest", temperature=0)
+    labels = (prompt | llm | JsonOutputParser()).invoke({"topics": block})
+    labeled = [{**s, **l} for s, l in zip(sums, labels + sums)]
+    json.dump(labeled, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_labels.json", "w"), indent=2, default=str)
+    lines = [f"  **Topic {l.get('topic_id')}: {l.get('label')}** [{l.get('category')}] ({l.get('sentence_count')} sents)" for l in labeled]
+    return f"[{run_key}] {len(labeled)} topics labeled.\n\n" + "\n\n".join(lines)
+@tool
+def generate_comparison_csv() -> str:
+    done = [k for k in RUN_CONFIGS.keys() if os.path.exists(f"{CHECKPOINT_DIR}/rq4_{k}_labels.json")]
+    rows = []
+    for k in done:
+        ls = json.load(open(f"{CHECKPOINT_DIR}/rq4_{k}_labels.json"))
+        rows.extend([{"run": k, "topic_id": l.get("topic_id"), "label": l.get("label"), "category": l.get("category"), "sentences": l.get("sentence_count"), "papers": l.get("paper_count")} for l in ls])
+    df = pd.DataFrame(rows)
+    df.to_csv("/tmp/rq4_comparison.csv", index=False)
+    return f"Saved to /tmp/rq4_comparison.csv\n\n{df.to_string(index=False)}"
+@tool
+def export_narrative(run_key: str) -> str:
+    from langchain_mistralai import ChatMistralAI
+    ls = json.load(open(f"{CHECKPOINT_DIR}/rq4_{run_key}_labels.json"))
+    txt = "\n".join([f"- {l.get('label')} ({l.get('sentence_count')} sents)" for l in ls])
+    llm = ChatMistralAI(model="mistral-small-latest", temperature=0.3)
+    res = llm.invoke(f"Write a 500-word Section 7 'Topic Modeling Results' for {run_key} run:\n{txt}")
+    open("/tmp/rq4_narrative.txt", "w", encoding="utf-8").write(res.content)
+    return f"Saved to /tmp/rq4_narrative.txt\n\n{res.content}"
+@tool
+def consolidate_into_themes(run_key: str, theme_map: dict) -> str:
+    t_arr, embs, s_df = _data[f"{run_key}_topics"], _data[f"{run_key}_embeddings"], _data[f"{run_key}_sent_df"]
+    def _build(name, ids):
+        mask = np.isin(t_arr, ids)
+        m_idx, m_embs = np.where(mask)[0], embs[mask]
+        cent = m_embs.mean(axis=0)
+        dists = 1 - (m_embs @ cent) / (np.linalg.norm(m_embs, axis=1) * np.linalg.norm(cent) + 1e-10)
+        near = np.argsort(dists)[:NEAREST_K]
+        evidence = [{"sentence": str(s_df.iloc[m_idx[i]]["text"])[:250], "title": str(s_df.iloc[m_idx[i]].get("Title", ""))[:150]} for i in near]
+        return {"label": name, "merged_topics": list(ids), "sentence_count": int(mask.sum()), "paper_count": int(s_df.iloc[m_idx]["_paper_id"].nunique()), "nearest": evidence}
+    themes = [{"topic_id": i, **_build(n, ids)} for i, (n, ids) in enumerate(theme_map.items())]
+    json.dump(themes, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_themes.json", "w"), indent=2, default=str)
+    lines = [f"  **{t['label']}** ({t['sentence_count']} sents)" for t in themes]
+    return f"[{run_key}] {len(themes)} themes.\n\n" + "\n".join(lines)
+PAJAIS = ["Electronic Business", "HCI", "IS Strategy", "Business Intelligence", "Design Science", "Enterprise Systems", "Adoption", "Social Media", "Cultural Issues", "Security", "Smart/IoT", "Knowledge Management", "Digital Platform", "Healthcare", "Project Management", "Service Science", "Social/Org Aspects", "Research Methods", "E-Finance", "E-Government", "Education", "Sustainability"]
+@tool
+def compare_with_taxonomy(run_key: str) -> str:
+    from langchain_mistralai import ChatMistralAI
+    from langchain_core.prompts import PromptTemplate
+    from langchain_core.output_parsers import JsonOutputParser
+    src = (os.path.exists(f"{CHECKPOINT_DIR}/rq4_{run_key}_themes.json") and f"{CHECKPOINT_DIR}/rq4_{run_key}_themes.json") or f"{CHECKPOINT_DIR}/rq4_{run_key}_labels.json"
+    ts = json.load(open(src))
+    prompt = PromptTemplate.from_template("Map themes to PAJAIS taxonomy or mark 'NOVEL'. Return JSON array for:\nThemes:\n{ts}\nTaxonomy:\n{tax}")
+    llm = ChatMistralAI(model="mistral-small-latest", temperature=0)
+    ms = (prompt | llm | JsonOutputParser()).invoke({"ts": "\n".join([t['label'] for t in ts]), "tax": "\n".join(PAJAIS)})
+    json.dump(ms, open(f"{CHECKPOINT_DIR}/rq4_{run_key}_taxonomy_map.json", "w"), indent=2, default=str)
+    return f"[{run_key}] Mapping complete."
+def get_all_tools():
+    ts = [load_scopus_csv, run_bertopic_discovery, label_topics_with_llm, consolidate_into_themes, compare_with_taxonomy, generate_comparison_csv, export_narrative]
+    for t in ts: setattr(t, 'handle_tool_error', True)
+    return ts