Spaces:

Dash10107
/

topic-modelling-agent

Sleeping

File size: 3,591 Bytes

---
title: Topic Modelling Agentic AI
emoji: 🔬
colorFrom: indigo
colorTo: purple
sdk: gradio
app_file: app.py
pinned: false
---

# 🔬 Topic Modelling Agentic AI

A professional, agent-driven platform for automated **Reflexive Thematic Analysis** (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports).

---

## 🚀 Overview

This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes.

### Key Features
- **Agentic Workflow**: Powered by **LangGraph**, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling.
- **Precision Clustering**: Uses **BERTopic** with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (`all-MiniLM-L6-v2`).
- **Human-in-the-Loop**: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis.
- **Automated Synthesis**: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS).
- **Rich Visualizations**: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps.

---

## 🛠️ Technology Stack

- **Framework**: [LangGraph](https://github.com/langchain-ai/langgraph) (Agentic logic & state management)
- **Engine**: [BERTopic](https://github.com/MaartenGr/BERTopic) (Topic Modeling pipeline)
- **LLM**: [Mistral AI](https://mistral.ai/) (`mistral-small-latest`)
- **Embeddings**: `sentence-transformers/all-MiniLM-L6-v2`
- **UI**: [Gradio 5.x](https://gradio.app/)
- **Data**: Pandas, NumPy, Scikit-Learn

---

## 📋 Methodology

The agent follows the **Braun & Clarke (2006)** six-phase thematic analysis framework:

1. **Familiarization**: Loading and preprocessing Scopus CSV metadata.
2. **Initial Coding**: Sentence-level clustering to identify "semantic atoms."
3. **Searching for Themes**: Aggregating clusters into broader research themes.
4. **Reviewing Themes**: Researcher validation via the Review Table.
5. **Defining and Naming**: Refined LLM labeling based on centroid-nearest evidence.
6. **Producing the Report**: Exporting narrative sections and comparison matrices.

---

## 💻 Setup & Installation

### Prerequisites
- Python 3.10+
- Mistral AI API Key

### Installation

1.  **Clone the repository**:
    ```bash
    git clone https://github.com/your-repo/topic-modelling-agent.git
    cd topic-modelling-agent
    ```

2.  **Install dependencies**:
    ```bash
    pip install -r requirements.txt
    ```

3.  **Configure environment**:
    Create a `.env` file in the root directory:
    ```env
    MISTRAL_API_KEY=your_api_key_here
    ```

4.  **Run the application**:
    ```bash
    python app.py
    ```

---

## 📖 Usage

1.  **Upload Data**: Drag and drop a Scopus CSV export.
2.  **Initialize**: Type `Analyze my CSV` or `run abstract only` in the chat.
3.  **Iterate**: Use the chat to refine topics (e.g., `group topics 5 and 10 into "Sustainability"`).
4.  **Review**: Use the **Review Table** tab to approve or rename topics.
5.  **Export**: Download the generated Narrative and Comparison CSV from the **Download** tab.

---

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.