File size: 3,591 Bytes
0a64686
 
 
 
fe59a4d
0a64686
 
 
 
 
d2a404d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: Topic Modelling Agentic AI
emoji: πŸ”¬
colorFrom: indigo
colorTo: purple
sdk: gradio
app_file: app.py
pinned: false
---

# πŸ”¬ Topic Modelling Agentic AI

A professional, agent-driven platform for automated **Reflexive Thematic Analysis** (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports).

---

## πŸš€ Overview

This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes.

### Key Features
- **Agentic Workflow**: Powered by **LangGraph**, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling.
- **Precision Clustering**: Uses **BERTopic** with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (`all-MiniLM-L6-v2`).
- **Human-in-the-Loop**: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis.
- **Automated Synthesis**: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS).
- **Rich Visualizations**: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps.

---

## πŸ› οΈ Technology Stack

- **Framework**: [LangGraph](https://github.com/langchain-ai/langgraph) (Agentic logic & state management)
- **Engine**: [BERTopic](https://github.com/MaartenGr/BERTopic) (Topic Modeling pipeline)
- **LLM**: [Mistral AI](https://mistral.ai/) (`mistral-small-latest`)
- **Embeddings**: `sentence-transformers/all-MiniLM-L6-v2`
- **UI**: [Gradio 5.x](https://gradio.app/)
- **Data**: Pandas, NumPy, Scikit-Learn

---

## πŸ“‹ Methodology

The agent follows the **Braun & Clarke (2006)** six-phase thematic analysis framework:

1. **Familiarization**: Loading and preprocessing Scopus CSV metadata.
2. **Initial Coding**: Sentence-level clustering to identify "semantic atoms."
3. **Searching for Themes**: Aggregating clusters into broader research themes.
4. **Reviewing Themes**: Researcher validation via the Review Table.
5. **Defining and Naming**: Refined LLM labeling based on centroid-nearest evidence.
6. **Producing the Report**: Exporting narrative sections and comparison matrices.

---

## πŸ’» Setup & Installation

### Prerequisites
- Python 3.10+
- Mistral AI API Key

### Installation

1.  **Clone the repository**:
    ```bash
    git clone https://github.com/your-repo/topic-modelling-agent.git
    cd topic-modelling-agent
    ```

2.  **Install dependencies**:
    ```bash
    pip install -r requirements.txt
    ```

3.  **Configure environment**:
    Create a `.env` file in the root directory:
    ```env
    MISTRAL_API_KEY=your_api_key_here
    ```

4.  **Run the application**:
    ```bash
    python app.py
    ```

---

## πŸ“– Usage

1.  **Upload Data**: Drag and drop a Scopus CSV export.
2.  **Initialize**: Type `Analyze my CSV` or `run abstract only` in the chat.
3.  **Iterate**: Use the chat to refine topics (e.g., `group topics 5 and 10 into "Sustainability"`).
4.  **Review**: Use the **Review Table** tab to approve or rename topics.
5.  **Export**: Download the generated Narrative and Comparison CSV from the **Download** tab.

---

## πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.