Spaces:
Sleeping
Sleeping
File size: 3,591 Bytes
0a64686 fe59a4d 0a64686 d2a404d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | ---
title: Topic Modelling Agentic AI
emoji: π¬
colorFrom: indigo
colorTo: purple
sdk: gradio
app_file: app.py
pinned: false
---
# π¬ Topic Modelling Agentic AI
A professional, agent-driven platform for automated **Reflexive Thematic Analysis** (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports).
---
## π Overview
This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes.
### Key Features
- **Agentic Workflow**: Powered by **LangGraph**, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling.
- **Precision Clustering**: Uses **BERTopic** with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (`all-MiniLM-L6-v2`).
- **Human-in-the-Loop**: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis.
- **Automated Synthesis**: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS).
- **Rich Visualizations**: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps.
---
## π οΈ Technology Stack
- **Framework**: [LangGraph](https://github.com/langchain-ai/langgraph) (Agentic logic & state management)
- **Engine**: [BERTopic](https://github.com/MaartenGr/BERTopic) (Topic Modeling pipeline)
- **LLM**: [Mistral AI](https://mistral.ai/) (`mistral-small-latest`)
- **Embeddings**: `sentence-transformers/all-MiniLM-L6-v2`
- **UI**: [Gradio 5.x](https://gradio.app/)
- **Data**: Pandas, NumPy, Scikit-Learn
---
## π Methodology
The agent follows the **Braun & Clarke (2006)** six-phase thematic analysis framework:
1. **Familiarization**: Loading and preprocessing Scopus CSV metadata.
2. **Initial Coding**: Sentence-level clustering to identify "semantic atoms."
3. **Searching for Themes**: Aggregating clusters into broader research themes.
4. **Reviewing Themes**: Researcher validation via the Review Table.
5. **Defining and Naming**: Refined LLM labeling based on centroid-nearest evidence.
6. **Producing the Report**: Exporting narrative sections and comparison matrices.
---
## π» Setup & Installation
### Prerequisites
- Python 3.10+
- Mistral AI API Key
### Installation
1. **Clone the repository**:
```bash
git clone https://github.com/your-repo/topic-modelling-agent.git
cd topic-modelling-agent
```
2. **Install dependencies**:
```bash
pip install -r requirements.txt
```
3. **Configure environment**:
Create a `.env` file in the root directory:
```env
MISTRAL_API_KEY=your_api_key_here
```
4. **Run the application**:
```bash
python app.py
```
---
## π Usage
1. **Upload Data**: Drag and drop a Scopus CSV export.
2. **Initialize**: Type `Analyze my CSV` or `run abstract only` in the chat.
3. **Iterate**: Use the chat to refine topics (e.g., `group topics 5 and 10 into "Sustainability"`).
4. **Review**: Use the **Review Table** tab to approve or rename topics.
5. **Export**: Download the generated Narrative and Comparison CSV from the **Download** tab.
---
## π License
This project is licensed under the MIT License - see the LICENSE file for details.
|