Daksh C Jain
Fix invalid colorTo metadata in README.md
fe59a4d
---
title: Topic Modelling Agentic AI
emoji: πŸ”¬
colorFrom: indigo
colorTo: purple
sdk: gradio
app_file: app.py
pinned: false
---
# πŸ”¬ Topic Modelling Agentic AI
A professional, agent-driven platform for automated **Reflexive Thematic Analysis** (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports).
---
## πŸš€ Overview
This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes.
### Key Features
- **Agentic Workflow**: Powered by **LangGraph**, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling.
- **Precision Clustering**: Uses **BERTopic** with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (`all-MiniLM-L6-v2`).
- **Human-in-the-Loop**: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis.
- **Automated Synthesis**: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS).
- **Rich Visualizations**: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps.
---
## πŸ› οΈ Technology Stack
- **Framework**: [LangGraph](https://github.com/langchain-ai/langgraph) (Agentic logic & state management)
- **Engine**: [BERTopic](https://github.com/MaartenGr/BERTopic) (Topic Modeling pipeline)
- **LLM**: [Mistral AI](https://mistral.ai/) (`mistral-small-latest`)
- **Embeddings**: `sentence-transformers/all-MiniLM-L6-v2`
- **UI**: [Gradio 5.x](https://gradio.app/)
- **Data**: Pandas, NumPy, Scikit-Learn
---
## πŸ“‹ Methodology
The agent follows the **Braun & Clarke (2006)** six-phase thematic analysis framework:
1. **Familiarization**: Loading and preprocessing Scopus CSV metadata.
2. **Initial Coding**: Sentence-level clustering to identify "semantic atoms."
3. **Searching for Themes**: Aggregating clusters into broader research themes.
4. **Reviewing Themes**: Researcher validation via the Review Table.
5. **Defining and Naming**: Refined LLM labeling based on centroid-nearest evidence.
6. **Producing the Report**: Exporting narrative sections and comparison matrices.
---
## πŸ’» Setup & Installation
### Prerequisites
- Python 3.10+
- Mistral AI API Key
### Installation
1. **Clone the repository**:
```bash
git clone https://github.com/your-repo/topic-modelling-agent.git
cd topic-modelling-agent
```
2. **Install dependencies**:
```bash
pip install -r requirements.txt
```
3. **Configure environment**:
Create a `.env` file in the root directory:
```env
MISTRAL_API_KEY=your_api_key_here
```
4. **Run the application**:
```bash
python app.py
```
---
## πŸ“– Usage
1. **Upload Data**: Drag and drop a Scopus CSV export.
2. **Initialize**: Type `Analyze my CSV` or `run abstract only` in the chat.
3. **Iterate**: Use the chat to refine topics (e.g., `group topics 5 and 10 into "Sustainability"`).
4. **Review**: Use the **Review Table** tab to approve or rename topics.
5. **Export**: Download the generated Narrative and Comparison CSV from the **Download** tab.
---
## πŸ“„ License
This project is licensed under the MIT License - see the LICENSE file for details.