Spaces:

Dash10107
/

topic-modelling-agent

Sleeping

App Files Files Community

topic-modelling-agent / README.md

Daksh C Jain

Fix invalid colorTo metadata in README.md

fe59a4d about 1 month ago

preview code

raw

history blame contribute delete

3.59 kB

A newer version of the Gradio SDK is available: 6.15.1

Upgrade

metadata

title: Topic Modelling Agentic AI
emoji: 🔬
colorFrom: indigo
colorTo: purple
sdk: gradio
app_file: app.py
pinned: false

🔬 Topic Modelling Agentic AI

A professional, agent-driven platform for automated Reflexive Thematic Analysis (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports).

🚀 Overview

This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes.

Key Features

Agentic Workflow: Powered by LangGraph, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling.
Precision Clustering: Uses BERTopic with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (all-MiniLM-L6-v2).
Human-in-the-Loop: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis.
Automated Synthesis: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS).
Rich Visualizations: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps.

🛠️ Technology Stack

Framework: LangGraph (Agentic logic & state management)
Engine: BERTopic (Topic Modeling pipeline)
LLM: Mistral AI (mistral-small-latest)
Embeddings: sentence-transformers/all-MiniLM-L6-v2
UI: Gradio 5.x
Data: Pandas, NumPy, Scikit-Learn

📋 Methodology

The agent follows the Braun & Clarke (2006) six-phase thematic analysis framework:

Familiarization: Loading and preprocessing Scopus CSV metadata.
Initial Coding: Sentence-level clustering to identify "semantic atoms."
Searching for Themes: Aggregating clusters into broader research themes.
Reviewing Themes: Researcher validation via the Review Table.
Defining and Naming: Refined LLM labeling based on centroid-nearest evidence.
Producing the Report: Exporting narrative sections and comparison matrices.

💻 Setup & Installation

Prerequisites

Python 3.10+
Mistral AI API Key

Installation

Clone the repository:

git clone https://github.com/your-repo/topic-modelling-agent.git
cd topic-modelling-agent

Install dependencies:
```
pip install -r requirements.txt
```
Configure environment: Create a .env file in the root directory:
```
MISTRAL_API_KEY=your_api_key_here
```
Run the application:
```
python app.py
```

📖 Usage

Upload Data: Drag and drop a Scopus CSV export.
Initialize: Type Analyze my CSV or run abstract only in the chat.
Iterate: Use the chat to refine topics (e.g., group topics 5 and 10 into "Sustainability").
Review: Use the Review Table tab to approve or rename topics.
Export: Download the generated Narrative and Comparison CSV from the Download tab.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.