Spaces:

Dash10107
/

topic-modelling-agent

Sleeping

App Files Files Community

topic-modelling-agent / README.md

Daksh C Jain

Fix invalid colorTo metadata in README.md

fe59a4d about 1 month ago

preview code

raw

history blame contribute delete

3.59 kB

	---
	title: Topic Modelling Agentic AI
	emoji: 🔬
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	app_file: app.py
	pinned: false
	---

	# 🔬 Topic Modelling Agentic AI

	A professional, agent-driven platform for automated Reflexive Thematic Analysis (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports).

	---

	## 🚀 Overview

	This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes.

	### Key Features
	- Agentic Workflow: Powered by LangGraph, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling.
	- Precision Clustering: Uses BERTopic with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (`all-MiniLM-L6-v2`).
	- Human-in-the-Loop: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis.
	- Automated Synthesis: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS).
	- Rich Visualizations: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps.

	---

	## 🛠️ Technology Stack

	- Framework: [LangGraph](https://github.com/langchain-ai/langgraph) (Agentic logic & state management)
	- Engine: [BERTopic](https://github.com/MaartenGr/BERTopic) (Topic Modeling pipeline)
	- LLM: [Mistral AI](https://mistral.ai/) (`mistral-small-latest`)
	- Embeddings: `sentence-transformers/all-MiniLM-L6-v2`
	- UI: [Gradio 5.x](https://gradio.app/)
	- Data: Pandas, NumPy, Scikit-Learn

	---

	## 📋 Methodology

	The agent follows the Braun & Clarke (2006) six-phase thematic analysis framework:

	1. Familiarization: Loading and preprocessing Scopus CSV metadata.
	2. Initial Coding: Sentence-level clustering to identify "semantic atoms."
	3. Searching for Themes: Aggregating clusters into broader research themes.
	4. Reviewing Themes: Researcher validation via the Review Table.
	5. Defining and Naming: Refined LLM labeling based on centroid-nearest evidence.
	6. Producing the Report: Exporting narrative sections and comparison matrices.

	---

	## 💻 Setup & Installation

	### Prerequisites
	- Python 3.10+
	- Mistral AI API Key

	### Installation

	1. Clone the repository:
	```bash
	git clone https://github.com/your-repo/topic-modelling-agent.git
	cd topic-modelling-agent
	```

	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. Configure environment:
	Create a `.env` file in the root directory:
	```env
	MISTRAL_API_KEY=your_api_key_here
	```

	4. Run the application:
	```bash
	python app.py
	```

	---

	## 📖 Usage

	1. Upload Data: Drag and drop a Scopus CSV export.
	2. Initialize: Type `Analyze my CSV` or `run abstract only` in the chat.
	3. Iterate: Use the chat to refine topics (e.g., `group topics 5 and 10 into "Sustainability"`).
	4. Review: Use the Review Table tab to approve or rename topics.
	5. Export: Download the generated Narrative and Comparison CSV from the Download tab.

	---

	## 📄 License

	This project is licensed under the MIT License - see the LICENSE file for details.