Spaces:
Sleeping
Sleeping
| title: Topic Modelling Agentic AI | |
| emoji: π¬ | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: false | |
| # π¬ Topic Modelling Agentic AI | |
| A professional, agent-driven platform for automated **Reflexive Thematic Analysis** (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports). | |
| --- | |
| ## π Overview | |
| This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes. | |
| ### Key Features | |
| - **Agentic Workflow**: Powered by **LangGraph**, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling. | |
| - **Precision Clustering**: Uses **BERTopic** with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (`all-MiniLM-L6-v2`). | |
| - **Human-in-the-Loop**: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis. | |
| - **Automated Synthesis**: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS). | |
| - **Rich Visualizations**: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps. | |
| --- | |
| ## π οΈ Technology Stack | |
| - **Framework**: [LangGraph](https://github.com/langchain-ai/langgraph) (Agentic logic & state management) | |
| - **Engine**: [BERTopic](https://github.com/MaartenGr/BERTopic) (Topic Modeling pipeline) | |
| - **LLM**: [Mistral AI](https://mistral.ai/) (`mistral-small-latest`) | |
| - **Embeddings**: `sentence-transformers/all-MiniLM-L6-v2` | |
| - **UI**: [Gradio 5.x](https://gradio.app/) | |
| - **Data**: Pandas, NumPy, Scikit-Learn | |
| --- | |
| ## π Methodology | |
| The agent follows the **Braun & Clarke (2006)** six-phase thematic analysis framework: | |
| 1. **Familiarization**: Loading and preprocessing Scopus CSV metadata. | |
| 2. **Initial Coding**: Sentence-level clustering to identify "semantic atoms." | |
| 3. **Searching for Themes**: Aggregating clusters into broader research themes. | |
| 4. **Reviewing Themes**: Researcher validation via the Review Table. | |
| 5. **Defining and Naming**: Refined LLM labeling based on centroid-nearest evidence. | |
| 6. **Producing the Report**: Exporting narrative sections and comparison matrices. | |
| --- | |
| ## π» Setup & Installation | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - Mistral AI API Key | |
| ### Installation | |
| 1. **Clone the repository**: | |
| ```bash | |
| git clone https://github.com/your-repo/topic-modelling-agent.git | |
| cd topic-modelling-agent | |
| ``` | |
| 2. **Install dependencies**: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. **Configure environment**: | |
| Create a `.env` file in the root directory: | |
| ```env | |
| MISTRAL_API_KEY=your_api_key_here | |
| ``` | |
| 4. **Run the application**: | |
| ```bash | |
| python app.py | |
| ``` | |
| --- | |
| ## π Usage | |
| 1. **Upload Data**: Drag and drop a Scopus CSV export. | |
| 2. **Initialize**: Type `Analyze my CSV` or `run abstract only` in the chat. | |
| 3. **Iterate**: Use the chat to refine topics (e.g., `group topics 5 and 10 into "Sustainability"`). | |
| 4. **Review**: Use the **Review Table** tab to approve or rename topics. | |
| 5. **Export**: Download the generated Narrative and Comparison CSV from the **Download** tab. | |
| --- | |
| ## π License | |
| This project is licensed under the MIT License - see the LICENSE file for details. | |