--- title: Topic Modelling Agentic AI emoji: 🔬 colorFrom: indigo colorTo: purple sdk: gradio app_file: app.py pinned: false --- # 🔬 Topic Modelling Agentic AI A professional, agent-driven platform for automated **Reflexive Thematic Analysis** (Braun & Clarke, 2006) using state-of-the-art Natural Language Processing. Built with LangGraph, BERTopic, and Mistral AI, this agent automates the discovery, labeling, and synthesis of research topics from large-scale academic datasets (e.g., Scopus CSV exports). --- ## 🚀 Overview This project implements a sophisticated "Golden Thread" pipeline for qualitative research. It moves beyond traditional keyword extraction by using sentence-level embeddings and LLM-powered context awareness to identify nuanced themes. ### Key Features - **Agentic Workflow**: Powered by **LangGraph**, the agent autonomously decides when to load data, run clustering, or call the LLM for labeling. - **Precision Clustering**: Uses **BERTopic** with Agglomerative Clustering (Cosine similarity) on 384d sentence embeddings (`all-MiniLM-L6-v2`). - **Human-in-the-Loop**: An interactive Gradio UI allows researchers to review, rename, or reject agent-generated topics before final synthesis. - **Automated Synthesis**: Generates a 500-word research narrative and maps themes to established taxonomies (e.g., PAJAIS). - **Rich Visualizations**: Interactive Plotly charts including Intertopic Distance Maps, Hierarchical Clustering, and Heatmaps. --- ## 🛠️ Technology Stack - **Framework**: [LangGraph](https://github.com/langchain-ai/langgraph) (Agentic logic & state management) - **Engine**: [BERTopic](https://github.com/MaartenGr/BERTopic) (Topic Modeling pipeline) - **LLM**: [Mistral AI](https://mistral.ai/) (`mistral-small-latest`) - **Embeddings**: `sentence-transformers/all-MiniLM-L6-v2` - **UI**: [Gradio 5.x](https://gradio.app/) - **Data**: Pandas, NumPy, Scikit-Learn --- ## 📋 Methodology The agent follows the **Braun & Clarke (2006)** six-phase thematic analysis framework: 1. **Familiarization**: Loading and preprocessing Scopus CSV metadata. 2. **Initial Coding**: Sentence-level clustering to identify "semantic atoms." 3. **Searching for Themes**: Aggregating clusters into broader research themes. 4. **Reviewing Themes**: Researcher validation via the Review Table. 5. **Defining and Naming**: Refined LLM labeling based on centroid-nearest evidence. 6. **Producing the Report**: Exporting narrative sections and comparison matrices. --- ## 💻 Setup & Installation ### Prerequisites - Python 3.10+ - Mistral AI API Key ### Installation 1. **Clone the repository**: ```bash git clone https://github.com/your-repo/topic-modelling-agent.git cd topic-modelling-agent ``` 2. **Install dependencies**: ```bash pip install -r requirements.txt ``` 3. **Configure environment**: Create a `.env` file in the root directory: ```env MISTRAL_API_KEY=your_api_key_here ``` 4. **Run the application**: ```bash python app.py ``` --- ## 📖 Usage 1. **Upload Data**: Drag and drop a Scopus CSV export. 2. **Initialize**: Type `Analyze my CSV` or `run abstract only` in the chat. 3. **Iterate**: Use the chat to refine topics (e.g., `group topics 5 and 10 into "Sustainability"`). 4. **Review**: Use the **Review Table** tab to approve or rename topics. 5. **Export**: Download the generated Narrative and Comparison CSV from the **Download** tab. --- ## 📄 License This project is licensed under the MIT License - see the LICENSE file for details.