--- title: BERTopic Agentic Topic Modelling emoji: π§ colorFrom: blue colorTo: indigo sdk: gradio app_file: app.py pinned: false --- # π¬ BERTopic Agentic Topic Modelling ### *Computational Thematic Analysis powered by Braun & Clarke (2006)*  --- ## π Overview **BERTopic Agentic Topic Modelling** is a state-of-the-art research tool designed to automate and enhance the process of **Thematic Analysis** for academic literature. By integrating **BERTopic**'s transformer-based clustering with a **LangGraph-driven agentic workflow**, this application guides researchers through the rigorous 6-phase framework of Braun & Clarke (2006). It doesn't just cluster text; it *reasons* about it. Featuring a unique **"AI Council"** where multiple Large Language Models (Mistral & Groq) debate and reach consensus on topic labels, the tool ensures high-fidelity, publishable results. --- ## π§ Theoretical Foundation: Braun & Clarke (2006) This tool is strictly mapped to the six phases of thematic analysis as defined in the seminal work: 1. **Familiarisation with data**: Automatic cleaning, boilerplate removal, and dataset profiling. 2. **Generating initial codes**: BERTopic discovery and AI-assisted initial labeling. 3. **Searching for themes**: LLM-driven consolidation of topics into overarching themes. 4. **Reviewing potential themes**: Saturation checks and coverage analysis. 5. **Defining and naming themes**: Generation of academic definitions and core narratives. 6. **Producing the report**: Narrative writing (Section 7 draft) and PAJAIS taxonomy mapping. --- ## β¨ Key Features - **π€ Agentic Workflow**: A LangGraph agent manages the entire pipeline, maintaining memory and ensuring a step-by-step scientific process. - **βοΈ AI Council**: Real-time debates between **Mistral-Large** and **Llama-3 (Groq)** to determine the most accurate thematic labels. - **π Dynamic Visualizations**: 8+ interactive Plotly charts (Intertopic maps, Frequency bars, Heatmaps, Treemaps, and DBSCAN scatter plots). - **π‘οΈ Multi-Model Analysis**: Run separate analyses on **Abstracts** vs. **Titles** and generate a side-by-side convergence CSV. - **π Density Refinement**: Optional **DBSCAN** clustering to complement traditional hierarchical methods and handle noise points elegantly. - **π·οΈ PAJAIS Taxonomy Mapping**: Automated gap analysis by mapping themes to the standard 25 PAJAIS Information Systems categories. - **π₯ One-Click Export**: Download structured JSON, side-by-side CSVs, PNG charts, and a 500-word academic narrative report. --- ## π οΈ Architecture ```mermaid graph TD A[Scopus CSV Upload] --> B{Agentic Workflow} B -->|Phase 1| C[Data Loading & Cleaning] C -->|Phase 2| D[BERTopic / DBSCAN Discovery] D --> E[AI Council Labeling] E -->|Phase 3| F[Theme Consolidation] F -->|Phase 4| G[Saturation Check] G -->|Phase 5| H[Definition & Naming] H -->|Phase 5.5| I[PAJAIS Taxonomy Mapping] I -->|Phase 6| J[Report Generation] subgraph "AI Council" E1[Mistral-Large] <--> E2[Groq Llama-3] end subgraph "Outputs" J --> K[narrative.txt] J --> L[comparison.csv] J --> M[Interactive Charts] end ``` --- ## π₯οΈ App Navigation & Expected UI The interface is divided into three logical zones for a streamlined user experience: ### 1. Control Center (Top & Left) - **Phase Progress Bar**: A visual indicator of your progress through Braun & Clarkeβs 6 phases. - **Data Input (Left)**: The upload zone for your Scopus CSV. Once uploaded, Phase 1 triggers automatically. ### 2. The Agent Laboratory (Center) - **Chatbot Interface**: Your main point of interaction. The agent will ask questions, provide stats, and guide you. You can type commands like "run abstract" or "Continue". - **AI Council Feedback**: Every time a label is generated, look for the reasoning block. It shows the consensus score between models. ### 3. Results Dashboard (Bottom Tabs) - **π Review Table**: The "Heart" of the app. This is where you approve, rename, and refine the AI's findings. You MUST click **"Submit Review"** to move past STOP GATES. - **π Charts Tab**: Switch between **Intertopic Map**, **Frequency Bars**, **Hierarchy (Treemap)**, and **Similarity Heatmap**. - **βοΈ AI Council Tab**: A dedicated view showing the full transcript of debates between Mistral and Groq. - **πΎ Download Tab**: Your final repository. All files are generated in real-time and appear here for one-click downloading. ### π€ Expected Output Preview - **In Chat**: Summary tables, saturation percentages (e.g., "92.4% Coverage"), and phase completion checkmarks. - **In Files**: - `narrative.txt`: Academic prose with structured headings. - `comparison.csv`: Columns for `Abstract Theme`, `Title Theme`, and `Convergence` (marked with β). - `taxonomy_map.json`: A mapping showing each theme's link to the PAJAIS framework and its **Novelty score**. --- ### 1. Prerequisites - Python 3.9+ - API Keys for **Mistral AI** and **Groq** (optional but recommended for the Council feature). ### 2. Installation Clone the repository and install the dependencies: ```bash # Clone the repo git clone https://github.com/ShivamKadam63s/BERT_Topic_Modelling.git cd BERT_Topic_Modelling # Install dependencies pip install -r requirements.txt ``` ### 3. Environment Setup Create a `.env` file or export your API keys in your terminal: ```powershell $env:MISTRAL_API_KEY="your_mistral_key" $env:GROQ_API_KEY="your_groq_key" ``` ### 4. Running the App Start the Gradio interface: ```bash python app.py ``` Open your browser at `http://localhost:7860`. --- ## π User Guide: Phase-by-Phase Walkthrough ### Step 1: Data Input Upload your **Scopus CSV** file. The agent will immediately scan the file, remove boilerplate text (Copyright notices, DOIs, etc.), and provide a dataset profile including paper counts and year ranges. ### Step 2: Discovery & Coding - Click **"run abstract"** or **"run title"**. - The system will generate clusters and invoke the **AI Council**. - **Navigation**: Check the **"βοΈ AI Council"** tab to see the reasoning behind each label. - **Action**: In the **"π Review Table"**, tick **Approve** for clusters you accept or provide a custom name in **Rename To**. Click **"Submit Review"**. ### Step 3: Themes & Saturation The agent combines approved codes into 4-8 themes. It will report **Thematic Saturation** (e.g., "Themes cover 92% of the corpus"). ### Step 4: Taxonomy Mapping The tool automatically maps your themes to the **PAJAIS Taxonomy**. - Themes marked with π **NOVEL** are identified as potential new research contributions not found in standard taxonomies. ### Step 5: Final Report The agent generates a **500-word Section 7 draft**. Check the **"πΎ Download"** tab for your full suite of results. --- ## π Expected Outputs | Output File | Description | | :--- | :--- | | `narrative.txt` | A complete Section 7 draft following academic standards. | | `comparison.csv` | Side-by-side comparison of Abstract and Title themes. | | `taxonomy_map.json` | JSON mapping of themes to PAJAIS categories. | | `chart_*.html` | Interactive Plotly visualizations for intertopic distance and hierarchy. | | `*.png` | High-resolution static exports of all charts. | --- ## π οΈ Built With - **Gradio**: Modern UI Framework - **LangGraph**: Agentic Multi-Model Workflows - **BERTopic**: Advanced Topic Modeling - **Sentence-Transformers**: `all-MiniLM-L6-v2` embeddings - **Mistral Large**: Primary Reasoning LLM - **Groq (Llama-3)**: Secondary Council LLM - **Plotly**: Dynamic Data Science Charts --- ## βοΈ License & Citation If you use this tool in your research, please cite: *Shivam Kadam, "BERTopic Agentic Topic Modelling for Systematic Literature Reviews," 2026.* Based on: *Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77-101.* ---
Made with β€οΈ for the Research Community