CoolDataScientist's picture
Update README.md
a31c32e verified
|
Raw
History Blame Contribute Delete
8.11 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: BERTopic Agentic Topic Modelling
emoji: 🧠
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false

πŸ”¬ BERTopic Agentic Topic Modelling

Computational Thematic Analysis powered by Braun & Clarke (2006)

BERTopic Agent Logo


🌟 Overview

BERTopic Agentic Topic Modelling is a state-of-the-art research tool designed to automate and enhance the process of Thematic Analysis for academic literature. By integrating BERTopic's transformer-based clustering with a LangGraph-driven agentic workflow, this application guides researchers through the rigorous 6-phase framework of Braun & Clarke (2006).

It doesn't just cluster text; it reasons about it. Featuring a unique "AI Council" where multiple Large Language Models (Mistral & Groq) debate and reach consensus on topic labels, the tool ensures high-fidelity, publishable results.


🧠 Theoretical Foundation: Braun & Clarke (2006)

This tool is strictly mapped to the six phases of thematic analysis as defined in the seminal work:

  1. Familiarisation with data: Automatic cleaning, boilerplate removal, and dataset profiling.
  2. Generating initial codes: BERTopic discovery and AI-assisted initial labeling.
  3. Searching for themes: LLM-driven consolidation of topics into overarching themes.
  4. Reviewing potential themes: Saturation checks and coverage analysis.
  5. Defining and naming themes: Generation of academic definitions and core narratives.
  6. Producing the report: Narrative writing (Section 7 draft) and PAJAIS taxonomy mapping.

✨ Key Features

  • πŸ€– Agentic Workflow: A LangGraph agent manages the entire pipeline, maintaining memory and ensuring a step-by-step scientific process.
  • βš–οΈ AI Council: Real-time debates between Mistral-Large and Llama-3 (Groq) to determine the most accurate thematic labels.
  • πŸ“Š Dynamic Visualizations: 8+ interactive Plotly charts (Intertopic maps, Frequency bars, Heatmaps, Treemaps, and DBSCAN scatter plots).
  • πŸ›‘οΈ Multi-Model Analysis: Run separate analyses on Abstracts vs. Titles and generate a side-by-side convergence CSV.
  • πŸ” Density Refinement: Optional DBSCAN clustering to complement traditional hierarchical methods and handle noise points elegantly.
  • 🏷️ PAJAIS Taxonomy Mapping: Automated gap analysis by mapping themes to the standard 25 PAJAIS Information Systems categories.
  • πŸ“₯ One-Click Export: Download structured JSON, side-by-side CSVs, PNG charts, and a 500-word academic narrative report.

πŸ› οΈ Architecture

graph TD
    A[Scopus CSV Upload] --> B{Agentic Workflow}
    B -->|Phase 1| C[Data Loading & Cleaning]
    C -->|Phase 2| D[BERTopic / DBSCAN Discovery]
    D --> E[AI Council Labeling]
    E -->|Phase 3| F[Theme Consolidation]
    F -->|Phase 4| G[Saturation Check]
    G -->|Phase 5| H[Definition & Naming]
    H -->|Phase 5.5| I[PAJAIS Taxonomy Mapping]
    I -->|Phase 6| J[Report Generation]
    
    subgraph "AI Council"
    E1[Mistral-Large] <--> E2[Groq Llama-3]
    end
    
    subgraph "Outputs"
    J --> K[narrative.txt]
    J --> L[comparison.csv]
    J --> M[Interactive Charts]
    end

πŸ–₯️ App Navigation & Expected UI

The interface is divided into three logical zones for a streamlined user experience:

1. Control Center (Top & Left)

  • Phase Progress Bar: A visual indicator of your progress through Braun & Clarke’s 6 phases.
  • Data Input (Left): The upload zone for your Scopus CSV. Once uploaded, Phase 1 triggers automatically.

2. The Agent Laboratory (Center)

  • Chatbot Interface: Your main point of interaction. The agent will ask questions, provide stats, and guide you. You can type commands like "run abstract" or "Continue".
  • AI Council Feedback: Every time a label is generated, look for the reasoning block. It shows the consensus score between models.

3. Results Dashboard (Bottom Tabs)

  • πŸ“‹ Review Table: The "Heart" of the app. This is where you approve, rename, and refine the AI's findings. You MUST click "Submit Review" to move past STOP GATES.
  • πŸ“ˆ Charts Tab: Switch between Intertopic Map, Frequency Bars, Hierarchy (Treemap), and Similarity Heatmap.
  • βš–οΈ AI Council Tab: A dedicated view showing the full transcript of debates between Mistral and Groq.
  • πŸ’Ύ Download Tab: Your final repository. All files are generated in real-time and appear here for one-click downloading.

πŸ“€ Expected Output Preview

  • In Chat: Summary tables, saturation percentages (e.g., "92.4% Coverage"), and phase completion checkmarks.
  • In Files:
    • narrative.txt: Academic prose with structured headings.
    • comparison.csv: Columns for Abstract Theme, Title Theme, and Convergence (marked with βœ“).
    • taxonomy_map.json: A mapping showing each theme's link to the PAJAIS framework and its Novelty score.

1. Prerequisites

  • Python 3.9+
  • API Keys for Mistral AI and Groq (optional but recommended for the Council feature).

2. Installation

Clone the repository and install the dependencies:

# Clone the repo
git clone https://github.com/ShivamKadam63s/BERT_Topic_Modelling.git
cd BERT_Topic_Modelling

# Install dependencies
pip install -r requirements.txt

3. Environment Setup

Create a .env file or export your API keys in your terminal:

$env:MISTRAL_API_KEY="your_mistral_key"
$env:GROQ_API_KEY="your_groq_key"

4. Running the App

Start the Gradio interface:

python app.py

Open your browser at http://localhost:7860.


πŸ“– User Guide: Phase-by-Phase Walkthrough

Step 1: Data Input

Upload your Scopus CSV file. The agent will immediately scan the file, remove boilerplate text (Copyright notices, DOIs, etc.), and provide a dataset profile including paper counts and year ranges.

Step 2: Discovery & Coding

  • Click "run abstract" or "run title".
  • The system will generate clusters and invoke the AI Council.
  • Navigation: Check the "βš–οΈ AI Council" tab to see the reasoning behind each label.
  • Action: In the "πŸ“‹ Review Table", tick Approve for clusters you accept or provide a custom name in Rename To. Click "Submit Review".

Step 3: Themes & Saturation

The agent combines approved codes into 4-8 themes. It will report Thematic Saturation (e.g., "Themes cover 92% of the corpus").

Step 4: Taxonomy Mapping

The tool automatically maps your themes to the PAJAIS Taxonomy.

  • Themes marked with 🌟 NOVEL are identified as potential new research contributions not found in standard taxonomies.

Step 5: Final Report

The agent generates a 500-word Section 7 draft. Check the "πŸ’Ύ Download" tab for your full suite of results.


πŸ“ˆ Expected Outputs

Output File Description
narrative.txt A complete Section 7 draft following academic standards.
comparison.csv Side-by-side comparison of Abstract and Title themes.
taxonomy_map.json JSON mapping of themes to PAJAIS categories.
chart_*.html Interactive Plotly visualizations for intertopic distance and hierarchy.
*.png High-resolution static exports of all charts.

πŸ› οΈ Built With

  • Gradio: Modern UI Framework
  • LangGraph: Agentic Multi-Model Workflows
  • BERTopic: Advanced Topic Modeling
  • Sentence-Transformers: all-MiniLM-L6-v2 embeddings
  • Mistral Large: Primary Reasoning LLM
  • Groq (Llama-3): Secondary Council LLM
  • Plotly: Dynamic Data Science Charts

βš–οΈ License & Citation

If you use this tool in your research, please cite: Shivam Kadam, "BERTopic Agentic Topic Modelling for Systematic Literature Reviews," 2026.

Based on: Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77-101.


Made with ❀️ for the Research Community