File size: 8,113 Bytes
a31c32e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: BERTopic Agentic Topic Modelling
emoji: 🧠
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
---

# πŸ”¬ BERTopic Agentic Topic Modelling

### *Computational Thematic Analysis powered by Braun & Clarke (2006)*

![BERTopic Agent Logo](logo.png)

---

## 🌟 Overview

**BERTopic Agentic Topic Modelling** is a state-of-the-art research tool designed to automate and enhance the process of **Thematic Analysis** for academic literature. By integrating **BERTopic**'s transformer-based clustering with a **LangGraph-driven agentic workflow**, this application guides researchers through the rigorous 6-phase framework of Braun & Clarke (2006).

It doesn't just cluster text; it *reasons* about it. Featuring a unique **"AI Council"** where multiple Large Language Models (Mistral & Groq) debate and reach consensus on topic labels, the tool ensures high-fidelity, publishable results.

---

## 🧠 Theoretical Foundation: Braun & Clarke (2006)

This tool is strictly mapped to the six phases of thematic analysis as defined in the seminal work:

1.  **Familiarisation with data**: Automatic cleaning, boilerplate removal, and dataset profiling.
2.  **Generating initial codes**: BERTopic discovery and AI-assisted initial labeling.
3.  **Searching for themes**: LLM-driven consolidation of topics into overarching themes.
4.  **Reviewing potential themes**: Saturation checks and coverage analysis.
5.  **Defining and naming themes**: Generation of academic definitions and core narratives.
6.  **Producing the report**: Narrative writing (Section 7 draft) and PAJAIS taxonomy mapping.

---

## ✨ Key Features

- **πŸ€– Agentic Workflow**: A LangGraph agent manages the entire pipeline, maintaining memory and ensuring a step-by-step scientific process.
- **βš–οΈ AI Council**: Real-time debates between **Mistral-Large** and **Llama-3 (Groq)** to determine the most accurate thematic labels.
- **πŸ“Š Dynamic Visualizations**: 8+ interactive Plotly charts (Intertopic maps, Frequency bars, Heatmaps, Treemaps, and DBSCAN scatter plots).
- **πŸ›‘οΈ Multi-Model Analysis**: Run separate analyses on **Abstracts** vs. **Titles** and generate a side-by-side convergence CSV.
- **πŸ” Density Refinement**: Optional **DBSCAN** clustering to complement traditional hierarchical methods and handle noise points elegantly.
- **🏷️ PAJAIS Taxonomy Mapping**: Automated gap analysis by mapping themes to the standard 25 PAJAIS Information Systems categories.
- **πŸ“₯ One-Click Export**: Download structured JSON, side-by-side CSVs, PNG charts, and a 500-word academic narrative report.

---

## πŸ› οΈ Architecture

```mermaid
graph TD
    A[Scopus CSV Upload] --> B{Agentic Workflow}
    B -->|Phase 1| C[Data Loading & Cleaning]
    C -->|Phase 2| D[BERTopic / DBSCAN Discovery]
    D --> E[AI Council Labeling]
    E -->|Phase 3| F[Theme Consolidation]
    F -->|Phase 4| G[Saturation Check]
    G -->|Phase 5| H[Definition & Naming]
    H -->|Phase 5.5| I[PAJAIS Taxonomy Mapping]
    I -->|Phase 6| J[Report Generation]
    
    subgraph "AI Council"
    E1[Mistral-Large] <--> E2[Groq Llama-3]
    end
    
    subgraph "Outputs"
    J --> K[narrative.txt]
    J --> L[comparison.csv]
    J --> M[Interactive Charts]
    end
```

---

## πŸ–₯️ App Navigation & Expected UI

The interface is divided into three logical zones for a streamlined user experience:

### 1. Control Center (Top & Left)
- **Phase Progress Bar**: A visual indicator of your progress through Braun & Clarke’s 6 phases.
- **Data Input (Left)**: The upload zone for your Scopus CSV. Once uploaded, Phase 1 triggers automatically.

### 2. The Agent Laboratory (Center)
- **Chatbot Interface**: Your main point of interaction. The agent will ask questions, provide stats, and guide you. You can type commands like "run abstract" or "Continue".
- **AI Council Feedback**: Every time a label is generated, look for the reasoning block. It shows the consensus score between models.

### 3. Results Dashboard (Bottom Tabs)
- **πŸ“‹ Review Table**: The "Heart" of the app. This is where you approve, rename, and refine the AI's findings. You MUST click **"Submit Review"** to move past STOP GATES.
- **πŸ“ˆ Charts Tab**: Switch between **Intertopic Map**, **Frequency Bars**, **Hierarchy (Treemap)**, and **Similarity Heatmap**.
- **βš–οΈ AI Council Tab**: A dedicated view showing the full transcript of debates between Mistral and Groq.
- **πŸ’Ύ Download Tab**: Your final repository. All files are generated in real-time and appear here for one-click downloading.

### πŸ“€ Expected Output Preview
- **In Chat**: Summary tables, saturation percentages (e.g., "92.4% Coverage"), and phase completion checkmarks.
- **In Files**:
  - `narrative.txt`: Academic prose with structured headings.
  - `comparison.csv`: Columns for `Abstract Theme`, `Title Theme`, and `Convergence` (marked with βœ“).
  - `taxonomy_map.json`: A mapping showing each theme's link to the PAJAIS framework and its **Novelty score**.

---


### 1. Prerequisites
- Python 3.9+
- API Keys for **Mistral AI** and **Groq** (optional but recommended for the Council feature).

### 2. Installation

Clone the repository and install the dependencies:

```bash
# Clone the repo
git clone https://github.com/ShivamKadam63s/BERT_Topic_Modelling.git
cd BERT_Topic_Modelling

# Install dependencies
pip install -r requirements.txt
```

### 3. Environment Setup

Create a `.env` file or export your API keys in your terminal:

```powershell
$env:MISTRAL_API_KEY="your_mistral_key"
$env:GROQ_API_KEY="your_groq_key"
```

### 4. Running the App

Start the Gradio interface:

```bash
python app.py
```

Open your browser at `http://localhost:7860`.

---

## πŸ“– User Guide: Phase-by-Phase Walkthrough

### Step 1: Data Input
Upload your **Scopus CSV** file. The agent will immediately scan the file, remove boilerplate text (Copyright notices, DOIs, etc.), and provide a dataset profile including paper counts and year ranges.

### Step 2: Discovery & Coding
- Click **"run abstract"** or **"run title"**.
- The system will generate clusters and invoke the **AI Council**.
- **Navigation**: Check the **"βš–οΈ AI Council"** tab to see the reasoning behind each label.
- **Action**: In the **"πŸ“‹ Review Table"**, tick **Approve** for clusters you accept or provide a custom name in **Rename To**. Click **"Submit Review"**.

### Step 3: Themes & Saturation
The agent combines approved codes into 4-8 themes. It will report **Thematic Saturation** (e.g., "Themes cover 92% of the corpus").

### Step 4: Taxonomy Mapping
The tool automatically maps your themes to the **PAJAIS Taxonomy**. 
- Themes marked with 🌟 **NOVEL** are identified as potential new research contributions not found in standard taxonomies.

### Step 5: Final Report
The agent generates a **500-word Section 7 draft**. Check the **"πŸ’Ύ Download"** tab for your full suite of results.

---

## πŸ“ˆ Expected Outputs

| Output File | Description |
| :--- | :--- |
| `narrative.txt` | A complete Section 7 draft following academic standards. |
| `comparison.csv` | Side-by-side comparison of Abstract and Title themes. |
| `taxonomy_map.json` | JSON mapping of themes to PAJAIS categories. |
| `chart_*.html` | Interactive Plotly visualizations for intertopic distance and hierarchy. |
| `*.png` | High-resolution static exports of all charts. |

---

## πŸ› οΈ Built With

- **Gradio**: Modern UI Framework
- **LangGraph**: Agentic Multi-Model Workflows
- **BERTopic**: Advanced Topic Modeling
- **Sentence-Transformers**: `all-MiniLM-L6-v2` embeddings
- **Mistral Large**: Primary Reasoning LLM
- **Groq (Llama-3)**: Secondary Council LLM
- **Plotly**: Dynamic Data Science Charts

---

## βš–οΈ License & Citation

If you use this tool in your research, please cite:
*Shivam Kadam, "BERTopic Agentic Topic Modelling for Systematic Literature Reviews," 2026.*

Based on:
*Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77-101.*

---
<p align="center">Made with ❀️ for the Research Community</p>