CoolDataScientist commited on
Commit
a31c32e
Β·
verified Β·
1 Parent(s): f35e567

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +201 -191
README.md CHANGED
@@ -1,191 +1,201 @@
1
- # πŸ”¬ BERTopic Agentic Topic Modelling
2
-
3
- ### *Computational Thematic Analysis powered by Braun & Clarke (2006)*
4
-
5
- ![BERTopic Agent Logo](logo.png)
6
-
7
- ---
8
-
9
- ## 🌟 Overview
10
-
11
- **BERTopic Agentic Topic Modelling** is a state-of-the-art research tool designed to automate and enhance the process of **Thematic Analysis** for academic literature. By integrating **BERTopic**'s transformer-based clustering with a **LangGraph-driven agentic workflow**, this application guides researchers through the rigorous 6-phase framework of Braun & Clarke (2006).
12
-
13
- It doesn't just cluster text; it *reasons* about it. Featuring a unique **"AI Council"** where multiple Large Language Models (Mistral & Groq) debate and reach consensus on topic labels, the tool ensures high-fidelity, publishable results.
14
-
15
- ---
16
-
17
- ## 🧠 Theoretical Foundation: Braun & Clarke (2006)
18
-
19
- This tool is strictly mapped to the six phases of thematic analysis as defined in the seminal work:
20
-
21
- 1. **Familiarisation with data**: Automatic cleaning, boilerplate removal, and dataset profiling.
22
- 2. **Generating initial codes**: BERTopic discovery and AI-assisted initial labeling.
23
- 3. **Searching for themes**: LLM-driven consolidation of topics into overarching themes.
24
- 4. **Reviewing potential themes**: Saturation checks and coverage analysis.
25
- 5. **Defining and naming themes**: Generation of academic definitions and core narratives.
26
- 6. **Producing the report**: Narrative writing (Section 7 draft) and PAJAIS taxonomy mapping.
27
-
28
- ---
29
-
30
- ## ✨ Key Features
31
-
32
- - **πŸ€– Agentic Workflow**: A LangGraph agent manages the entire pipeline, maintaining memory and ensuring a step-by-step scientific process.
33
- - **βš–οΈ AI Council**: Real-time debates between **Mistral-Large** and **Llama-3 (Groq)** to determine the most accurate thematic labels.
34
- - **πŸ“Š Dynamic Visualizations**: 8+ interactive Plotly charts (Intertopic maps, Frequency bars, Heatmaps, Treemaps, and DBSCAN scatter plots).
35
- - **πŸ›‘οΈ Multi-Model Analysis**: Run separate analyses on **Abstracts** vs. **Titles** and generate a side-by-side convergence CSV.
36
- - **πŸ” Density Refinement**: Optional **DBSCAN** clustering to complement traditional hierarchical methods and handle noise points elegantly.
37
- - **🏷️ PAJAIS Taxonomy Mapping**: Automated gap analysis by mapping themes to the standard 25 PAJAIS Information Systems categories.
38
- - **πŸ“₯ One-Click Export**: Download structured JSON, side-by-side CSVs, PNG charts, and a 500-word academic narrative report.
39
-
40
- ---
41
-
42
- ## πŸ› οΈ Architecture
43
-
44
- ```mermaid
45
- graph TD
46
- A[Scopus CSV Upload] --> B{Agentic Workflow}
47
- B -->|Phase 1| C[Data Loading & Cleaning]
48
- C -->|Phase 2| D[BERTopic / DBSCAN Discovery]
49
- D --> E[AI Council Labeling]
50
- E -->|Phase 3| F[Theme Consolidation]
51
- F -->|Phase 4| G[Saturation Check]
52
- G -->|Phase 5| H[Definition & Naming]
53
- H -->|Phase 5.5| I[PAJAIS Taxonomy Mapping]
54
- I -->|Phase 6| J[Report Generation]
55
-
56
- subgraph "AI Council"
57
- E1[Mistral-Large] <--> E2[Groq Llama-3]
58
- end
59
-
60
- subgraph "Outputs"
61
- J --> K[narrative.txt]
62
- J --> L[comparison.csv]
63
- J --> M[Interactive Charts]
64
- end
65
- ```
66
-
67
- ---
68
-
69
- ## πŸ–₯️ App Navigation & Expected UI
70
-
71
- The interface is divided into three logical zones for a streamlined user experience:
72
-
73
- ### 1. Control Center (Top & Left)
74
- - **Phase Progress Bar**: A visual indicator of your progress through Braun & Clarke’s 6 phases.
75
- - **Data Input (Left)**: The upload zone for your Scopus CSV. Once uploaded, Phase 1 triggers automatically.
76
-
77
- ### 2. The Agent Laboratory (Center)
78
- - **Chatbot Interface**: Your main point of interaction. The agent will ask questions, provide stats, and guide you. You can type commands like "run abstract" or "Continue".
79
- - **AI Council Feedback**: Every time a label is generated, look for the reasoning block. It shows the consensus score between models.
80
-
81
- ### 3. Results Dashboard (Bottom Tabs)
82
- - **πŸ“‹ Review Table**: The "Heart" of the app. This is where you approve, rename, and refine the AI's findings. You MUST click **"Submit Review"** to move past STOP GATES.
83
- - **πŸ“ˆ Charts Tab**: Switch between **Intertopic Map**, **Frequency Bars**, **Hierarchy (Treemap)**, and **Similarity Heatmap**.
84
- - **βš–οΈ AI Council Tab**: A dedicated view showing the full transcript of debates between Mistral and Groq.
85
- - **πŸ’Ύ Download Tab**: Your final repository. All files are generated in real-time and appear here for one-click downloading.
86
-
87
- ### πŸ“€ Expected Output Preview
88
- - **In Chat**: Summary tables, saturation percentages (e.g., "92.4% Coverage"), and phase completion checkmarks.
89
- - **In Files**:
90
- - `narrative.txt`: Academic prose with structured headings.
91
- - `comparison.csv`: Columns for `Abstract Theme`, `Title Theme`, and `Convergence` (marked with βœ“).
92
- - `taxonomy_map.json`: A mapping showing each theme's link to the PAJAIS framework and its **Novelty score**.
93
-
94
- ---
95
-
96
-
97
- ### 1. Prerequisites
98
- - Python 3.9+
99
- - API Keys for **Mistral AI** and **Groq** (optional but recommended for the Council feature).
100
-
101
- ### 2. Installation
102
-
103
- Clone the repository and install the dependencies:
104
-
105
- ```bash
106
- # Clone the repo
107
- git clone https://github.com/ShivamKadam63s/BERT_Topic_Modelling.git
108
- cd BERT_Topic_Modelling
109
-
110
- # Install dependencies
111
- pip install -r requirements.txt
112
- ```
113
-
114
- ### 3. Environment Setup
115
-
116
- Create a `.env` file or export your API keys in your terminal:
117
-
118
- ```powershell
119
- $env:MISTRAL_API_KEY="your_mistral_key"
120
- $env:GROQ_API_KEY="your_groq_key"
121
- ```
122
-
123
- ### 4. Running the App
124
-
125
- Start the Gradio interface:
126
-
127
- ```bash
128
- python app.py
129
- ```
130
-
131
- Open your browser at `http://localhost:7860`.
132
-
133
- ---
134
-
135
- ## πŸ“– User Guide: Phase-by-Phase Walkthrough
136
-
137
- ### Step 1: Data Input
138
- Upload your **Scopus CSV** file. The agent will immediately scan the file, remove boilerplate text (Copyright notices, DOIs, etc.), and provide a dataset profile including paper counts and year ranges.
139
-
140
- ### Step 2: Discovery & Coding
141
- - Click **"run abstract"** or **"run title"**.
142
- - The system will generate clusters and invoke the **AI Council**.
143
- - **Navigation**: Check the **"βš–οΈ AI Council"** tab to see the reasoning behind each label.
144
- - **Action**: In the **"πŸ“‹ Review Table"**, tick **Approve** for clusters you accept or provide a custom name in **Rename To**. Click **"Submit Review"**.
145
-
146
- ### Step 3: Themes & Saturation
147
- The agent combines approved codes into 4-8 themes. It will report **Thematic Saturation** (e.g., "Themes cover 92% of the corpus").
148
-
149
- ### Step 4: Taxonomy Mapping
150
- The tool automatically maps your themes to the **PAJAIS Taxonomy**.
151
- - Themes marked with 🌟 **NOVEL** are identified as potential new research contributions not found in standard taxonomies.
152
-
153
- ### Step 5: Final Report
154
- The agent generates a **500-word Section 7 draft**. Check the **"πŸ’Ύ Download"** tab for your full suite of results.
155
-
156
- ---
157
-
158
- ## πŸ“ˆ Expected Outputs
159
-
160
- | Output File | Description |
161
- | :--- | :--- |
162
- | `narrative.txt` | A complete Section 7 draft following academic standards. |
163
- | `comparison.csv` | Side-by-side comparison of Abstract and Title themes. |
164
- | `taxonomy_map.json` | JSON mapping of themes to PAJAIS categories. |
165
- | `chart_*.html` | Interactive Plotly visualizations for intertopic distance and hierarchy. |
166
- | `*.png` | High-resolution static exports of all charts. |
167
-
168
- ---
169
-
170
- ## πŸ› οΈ Built With
171
-
172
- - **Gradio**: Modern UI Framework
173
- - **LangGraph**: Agentic Multi-Model Workflows
174
- - **BERTopic**: Advanced Topic Modeling
175
- - **Sentence-Transformers**: `all-MiniLM-L6-v2` embeddings
176
- - **Mistral Large**: Primary Reasoning LLM
177
- - **Groq (Llama-3)**: Secondary Council LLM
178
- - **Plotly**: Dynamic Data Science Charts
179
-
180
- ---
181
-
182
- ## βš–οΈ License & Citation
183
-
184
- If you use this tool in your research, please cite:
185
- *Shivam Kadam, "BERTopic Agentic Topic Modelling for Systematic Literature Reviews," 2026.*
186
-
187
- Based on:
188
- *Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77-101.*
189
-
190
- ---
191
- <p align="center">Made with ❀️ for the Research Community</p>
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: BERTopic Agentic Topic Modelling
3
+ emoji: 🧠
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ app_file: app.py
8
+ pinned: false
9
+ ---
10
+
11
+ # πŸ”¬ BERTopic Agentic Topic Modelling
12
+
13
+ ### *Computational Thematic Analysis powered by Braun & Clarke (2006)*
14
+
15
+ ![BERTopic Agent Logo](logo.png)
16
+
17
+ ---
18
+
19
+ ## 🌟 Overview
20
+
21
+ **BERTopic Agentic Topic Modelling** is a state-of-the-art research tool designed to automate and enhance the process of **Thematic Analysis** for academic literature. By integrating **BERTopic**'s transformer-based clustering with a **LangGraph-driven agentic workflow**, this application guides researchers through the rigorous 6-phase framework of Braun & Clarke (2006).
22
+
23
+ It doesn't just cluster text; it *reasons* about it. Featuring a unique **"AI Council"** where multiple Large Language Models (Mistral & Groq) debate and reach consensus on topic labels, the tool ensures high-fidelity, publishable results.
24
+
25
+ ---
26
+
27
+ ## 🧠 Theoretical Foundation: Braun & Clarke (2006)
28
+
29
+ This tool is strictly mapped to the six phases of thematic analysis as defined in the seminal work:
30
+
31
+ 1. **Familiarisation with data**: Automatic cleaning, boilerplate removal, and dataset profiling.
32
+ 2. **Generating initial codes**: BERTopic discovery and AI-assisted initial labeling.
33
+ 3. **Searching for themes**: LLM-driven consolidation of topics into overarching themes.
34
+ 4. **Reviewing potential themes**: Saturation checks and coverage analysis.
35
+ 5. **Defining and naming themes**: Generation of academic definitions and core narratives.
36
+ 6. **Producing the report**: Narrative writing (Section 7 draft) and PAJAIS taxonomy mapping.
37
+
38
+ ---
39
+
40
+ ## ✨ Key Features
41
+
42
+ - **πŸ€– Agentic Workflow**: A LangGraph agent manages the entire pipeline, maintaining memory and ensuring a step-by-step scientific process.
43
+ - **βš–οΈ AI Council**: Real-time debates between **Mistral-Large** and **Llama-3 (Groq)** to determine the most accurate thematic labels.
44
+ - **πŸ“Š Dynamic Visualizations**: 8+ interactive Plotly charts (Intertopic maps, Frequency bars, Heatmaps, Treemaps, and DBSCAN scatter plots).
45
+ - **πŸ›‘οΈ Multi-Model Analysis**: Run separate analyses on **Abstracts** vs. **Titles** and generate a side-by-side convergence CSV.
46
+ - **πŸ” Density Refinement**: Optional **DBSCAN** clustering to complement traditional hierarchical methods and handle noise points elegantly.
47
+ - **🏷️ PAJAIS Taxonomy Mapping**: Automated gap analysis by mapping themes to the standard 25 PAJAIS Information Systems categories.
48
+ - **πŸ“₯ One-Click Export**: Download structured JSON, side-by-side CSVs, PNG charts, and a 500-word academic narrative report.
49
+
50
+ ---
51
+
52
+ ## πŸ› οΈ Architecture
53
+
54
+ ```mermaid
55
+ graph TD
56
+ A[Scopus CSV Upload] --> B{Agentic Workflow}
57
+ B -->|Phase 1| C[Data Loading & Cleaning]
58
+ C -->|Phase 2| D[BERTopic / DBSCAN Discovery]
59
+ D --> E[AI Council Labeling]
60
+ E -->|Phase 3| F[Theme Consolidation]
61
+ F -->|Phase 4| G[Saturation Check]
62
+ G -->|Phase 5| H[Definition & Naming]
63
+ H -->|Phase 5.5| I[PAJAIS Taxonomy Mapping]
64
+ I -->|Phase 6| J[Report Generation]
65
+
66
+ subgraph "AI Council"
67
+ E1[Mistral-Large] <--> E2[Groq Llama-3]
68
+ end
69
+
70
+ subgraph "Outputs"
71
+ J --> K[narrative.txt]
72
+ J --> L[comparison.csv]
73
+ J --> M[Interactive Charts]
74
+ end
75
+ ```
76
+
77
+ ---
78
+
79
+ ## πŸ–₯️ App Navigation & Expected UI
80
+
81
+ The interface is divided into three logical zones for a streamlined user experience:
82
+
83
+ ### 1. Control Center (Top & Left)
84
+ - **Phase Progress Bar**: A visual indicator of your progress through Braun & Clarke’s 6 phases.
85
+ - **Data Input (Left)**: The upload zone for your Scopus CSV. Once uploaded, Phase 1 triggers automatically.
86
+
87
+ ### 2. The Agent Laboratory (Center)
88
+ - **Chatbot Interface**: Your main point of interaction. The agent will ask questions, provide stats, and guide you. You can type commands like "run abstract" or "Continue".
89
+ - **AI Council Feedback**: Every time a label is generated, look for the reasoning block. It shows the consensus score between models.
90
+
91
+ ### 3. Results Dashboard (Bottom Tabs)
92
+ - **πŸ“‹ Review Table**: The "Heart" of the app. This is where you approve, rename, and refine the AI's findings. You MUST click **"Submit Review"** to move past STOP GATES.
93
+ - **πŸ“ˆ Charts Tab**: Switch between **Intertopic Map**, **Frequency Bars**, **Hierarchy (Treemap)**, and **Similarity Heatmap**.
94
+ - **βš–οΈ AI Council Tab**: A dedicated view showing the full transcript of debates between Mistral and Groq.
95
+ - **πŸ’Ύ Download Tab**: Your final repository. All files are generated in real-time and appear here for one-click downloading.
96
+
97
+ ### πŸ“€ Expected Output Preview
98
+ - **In Chat**: Summary tables, saturation percentages (e.g., "92.4% Coverage"), and phase completion checkmarks.
99
+ - **In Files**:
100
+ - `narrative.txt`: Academic prose with structured headings.
101
+ - `comparison.csv`: Columns for `Abstract Theme`, `Title Theme`, and `Convergence` (marked with βœ“).
102
+ - `taxonomy_map.json`: A mapping showing each theme's link to the PAJAIS framework and its **Novelty score**.
103
+
104
+ ---
105
+
106
+
107
+ ### 1. Prerequisites
108
+ - Python 3.9+
109
+ - API Keys for **Mistral AI** and **Groq** (optional but recommended for the Council feature).
110
+
111
+ ### 2. Installation
112
+
113
+ Clone the repository and install the dependencies:
114
+
115
+ ```bash
116
+ # Clone the repo
117
+ git clone https://github.com/ShivamKadam63s/BERT_Topic_Modelling.git
118
+ cd BERT_Topic_Modelling
119
+
120
+ # Install dependencies
121
+ pip install -r requirements.txt
122
+ ```
123
+
124
+ ### 3. Environment Setup
125
+
126
+ Create a `.env` file or export your API keys in your terminal:
127
+
128
+ ```powershell
129
+ $env:MISTRAL_API_KEY="your_mistral_key"
130
+ $env:GROQ_API_KEY="your_groq_key"
131
+ ```
132
+
133
+ ### 4. Running the App
134
+
135
+ Start the Gradio interface:
136
+
137
+ ```bash
138
+ python app.py
139
+ ```
140
+
141
+ Open your browser at `http://localhost:7860`.
142
+
143
+ ---
144
+
145
+ ## πŸ“– User Guide: Phase-by-Phase Walkthrough
146
+
147
+ ### Step 1: Data Input
148
+ Upload your **Scopus CSV** file. The agent will immediately scan the file, remove boilerplate text (Copyright notices, DOIs, etc.), and provide a dataset profile including paper counts and year ranges.
149
+
150
+ ### Step 2: Discovery & Coding
151
+ - Click **"run abstract"** or **"run title"**.
152
+ - The system will generate clusters and invoke the **AI Council**.
153
+ - **Navigation**: Check the **"βš–οΈ AI Council"** tab to see the reasoning behind each label.
154
+ - **Action**: In the **"πŸ“‹ Review Table"**, tick **Approve** for clusters you accept or provide a custom name in **Rename To**. Click **"Submit Review"**.
155
+
156
+ ### Step 3: Themes & Saturation
157
+ The agent combines approved codes into 4-8 themes. It will report **Thematic Saturation** (e.g., "Themes cover 92% of the corpus").
158
+
159
+ ### Step 4: Taxonomy Mapping
160
+ The tool automatically maps your themes to the **PAJAIS Taxonomy**.
161
+ - Themes marked with 🌟 **NOVEL** are identified as potential new research contributions not found in standard taxonomies.
162
+
163
+ ### Step 5: Final Report
164
+ The agent generates a **500-word Section 7 draft**. Check the **"πŸ’Ύ Download"** tab for your full suite of results.
165
+
166
+ ---
167
+
168
+ ## πŸ“ˆ Expected Outputs
169
+
170
+ | Output File | Description |
171
+ | :--- | :--- |
172
+ | `narrative.txt` | A complete Section 7 draft following academic standards. |
173
+ | `comparison.csv` | Side-by-side comparison of Abstract and Title themes. |
174
+ | `taxonomy_map.json` | JSON mapping of themes to PAJAIS categories. |
175
+ | `chart_*.html` | Interactive Plotly visualizations for intertopic distance and hierarchy. |
176
+ | `*.png` | High-resolution static exports of all charts. |
177
+
178
+ ---
179
+
180
+ ## πŸ› οΈ Built With
181
+
182
+ - **Gradio**: Modern UI Framework
183
+ - **LangGraph**: Agentic Multi-Model Workflows
184
+ - **BERTopic**: Advanced Topic Modeling
185
+ - **Sentence-Transformers**: `all-MiniLM-L6-v2` embeddings
186
+ - **Mistral Large**: Primary Reasoning LLM
187
+ - **Groq (Llama-3)**: Secondary Council LLM
188
+ - **Plotly**: Dynamic Data Science Charts
189
+
190
+ ---
191
+
192
+ ## βš–οΈ License & Citation
193
+
194
+ If you use this tool in your research, please cite:
195
+ *Shivam Kadam, "BERTopic Agentic Topic Modelling for Systematic Literature Reviews," 2026.*
196
+
197
+ Based on:
198
+ *Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77-101.*
199
+
200
+ ---
201
+ <p align="center">Made with ❀️ for the Research Community</p>