Spaces:

CoolDataScientist
/

BERTopic-Modelling-Final

Sleeping

App Files Files Community

CoolDataScientist commited on Apr 26

Commit

a31c32e

verified ·

1 Parent(s): f35e567

Update README.md

Browse files

Files changed (1) hide show

README.md +201 -191

README.md CHANGED Viewed

@@ -1,191 +1,201 @@
-# 🔬 BERTopic Agentic Topic Modelling
-### *Computational Thematic Analysis powered by Braun & Clarke (2006)*
-![BERTopic Agent Logo](logo.png)
----
-## 🌟 Overview
-**BERTopic Agentic Topic Modelling** is a state-of-the-art research tool designed to automate and enhance the process of **Thematic Analysis** for academic literature. By integrating **BERTopic**'s transformer-based clustering with a **LangGraph-driven agentic workflow**, this application guides researchers through the rigorous 6-phase framework of Braun & Clarke (2006).
-It doesn't just cluster text; it *reasons* about it. Featuring a unique **"AI Council"** where multiple Large Language Models (Mistral & Groq) debate and reach consensus on topic labels, the tool ensures high-fidelity, publishable results.
----
-## 🧠 Theoretical Foundation: Braun & Clarke (2006)
-This tool is strictly mapped to the six phases of thematic analysis as defined in the seminal work:
-1.  **Familiarisation with data**: Automatic cleaning, boilerplate removal, and dataset profiling.
-2.  **Generating initial codes**: BERTopic discovery and AI-assisted initial labeling.
-3.  **Searching for themes**: LLM-driven consolidation of topics into overarching themes.
-4.  **Reviewing potential themes**: Saturation checks and coverage analysis.
-5.  **Defining and naming themes**: Generation of academic definitions and core narratives.
-6.  **Producing the report**: Narrative writing (Section 7 draft) and PAJAIS taxonomy mapping.
----
-## ✨ Key Features
-- **🤖 Agentic Workflow**: A LangGraph agent manages the entire pipeline, maintaining memory and ensuring a step-by-step scientific process.
-- **⚖️ AI Council**: Real-time debates between **Mistral-Large** and **Llama-3 (Groq)** to determine the most accurate thematic labels.
-- **📊 Dynamic Visualizations**: 8+ interactive Plotly charts (Intertopic maps, Frequency bars, Heatmaps, Treemaps, and DBSCAN scatter plots).
-- **🛡️ Multi-Model Analysis**: Run separate analyses on **Abstracts** vs. **Titles** and generate a side-by-side convergence CSV.
-- **🔍 Density Refinement**: Optional **DBSCAN** clustering to complement traditional hierarchical methods and handle noise points elegantly.
-- **🏷️ PAJAIS Taxonomy Mapping**: Automated gap analysis by mapping themes to the standard 25 PAJAIS Information Systems categories.
-- **📥 One-Click Export**: Download structured JSON, side-by-side CSVs, PNG charts, and a 500-word academic narrative report.
----
-## 🛠️ Architecture
-```mermaid
-graph TD
-    A[Scopus CSV Upload] --> B{Agentic Workflow}
-    B -->|Phase 1| C[Data Loading & Cleaning]
-    C -->|Phase 2| D[BERTopic / DBSCAN Discovery]
-    D --> E[AI Council Labeling]
-    E -->|Phase 3| F[Theme Consolidation]
-    F -->|Phase 4| G[Saturation Check]
-    G -->|Phase 5| H[Definition & Naming]
-    H -->|Phase 5.5| I[PAJAIS Taxonomy Mapping]
-    I -->|Phase 6| J[Report Generation]
-    subgraph "AI Council"
-    E1[Mistral-Large] <--> E2[Groq Llama-3]
-    end
-    subgraph "Outputs"
-    J --> K[narrative.txt]
-    J --> L[comparison.csv]
-    J --> M[Interactive Charts]
-    end
-```
----
-## 🖥️ App Navigation & Expected UI
-The interface is divided into three logical zones for a streamlined user experience:
-### 1. Control Center (Top & Left)
-- **Phase Progress Bar**: A visual indicator of your progress through Braun & Clarke’s 6 phases.
-- **Data Input (Left)**: The upload zone for your Scopus CSV. Once uploaded, Phase 1 triggers automatically.
-### 2. The Agent Laboratory (Center)
-- **Chatbot Interface**: Your main point of interaction. The agent will ask questions, provide stats, and guide you. You can type commands like "run abstract" or "Continue".
-- **AI Council Feedback**: Every time a label is generated, look for the reasoning block. It shows the consensus score between models.
-### 3. Results Dashboard (Bottom Tabs)
-- **📋 Review Table**: The "Heart" of the app. This is where you approve, rename, and refine the AI's findings. You MUST click **"Submit Review"** to move past STOP GATES.
-- **📈 Charts Tab**: Switch between **Intertopic Map**, **Frequency Bars**, **Hierarchy (Treemap)**, and **Similarity Heatmap**.
-- **⚖️ AI Council Tab**: A dedicated view showing the full transcript of debates between Mistral and Groq.
-- **💾 Download Tab**: Your final repository. All files are generated in real-time and appear here for one-click downloading.
-### 📤 Expected Output Preview
-- **In Chat**: Summary tables, saturation percentages (e.g., "92.4% Coverage"), and phase completion checkmarks.
-- **In Files**:
-  - `narrative.txt`: Academic prose with structured headings.
-  - `comparison.csv`: Columns for `Abstract Theme`, `Title Theme`, and `Convergence` (marked with ✓).
-  - `taxonomy_map.json`: A mapping showing each theme's link to the PAJAIS framework and its **Novelty score**.
----
-### 1. Prerequisites
-- Python 3.9+
-- API Keys for **Mistral AI** and **Groq** (optional but recommended for the Council feature).
-### 2. Installation
-Clone the repository and install the dependencies:
-```bash
-# Clone the repo
-git clone https://github.com/ShivamKadam63s/BERT_Topic_Modelling.git
-cd BERT_Topic_Modelling
-# Install dependencies
-pip install -r requirements.txt
-```
-### 3. Environment Setup
-Create a `.env` file or export your API keys in your terminal:
-```powershell
-$env:MISTRAL_API_KEY="your_mistral_key"
-$env:GROQ_API_KEY="your_groq_key"
-```
-### 4. Running the App
-Start the Gradio interface:
-```bash
-python app.py
-```
-Open your browser at `http://localhost:7860`.
----
-## 📖 User Guide: Phase-by-Phase Walkthrough
-### Step 1: Data Input
-Upload your **Scopus CSV** file. The agent will immediately scan the file, remove boilerplate text (Copyright notices, DOIs, etc.), and provide a dataset profile including paper counts and year ranges.
-### Step 2: Discovery & Coding
-- Click **"run abstract"** or **"run title"**.
-- The system will generate clusters and invoke the **AI Council**.
-- **Navigation**: Check the **"⚖️ AI Council"** tab to see the reasoning behind each label.
-- **Action**: In the **"📋 Review Table"**, tick **Approve** for clusters you accept or provide a custom name in **Rename To**. Click **"Submit Review"**.
-### Step 3: Themes & Saturation
-The agent combines approved codes into 4-8 themes. It will report **Thematic Saturation** (e.g., "Themes cover 92% of the corpus").
-### Step 4: Taxonomy Mapping
-The tool automatically maps your themes to the **PAJAIS Taxonomy**.
-- Themes marked with 🌟 **NOVEL** are identified as potential new research contributions not found in standard taxonomies.
-### Step 5: Final Report
-The agent generates a **500-word Section 7 draft**. Check the **"💾 Download"** tab for your full suite of results.
----
-## 📈 Expected Outputs
-| Output File | Description |
-| :--- | :--- |
-| `narrative.txt` | A complete Section 7 draft following academic standards. |
-| `comparison.csv` | Side-by-side comparison of Abstract and Title themes. |
-| `taxonomy_map.json` | JSON mapping of themes to PAJAIS categories. |
-| `chart_*.html` | Interactive Plotly visualizations for intertopic distance and hierarchy. |
-| `*.png` | High-resolution static exports of all charts. |
----
-## 🛠️ Built With
-- **Gradio**: Modern UI Framework
-- **LangGraph**: Agentic Multi-Model Workflows
-- **BERTopic**: Advanced Topic Modeling
-- **Sentence-Transformers**: `all-MiniLM-L6-v2` embeddings
-- **Mistral Large**: Primary Reasoning LLM
-- **Groq (Llama-3)**: Secondary Council LLM
-- **Plotly**: Dynamic Data Science Charts
----
-## ⚖️ License & Citation
-If you use this tool in your research, please cite:
-*Shivam Kadam, "BERTopic Agentic Topic Modelling for Systematic Literature Reviews," 2026.*
-Based on:
-*Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77-101.*
----
-<p align="center">Made with ❤️ for the Research Community</p>

+---
+title: BERTopic Agentic Topic Modelling
+emoji: 🧠
+colorFrom: blue
+colorTo: indigo
+sdk: gradio
+app_file: app.py
+pinned: false
+---
+# 🔬 BERTopic Agentic Topic Modelling
+### *Computational Thematic Analysis powered by Braun & Clarke (2006)*
+![BERTopic Agent Logo](logo.png)
+---
+## 🌟 Overview
+**BERTopic Agentic Topic Modelling** is a state-of-the-art research tool designed to automate and enhance the process of **Thematic Analysis** for academic literature. By integrating **BERTopic**'s transformer-based clustering with a **LangGraph-driven agentic workflow**, this application guides researchers through the rigorous 6-phase framework of Braun & Clarke (2006).
+It doesn't just cluster text; it *reasons* about it. Featuring a unique **"AI Council"** where multiple Large Language Models (Mistral & Groq) debate and reach consensus on topic labels, the tool ensures high-fidelity, publishable results.
+---
+## 🧠 Theoretical Foundation: Braun & Clarke (2006)
+This tool is strictly mapped to the six phases of thematic analysis as defined in the seminal work:
+1.  **Familiarisation with data**: Automatic cleaning, boilerplate removal, and dataset profiling.
+2.  **Generating initial codes**: BERTopic discovery and AI-assisted initial labeling.
+3.  **Searching for themes**: LLM-driven consolidation of topics into overarching themes.
+4.  **Reviewing potential themes**: Saturation checks and coverage analysis.
+5.  **Defining and naming themes**: Generation of academic definitions and core narratives.
+6.  **Producing the report**: Narrative writing (Section 7 draft) and PAJAIS taxonomy mapping.
+---
+## ✨ Key Features
+- **🤖 Agentic Workflow**: A LangGraph agent manages the entire pipeline, maintaining memory and ensuring a step-by-step scientific process.
+- **⚖️ AI Council**: Real-time debates between **Mistral-Large** and **Llama-3 (Groq)** to determine the most accurate thematic labels.
+- **📊 Dynamic Visualizations**: 8+ interactive Plotly charts (Intertopic maps, Frequency bars, Heatmaps, Treemaps, and DBSCAN scatter plots).
+- **🛡️ Multi-Model Analysis**: Run separate analyses on **Abstracts** vs. **Titles** and generate a side-by-side convergence CSV.
+- **🔍 Density Refinement**: Optional **DBSCAN** clustering to complement traditional hierarchical methods and handle noise points elegantly.
+- **🏷️ PAJAIS Taxonomy Mapping**: Automated gap analysis by mapping themes to the standard 25 PAJAIS Information Systems categories.
+- **📥 One-Click Export**: Download structured JSON, side-by-side CSVs, PNG charts, and a 500-word academic narrative report.
+---
+## 🛠️ Architecture
+```mermaid
+graph TD
+    A[Scopus CSV Upload] --> B{Agentic Workflow}
+    B -->|Phase 1| C[Data Loading & Cleaning]
+    C -->|Phase 2| D[BERTopic / DBSCAN Discovery]
+    D --> E[AI Council Labeling]
+    E -->|Phase 3| F[Theme Consolidation]
+    F -->|Phase 4| G[Saturation Check]
+    G -->|Phase 5| H[Definition & Naming]
+    H -->|Phase 5.5| I[PAJAIS Taxonomy Mapping]
+    I -->|Phase 6| J[Report Generation]
+    subgraph "AI Council"
+    E1[Mistral-Large] <--> E2[Groq Llama-3]
+    end
+    subgraph "Outputs"
+    J --> K[narrative.txt]
+    J --> L[comparison.csv]
+    J --> M[Interactive Charts]
+    end
+```
+---
+## 🖥️ App Navigation & Expected UI
+The interface is divided into three logical zones for a streamlined user experience:
+### 1. Control Center (Top & Left)
+- **Phase Progress Bar**: A visual indicator of your progress through Braun & Clarke’s 6 phases.
+- **Data Input (Left)**: The upload zone for your Scopus CSV. Once uploaded, Phase 1 triggers automatically.
+### 2. The Agent Laboratory (Center)
+- **Chatbot Interface**: Your main point of interaction. The agent will ask questions, provide stats, and guide you. You can type commands like "run abstract" or "Continue".
+- **AI Council Feedback**: Every time a label is generated, look for the reasoning block. It shows the consensus score between models.
+### 3. Results Dashboard (Bottom Tabs)
+- **📋 Review Table**: The "Heart" of the app. This is where you approve, rename, and refine the AI's findings. You MUST click **"Submit Review"** to move past STOP GATES.
+- **📈 Charts Tab**: Switch between **Intertopic Map**, **Frequency Bars**, **Hierarchy (Treemap)**, and **Similarity Heatmap**.
+- **⚖️ AI Council Tab**: A dedicated view showing the full transcript of debates between Mistral and Groq.
+- **💾 Download Tab**: Your final repository. All files are generated in real-time and appear here for one-click downloading.
+### 📤 Expected Output Preview
+- **In Chat**: Summary tables, saturation percentages (e.g., "92.4% Coverage"), and phase completion checkmarks.
+- **In Files**:
+  - `narrative.txt`: Academic prose with structured headings.
+  - `comparison.csv`: Columns for `Abstract Theme`, `Title Theme`, and `Convergence` (marked with ✓).
+  - `taxonomy_map.json`: A mapping showing each theme's link to the PAJAIS framework and its **Novelty score**.
+---
+### 1. Prerequisites
+- Python 3.9+
+- API Keys for **Mistral AI** and **Groq** (optional but recommended for the Council feature).
+### 2. Installation
+Clone the repository and install the dependencies:
+```bash
+# Clone the repo
+git clone https://github.com/ShivamKadam63s/BERT_Topic_Modelling.git
+cd BERT_Topic_Modelling
+# Install dependencies
+pip install -r requirements.txt
+```
+### 3. Environment Setup
+Create a `.env` file or export your API keys in your terminal:
+```powershell
+$env:MISTRAL_API_KEY="your_mistral_key"
+$env:GROQ_API_KEY="your_groq_key"
+```
+### 4. Running the App
+Start the Gradio interface:
+```bash
+python app.py
+```
+Open your browser at `http://localhost:7860`.
+---
+## 📖 User Guide: Phase-by-Phase Walkthrough
+### Step 1: Data Input
+Upload your **Scopus CSV** file. The agent will immediately scan the file, remove boilerplate text (Copyright notices, DOIs, etc.), and provide a dataset profile including paper counts and year ranges.
+### Step 2: Discovery & Coding
+- Click **"run abstract"** or **"run title"**.
+- The system will generate clusters and invoke the **AI Council**.
+- **Navigation**: Check the **"⚖️ AI Council"** tab to see the reasoning behind each label.
+- **Action**: In the **"📋 Review Table"**, tick **Approve** for clusters you accept or provide a custom name in **Rename To**. Click **"Submit Review"**.
+### Step 3: Themes & Saturation
+The agent combines approved codes into 4-8 themes. It will report **Thematic Saturation** (e.g., "Themes cover 92% of the corpus").
+### Step 4: Taxonomy Mapping
+The tool automatically maps your themes to the **PAJAIS Taxonomy**.
+- Themes marked with 🌟 **NOVEL** are identified as potential new research contributions not found in standard taxonomies.
+### Step 5: Final Report
+The agent generates a **500-word Section 7 draft**. Check the **"💾 Download"** tab for your full suite of results.
+---
+## 📈 Expected Outputs
+| Output File | Description |
+| :--- | :--- |
+| `narrative.txt` | A complete Section 7 draft following academic standards. |
+| `comparison.csv` | Side-by-side comparison of Abstract and Title themes. |
+| `taxonomy_map.json` | JSON mapping of themes to PAJAIS categories. |
+| `chart_*.html` | Interactive Plotly visualizations for intertopic distance and hierarchy. |
+| `*.png` | High-resolution static exports of all charts. |
+---
+## 🛠️ Built With
+- **Gradio**: Modern UI Framework
+- **LangGraph**: Agentic Multi-Model Workflows
+- **BERTopic**: Advanced Topic Modeling
+- **Sentence-Transformers**: `all-MiniLM-L6-v2` embeddings
+- **Mistral Large**: Primary Reasoning LLM
+- **Groq (Llama-3)**: Secondary Council LLM
+- **Plotly**: Dynamic Data Science Charts
+---
+## ⚖️ License & Citation
+If you use this tool in your research, please cite:
+*Shivam Kadam, "BERTopic Agentic Topic Modelling for Systematic Literature Reviews," 2026.*
+Based on:
+*Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77-101.*
+---
+<p align="center">Made with ❤️ for the Research Community</p>