bertopic / Social Media Topic Modeling System.md
Mars203020's picture
Upload 17 files
b7b041e verified
# Social Media Topic Modeling System
A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation, and topic evolution analysis.
## Features
- **πŸ“Š Topic Modeling**: Uses BERTopic for state-of-the-art topic modeling.
- **βš™οΈ Flexible Configuration**:
- **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`.
- **Topic Number Control**: Let the model find topics automatically or specify the exact number you need.
- **🌍 Multilingual Support**: Handles English and 50+ other languages.
- **πŸ“ˆ Gini Coefficient Analysis**: Calculates topic distribution inequality per user and per topic.
- **⏰ Topic Evolution**: Tracks how topics change over time.
- **🎯 Interactive Visualizations**: Built-in charts and data tables using Plotly.
- **πŸ“± Responsive Interface**: Clean, modern Streamlit interface with a control sidebar.
## Requirements
### CSV File Format
Your CSV file must contain columns that can be mapped to the following roles:
- **User ID**: A column with unique identifiers for each user (string).
- **Post Content**: A column with the text content of the social media post (string).
- **Timestamp**: A column with the date and time of the post (e.g., "2023-01-15 14:30:00").
The application will prompt you to select the correct column for each role after you upload your file.
### Dependencies
See `requirements.txt` for a full list of dependencies.
## Installation
### Option 1: Local Installation
1. **Clone or download the project files.**
2. **Install dependencies:**
```bash
pip install -r requirements.txt
```
### Option 2: Docker Installation (Recommended)
1. **Using Docker Compose (easiest):**
```bash
docker-compose up --build
```
2. **Access the application:**
```
http://localhost:8501
```
## Usage
1. **Start the Streamlit application:**
```bash
streamlit run app.py
```
2. **Open your browser** and navigate to `http://localhost:8501`.
3. **Follow the steps in the sidebar:**
- **1. Upload CSV File**: Click "Browse files" to upload your dataset.
- **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`.
- **3. Configure Analysis**:
- **Language Model**: Choose `english` for English-only data or `multilingual` for other languages.
- **Number of Topics**: Enter a specific number of topics to find, or use `-1` to let the model decide automatically.
- **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis.
- **4. Run Analysis**: Click the "πŸš€ Analyze Topics" button.
4. **Explore the results** in the five interactive tabs in the main panel.
### Using the Interface
The application provides five main tabs:
#### πŸ“‹ Overview
- Key metrics, dataset preview, and average Gini coefficient.
#### 🎯 Topics
- Topic information table and topic distribution bar chart.
#### πŸ“Š Gini Analysis
- Analysis of topic diversity for each user and user concentration for each topic.
#### πŸ“ˆ Topic Evolution
- Timelines showing how topic popularity changes over time, for all users and for individual users.
#### πŸ“„ Documents
- A detailed view of your original data with assigned topics and probabilities.
## Understanding the Results
### Gini Coefficient
- **Range**: 0 to 1
- **User Gini**: Measures how diverse a user's topics are. **0** = perfectly diverse (posts on many topics), **1** = perfectly specialized (posts on one topic).
- **Topic Gini**: Measures how concentrated a topic is among users. **0** = widely discussed by many users, **1** = dominated by a few users.
---
**Built with ❀️ using Streamlit and BERTopic**