Spaces:
Sleeping
Sleeping
Social Media Topic Modeling System
A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation, and topic evolution analysis.
Features
- π Topic Modeling: Uses BERTopic for state-of-the-art topic modeling.
- βοΈ Flexible Configuration:
- Custom Column Mapping: Use any CSV file by mapping your columns to
user_id,post_content, andtimestamp. - Topic Number Control: Let the model find topics automatically or specify the exact number you need.
- Custom Column Mapping: Use any CSV file by mapping your columns to
- π Multilingual Support: Handles English and 50+ other languages.
- π Gini Coefficient Analysis: Calculates topic distribution inequality per user and per topic.
- β° Topic Evolution: Tracks how topics change over time.
- π― Interactive Visualizations: Built-in charts and data tables using Plotly.
- π± Responsive Interface: Clean, modern Streamlit interface with a control sidebar.
Requirements
CSV File Format
Your CSV file must contain columns that can be mapped to the following roles:
- User ID: A column with unique identifiers for each user (string).
- Post Content: A column with the text content of the social media post (string).
- Timestamp: A column with the date and time of the post (e.g., "2023-01-15 14:30:00").
The application will prompt you to select the correct column for each role after you upload your file.
Dependencies
See requirements.txt for a full list of dependencies.
Installation
Option 1: Local Installation
- Clone or download the project files.
- Install dependencies:
pip install -r requirements.txt
Option 2: Docker Installation (Recommended)
- Using Docker Compose (easiest):
docker-compose up --build - Access the application:
http://localhost:8501
Usage
Start the Streamlit application:
streamlit run app.pyOpen your browser and navigate to
http://localhost:8501.Follow the steps in the sidebar:
- 1. Upload CSV File: Click "Browse files" to upload your dataset.
- 2. Map Data Columns: Once uploaded, select which of your columns correspond to
User ID,Post Content, andTimestamp. - 3. Configure Analysis:
- Language Model: Choose
englishfor English-only data ormultilingualfor other languages. - Number of Topics: Enter a specific number of topics to find, or use
-1to let the model decide automatically. - Custom Stopwords: (Optional) Enter comma-separated words to exclude from analysis.
- Language Model: Choose
- 4. Run Analysis: Click the "π Analyze Topics" button.
Explore the results in the five interactive tabs in the main panel.
Using the Interface
The application provides five main tabs:
π Overview
- Key metrics, dataset preview, and average Gini coefficient.
π― Topics
- Topic information table and topic distribution bar chart.
π Gini Analysis
- Analysis of topic diversity for each user and user concentration for each topic.
π Topic Evolution
- Timelines showing how topic popularity changes over time, for all users and for individual users.
π Documents
- A detailed view of your original data with assigned topics and probabilities.
Understanding the Results
Gini Coefficient
- Range: 0 to 1
- User Gini: Measures how diverse a user's topics are. 0 = perfectly diverse (posts on many topics), 1 = perfectly specialized (posts on one topic).
- Topic Gini: Measures how concentrated a topic is among users. 0 = widely discussed by many users, 1 = dominated by a few users.
Built with β€οΈ using Streamlit and BERTopic