bertopic / Social Media Topic Modeling System.md
Mars203020's picture
Upload 17 files
b7b041e verified

Social Media Topic Modeling System

A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation, and topic evolution analysis.

Features

  • πŸ“Š Topic Modeling: Uses BERTopic for state-of-the-art topic modeling.
  • βš™οΈ Flexible Configuration:
    • Custom Column Mapping: Use any CSV file by mapping your columns to user_id, post_content, and timestamp.
    • Topic Number Control: Let the model find topics automatically or specify the exact number you need.
  • 🌍 Multilingual Support: Handles English and 50+ other languages.
  • πŸ“ˆ Gini Coefficient Analysis: Calculates topic distribution inequality per user and per topic.
  • ⏰ Topic Evolution: Tracks how topics change over time.
  • 🎯 Interactive Visualizations: Built-in charts and data tables using Plotly.
  • πŸ“± Responsive Interface: Clean, modern Streamlit interface with a control sidebar.

Requirements

CSV File Format

Your CSV file must contain columns that can be mapped to the following roles:

  • User ID: A column with unique identifiers for each user (string).
  • Post Content: A column with the text content of the social media post (string).
  • Timestamp: A column with the date and time of the post (e.g., "2023-01-15 14:30:00").

The application will prompt you to select the correct column for each role after you upload your file.

Dependencies

See requirements.txt for a full list of dependencies.

Installation

Option 1: Local Installation

  1. Clone or download the project files.
  2. Install dependencies:
    pip install -r requirements.txt
    

Option 2: Docker Installation (Recommended)

  1. Using Docker Compose (easiest):
    docker-compose up --build
    
  2. Access the application:
    http://localhost:8501
    

Usage

  1. Start the Streamlit application:

    streamlit run app.py
    
  2. Open your browser and navigate to http://localhost:8501.

  3. Follow the steps in the sidebar:

    • 1. Upload CSV File: Click "Browse files" to upload your dataset.
    • 2. Map Data Columns: Once uploaded, select which of your columns correspond to User ID, Post Content, and Timestamp.
    • 3. Configure Analysis:
      • Language Model: Choose english for English-only data or multilingual for other languages.
      • Number of Topics: Enter a specific number of topics to find, or use -1 to let the model decide automatically.
      • Custom Stopwords: (Optional) Enter comma-separated words to exclude from analysis.
    • 4. Run Analysis: Click the "πŸš€ Analyze Topics" button.
  4. Explore the results in the five interactive tabs in the main panel.

Using the Interface

The application provides five main tabs:

πŸ“‹ Overview

  • Key metrics, dataset preview, and average Gini coefficient.

🎯 Topics

  • Topic information table and topic distribution bar chart.

πŸ“Š Gini Analysis

  • Analysis of topic diversity for each user and user concentration for each topic.

πŸ“ˆ Topic Evolution

  • Timelines showing how topic popularity changes over time, for all users and for individual users.

πŸ“„ Documents

  • A detailed view of your original data with assigned topics and probabilities.

Understanding the Results

Gini Coefficient

  • Range: 0 to 1
  • User Gini: Measures how diverse a user's topics are. 0 = perfectly diverse (posts on many topics), 1 = perfectly specialized (posts on one topic).
  • Topic Gini: Measures how concentrated a topic is among users. 0 = widely discussed by many users, 1 = dominated by a few users.

Built with ❀️ using Streamlit and BERTopic