File size: 3,915 Bytes
b7b041e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# Social Media Topic Modeling System

A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation, and topic evolution analysis.

## Features

- **πŸ“Š Topic Modeling**: Uses BERTopic for state-of-the-art topic modeling.
- **βš™οΈ Flexible Configuration**:
    - **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`.
    - **Topic Number Control**: Let the model find topics automatically or specify the exact number you need.
- **🌍 Multilingual Support**: Handles English and 50+ other languages.
- **πŸ“ˆ Gini Coefficient Analysis**: Calculates topic distribution inequality per user and per topic.
- **⏰ Topic Evolution**: Tracks how topics change over time.
- **🎯 Interactive Visualizations**: Built-in charts and data tables using Plotly.
- **πŸ“± Responsive Interface**: Clean, modern Streamlit interface with a control sidebar.

## Requirements

### CSV File Format

Your CSV file must contain columns that can be mapped to the following roles:
- **User ID**: A column with unique identifiers for each user (string).
- **Post Content**: A column with the text content of the social media post (string).
- **Timestamp**: A column with the date and time of the post (e.g., "2023-01-15 14:30:00").

The application will prompt you to select the correct column for each role after you upload your file.

### Dependencies

See `requirements.txt` for a full list of dependencies.

## Installation

### Option 1: Local Installation

1.  **Clone or download the project files.**
2.  **Install dependencies:**
    ```bash
    pip install -r requirements.txt
    ```

### Option 2: Docker Installation (Recommended)

1.  **Using Docker Compose (easiest):**
    ```bash
    docker-compose up --build
    ```
2.  **Access the application:**
    ```
    http://localhost:8501
    ```

## Usage

1.  **Start the Streamlit application:**
    ```bash
    streamlit run app.py
    ```
2.  **Open your browser** and navigate to `http://localhost:8501`.
3.  **Follow the steps in the sidebar:**
    - **1. Upload CSV File**: Click "Browse files" to upload your dataset.
    - **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`.
    - **3. Configure Analysis**:
        - **Language Model**: Choose `english` for English-only data or `multilingual` for other languages.
        - **Number of Topics**: Enter a specific number of topics to find, or use `-1` to let the model decide automatically.
        - **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis.
    - **4. Run Analysis**: Click the "πŸš€ Analyze Topics" button.

4.  **Explore the results** in the five interactive tabs in the main panel.

### Using the Interface

The application provides five main tabs:

#### πŸ“‹ Overview
- Key metrics, dataset preview, and average Gini coefficient.

#### 🎯 Topics
- Topic information table and topic distribution bar chart.

#### πŸ“Š Gini Analysis
- Analysis of topic diversity for each user and user concentration for each topic.

#### πŸ“ˆ Topic Evolution
- Timelines showing how topic popularity changes over time, for all users and for individual users.

#### πŸ“„ Documents
- A detailed view of your original data with assigned topics and probabilities.

## Understanding the Results

### Gini Coefficient
- **Range**: 0 to 1
- **User Gini**: Measures how diverse a user's topics are. **0** = perfectly diverse (posts on many topics), **1** = perfectly specialized (posts on one topic).
- **Topic Gini**: Measures how concentrated a topic is among users. **0** = widely discussed by many users, **1** = dominated by a few users.

---

**Built with ❀️ using Streamlit and BERTopic**