Spaces:
Running
Running
| # Social Media Topic Modeling System | |
| A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation, and topic evolution analysis. | |
| ## Features | |
| - **π Topic Modeling**: Uses BERTopic for state-of-the-art topic modeling. | |
| - **βοΈ Flexible Configuration**: | |
| - **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`. | |
| - **Topic Number Control**: Let the model find topics automatically or specify the exact number you need. | |
| - **π Multilingual Support**: Handles English and 50+ other languages. | |
| - **π Gini Coefficient Analysis**: Calculates topic distribution inequality per user and per topic. | |
| - **β° Topic Evolution**: Tracks how topics change over time. | |
| - **π― Interactive Visualizations**: Built-in charts and data tables using Plotly. | |
| - **π± Responsive Interface**: Clean, modern Streamlit interface with a control sidebar. | |
| ## Requirements | |
| ### CSV File Format | |
| Your CSV file must contain columns that can be mapped to the following roles: | |
| - **User ID**: A column with unique identifiers for each user (string). | |
| - **Post Content**: A column with the text content of the social media post (string). | |
| - **Timestamp**: A column with the date and time of the post (e.g., "2023-01-15 14:30:00"). | |
| The application will prompt you to select the correct column for each role after you upload your file. | |
| ### Dependencies | |
| See `requirements.txt` for a full list of dependencies. | |
| ## Installation | |
| ### Option 1: Local Installation | |
| 1. **Clone or download the project files.** | |
| 2. **Install dependencies:** | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### Option 2: Docker Installation (Recommended) | |
| 1. **Using Docker Compose (easiest):** | |
| ```bash | |
| docker-compose up --build | |
| ``` | |
| 2. **Access the application:** | |
| ``` | |
| http://localhost:8501 | |
| ``` | |
| ## Usage | |
| 1. **Start the Streamlit application:** | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| 2. **Open your browser** and navigate to `http://localhost:8501`. | |
| 3. **Follow the steps in the sidebar:** | |
| - **1. Upload CSV File**: Click "Browse files" to upload your dataset. | |
| - **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`. | |
| - **3. Configure Analysis**: | |
| - **Language Model**: Choose `english` for English-only data or `multilingual` for other languages. | |
| - **Number of Topics**: Enter a specific number of topics to find, or use `-1` to let the model decide automatically. | |
| - **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis. | |
| - **4. Run Analysis**: Click the "π Analyze Topics" button. | |
| 4. **Explore the results** in the five interactive tabs in the main panel. | |
| ### Using the Interface | |
| The application provides five main tabs: | |
| #### π Overview | |
| - Key metrics, dataset preview, and average Gini coefficient. | |
| #### π― Topics | |
| - Topic information table and topic distribution bar chart. | |
| #### π Gini Analysis | |
| - Analysis of topic diversity for each user and user concentration for each topic. | |
| #### π Topic Evolution | |
| - Timelines showing how topic popularity changes over time, for all users and for individual users. | |
| #### π Documents | |
| - A detailed view of your original data with assigned topics and probabilities. | |
| ## Understanding the Results | |
| ### Gini Coefficient | |
| - **Range**: 0 to 1 | |
| - **User Gini**: Measures how diverse a user's topics are. **0** = perfectly diverse (posts on many topics), **1** = perfectly specialized (posts on one topic). | |
| - **Topic Gini**: Measures how concentrated a topic is among users. **0** = widely discussed by many users, **1** = dominated by a few users. | |
| --- | |
| **Built with β€οΈ using Streamlit and BERTopic** |