Spaces:

Mars203020
/

bertopic

Sleeping

App Files Files Community

Mars203020 commited on Jan 11

Commit

b7b041e

verified ·

1 Parent(s): c5c609e

Upload 17 files

Browse files

Files changed (17) hide show

Deployment Guide.md +267 -0
Dockerfile +53 -9
Social Media Topic Modeling System.md +99 -0
TopicModelingApp.spec +142 -0
app.py +661 -0
docker-compose.yml +25 -0
gini_calculator.py +107 -0
narrative_similarity.py +102 -0
readme.md +138 -0
requirements.txt +19 -3
resource_path.py +12 -0
run.py +42 -0
sample_data.csv +27 -0
start-streamlit.sh +12 -0
text_preprocessor.py +131 -0
topic_evolution.py +100 -0
topic_modeling.py +88 -0

Deployment Guide.md ADDED Viewed

	@@ -0,0 +1,267 @@

+# Deployment Guide
+This guide covers various deployment options for the Social Media Topic Modeling System.
+## Local Development
+### Quick Start
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Run the application
+streamlit run streamlit_app.py
+```
+### Development with Docker
+```bash
+# Build and run with Docker Compose
+docker-compose up --build
+# Or build and run manually
+docker build -t topic-modeling-app .
+docker run -p 8501:8501 topic-modeling-app
+```
+## Production Deployment
+### Docker Production Setup
+1. **Build the production image:**
+```bash
+docker build -t topic-modeling-app:latest .
+```
+2. **Run with production settings:**
+```bash
+docker run -d \
+  --name topic-modeling-prod \
+  -p 8501:8501 \
+  --memory=4g \
+  --cpus=2 \
+  --restart=unless-stopped \
+  topic-modeling-app:latest
+```
+3. **Using Docker Compose for production:**
+```yaml
+version: '3.8'
+services:
+  topic-modeling-app:
+    build: .
+    ports:
+      - "8501:8501"
+    environment:
+      - STREAMLIT_SERVER_PORT=8501
+      - STREAMLIT_SERVER_ADDRESS=0.0.0.0
+    volumes:
+      - ./data:/app/data
+    restart: unless-stopped
+    deploy:
+      resources:
+        limits:
+          memory: 4G
+          cpus: '2'
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8501/_stcore/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+```
+### Cloud Deployment Options
+#### 1. AWS ECS/Fargate
+```bash
+# Tag for ECR
+docker tag topic-modeling-app:latest your-account.dkr.ecr.region.amazonaws.com/topic-modeling-app:latest
+# Push to ECR
+docker push your-account.dkr.ecr.region.amazonaws.com/topic-modeling-app:latest
+```
+#### 2. Google Cloud Run
+```bash
+# Build and deploy to Cloud Run
+gcloud run deploy topic-modeling-app \
+  --image gcr.io/your-project/topic-modeling-app \
+  --platform managed \
+  --region us-central1 \
+  --memory 4Gi \
+  --cpu 2
+```
+#### 3. Azure Container Instances
+```bash
+# Deploy to Azure
+az container create \
+  --resource-group myResourceGroup \
+  --name topic-modeling-app \
+  --image your-registry.azurecr.io/topic-modeling-app:latest \
+  --cpu 2 \
+  --memory 4 \
+  --ports 8501
+```
+#### 4. Heroku
+```bash
+# Login to Heroku Container Registry
+heroku container:login
+# Build and push
+heroku container:push web --app your-app-name
+# Release
+heroku container:release web --app your-app-name
+```
+### Kubernetes Deployment
+#### Deployment YAML
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: topic-modeling-app
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: topic-modeling-app
+  template:
+    metadata:
+      labels:
+        app: topic-modeling-app
+    spec:
+      containers:
+      - name: topic-modeling-app
+        image: topic-modeling-app:latest
+        ports:
+        - containerPort: 8501
+        resources:
+          requests:
+            memory: "2Gi"
+            cpu: "1"
+          limits:
+            memory: "4Gi"
+            cpu: "2"
+        env:
+        - name: STREAMLIT_SERVER_PORT
+          value: "8501"
+        - name: STREAMLIT_SERVER_ADDRESS
+          value: "0.0.0.0"
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: topic-modeling-service
+spec:
+  selector:
+    app: topic-modeling-app
+  ports:
+  - port: 80
+    targetPort: 8501
+  type: LoadBalancer
+```
+## Performance Optimization
+### Memory Management
+- **Minimum RAM**: 4GB for small datasets (< 1000 documents)
+- **Recommended RAM**: 8GB+ for larger datasets
+- **Large datasets**: Consider processing in batches
+### CPU Optimization
+- **Minimum**: 2 CPU cores
+- **Recommended**: 4+ CPU cores for faster processing
+- **GPU**: Optional, can speed up transformer models
+### Storage Considerations
+- **Docker image**: ~2GB
+- **Temporary files**: Varies with dataset size
+- **Persistent storage**: Optional for saving results
+## Monitoring and Logging
+### Health Checks
+The application includes built-in health checks:
+```bash
+# Check application health
+curl http://localhost:8501/_stcore/health
+```
+### Logging
+Streamlit logs are available through Docker:
+```bash
+# View logs
+docker logs topic-modeling-app
+# Follow logs
+docker logs -f topic-modeling-app
+```
+### Monitoring with Prometheus
+Add monitoring endpoints for production:
+```python
+# Add to streamlit_app.py for monitoring
+import time
+import psutil
+# Add metrics endpoint
+@st.cache_data
+def get_system_metrics():
+    return {
+        'cpu_percent': psutil.cpu_percent(),
+        'memory_percent': psutil.virtual_memory().percent,
+        'timestamp': time.time()
+    }
+```
+## Security Considerations
+### Container Security
+- Run as non-root user (included in Dockerfile)
+- Use minimal base images
+- Regularly update dependencies
+### Network Security
+- Use HTTPS in production
+- Implement proper firewall rules
+- Consider VPN for internal access
+### Data Security
+- Encrypt data at rest and in transit
+- Implement proper access controls
+- Regular security audits
+## Troubleshooting
+### Common Issues
+1. **Out of Memory Errors**
+   - Increase container memory limits
+   - Process smaller datasets
+   - Use batch processing
+2. **Slow Performance**
+   - Increase CPU allocation
+   - Use SSD storage
+   - Optimize dataset size
+3. **Container Won't Start**
+   - Check logs: `docker logs container-name`
+   - Verify port availability
+   - Check resource limits
+4. **Model Loading Issues**
+   - Ensure internet connectivity for model downloads
+   - Pre-download models in Docker build
+   - Check disk space
+### Support
+For deployment issues:
+1. Check the logs first
+2. Verify system requirements
+3. Test with sample data
+4. Check network connectivity

Dockerfile CHANGED Viewed

@@ -1,20 +1,64 @@
-FROM python:3.13.5-slim
 WORKDIR /app
-RUN apt-get update && apt-get install -y \
     build-essential \
     curl \
     git \
-    && rm -rf /var/lib/apt/lists/*
-COPY requirements.txt ./
-COPY src/ ./src/
-RUN pip3 install -r requirements.txt
-EXPOSE 8501
-HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
-ENTRYPOINT ["streamlit", "run", "src/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]

+# Use Python base image (avoid slim on HF)
+FROM python:3.11
+# Set working directory
 WORKDIR /app
+# Environment variables (use port 7860 for HF Spaces)
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    STREAMLIT_SERVER_PORT=7860 \
+    STREAMLIT_SERVER_ADDRESS=0.0.0.0 \
+    STREAMLIT_BROWSER_GATHER_USAGE_STATS=false \
+    STREAMLIT_SERVER_HEADLESS=true
+# Install system dependencies (HF-safe)
+RUN apt-get update --fix-missing && \
+    apt-get install -y --no-install-recommends \
     build-essential \
     curl \
     git \
+    fontconfig \
+    fonts-dejavu-core && \
+    fc-cache -f && \
+    rm -rf /var/lib/apt/lists/*
+# Copy requirements first (better cache)
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r requirements.txt
+# Download spaCy models (required for text preprocessing)
+RUN python -m spacy download en_core_web_sm && \
+    python -m spacy download xx_ent_wiki_sm
+# Download NLTK data (required for coherence calculation)
+RUN python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
+# Copy application files
+COPY app.py .
+COPY topic_modeling.py .
+COPY text_preprocessor.py .
+COPY gini_calculator.py .
+COPY topic_evolution.py .
+COPY narrative_similarity.py .
+COPY resource_path.py .
+COPY sample_data.csv .
+# Copy Streamlit config (fixes 403 upload error)
+COPY .streamlit/config.toml .streamlit/config.toml
+# Create non-root user (HF compatible)
+RUN useradd -m appuser
+USER appuser
+# Expose Streamlit port (7860 for HF Spaces)
+EXPOSE 7860
+# Health check
+HEALTHCHECK CMD curl --fail http://localhost:7860/_stcore/health || exit 1
+# Run Streamlit
+CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]

Social Media Topic Modeling System.md ADDED Viewed

	@@ -0,0 +1,99 @@

+# Social Media Topic Modeling System
+A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation, and topic evolution analysis.
+## Features
+- **📊 Topic Modeling**: Uses BERTopic for state-of-the-art topic modeling.
+- **⚙️ Flexible Configuration**:
+    - **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`.
+    - **Topic Number Control**: Let the model find topics automatically or specify the exact number you need.
+- **🌍 Multilingual Support**: Handles English and 50+ other languages.
+- **📈 Gini Coefficient Analysis**: Calculates topic distribution inequality per user and per topic.
+- **⏰ Topic Evolution**: Tracks how topics change over time.
+- **🎯 Interactive Visualizations**: Built-in charts and data tables using Plotly.
+- **📱 Responsive Interface**: Clean, modern Streamlit interface with a control sidebar.
+## Requirements
+### CSV File Format
+Your CSV file must contain columns that can be mapped to the following roles:
+- **User ID**: A column with unique identifiers for each user (string).
+- **Post Content**: A column with the text content of the social media post (string).
+- **Timestamp**: A column with the date and time of the post (e.g., "2023-01-15 14:30:00").
+The application will prompt you to select the correct column for each role after you upload your file.
+### Dependencies
+See `requirements.txt` for a full list of dependencies.
+## Installation
+### Option 1: Local Installation
+1.  **Clone or download the project files.**
+2.  **Install dependencies:**
+    ```bash
+    pip install -r requirements.txt
+    ```
+### Option 2: Docker Installation (Recommended)
+1.  **Using Docker Compose (easiest):**
+    ```bash
+    docker-compose up --build
+    ```
+2.  **Access the application:**
+    ```
+    http://localhost:8501
+    ```
+## Usage
+1.  **Start the Streamlit application:**
+    ```bash
+    streamlit run app.py
+    ```
+2.  **Open your browser** and navigate to `http://localhost:8501`.
+3.  **Follow the steps in the sidebar:**
+    - **1. Upload CSV File**: Click "Browse files" to upload your dataset.
+    - **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`.
+    - **3. Configure Analysis**:
+        - **Language Model**: Choose `english` for English-only data or `multilingual` for other languages.
+        - **Number of Topics**: Enter a specific number of topics to find, or use `-1` to let the model decide automatically.
+        - **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis.
+    - **4. Run Analysis**: Click the "🚀 Analyze Topics" button.
+4.  **Explore the results** in the five interactive tabs in the main panel.
+### Using the Interface
+The application provides five main tabs:
+#### 📋 Overview
+- Key metrics, dataset preview, and average Gini coefficient.
+#### 🎯 Topics
+- Topic information table and topic distribution bar chart.
+#### 📊 Gini Analysis
+- Analysis of topic diversity for each user and user concentration for each topic.
+#### 📈 Topic Evolution
+- Timelines showing how topic popularity changes over time, for all users and for individual users.
+#### 📄 Documents
+- A detailed view of your original data with assigned topics and probabilities.
+## Understanding the Results
+### Gini Coefficient
+- **Range**: 0 to 1
+- **User Gini**: Measures how diverse a user's topics are. **0** = perfectly diverse (posts on many topics), **1** = perfectly specialized (posts on one topic).
+- **Topic Gini**: Measures how concentrated a topic is among users. **0** = widely discussed by many users, **1** = dominated by a few users.
+---
+**Built with ❤️ using Streamlit and BERTopic**

TopicModelingApp.spec ADDED Viewed

	@@ -0,0 +1,142 @@

+# topicmodelingapp.spec
+import sys
+import os
+import site
+from pathlib import Path
+from PyInstaller.utils.hooks import collect_all
+from PyInstaller.building.datastruct import Tree
+# Add the script's directory to the path for local imports
+sys.path.append(os.path.abspath(os.path.dirname(sys.argv[0])))
+# --- Dynamic Path Logic (Makes the SPEC file generic) ---
+def get_site_packages_path():
+    """Tries to find the site-packages directory of the current environment."""
+    try:
+        # Tries the standard site.getsitepackages method
+        return Path(site.getsitepackages()[0])
+    except Exception:
+        # Fallback for complex environments like Conda
+        return Path(sys.prefix) / 'lib' / f'python{sys.version_info.major}.{sys.version_info.minor}' / 'site-packages'
+SP_PATH_STR = str(get_site_packages_path()) + os.sep
+def get_model_path(model_name):
+    """Gets the absolute path to an installed spaCy model."""
+    spacy_path = get_site_packages_path()
+    model_dir = spacy_path / model_name
+    if not model_dir.exists():
+        raise FileNotFoundError(
+            f"spaCy model '{model_name}' not found at expected location: {model_dir}"
+        )
+    return str(model_dir)
+# --- Core Dependency Collection (C-Extension Fix) ---
+# Use collect_all. The output is a tuple: (datas [0], binaries [1], hiddenimports [2], excludes [3], pathex [4])
+spacy_data = collect_all('spacy')
+numpy_data = collect_all('numpy')
+sklearn_data = collect_all('sklearn')
+hdbscan_data = collect_all('hdbscan')
+scipy_data = collect_all('scipy')
+# 1. Consolidate ALL hidden imports (index 2 - module names/strings)
+all_collected_imports = []
+all_collected_imports.extend(spacy_data[2])
+all_collected_imports.extend(numpy_data[2])
+all_collected_imports.extend(sklearn_data[2])
+all_collected_imports.extend(hdbscan_data[2])
+all_collected_imports.extend(scipy_data[2])
+# 2. Consolidate all collected data (index 0 - tuples)
+all_collected_datas = []
+all_collected_datas.extend(spacy_data[0])
+all_collected_datas.extend(numpy_data[0])
+all_collected_datas.extend(sklearn_data[0])
+all_collected_datas.extend(hdbscan_data[0])
+all_collected_datas.extend(scipy_data[0])
+# 3. Consolidate all collected binaries (index 1 - tuples of C-extensions/dylibs)
+all_collected_binaries = []
+all_collected_binaries.extend(spacy_data[1])
+all_collected_binaries.extend(numpy_data[1])
+all_collected_binaries.extend(sklearn_data[1])
+all_collected_binaries.extend(hdbscan_data[1])
+all_collected_binaries.extend(scipy_data[1])
+# --- Analysis Setup ---
+a = Analysis(
+    # 1. Explicitly list all your source files
+    ['run.py', 'app.py', 'text_preprocessor.py', 'topic_modeling.py', 'gini_calculator.py', 'narrative_similarity.py', 'resource_path.py', 'topic_evolution.py'],
+    pathex=['.'],
+    # *** CRITICAL FIX: Use the collected binaries list for C extensions/dylibs ***
+    binaries=all_collected_binaries,
+    # 2. The final datas list: collected tuples + manual tuples
+    datas=all_collected_datas + [
+        # Streamlit metadata (Dynamic path and wildcard)
+        (SP_PATH_STR + 'streamlit*.dist-info', 'streamlit_metadata'),
+        (SP_PATH_STR + 'streamlit/static', 'streamlit/static'),
+        # Application resources
+        (os.path.abspath('app.py'), '.'),
+        ('readme.md', '.'),
+        ('requirements.txt', '.'),
+    ],
+    # 3. The final hiddenimports list: collected strings + manual strings
+    hiddenimports=all_collected_imports + [
+        'charset_normalizer',
+        'streamlit.runtime.scriptrunner.magic_funcs',
+        'spacy.parts_of_speech',
+        'scipy.spatial.ckdtree',
+        'thinc.extra.wrappers',
+        'streamlit.web.cli',
+    ],
+    hookspath=[],
+    hooksconfig={},
+    runtime_hooks=[],
+    # Add all collected excludes to the main excludes list
+    excludes=['tkinter', 'matplotlib.pyplot'] + spacy_data[3] + numpy_data[3] + sklearn_data[3] + hdbscan_data[3] + scipy_data[3],
+    noarchive=False,
+    optimize=0,
+)
+# 4. Explicitly include the actual spaCy model directories using Tree
+a.datas.extend(
+    Tree(get_model_path('en_core_web_sm'), prefix='en_core_web_sm')
+)
+a.datas.extend(
+    Tree(get_model_path('xx_ent_wiki_sm'), prefix='xx_ent_wiki_sm')
+)
+pyz = PYZ(a.pure)
+exe = EXE(
+    pyz,
+    a.scripts,
+    a.binaries,
+    a.datas,
+    [],
+    name='TopicModelingApp',
+    debug=False,
+    bootloader_ignore_signals=False,
+    strip=False,
+    upx=True,
+    upx_exclude=[],
+    runtime_tmpdir=None,
+    console=True,
+    disable_windowed_traceback=False,
+    argv_emulation=False,
+    target_arch=None,
+    codesign_identity=None,
+    entitlements_file=None,
+)

app.py ADDED Viewed

	@@ -0,0 +1,661 @@

+import streamlit as st
+import pandas as pd
+import numpy as np
+import time
+import plotly.express as px
+from wordcloud import WordCloud
+import matplotlib.pyplot as plt
+# Import custom modules
+from text_preprocessor import MultilingualPreprocessor
+from topic_modeling import perform_topic_modeling
+from gini_calculator import calculate_gini_per_user, calculate_gini_per_topic
+from topic_evolution import analyze_general_topic_evolution
+from narrative_similarity import calculate_narrative_similarity, calculate_text_similarity_tfidf
+# --- Page Configuration ---
+st.set_page_config(
+    page_title="Social Media Topic Modeling System",
+    page_icon="📊",
+    layout="wide",
+)
+# --- Custom CSS ---
+st.markdown("""
+<style>
+    .main-header { font-size: 2.5rem; color: #1f77b4; text-align: center; margin-bottom: 1rem; }
+    .sub-header { font-size: 1.75rem; color: #2c3e50; border-bottom: 2px solid #f0f2f6; padding-bottom: 0.3rem; margin-top: 2rem; margin-bottom: 1rem;}
+</style>
+""", unsafe_allow_html=True)
+# --- Session State Initialization ---
+if 'results' not in st.session_state:
+    st.session_state.results = None
+if 'df_raw' not in st.session_state:
+    st.session_state.df_raw = None
+if 'custom_stopwords_text' not in st.session_state:
+    st.session_state.custom_stopwords_text = ""
+if "topics_info_for_sync" not in st.session_state:
+    st.session_state.topics_info_for_sync = []
+# --- Helper Functions ---
+@st.cache_data
+def create_word_cloud(_topic_model, topic_id):
+    word_freq = _topic_model.get_topic(topic_id)
+    if not word_freq: return None
+    wc = WordCloud(width=800, height=400, background_color="white", colormap="viridis", max_words=50).generate_from_frequencies(dict(word_freq))
+    fig, ax = plt.subplots(figsize=(10, 5))
+    ax.imshow(wc, interpolation='bilinear')
+    ax.axis("off")
+    plt.close(fig)
+    return fig
+def interpret_gini(gini_score):
+    # Handle NaN or None values
+    if gini_score is None or (isinstance(gini_score, float) and np.isnan(gini_score)):
+        return "N/A"
+    # Logic is now FLIPPED for Gini Impurity
+    if gini_score >= 0.6: return "Diverse Interests"
+    elif gini_score >= 0.3: return "Moderately Focused"
+    else: return "Highly Specialized"
+# --- START OF DEFINITIVE FIX: Centralized Callback Function ---
+def sync_stopwords():
+    """
+    This function is the single source of truth for updating stopwords.
+    It's called whenever any related widget changes.
+    """
+    # 1. Get words from all multiselect lists
+    selected_from_lists = set()
+    for topic_id in st.session_state.topics_info_for_sync:
+        key = f"multiselect_topic_{topic_id}"
+        if key in st.session_state:
+            selected_from_lists.update([s.split(' ')[0] for s in st.session_state[key]])
+    # 2. Get words from the text area
+    # The key for the text area is now the master state variable itself.
+    typed_stopwords = set([s.strip() for s in st.session_state.custom_stopwords_text.split(',') if s])
+    # 3. Combine them and update the master state variable
+    combined_stopwords = typed_stopwords.union(selected_from_lists)
+    st.session_state.custom_stopwords_text = ", ".join(sorted(list(combined_stopwords)))
+# --- Main Page Layout ---
+st.title("🌍 Multilingual Topic Modeling Dashboard")
+st.markdown("Analyze textual data in multiple languages to discover topics and user trends.")
+# Use a key to ensure the file uploader keeps its state, and update session_state directly
+uploaded_file = st.file_uploader("Upload your CSV data", type="csv", key="csv_uploader")
+# Check if a new file has been uploaded (or if it's the first time and a file exists)
+if uploaded_file is not None and uploaded_file != st.session_state.get('last_uploaded_file', None):
+    try:
+        st.session_state.df_raw = pd.read_csv(uploaded_file)
+        st.session_state.results = None # Reset results if a new file is uploaded
+        st.session_state.custom_stopwords_text = ""
+        st.session_state.last_uploaded_file = uploaded_file # Store the uploaded file itself
+        st.success("CSV file loaded successfully!")
+    except Exception as e:
+        st.error(f"Could not read CSV file. Error: {e}")
+        st.session_state.df_raw = None
+        st.session_state.last_uploaded_file = None
+if st.session_state.df_raw is not None:
+    df_raw = st.session_state.df_raw
+    col1, col2, col3 = st.columns(3)
+    with col1: user_id_col = st.selectbox("User ID Column", df_raw.columns, index=0, key="user_id_col")
+    with col2: post_content_col = st.selectbox("Post Content Column", df_raw.columns, index=min(1, len(df_raw.columns)-1), key="post_content_col")
+    with col3: timestamp_col = st.selectbox("Timestamp Column", df_raw.columns, index=min(2, len(df_raw.columns)-1), key="timestamp_col")
+    st.subheader("Topic Modeling Settings")
+    lang_col, topics_col = st.columns(2)
+    with lang_col: language = st.selectbox("Language Model", ["english", "multilingual"], key="language_model")
+    with topics_col: num_topics = st.number_input("Number of Topics", -1, help="Use -1 for automatic detection", key="num_topics")
+    with st.expander("Advanced: Text Cleaning & Preprocessing Options", expanded=False):
+        c1, c2 = st.columns(2)
+        with c1:
+            opts = {
+                'lowercase': st.checkbox("Convert to Lowercase", True, key="opt_lowercase"),
+                'lemmatize': st.checkbox("Lemmatize words", False, key="opt_lemmatize"),
+                'remove_urls': st.checkbox("Remove URLs", False, key="opt_remove_urls"),
+                'remove_html': st.checkbox("Remove HTML Tags", False, key="opt_remove_html")
+            }
+        with c2:
+            opts.update({
+                'remove_special_chars': st.checkbox("Remove Special Characters", False, key="opt_remove_special_chars"),
+                'remove_punctuation': st.checkbox("Remove Punctuation", False, key="opt_remove_punctuation"),
+                'remove_numbers': st.checkbox("Remove Numbers", False, key="opt_remove_numbers")
+            })
+        st.markdown("---")
+        c1_emoji, c2_hashtag, c3_mention = st.columns(3)
+        with c1_emoji: opts['handle_emojis'] = st.radio("Emoji Handling", ["Keep Emojis", "Remove Emojis", "Convert Emojis to Text"], index=0, key="opt_handle_emojis")
+        with c2_hashtag: opts['handle_hashtags'] = st.radio("Hashtag (#) Handling", ["Keep Hashtags", "Remove Hashtags", "Extract Hashtags"], index=0, key="opt_handle_hashtags")
+        with c3_mention: opts['handle_mentions'] = st.radio("Mention (@) Handling", ["Keep Mentions", "Remove Mentions", "Extract Mentions"], index=0, key="opt_handle_mentions")
+        st.markdown("---")
+        opts['remove_stopwords'] = st.checkbox("Remove Stopwords", True, key="opt_remove_stopwords")
+        st.text_area(
+            "Custom Stopwords (comma-separated)",
+            key="custom_stopwords_text", # This one already had a key
+            on_change=sync_stopwords
+        )
+        opts['custom_stopwords'] = [s.strip().lower() for s in st.session_state.custom_stopwords_text.split(',') if s]
+    st.subheader("User Similarity Analysis")
+    enable_similarity = st.checkbox(
+        "Enable User Similarity Analysis",
+        value=True,
+        help="Find users with similar interests based on topics or text content",
+        key="enable_similarity"
+    )
+    if enable_similarity:
+        similarity_method = st.radio(
+            "Similarity Method",
+            options=["Topic-Based", "Text Similarity (TF-IDF)"],
+            index=0,
+            help="Topic-Based: Compare topic distributions. TF-IDF: Compare actual text content.",
+            key="similarity_method",
+            horizontal=True
+        )
+    else:
+        similarity_method = None
+    st.divider()
+    process_button = st.button("🚀 Run Full Analysis", type="primary", use_container_width=True)
+else:
+    process_button = False
+st.divider()
+# --- Main Processing Logic ---
+if process_button:
+    st.session_state.results = None
+    start_time = time.time()
+    with st.spinner("Processing your data... This may take a few minutes."):
+        try:
+            df = df_raw[[user_id_col, post_content_col, timestamp_col]].copy()
+            df.columns = ['user_id', 'post_content', 'timestamp']
+            df.dropna(subset=['user_id', 'post_content', 'timestamp'], inplace=True)
+            try:
+                df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
+                invalid_timestamps = df['timestamp'].isna().sum()
+                if invalid_timestamps > 0:
+                    st.warning(f"Warning: {invalid_timestamps} rows have invalid timestamps and will be excluded.")
+                    df = df.dropna(subset=['timestamp'])
+            except Exception as e:
+                st.error(f"Could not parse timestamp column: {e}")
+                st.stop()
+            if opts['handle_hashtags'] == 'Extract Hashtags': df['hashtags'] = df['post_content'].str.findall(r'#\w+')
+            if opts['handle_mentions'] == 'Extract Mentions': df['mentions'] = df['post_content'].str.findall(r'@\w+')
+            # 1. Capture the user's actual choice about stopwords
+            user_wants_stopwords_removed = opts.get("remove_stopwords", False)
+            custom_stopwords_list = opts.get("custom_stopwords", [])
+            # 2. Tell the preprocessor to KEEP stopwords in the text.
+            opts_for_preprocessor = opts.copy()
+            opts_for_preprocessor['remove_stopwords'] = False
+            st.info("⚙️ Initializing preprocessor and cleaning text (keeping stopwords for now)...")
+            preprocessor = MultilingualPreprocessor(language=language)
+            df['processed_content'] = preprocessor.preprocess_series(
+                df['post_content'],
+                opts_for_preprocessor,
+                n_process_spacy=-1  # Use all CPU cores for faster processing
+            )
+            st.info("🔍 Performing topic modeling...")
+            # Add +1 because BERTopic creates an outlier topic (-1), so to get N meaningful topics, request N+1
+            if num_topics > 0:
+                bertopic_nr_topics = num_topics + 1
+            else:
+                bertopic_nr_topics = "auto"
+            docs_series = df['processed_content'].fillna('').astype(str)
+            docs_to_model = docs_series[docs_series.str.len() > 0].tolist()
+            df_with_content = df[docs_series.str.len() > 0].copy()
+            if not docs_to_model:
+                st.error("❌ After preprocessing, no documents were left to analyze. Please adjust your cleaning options.")
+                st.stop()
+            # 3. Pass the user's choice and stopwords list to BERTopic
+            topic_model, topics, probs, coherence_score = perform_topic_modeling(
+                docs=docs_to_model,
+                language=language,
+                nr_topics=bertopic_nr_topics,
+                remove_stopwords_bertopic=user_wants_stopwords_removed,
+                custom_stopwords=custom_stopwords_list
+            )
+            df_with_content['topic_id'] = topics
+            df_with_content['probability'] = probs
+            df = pd.merge(df, df_with_content[['topic_id', 'probability']], left_index=True, right_index=True, how='left')
+            df['topic_id'] = df['topic_id'].fillna(-1).astype(int)
+            st.info("📊 Calculating user engagement metrics...")
+            all_unique_topics = sorted(df[df['topic_id'] != -1]['topic_id'].unique().tolist())
+            all_unique_users = sorted(df['user_id'].unique().tolist())
+            gini_per_user = calculate_gini_per_user(df[['user_id', 'topic_id']], all_topics=all_unique_topics)
+            gini_per_topic = calculate_gini_per_topic(df[['user_id', 'topic_id']], all_users=all_unique_users)
+            st.info("📈 Analyzing topic evolution...")
+            general_evolution = analyze_general_topic_evolution(topic_model, docs_to_model, df_with_content['timestamp'].tolist())
+            end_time = time.time()
+            elapsed_time = end_time - start_time
+            # Format elapsed time nicely
+            if elapsed_time >= 60:
+                minutes = int(elapsed_time // 60)
+                seconds = elapsed_time % 60
+                time_str = f"{minutes} min {seconds:.1f} sec"
+            else:
+                time_str = f"{elapsed_time:.1f} sec"
+            # Cache df_meaningful for reuse (avoids repeated filtering)
+            df_meaningful = df[df['topic_id'] != -1].copy()
+            st.session_state.results = {
+                'topic_model': topic_model,
+                'topic_info': topic_model.get_topic_info(),
+                'df': df,
+                'df_meaningful': df_meaningful,  # Cached for performance
+                'gini_per_user': gini_per_user,
+                'gini_per_topic': gini_per_topic,
+                'general_evolution': general_evolution,
+                'coherence_score': coherence_score,
+                'processing_time': elapsed_time
+            }
+            st.success(f"✅ Analysis complete! Processing time: {time_str}")
+        except OSError as e:
+            st.error(f"spaCy Model Error: Could not load model. Please run `python -m spacy download en_core_web_sm` and `python -m spacy download xx_ent_wiki_sm` from your terminal.")
+        except Exception as e:
+            st.error(f"❌ An error occurred during processing: {e}")
+            st.exception(e)
+# --- Display Results ---
+if st.session_state.results:
+    results = st.session_state.results
+    df = results['df']
+    topic_model = results['topic_model']
+    topic_info = results['topic_info']
+    st.markdown('<h2 class="sub-header">📋 Overview & Preprocessing</h2>', unsafe_allow_html=True)
+    score_text = f"{results['coherence_score']:.3f}" if results['coherence_score'] is not None else "N/A"
+    num_users = df['user_id'].nunique()
+    avg_posts = len(df) / num_users if num_users > 0 else 0
+    start_date, end_date = df['timestamp'].min(), df['timestamp'].max()
+     # Option 1: More Compact Date Format
+    if start_date.year == end_date.year:
+        # If both dates are in the same year, only show year on the end date
+        time_range_str = f"{start_date.strftime('%b %d')} - {end_date.strftime('%b %d, %Y')}"
+    else:
+        # If dates span multiple years, show year on both
+        time_range_str = f"{start_date.strftime('%b %d, %Y')} - {end_date.strftime('%b %d, %Y')}"
+    # Format processing time for display
+    proc_time = results.get('processing_time', 0)
+    if proc_time >= 60:
+        proc_time_str = f"{int(proc_time // 60)}m {proc_time % 60:.1f}s"
+    else:
+        proc_time_str = f"{proc_time:.1f}s"
+    col1, col2, col3, col4, col5, col6 = st.columns(6)
+    col1.metric("Total Posts", len(df))
+    col2.metric("Unique Users", num_users)
+    col3.metric("Avg Posts / User", f"{avg_posts:.1f}")
+    col4.metric("Time Range", time_range_str)
+    col5.metric("Topic Coherence", score_text)
+    col6.metric("Processing Time", proc_time_str)
+    st.markdown("#### Preprocessing Results (Sample)")
+    st.dataframe(df[['post_content', 'processed_content']].head())
+    with st.expander("📊 Topic Model Evaluation Metrics"):
+        st.write("""
+        ### 🔹Coherence Score
+        - measures how well the discovered topics make sense:
+        - **> 0.6**: Excellent - Topics are very distinct and meaningful
+        - **0.5 - 0.6**: Good - Topics are generally clear and interpretable
+        - **0.4 - 0.5**: Fair - Topics are somewhat meaningful but may overlap
+        - **< 0.4**: Poor - Topics may be unclear or too similar
+        💡 **Tip**: If coherence is low, try adjusting the number of topics or cleaning options.
+        """)
+    st.markdown('<h2 class="sub-header">🎯 Topic Visualization & Refinement</h2>', unsafe_allow_html=True)
+    topic_options = topic_info[topic_info.Topic != -1].sort_values('Count', ascending=False)
+    view1, view2 = st.tabs(["Word Clouds", "Interactive Word Lists & Refinement"])
+    with view1:
+        st.info("Visual representation of the most important words for each topic.")
+        topics_to_show = topic_options.head(9)
+        num_cols = 3
+        cols = st.columns(num_cols)
+        for i, row in enumerate(topics_to_show.itertuples()):
+            with cols[i % num_cols]:
+                st.markdown(f"##### Topic {row.Topic}: {row.Name}")
+                fig = create_word_cloud(topic_model, row.Topic)
+                if fig: st.pyplot(fig, use_container_width=True)
+    with view2:
+        st.info("Select or deselect words from the lists below to instantly update the custom stopwords list in the configuration section above.")
+        topics_to_show = topic_options.head(9)
+        # Store the topic IDs we are showing so the callback can find the right widgets
+        st.session_state.topics_info_for_sync = [row.Topic for row in topics_to_show.itertuples()]
+        num_cols = 3
+        cols = st.columns(num_cols)
+        # Calculate which words should be pre-selected in the multiselects
+        current_stopwords_set = set([s.strip() for s in st.session_state.custom_stopwords_text.split(',') if s])
+        for i, row in enumerate(topics_to_show.itertuples()):
+            with cols[i % num_cols]:
+                st.markdown(f"##### Topic {row.Topic}")
+                topic_words = topic_model.get_topic(row.Topic)
+                # The options for the multiselect, e.g., ["word1 (0.123)", "word2 (0.122)"]
+                formatted_options = [f"{word} ({score:.3f})" for word, score in topic_words[:15]]
+                # Determine the default selected values for this specific multiselect
+                default_selection = []
+                for formatted_word in formatted_options:
+                    word_part = formatted_word.split(' ')[0]
+                    if word_part in current_stopwords_set:
+                        default_selection.append(formatted_word)
+                st.multiselect(
+                    f"Select words from Topic {row.Topic}",
+                    options=formatted_options,
+                    default=default_selection, # Pre-select words that are already in the list
+                    key=f"multiselect_topic_{row.Topic}",
+                    on_change=sync_stopwords, # The callback synchronizes everything
+                    label_visibility="collapsed"
+                )
+    st.markdown('<h2 class="sub-header">📈 Topic Evolution</h2>', unsafe_allow_html=True)
+    if not results['general_evolution'].empty:
+        evo = results['general_evolution']
+        # 1. Filter out the outlier topic (-1) and ensure Timestamp is a datetime object
+        evo_filtered = evo[evo.Topic != -1].copy()
+        evo_filtered['Timestamp'] = pd.to_datetime(evo_filtered['Timestamp'])
+        if not evo_filtered.empty:
+            # 2. Pivot the data to get topics as columns and aggregate frequencies
+            evo_pivot = evo_filtered.pivot_table(
+                index='Timestamp',
+                columns='Topic',
+                values='Frequency',
+                aggfunc='sum'
+            ).fillna(0)
+            # 3. Dynamically choose a good resampling frequency (Hourly, Daily, or Weekly)
+            time_delta = evo_pivot.index.max() - evo_pivot.index.min()
+            if time_delta.days > 60:
+                resample_freq, freq_label = 'W', 'Weekly'
+            elif time_delta.days > 5:
+                resample_freq, freq_label = 'D', 'Daily'
+            else:
+                resample_freq, freq_label = 'H', 'Hourly'
+            # Resample the data into the chosen time bins by summing up the frequencies
+            evo_resampled = evo_pivot.resample(resample_freq).sum()
+            # 4. Create the line chart using plotly.express.line
+            # --- The main change is here: from px.area to px.line ---
+            fig_evo = px.line(
+                evo_resampled,
+                x=evo_resampled.index,
+                y=evo_resampled.columns,
+                title=f"Topic Frequency Over Time ({freq_label} Line Chart)",
+                labels={'value': 'Total Frequency', 'variable': 'Topic ID', 'index': 'Time'},
+                height=500
+            )
+            # Make the topic IDs in the legend categorical for better color mapping
+            fig_evo.for_each_trace(lambda t: t.update(name=str(t.name)))
+            fig_evo.update_layout(legend_title_text='Topic')
+            st.plotly_chart(fig_evo, use_container_width=True)
+        else:
+            st.info("No topic evolution data available to display (all posts may have been outliers).")
+    else:
+        st.warning("Could not compute topic evolution (requires more data points over time).")
+    st.markdown('<h2 class="sub-header">🧑‍🤝‍🧑 User Engagement Profile</h2>', unsafe_allow_html=True)
+    # --- START OF THE CRITICAL FIX ---
+    # 1. Use cached df_meaningful from session_state for performance
+    df_meaningful = results.get('df_meaningful', df[df['topic_id'] != -1])
+    # 2. Get post counts based on this meaningful data.
+    meaningful_post_counts = df_meaningful.groupby('user_id').size().reset_index(name='post_count')
+    # 3. Merge with the Gini results (which were already correctly calculated on meaningful topics).
+    #    Using an 'inner' merge ensures we only consider users who have at least one meaningful post.
+    user_metrics_df = pd.merge(
+        meaningful_post_counts,
+        results['gini_per_user'],
+        on='user_id',
+        how='inner'
+    )
+    # 4. Filter to include only users with more than one MEANINGFUL post.
+    metrics_to_plot = user_metrics_df[user_metrics_df['post_count'] > 1].copy()
+    total_meaningful_users = len(user_metrics_df)
+    st.info(f"Displaying engagement profile for {len(metrics_to_plot)} users out of {total_meaningful_users} who contributed to meaningful topics.")
+    # 5. Add jitter for better visualization (deterministic seed for consistency)
+    np.random.seed(42)
+    jitter_strength = 0.02
+    metrics_to_plot['gini_jittered'] = metrics_to_plot['gini_coefficient'] + \
+                                        np.random.uniform(-jitter_strength, jitter_strength, size=len(metrics_to_plot))
+    # 6. Create the plot using the correctly filtered and prepared data.
+    fig = px.scatter(
+        metrics_to_plot,
+        x='post_count',
+        y='gini_jittered',
+        title='User Engagement Profile (based on posts in meaningful topics)',
+        labels={
+            'post_count': 'Number of Posts in Meaningful Topics', # Updated label
+            'gini_jittered': 'Gini Index (Topic Diversity)'
+        },
+        custom_data=['user_id', 'gini_coefficient']
+    )
+    fig.update_traces(
+        marker=dict(opacity=0.5),
+        hovertemplate="<b>User</b>: %{customdata[0]}<br><b>Meaningful Posts</b>: %{x}<br><b>Gini (Original)</b>: %{customdata[1]:.3f}<extra></extra>"
+    )
+    fig.update_yaxes(range=[-0.05, 1.05])
+    st.plotly_chart(fig, use_container_width=True)
+    # --- END OF THE CRITICAL FIX ---
+    st.markdown('<h2 class="sub-header">👤 User Deep Dive</h2>', unsafe_allow_html=True)
+    selected_user = st.selectbox("Select a User to Analyze", options=sorted(df['user_id'].unique()), key="selected_user_dropdown")
+    if selected_user:
+        user_df = df[df['user_id'] == selected_user]
+        matching_users = user_metrics_df[user_metrics_df['user_id'] == selected_user]
+        if matching_users.empty:
+            st.warning("This user has no posts in meaningful topics (all posts were classified as outliers).")
+            st.metric("Total Posts by User", len(user_df))
+        else:
+            user_gini_info = matching_users.iloc[0]
+            # Display the top-level metrics for the user first
+            c1, c2 = st.columns(2)
+            with c1: st.metric("Total Posts by User", len(user_df))
+            with c2: st.metric("Topic Diversity (Gini)", f"{user_gini_info['gini_coefficient']:.3f}", help=interpret_gini(user_gini_info['gini_coefficient']))
+        st.markdown("---") # Add a visual separator
+        # --- START: New Two-Column Layout for Charts ---
+        col1, col2 = st.columns(2)
+        with col1:
+            # --- Chart 1: Topic Distribution Pie Chart ---
+            user_topic_counts = user_df['topic_id'].value_counts().reset_index()
+            user_topic_counts.columns = ['topic_id', 'count']
+            fig_pie = px.pie(
+                user_topic_counts[user_topic_counts.topic_id != -1],
+                names='topic_id',
+                values='count',
+                title=f"Overall Topic Distribution for {selected_user}",
+                hole=0.4
+            )
+            fig_pie.update_layout(margin=dict(l=0, r=0, t=40, b=0))
+            st.plotly_chart(fig_pie, use_container_width=True)
+        with col2:
+            # --- Chart 2: Topic Evolution for User ---
+            if len(user_df) > 1:
+                user_evo_df = user_df[user_df['topic_id'] != -1].copy()
+                user_evo_df['timestamp'] = pd.to_datetime(user_evo_df['timestamp'])
+                if not user_evo_df.empty and user_evo_df['timestamp'].nunique() > 1:
+                    user_pivot = user_evo_df.pivot_table(index='timestamp', columns='topic_id', aggfunc='size', fill_value=0)
+                    time_delta = user_pivot.index.max() - user_pivot.index.min()
+                    if time_delta.days > 30: resample_freq = 'D'
+                    elif time_delta.days > 2: resample_freq = 'H'
+                    else: resample_freq = 'T'
+                    user_resampled = user_pivot.resample(resample_freq).sum()
+                    row_sums = user_resampled.sum(axis=1)
+                    user_proportions = user_resampled.div(row_sums, axis=0).fillna(0)
+                    topic_name_map = topic_info.set_index('Topic')['Name'].to_dict()
+                    user_proportions.rename(columns=topic_name_map, inplace=True)
+                    fig_user_evo = px.area(
+                        user_proportions,
+                        x=user_proportions.index,
+                        y=user_proportions.columns,
+                        title=f"Topic Proportion Over Time for {selected_user}",
+                        labels={'value': 'Topic Proportion', 'variable': 'Topic', 'index': 'Time'},
+                    )
+                    fig_user_evo.update_layout(margin=dict(l=0, r=0, t=40, b=0))
+                    st.plotly_chart(fig_user_evo, use_container_width=True)
+                else:
+                    st.info("This user has no posts in meaningful topics or all posts occurred at the same time.")
+            else:
+                st.info("Topic evolution requires more than one post to display.")
+        st.markdown("#### User's Most Recent Posts")
+        user_posts_table = user_df[['post_content', 'timestamp', 'topic_id']] \
+            .sort_values(by='timestamp', ascending=False) \
+            .head(100)
+        user_posts_table.columns = ['Post Content', 'Timestamp', 'Assigned Topic']
+        st.dataframe(user_posts_table, use_container_width=True)
+        with st.expander("Show User Distribution by Post Count"):
+            # We use 'user_metrics_df' because it's based on meaningful posts
+            post_distribution = user_metrics_df['post_count'].value_counts().reset_index()
+            post_distribution.columns = ['Number of Posts', 'Number of Users']
+            post_distribution = post_distribution.sort_values(by='Number of Posts')
+            # Create a bar chart for the distribution
+            fig_dist = px.bar(
+                post_distribution,
+                x='Number of Posts',
+                y='Number of Users',
+                title='User Distribution by Number of Meaningful Posts'
+            )
+            st.plotly_chart(fig_dist, use_container_width=True)
+            # Display the raw data in a table
+            st.write("Data Table: User Distribution")
+            st.dataframe(post_distribution, use_container_width=True)
+    # --- User Similarity Analysis Section ---
+    # Check if similarity analysis is enabled
+    if st.session_state.get('enable_similarity', True):
+        st.markdown('<h2 class="sub-header">🤝 User Similarity Analysis</h2>', unsafe_allow_html=True)
+        # Get the selected method
+        selected_method = st.session_state.get('similarity_method', 'Topic-Based')
+        if selected_method == "Topic-Based":
+            st.info("Finding users with similar **topic interests** based on their topic distributions.")
+            df_for_similarity = results.get('df_meaningful', df[df['topic_id'] != -1])
+            similarity_df = calculate_narrative_similarity(df_for_similarity)
+        else:  # TF-IDF
+            st.info("Finding users with similar **text content** using TF-IDF word analysis.")
+            with st.spinner("Calculating text similarity (this may take a moment)..."):
+                similarity_df = calculate_text_similarity_tfidf(df)
+        if similarity_df.empty:
+            st.warning("Not enough data to calculate similarity. Need at least 2 users with content.")
+        else:
+            # User selection for similarity analysis
+            similarity_user = st.selectbox(
+                "Select a User to Find Similar Users",
+                options=sorted(similarity_df.index.tolist()),
+                key="similarity_user_dropdown"
+            )
+            # Similarity threshold slider
+            similarity_threshold = st.slider(
+                "Similarity Threshold",
+                min_value=0.0,
+                max_value=1.0,
+                value=0.5,
+                step=0.05,
+                help="Only show users with similarity score above this threshold"
+            )
+            if similarity_user:
+                # Get similarity scores for the selected user
+                user_similarities = similarity_df[similarity_user].drop(similarity_user)  # Exclude self
+                # Filter by threshold
+                similar_users = user_similarities[user_similarities >= similarity_threshold].sort_values(ascending=False)
+                if similar_users.empty:
+                    st.info(f"No users found with similarity >= {similarity_threshold}. Try lowering the threshold.")
+                else:
+                    # Create a results DataFrame with post counts
+                    similar_users_df = pd.DataFrame({
+                        'User ID': similar_users.index,
+                        'Similarity Score': similar_users.values
+                    })
+                    # Add post count for context
+                    post_counts = df.groupby('user_id').size()
+                    similar_users_df['Post Count'] = similar_users_df['User ID'].map(post_counts).fillna(0).astype(int)
+                    # Format the similarity score
+                    similar_users_df['Similarity Score'] = similar_users_df['Similarity Score'].apply(lambda x: f"{x:.3f}")
+                    method_label = "topic interests" if selected_method == "Topic-Based" else "text content"
+                    st.write(f"**Found {len(similar_users_df)} users** with similar {method_label} to **{similarity_user}**:")
+                    st.dataframe(similar_users_df, use_container_width=True, hide_index=True)

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,25 @@

+version: '3.8'
+services:
+  topic-modeling-app:
+    build: .
+    ports:
+      - "8501:8501"
+    environment:
+      - STREAMLIT_SERVER_PORT=8501
+      - STREAMLIT_SERVER_ADDRESS=0.0.0.0
+      - STREAMLIT_BROWSER_GATHER_USAGE_STATS=false
+      - STREAMLIT_SERVER_HEADLESS=true
+      - TOKENIZERS_PARALLELISM=false
+    volumes:
+      # Optional: Mount a directory for persistent data storage
+      - ./data:/app/data
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8501/_stcore/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+      start_period: 40s

gini_calculator.py ADDED Viewed

	@@ -0,0 +1,107 @@

+import pandas as pd
+from math import isnan
+import math
+from typing import List
+def calculate_gini(counts, *, min_posts=None, normalize=False):
+    """
+    Compute 1 - sum(p_i^2) where p_i are category probabilities (Gini Impurity).
+    Handles: list/tuple of counts, dict {cat: count}, numpy array, pandas Series.
+    Edge cases:
+      - total == 0  -> return float('nan')
+      - total == 1  -> return 0.0
+      - min_posts set and total < min_posts -> return float('nan')
+      - normalize=True -> divide by (1 - 1/k_nonzero) when k_nonzero > 1
+    Parameters
+    ----------
+    counts : Iterable[int] | dict | pandas.Series | numpy.ndarray
+        Nonnegative counts per category.
+    min_posts : int | None
+        If provided and total posts < min_posts, returns NaN.
+    normalize : bool
+        If True, returns Gini / (1 - 1/k_nonzero) for k_nonzero > 1.
+    Returns
+    -------
+    float
+    """
+    # Convert to a flat list of counts
+    if counts is None:
+        return float('nan')
+    if isinstance(counts, dict):
+        vals = list(counts.values())
+    else:
+        # Works for list/tuple/np.array/Series
+        try:
+            vals = list(counts)
+        except TypeError:
+            return float('nan')
+    # Validate & clean
+    vals = [float(v) for v in vals if v is not None and not math.isnan(v)]
+    if any(v < 0 for v in vals):
+        raise ValueError("Counts must be nonnegative.")
+    total = sum(vals)
+    # Edge cases
+    if total == 0:
+        return float('nan')
+    if min_posts is not None and total < min_posts:
+        return float('nan')
+    if total == 1:
+        base = 0.0
+    else:
+        # Compute 1 - sum p_i^2
+        s2 = sum((v / total) ** 2 for v in vals)
+        base = 1.0 - s2
+    if not normalize:
+        return base
+    # Normalization by maximum possible diversity for observed nonzero categories
+    k_nonzero = sum(1 for v in vals if v > 0)
+    if k_nonzero <= 1:
+        # If only one category has posts, diversity is 0 and normalization isn't defined—return 0
+        return 0.0
+    denom = 1.0 - 1.0 / k_nonzero
+    # Guard against floating tiny negatives due to FP
+    return max(0.0, min(1.0, base / denom))
+def calculate_gini_per_user(df: pd.DataFrame, all_topics: List[int]):
+    """
+    Calculates the Gini Impurity for topic distribution per user.
+    A high value indicates high topic diversity.
+    Optimized with groupby for better performance.
+    """
+    def compute_user_gini(group):
+        existing_topic_counts = group["topic_id"].value_counts()
+        full_topic_counts = pd.Series(0, index=all_topics)
+        full_topic_counts.update(existing_topic_counts)
+        return calculate_gini(full_topic_counts.values, normalize=True)
+    # Use groupby instead of loop for O(n) instead of O(n*m) complexity
+    user_gini = df.groupby("user_id").apply(compute_user_gini).reset_index()
+    user_gini.columns = ["user_id", "gini_coefficient"]
+    return user_gini.fillna(0)
+def calculate_gini_per_topic(df: pd.DataFrame, all_users: List[str]):
+    """
+    Calculates the Gini Impurity for user distribution per topic.
+    A high value indicates the topic is discussed by a diverse set of users.
+    Optimized with groupby for better performance.
+    """
+    def compute_topic_gini(group):
+        existing_user_counts = group["user_id"].value_counts()
+        full_user_counts = pd.Series(0, index=all_users)
+        full_user_counts.update(existing_user_counts)
+        return calculate_gini(full_user_counts.values, normalize=True)
+    # Use groupby instead of loop for O(n) instead of O(n*m) complexity
+    topic_gini = df.groupby("topic_id").apply(compute_topic_gini).reset_index()
+    topic_gini.columns = ["topic_id", "gini_coefficient"]
+    return topic_gini.fillna(0)

narrative_similarity.py ADDED Viewed

	@@ -0,0 +1,102 @@

+# narrative_similarity.py
+import pandas as pd
+from sklearn.metrics.pairwise import cosine_similarity
+from sklearn.feature_extraction.text import TfidfVectorizer
+def calculate_narrative_similarity(df: pd.DataFrame):
+    """
+    Calculates the narrative overlap between users based on their topic distributions.
+    Args:
+        df (pd.DataFrame): DataFrame containing 'user_id' and 'topic_id' columns.
+                          Should already be filtered to exclude outliers (topic_id == -1).
+    Returns:
+        pd.DataFrame: A square DataFrame where rows and columns are user_ids
+                      and values are the cosine similarity of their topic distributions.
+    """
+    # Filter out outlier posts if any remain
+    df_meaningful = df[df['topic_id'] != -1] if 'topic_id' in df.columns else df
+    if df_meaningful.empty:
+        return pd.DataFrame()
+    # Create the "narrative vector" for each user
+    # Rows: user_id, Columns: topic_id, Values: count of posts
+    user_topic_matrix = pd.crosstab(df_meaningful['user_id'], df_meaningful['topic_id'])
+    # Need at least 2 users for meaningful comparison
+    if len(user_topic_matrix) < 2:
+        return pd.DataFrame()
+    # Normalize rows to get proportions (important for meaningful cosine similarity)
+    # This ensures users with different post counts can still be compared fairly
+    row_sums = user_topic_matrix.sum(axis=1)
+    user_topic_proportions = user_topic_matrix.div(row_sums, axis=0)
+    # Calculate pairwise cosine similarity between all users
+    similarity_matrix = cosine_similarity(user_topic_proportions)
+    # Convert the result back to a DataFrame with user_ids as labels
+    similarity_df = pd.DataFrame(
+        similarity_matrix,
+        index=user_topic_matrix.index,
+        columns=user_topic_matrix.index
+    )
+    return similarity_df
+def calculate_text_similarity_tfidf(df: pd.DataFrame):
+    """
+    Calculates text similarity between users using TF-IDF vectorization.
+    Combines all posts from each user into a single document, then compares
+    the word frequencies using TF-IDF and cosine similarity.
+    Args:
+        df (pd.DataFrame): DataFrame containing 'user_id' and 'post_content' columns.
+    Returns:
+        pd.DataFrame: A square DataFrame where rows and columns are user_ids
+                      and values are the cosine similarity of their text content.
+    """
+    if df.empty or 'post_content' not in df.columns:
+        return pd.DataFrame()
+    # Combine all posts from each user into a single document
+    user_docs = df.groupby('user_id')['post_content'].apply(
+        lambda posts: ' '.join(posts.astype(str))
+    ).reset_index()
+    user_docs.columns = ['user_id', 'combined_text']
+    # Need at least 2 users for meaningful comparison
+    if len(user_docs) < 2:
+        return pd.DataFrame()
+    # Create TF-IDF vectors for each user's combined text
+    tfidf = TfidfVectorizer(
+        max_features=5000,  # Limit vocabulary size for performance
+        stop_words='english',
+        min_df=1,
+        max_df=0.95
+    )
+    try:
+        tfidf_matrix = tfidf.fit_transform(user_docs['combined_text'])
+    except ValueError:
+        # Empty vocabulary (all stop words or empty texts)
+        return pd.DataFrame()
+    # Calculate pairwise cosine similarity
+    similarity_matrix = cosine_similarity(tfidf_matrix)
+    # Convert to DataFrame with user_ids as labels
+    similarity_df = pd.DataFrame(
+        similarity_matrix,
+        index=user_docs['user_id'],
+        columns=user_docs['user_id']
+    )
+    return similarity_df

readme.md ADDED Viewed

	@@ -0,0 +1,138 @@

+# Social Media Topic Modeling System
+A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation for diversity analysis, topic evolution tracking, and semantic narrative overlap detection.
+## Features
+- **📊 Topic Modeling**: Uses BERTopic for state-of-the-art, transformer-based topic modeling.
+- **⚙️ Flexible Configuration**:
+    - **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`.
+    - **Topic Number Control**: Let the model find topics automatically or specify the exact number you need.
+- **🌍 Multilingual Support**: Handles English and 50+ other languages using appropriate language models.
+- **📈 Gini Index Analysis**: Calculates topic and user diversity.
+- **⏰ Topic Evolution**: Tracks how topic popularity and user interests change over time with interactive charts.
+- **🤝 Narrative Overlap Analysis**: Identifies users with semantically similar posting patterns (shared narratives), even when their wording differs.
+- **✍️ Interactive Topic Refinement**: Fine-tune topic quality by adding words to a custom stopword list directly from the dashboard.
+- **🎯 Interactive Visualizations**: A rich dashboard with built-in charts and data tables using Plotly.
+- **📱 Responsive Interface**: Clean, modern Streamlit interface with a control panel for all settings.
+## Requirements
+### CSV File Format
+Your CSV file must contain columns that can be mapped to the following roles:
+- **User ID**: A column with unique identifiers for each user (string).
+- **Post Content**: A column with the text content of the social media post (string).
+- **Timestamp**: A column with the date and time of the post.
+The application will prompt you to select the correct column for each role after you upload your file.
+#### A Note on Timestamp Formatting
+The application is highly flexible and can automatically parse many common date and time formats thanks to the powerful Pandas library. However, to ensure 100% accuracy and avoid errors, please follow these guidelines for your timestamp column:
+*   **Best Practice (Recommended):** Use a standard, unambiguous format like ISO 8601.
+    - `YYYY-MM-DD HH:MM:SS` (e.g., `2023-10-27 15:30:00`)
+    - `YYYY-MM-DDTHH:MM:SS` (e.g., `2023-10-27T15:30:00`)
+*   **Supported Formats:** Most common formats will work, including:
+    - `MM/DD/YYYY HH:MM` (e.g., `10/27/2023 15:30`)
+    - `DD/MM/YYYY HH:MM` (e.g., `27/10/2023 15:30`)
+    - `Month D, YYYY` (e.g., `October 27, 2023`)
+*   **Potential Issues to Avoid:**
+    - **Ambiguous formats:** A date like `01/02/2023` can be interpreted as either Jan 2nd or Feb 1st. Using a `YYYY-MM-DD` format avoids this.
+    - **Mixed formats in one column:** Ensure all timestamps in your column follow the same format for best performance and reliability.
+    - **Timezone information:** Formats with timezone offsets (e.g., `2023-10-27 15:30:00+05:30`) are fully supported.
+### Dependencies
+See `requirements.txt` for a full list of dependencies.
+## Installation
+### Option 1: Local Installation
+1.  **Clone or download the project files.**
+2.  **Install dependencies:**
+    ```bash
+    pip install -r requirements.txt
+    ```
+3.  **Download spaCy models:**
+    ```bash
+    python -m spacy download en_core_web_sm
+    python -m spacy download xx_ent_wiki_sm
+    ```
+### Option 2: Docker Installation (Recommended)
+1.  **Using Docker Compose (easiest):**
+    ```bash
+    docker-compose up --build
+    ```
+2.  **Access the application:**
+    Open your browser and go to `http://localhost:8501`.
+## Usage
+1.  **Start the Streamlit application:**
+    ```bash
+    streamlit run app.py
+    ```
+2.  **Open your browser** and navigate to the local URL provided by Streamlit (usually `http://localhost:8501`).
+3.  **Follow the steps in the application:**
+    - **1. Upload CSV File**: Click "Browse files" to upload your dataset.
+    - **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`.
+    - **3. Configure Analysis**:
+        - **Language Model**: Choose `english` for English-only data or `multilingual` for other languages.
+        - **Number of Topics**: Enter a specific number of meaningful topics to find, or use `-1` to let the model decide automatically.
+        - **Text Preprocessing**: Expand the advanced options to select cleaning steps like lowercasing, punctuation removal, and more.
+        - **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis.
+    - **4. Run Analysis**: Click the "🚀 Run Full Analysis" button.
+4.  **Explore the results** in the interactive sections of the main panel.
+### Exploring the Interface
+The application provides a series of detailed sections:
+#### 📋 Overview & Preprocessing
+- Key metrics (total posts, unique users), dataset time range, and a topic coherence score.
+- A sample of your data showing the original and processed text.
+#### 🎯 Topic Visualization & Refinement
+- **Word Clouds**: Visual representation of the most important words for top topics.
+- **Interactive Word Lists**: Interactively select words from topic lists to add them to your custom stopwords for re-analysis.
+#### 📈 Topic Evolution
+- An interactive line chart showing how topic frequencies change over the entire dataset's timespan.
+#### 🧑‍🤝‍🧑 User Engagement Profile
+- A scatter plot visualizing the relationship between the number of posts a user makes and the diversity of their topics.
+- An expandable section showing the distribution of users by their post count.
+#### 👤 User Deep Dive
+- Select a specific user to analyze.
+- View their key metrics, overall topic distribution pie chart, and their personal topic evolution over time.
+- See detailed tables of their topic breakdown and their most recent posts.
+#### 🤝 Narrative Overlap Analysis
+- Select a user to find other users who discuss a similar mix of topics.
+- Use the slider to adjust the similarity threshold.
+- The results table shows the overlap score and post count of similar users, providing context on both narrative alignment and engagement level.
+## Understanding the Results
+### Gini Impurity Index
+This application uses the **Gini Impurity Index**, a measure of diversity.
+- **Range**: 0 to 1
+- **User Gini (Topic Diversity)**: Measures how diverse a user's topics are. **0** = perfectly specialized (posts on only one topic), **1** = perfectly diverse (posts spread evenly across all topics).
+- **Topic Gini (User Diversity)**: Measures how concentrated a topic is among users. **0** = dominated by a single user, **1** = widely and evenly discussed by many users.
+### Narrative Overlap Score
+- **Range**: 0 to 1
+- This score measures the **cosine similarity** between the topic distributions of two users.
+- A score of **1.0** means the two users have an identical proportional interest in topics (e.g., both are 100% focused on Topic 3).
+- A score of **0.0** means their topic interests are completely different.
+- This helps identify users with similar narrative focus, regardless of their total post count.

requirements.txt CHANGED Viewed

@@ -1,3 +1,19 @@
-altair
-pandas
-streamlit

+streamlit>=1.17.0
+bertopic[all]>=0.16.0
+pandas>=2.0.0
+numpy>=1.20.0
+plotly>=5.0.0
+transformers>=4.21.0
+sentence-transformers>=2.2.0
+scikit-learn>=1.0.0
+hdbscan>=0.8.29
+umap-learn>=0.5.0
+torch>=1.11.0
+matplotlib>=3.5.0
+seaborn>=0.11.0
+gensim>=4.3.0
+nltk>=3.8.0
+wordcloud>=1.9.0
+emoji>=2.2.0
+spacy>=3.4.0
+pyinstaller

resource_path.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import sys
+import os
+def resource_path(relative_path):
+    """ Get absolute path to resource, works for dev and for PyInstaller """
+    try:
+        # PyInstaller creates a temp folder and stores path in _MEIPASS
+        base_path = sys._MEIPASS
+    except Exception:
+        base_path = os.path.abspath(".")
+    return os.path.join(base_path, relative_path)

run.py ADDED Viewed

	@@ -0,0 +1,42 @@

+# run.py
+import streamlit.web.cli as stcli
+import os
+import sys
+from resource_path import resource_path # Import resource_path
+def run_streamlit():
+    # Determine the correct base path at runtime
+    if hasattr(sys, '_MEIPASS'):
+        # In a PyInstaller bundle, the resource is in the temp folder
+        base_path = sys._MEIPASS
+    else:
+        # In development, the resource is in the current directory
+        base_path = os.path.abspath(os.path.dirname(__file__))
+    app_path = os.path.join(base_path, 'app.py')
+    # --- ADD DEBUG PRINT HERE ---
+    print(f"DEBUG: Calculated Streamlit app_path: {app_path}")
+    # Check if the file actually exists at the calculated path (for debugging the build)
+    if not os.path.exists(app_path):
+        print(f"FATAL: The file does NOT exist at the expected path: {app_path}")
+        # We can stop here and force the user to see the error
+        sys.exit(1)
+    # Set the command-line arguments for Streamlit
+    sys.argv = [
+        "streamlit",
+        "run",
+        app_path, # Use the correctly calculated path
+        "--server.port=8501",
+        "--server.headless=true",
+        "--global.developmentMode=false",
+    ]
+    # Run the Streamlit CLI
+    sys.exit(stcli.main())
+if __name__ == "__main__":
+    run_streamlit()

sample_data.csv ADDED Viewed

	@@ -0,0 +1,27 @@

+user_id,post_content,timestamp
+user1,I love watching movies especially action and thriller films. The cinematography is amazing.,2023-01-01 10:00:00
+user2,My new smartphone has incredible camera quality and battery life. Technology is advancing so fast.,2023-01-01 11:00:00
+user1,Just finished watching a sci-fi movie. The special effects were mind-blowing and the story was captivating.,2023-01-02 10:30:00
+user3,Learning about artificial intelligence and machine learning algorithms. The future of technology is fascinating.,2023-01-02 14:00:00
+user2,Need to upgrade my old laptop. It's getting slow and can't handle modern software efficiently.,2023-01-03 09:00:00
+user1,The soundtrack of that movie was incredible. Music really enhances the emotional impact of films.,2023-01-03 16:00:00
+user4,Exploring the mysteries of space and astronomy. The universe is full of wonders waiting to be discovered.,2023-01-04 08:00:00
+user3,Data science and predictive analytics are revolutionizing business intelligence and decision making processes.,2023-01-04 12:00:00
+user2,Shopping for a new computer with better performance. Need something powerful for work and gaming.,2023-01-05 10:00:00
+user1,Reading about quantum physics and theoretical concepts. Science fiction is becoming science fact.,2023-01-05 15:00:00
+user5,Cooking is my passion. Today I experimented with spicy Thai cuisine and aromatic herbs.,2023-01-06 09:30:00
+user4,The cosmos holds infinite mysteries. Black holes and dark matter continue to puzzle scientists worldwide.,2023-01-06 13:00:00
+user3,Deep learning neural networks are achieving remarkable results in image recognition and natural language processing.,2023-01-07 11:00:00
+user2,My laptop keeps crashing during important presentations. Definitely time for a hardware upgrade.,2023-01-07 14:30:00
+user1,Science fiction films always make me contemplate the future of humanity and technological advancement.,2023-01-08 10:00:00
+user6,Traveling to different countries and experiencing diverse cultures. Food and traditions vary so much globally.,2023-01-08 12:00:00
+user5,Experimenting with fusion cuisine combining Asian and European cooking techniques. Flavors are incredible.,2023-01-09 09:00:00
+user4,Studying astrophysics and cosmology. The scale of the universe is beyond human comprehension.,2023-01-09 14:00:00
+user3,Machine learning models are becoming more sophisticated. Artificial neural networks mimic human brain functions.,2023-01-10 11:30:00
+user6,Visited an art museum today. The paintings and sculptures were breathtaking and emotionally moving.,2023-01-10 16:00:00
+user1,Watching classic films from the golden age of cinema. The storytelling techniques were masterful.,2023-01-11 10:15:00
+user2,Finally bought a new gaming laptop with advanced graphics card and high-speed processor.,2023-01-11 13:45:00
+user5,Learning traditional cooking methods from different cultures. Each region has unique culinary secrets.,2023-01-12 08:30:00
+user4,Observing celestial objects through my telescope. Saturn's rings are absolutely magnificent tonight.,2023-01-12 20:00:00
+user3,Working on a computer vision project using convolutional neural networks for object detection.,2023-01-13 09:15:00

start-streamlit.sh ADDED Viewed

	@@ -0,0 +1,12 @@

+#!/bin/bash
+# IMPORTANT: Replace the path below with the actual path to your miniconda/anaconda installation
+# This ensures the 'conda' command is available to the script.
+source /Users/mariamalmutairi/miniconda3/etc/profile.d/conda.sh
+# Activate your specific Python environment
+conda activate nlp
+# Run the streamlit app. The '--server.headless true' flag is a good practice
+# as it prevents Streamlit from opening a new browser tab on its own.
+streamlit run app.py --server.headless true

text_preprocessor.py ADDED Viewed

	@@ -0,0 +1,131 @@

+import re
+import string
+import pandas as pd
+import spacy
+import emoji
+from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
+from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
+from spacy.util import compile_infix_regex
+from pathlib import Path
+from resource_path import resource_path
+class MultilingualPreprocessor:
+    """
+    A robust text preprocessor using spaCy for multilingual support.
+    """
+    def __init__(self, language: str):
+        """
+        Initializes the preprocessor and loads the appropriate spaCy model.
+        Args:
+            language (str): 'english' or 'multilingual'.
+        """
+        import sys
+        model_map = {
+            'english': 'en_core_web_sm',
+            'multilingual': 'xx_ent_wiki_sm'
+        }
+        self.model_name = model_map.get(language, 'xx_ent_wiki_sm')
+        try:
+            # Check if running from PyInstaller bundle
+            if hasattr(sys, '_MEIPASS'):
+                # PyInstaller mode: load from bundled path
+                model_path_obj = Path(resource_path(self.model_name))
+                self.nlp = spacy.util.load_model_from_path(model_path_obj)
+            else:
+                # Normal development mode: load by model name
+                self.nlp = spacy.load(self.model_name)
+        except OSError as e:
+            print(f"spaCy Model Error: Could not load model '{self.model_name}'")
+            print(f"Please run: python -m spacy download {self.model_name}")
+            raise
+        # Customize tokenizer to not split on hyphens in words
+        # CORRECTED LINE: CONCAT_QUOTES is wrapped in a list []
+        infixes = LIST_ELLIPSES + LIST_ICONS + [CONCAT_QUOTES]
+        infix_regex = compile_infix_regex(infixes)
+        self.nlp.tokenizer.infix_finditer = infix_regex.finditer
+    def preprocess_series(self, text_series: pd.Series, options: dict, n_process_spacy: int = -1) -> pd.Series:
+        """
+        Applies a series of cleaning steps to a pandas Series of text.
+        Args:
+            text_series (pd.Series): The text to be cleaned.
+            options (dict): A dictionary of preprocessing options.
+        Returns:
+            pd.Series: The cleaned text Series.
+        """
+        # --- Stage 1: Fast, Regex-based cleaning (combined for performance) ---
+        processed_text = text_series.copy().astype(str)
+        # Combine all regex patterns into a single pass for better performance
+        regex_patterns = []
+        if options.get("remove_html"):
+            regex_patterns.append(r"<.*?>")
+        if options.get("remove_urls"):
+            regex_patterns.append(r"http\S+|www\.\S+")
+        if options.get("handle_hashtags") == "Remove Hashtags":
+            regex_patterns.append(r"#\w+")
+        if options.get("handle_mentions") == "Remove Mentions":
+            regex_patterns.append(r"@\w+")
+        # Apply all regex replacements in a single pass
+        if regex_patterns:
+            combined_pattern = "|".join(regex_patterns)
+            processed_text = processed_text.str.replace(combined_pattern, "", regex=True)
+        # Emoji handling (separate as it needs special library)
+        emoji_option = options.get("handle_emojis", "Keep Emojis")
+        if emoji_option == "Remove Emojis":
+            processed_text = processed_text.apply(lambda s: emoji.replace_emoji(s, replace=''))
+        elif emoji_option == "Convert Emojis to Text":
+            processed_text = processed_text.apply(emoji.demojize)
+        # --- Stage 2: spaCy-based advanced processing ---
+        # Using nlp.pipe for efficiency on a Series
+        cleaned_docs = []
+        # docs = self.nlp.pipe(processed_text, n_process=-1, batch_size=500)
+        docs = self.nlp.pipe(processed_text, n_process=n_process_spacy, batch_size=500)
+        # Get custom stopwords and convert to lowercase set for fast lookups
+        custom_stopwords = set(options.get("custom_stopwords", []))
+        for doc in docs:
+            tokens = []
+            for token in doc:
+                # Punctuation and Number handling
+                if options.get("remove_punctuation") and token.is_punct:
+                    continue
+                if options.get("remove_numbers") and (token.is_digit or token.like_num):
+                    continue
+                # Stopword handling (including custom stopwords)
+                is_stopword = token.is_stop or token.text.lower() in custom_stopwords
+                if options.get("remove_stopwords") and is_stopword:
+                    continue
+                # Use lemma if lemmatization is on, otherwise use the original text
+                token_text = token.lemma_ if options.get("lemmatize") else token.text
+                # Lowercasing (language-aware)
+                if options.get("lowercase"):
+                    token_text = token_text.lower()
+                # Remove any leftover special characters or whitespace
+                if options.get("remove_special_chars"):
+                    token_text = re.sub(r'[^\w\s-]', '', token_text)
+                if token_text.strip():
+                    tokens.append(token_text.strip())
+            cleaned_docs.append(" ".join(tokens))
+        return pd.Series(cleaned_docs, index=text_series.index)

topic_evolution.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import pandas as pd
+from bertopic import BERTopic
+from bertopic.representation import KeyBERTInspired
+def analyze_general_topic_evolution(topic_model, docs, timestamps):
+    """
+    Analyzes general topic evolution over time.
+    Args:
+        topic_model: Trained BERTopic model.
+        docs (list): List of documents.
+        timestamps (list): List of timestamps corresponding to the documents.
+    Returns:
+        pd.DataFrame: DataFrame with topic evolution information.
+    """
+    try:
+        topics_over_time = topic_model.topics_over_time(docs, timestamps, global_tuning=True)
+        return topics_over_time
+    except Exception:
+        # Fallback for small datasets or cases where evolution can't be computed
+        return pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
+def analyze_user_topic_evolution(df: pd.DataFrame, topic_model):
+    """
+    Analyzes topic evolution per user.
+    Args:
+        df (pd.DataFrame): DataFrame with (
+            "user_id", "post_content", "timestamp", and "topic_id" columns.
+        topic_model: Trained BERTopic model.
+    Returns:
+        dict: A dictionary where keys are user_ids and values are DataFrames of topic evolution for that user.
+    """
+    user_topic_evolution = {}
+    for user_id in df["user_id"].unique():
+        user_df = df[df["user_id"] == user_id].copy()
+        if not user_df.empty and len(user_df) > 1:
+            try:
+                # Ensure timestamps are sorted for topics_over_time
+                user_df = user_df.sort_values(by="timestamp")
+                docs = user_df["post_content"].tolist()
+                timestamps = user_df["timestamp"].tolist()
+                selected_topics = user_df["topic_id"].tolist() # Get topic_ids for the user's posts
+                topics_over_time = topic_model.topics_over_time(docs, timestamps, topics=selected_topics, global_tuning=True)
+                user_topic_evolution[user_id] = topics_over_time
+            except Exception:
+                user_topic_evolution[user_id] = pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
+        else:
+             user_topic_evolution[user_id] = pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
+    return user_topic_evolution
+if __name__ == "__main__":
+    # Example Usage:
+    data = {
+        "user_id": ["user1", "user2", "user1", "user3", "user2", "user1", "user4", "user3", "user2", "user1", "user5", "user4", "user3", "user2", "user1"],
+        "post_content": [
+            "This is a great movie, I loved the acting and the plot. It was truly captivating.",
+            "The new phone has an amazing camera and long battery life. Highly recommend it.",
+            "I enjoyed the film, especially the special effects and the soundtrack. A must-watch.",
+            "Learning about AI and machine learning is fascinating. The future is here.",
+            "My old phone is so slow, I need an upgrade soon. Thinking about the latest model.",
+            "The best part of the movie was the soundtrack and the stunning visuals. Very immersive.",
+            "Exploring the vastness of space is a lifelong dream. Astronomy is amazing.",
+            "Data science is revolutionizing industries. Predictive analytics is key.",
+            "I need a new laptop for work. Something powerful and portable.",
+            "Just finished reading a fantastic book on quantum physics. Mind-blowing concepts.",
+            "Cooking new recipes is my passion. Today, I tried a spicy Thai curry.",
+            "The universe is full of mysteries. Black holes and dark matter are intriguing.",
+            "Deep learning models are becoming incredibly sophisticated. Image recognition is impressive.",
+            "My current laptop is crashing frequently. Time for an upgrade.",
+            "Science fiction movies always make me think about the future of humanity."
+        ],
+        "timestamp": [
+            "2023-01-01 10:00:00", "2023-01-01 11:00:00", "2023-01-02 10:30:00",
+            "2023-01-02 14:00:00", "2023-01-03 09:00:00", "2023-01-03 16:00:00",
+            "2023-01-04 08:00:00", "2023-01-04 12:00:00", "2023-01-05 10:00:00",
+            "2023-01-05 15:00:00", "2023-01-06 09:30:00", "2023-01-06 13:00:00",
+            "2023-01-07 11:00:00", "2023-01-07 14:30:00", "2023-01-08 10:00:00"
+        ]
+    }
+    df = pd.DataFrame(data)
+    df["timestamp"] = pd.to_datetime(df["timestamp"])
+    print("Performing topic modeling (English)...")
+    model_en, topics_en, probs_en = perform_topic_modeling(df, language="english")
+    df["topic_id"] = topics_en
+    print("\nAnalyzing general topic evolution...")
+    general_evolution_df = analyze_general_topic_evolution(model_en, df["post_content"].tolist(), df["timestamp"].tolist())
+    print(general_evolution_df.head())
+    print("\nAnalyzing per user topic evolution...")
+    user_evolution_dict = analyze_user_topic_evolution(df, model_en)
+    for user_id, evolution_df in user_evolution_dict.items():
+        print(f"\nTopic evolution for {user_id}:")
+        print(evolution_df.head())

topic_modeling.py ADDED Viewed

	@@ -0,0 +1,88 @@

+# topic_modeling.py
+import random
+import pandas as pd
+from bertopic import BERTopic
+from gensim.corpora import Dictionary
+from gensim.models import CoherenceModel
+from nltk.tokenize import word_tokenize
+from typing import List
+from sklearn.feature_extraction.text import CountVectorizer
+def perform_topic_modeling(
+    docs: List[str],
+    language: str = "english",
+    nr_topics=None,
+    remove_stopwords_bertopic: bool = False, # New parameter to control behavior
+    custom_stopwords: List[str] = None
+):
+    """
+    Performs topic modeling on a list of documents.
+    Args:
+        docs (List[str]): A list of documents. Stopwords should be INCLUDED for best results.
+        language (str): Language for the BERTopic model ('english', 'multilingual').
+        nr_topics: The number of topics to find ("auto" or an int).
+        remove_stopwords_bertopic (bool): If True, stopwords will be removed internally by BERTopic.
+        custom_stopwords (List[str]): A list of custom stopwords to use.
+    Returns:
+        tuple: BERTopic model, topics, probabilities, and coherence score.
+    """
+    vectorizer_model = None  # Default to no custom vectorizer
+    if remove_stopwords_bertopic:
+        stop_words_list = []
+        if language == "english":
+            # Start with the built-in English stopword list from scikit-learn
+            from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
+            stop_words_list = list(ENGLISH_STOP_WORDS)
+        # Add any custom stopwords provided by the user
+        if custom_stopwords:
+            stop_words_list.extend(custom_stopwords)
+        # Only create a vectorizer if there's a list of stopwords to use
+        if stop_words_list:
+            vectorizer_model = CountVectorizer(stop_words=stop_words_list)
+    # Instantiate BERTopic, passing the vectorizer_model if it was created
+    if language == "multilingual":
+        topic_model = BERTopic(language="multilingual", nr_topics=nr_topics, vectorizer_model=vectorizer_model)
+    else:
+        topic_model = BERTopic(language=language, nr_topics=nr_topics, vectorizer_model=vectorizer_model)
+    # The 'docs' passed here should contain stopwords for the embedding model to work best
+    topics, probs = topic_model.fit_transform(docs)
+    # --- Calculate Coherence Score ---
+    # Sample documents for faster coherence calculation (2000 docs is sufficient for accurate estimate)
+    max_coherence_docs = 2000
+    if len(docs) > max_coherence_docs:
+        sample_docs = random.sample(docs, max_coherence_docs)
+    else:
+        sample_docs = docs
+    tokenized_docs = [word_tokenize(doc) for doc in sample_docs]
+    dictionary = Dictionary(tokenized_docs)
+    corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
+    topic_words = topic_model.get_topics()
+    topics_for_coherence = []
+    for topic_id in sorted(topic_words.keys()):
+        if topic_id != -1:
+            words = [word for word, _ in topic_model.get_topic(topic_id)]
+            topics_for_coherence.append(words)
+    coherence_score = None
+    if topics_for_coherence and corpus:
+        try:
+            coherence_model = CoherenceModel(
+                topics=topics_for_coherence,
+                texts=tokenized_docs,
+                dictionary=dictionary,
+                coherence='c_v'
+            )
+            coherence_score = coherence_model.get_coherence()
+        except Exception as e:
+            print(f"Could not calculate coherence score: {e}")
+    return topic_model, topics, probs, coherence_score