Mars203020 commited on
Commit
b7b041e
·
verified ·
1 Parent(s): c5c609e

Upload 17 files

Browse files
Deployment Guide.md ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deployment Guide
2
+
3
+ This guide covers various deployment options for the Social Media Topic Modeling System.
4
+
5
+ ## Local Development
6
+
7
+ ### Quick Start
8
+ ```bash
9
+ # Install dependencies
10
+ pip install -r requirements.txt
11
+
12
+ # Run the application
13
+ streamlit run streamlit_app.py
14
+ ```
15
+
16
+ ### Development with Docker
17
+ ```bash
18
+ # Build and run with Docker Compose
19
+ docker-compose up --build
20
+
21
+ # Or build and run manually
22
+ docker build -t topic-modeling-app .
23
+ docker run -p 8501:8501 topic-modeling-app
24
+ ```
25
+
26
+ ## Production Deployment
27
+
28
+ ### Docker Production Setup
29
+
30
+ 1. **Build the production image:**
31
+ ```bash
32
+ docker build -t topic-modeling-app:latest .
33
+ ```
34
+
35
+ 2. **Run with production settings:**
36
+ ```bash
37
+ docker run -d \
38
+ --name topic-modeling-prod \
39
+ -p 8501:8501 \
40
+ --memory=4g \
41
+ --cpus=2 \
42
+ --restart=unless-stopped \
43
+ topic-modeling-app:latest
44
+ ```
45
+
46
+ 3. **Using Docker Compose for production:**
47
+ ```yaml
48
+ version: '3.8'
49
+ services:
50
+ topic-modeling-app:
51
+ build: .
52
+ ports:
53
+ - "8501:8501"
54
+ environment:
55
+ - STREAMLIT_SERVER_PORT=8501
56
+ - STREAMLIT_SERVER_ADDRESS=0.0.0.0
57
+ volumes:
58
+ - ./data:/app/data
59
+ restart: unless-stopped
60
+ deploy:
61
+ resources:
62
+ limits:
63
+ memory: 4G
64
+ cpus: '2'
65
+ healthcheck:
66
+ test: ["CMD", "curl", "-f", "http://localhost:8501/_stcore/health"]
67
+ interval: 30s
68
+ timeout: 10s
69
+ retries: 3
70
+ ```
71
+
72
+ ### Cloud Deployment Options
73
+
74
+ #### 1. AWS ECS/Fargate
75
+ ```bash
76
+ # Tag for ECR
77
+ docker tag topic-modeling-app:latest your-account.dkr.ecr.region.amazonaws.com/topic-modeling-app:latest
78
+
79
+ # Push to ECR
80
+ docker push your-account.dkr.ecr.region.amazonaws.com/topic-modeling-app:latest
81
+ ```
82
+
83
+ #### 2. Google Cloud Run
84
+ ```bash
85
+ # Build and deploy to Cloud Run
86
+ gcloud run deploy topic-modeling-app \
87
+ --image gcr.io/your-project/topic-modeling-app \
88
+ --platform managed \
89
+ --region us-central1 \
90
+ --memory 4Gi \
91
+ --cpu 2
92
+ ```
93
+
94
+ #### 3. Azure Container Instances
95
+ ```bash
96
+ # Deploy to Azure
97
+ az container create \
98
+ --resource-group myResourceGroup \
99
+ --name topic-modeling-app \
100
+ --image your-registry.azurecr.io/topic-modeling-app:latest \
101
+ --cpu 2 \
102
+ --memory 4 \
103
+ --ports 8501
104
+ ```
105
+
106
+ #### 4. Heroku
107
+ ```bash
108
+ # Login to Heroku Container Registry
109
+ heroku container:login
110
+
111
+ # Build and push
112
+ heroku container:push web --app your-app-name
113
+
114
+ # Release
115
+ heroku container:release web --app your-app-name
116
+ ```
117
+
118
+ ### Kubernetes Deployment
119
+
120
+ #### Deployment YAML
121
+ ```yaml
122
+ apiVersion: apps/v1
123
+ kind: Deployment
124
+ metadata:
125
+ name: topic-modeling-app
126
+ spec:
127
+ replicas: 3
128
+ selector:
129
+ matchLabels:
130
+ app: topic-modeling-app
131
+ template:
132
+ metadata:
133
+ labels:
134
+ app: topic-modeling-app
135
+ spec:
136
+ containers:
137
+ - name: topic-modeling-app
138
+ image: topic-modeling-app:latest
139
+ ports:
140
+ - containerPort: 8501
141
+ resources:
142
+ requests:
143
+ memory: "2Gi"
144
+ cpu: "1"
145
+ limits:
146
+ memory: "4Gi"
147
+ cpu: "2"
148
+ env:
149
+ - name: STREAMLIT_SERVER_PORT
150
+ value: "8501"
151
+ - name: STREAMLIT_SERVER_ADDRESS
152
+ value: "0.0.0.0"
153
+ ---
154
+ apiVersion: v1
155
+ kind: Service
156
+ metadata:
157
+ name: topic-modeling-service
158
+ spec:
159
+ selector:
160
+ app: topic-modeling-app
161
+ ports:
162
+ - port: 80
163
+ targetPort: 8501
164
+ type: LoadBalancer
165
+ ```
166
+
167
+ ## Performance Optimization
168
+
169
+ ### Memory Management
170
+ - **Minimum RAM**: 4GB for small datasets (< 1000 documents)
171
+ - **Recommended RAM**: 8GB+ for larger datasets
172
+ - **Large datasets**: Consider processing in batches
173
+
174
+ ### CPU Optimization
175
+ - **Minimum**: 2 CPU cores
176
+ - **Recommended**: 4+ CPU cores for faster processing
177
+ - **GPU**: Optional, can speed up transformer models
178
+
179
+ ### Storage Considerations
180
+ - **Docker image**: ~2GB
181
+ - **Temporary files**: Varies with dataset size
182
+ - **Persistent storage**: Optional for saving results
183
+
184
+ ## Monitoring and Logging
185
+
186
+ ### Health Checks
187
+ The application includes built-in health checks:
188
+ ```bash
189
+ # Check application health
190
+ curl http://localhost:8501/_stcore/health
191
+ ```
192
+
193
+ ### Logging
194
+ Streamlit logs are available through Docker:
195
+ ```bash
196
+ # View logs
197
+ docker logs topic-modeling-app
198
+
199
+ # Follow logs
200
+ docker logs -f topic-modeling-app
201
+ ```
202
+
203
+ ### Monitoring with Prometheus
204
+ Add monitoring endpoints for production:
205
+ ```python
206
+ # Add to streamlit_app.py for monitoring
207
+ import time
208
+ import psutil
209
+
210
+ # Add metrics endpoint
211
+ @st.cache_data
212
+ def get_system_metrics():
213
+ return {
214
+ 'cpu_percent': psutil.cpu_percent(),
215
+ 'memory_percent': psutil.virtual_memory().percent,
216
+ 'timestamp': time.time()
217
+ }
218
+ ```
219
+
220
+ ## Security Considerations
221
+
222
+ ### Container Security
223
+ - Run as non-root user (included in Dockerfile)
224
+ - Use minimal base images
225
+ - Regularly update dependencies
226
+
227
+ ### Network Security
228
+ - Use HTTPS in production
229
+ - Implement proper firewall rules
230
+ - Consider VPN for internal access
231
+
232
+ ### Data Security
233
+ - Encrypt data at rest and in transit
234
+ - Implement proper access controls
235
+ - Regular security audits
236
+
237
+ ## Troubleshooting
238
+
239
+ ### Common Issues
240
+
241
+ 1. **Out of Memory Errors**
242
+ - Increase container memory limits
243
+ - Process smaller datasets
244
+ - Use batch processing
245
+
246
+ 2. **Slow Performance**
247
+ - Increase CPU allocation
248
+ - Use SSD storage
249
+ - Optimize dataset size
250
+
251
+ 3. **Container Won't Start**
252
+ - Check logs: `docker logs container-name`
253
+ - Verify port availability
254
+ - Check resource limits
255
+
256
+ 4. **Model Loading Issues**
257
+ - Ensure internet connectivity for model downloads
258
+ - Pre-download models in Docker build
259
+ - Check disk space
260
+
261
+ ### Support
262
+ For deployment issues:
263
+ 1. Check the logs first
264
+ 2. Verify system requirements
265
+ 3. Test with sample data
266
+ 4. Check network connectivity
267
+
Dockerfile CHANGED
@@ -1,20 +1,64 @@
1
- FROM python:3.13.5-slim
 
2
 
 
3
  WORKDIR /app
4
 
5
- RUN apt-get update && apt-get install -y \
 
 
 
 
 
 
 
 
 
 
6
  build-essential \
7
  curl \
8
  git \
9
- && rm -rf /var/lib/apt/lists/*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- COPY requirements.txt ./
12
- COPY src/ ./src/
13
 
14
- RUN pip3 install -r requirements.txt
 
 
15
 
16
- EXPOSE 8501
 
17
 
18
- HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
 
19
 
20
- ENTRYPOINT ["streamlit", "run", "src/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]
 
 
1
+ # Use Python base image (avoid slim on HF)
2
+ FROM python:3.11
3
 
4
+ # Set working directory
5
  WORKDIR /app
6
 
7
+ # Environment variables (use port 7860 for HF Spaces)
8
+ ENV PYTHONDONTWRITEBYTECODE=1 \
9
+ PYTHONUNBUFFERED=1 \
10
+ STREAMLIT_SERVER_PORT=7860 \
11
+ STREAMLIT_SERVER_ADDRESS=0.0.0.0 \
12
+ STREAMLIT_BROWSER_GATHER_USAGE_STATS=false \
13
+ STREAMLIT_SERVER_HEADLESS=true
14
+
15
+ # Install system dependencies (HF-safe)
16
+ RUN apt-get update --fix-missing && \
17
+ apt-get install -y --no-install-recommends \
18
  build-essential \
19
  curl \
20
  git \
21
+ fontconfig \
22
+ fonts-dejavu-core && \
23
+ fc-cache -f && \
24
+ rm -rf /var/lib/apt/lists/*
25
+
26
+ # Copy requirements first (better cache)
27
+ COPY requirements.txt .
28
+
29
+ # Install Python dependencies
30
+ RUN pip install --no-cache-dir --upgrade pip && \
31
+ pip install --no-cache-dir -r requirements.txt
32
+
33
+ # Download spaCy models (required for text preprocessing)
34
+ RUN python -m spacy download en_core_web_sm && \
35
+ python -m spacy download xx_ent_wiki_sm
36
+
37
+ # Download NLTK data (required for coherence calculation)
38
+ RUN python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"
39
+
40
+ # Copy application files
41
+ COPY app.py .
42
+ COPY topic_modeling.py .
43
+ COPY text_preprocessor.py .
44
+ COPY gini_calculator.py .
45
+ COPY topic_evolution.py .
46
+ COPY narrative_similarity.py .
47
+ COPY resource_path.py .
48
+ COPY sample_data.csv .
49
 
50
+ # Copy Streamlit config (fixes 403 upload error)
51
+ COPY .streamlit/config.toml .streamlit/config.toml
52
 
53
+ # Create non-root user (HF compatible)
54
+ RUN useradd -m appuser
55
+ USER appuser
56
 
57
+ # Expose Streamlit port (7860 for HF Spaces)
58
+ EXPOSE 7860
59
 
60
+ # Health check
61
+ HEALTHCHECK CMD curl --fail http://localhost:7860/_stcore/health || exit 1
62
 
63
+ # Run Streamlit
64
+ CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]
Social Media Topic Modeling System.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Social Media Topic Modeling System
2
+
3
+ A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation, and topic evolution analysis.
4
+
5
+ ## Features
6
+
7
+ - **📊 Topic Modeling**: Uses BERTopic for state-of-the-art topic modeling.
8
+ - **⚙️ Flexible Configuration**:
9
+ - **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`.
10
+ - **Topic Number Control**: Let the model find topics automatically or specify the exact number you need.
11
+ - **🌍 Multilingual Support**: Handles English and 50+ other languages.
12
+ - **📈 Gini Coefficient Analysis**: Calculates topic distribution inequality per user and per topic.
13
+ - **⏰ Topic Evolution**: Tracks how topics change over time.
14
+ - **🎯 Interactive Visualizations**: Built-in charts and data tables using Plotly.
15
+ - **📱 Responsive Interface**: Clean, modern Streamlit interface with a control sidebar.
16
+
17
+ ## Requirements
18
+
19
+ ### CSV File Format
20
+
21
+ Your CSV file must contain columns that can be mapped to the following roles:
22
+ - **User ID**: A column with unique identifiers for each user (string).
23
+ - **Post Content**: A column with the text content of the social media post (string).
24
+ - **Timestamp**: A column with the date and time of the post (e.g., "2023-01-15 14:30:00").
25
+
26
+ The application will prompt you to select the correct column for each role after you upload your file.
27
+
28
+ ### Dependencies
29
+
30
+ See `requirements.txt` for a full list of dependencies.
31
+
32
+ ## Installation
33
+
34
+ ### Option 1: Local Installation
35
+
36
+ 1. **Clone or download the project files.**
37
+ 2. **Install dependencies:**
38
+ ```bash
39
+ pip install -r requirements.txt
40
+ ```
41
+
42
+ ### Option 2: Docker Installation (Recommended)
43
+
44
+ 1. **Using Docker Compose (easiest):**
45
+ ```bash
46
+ docker-compose up --build
47
+ ```
48
+ 2. **Access the application:**
49
+ ```
50
+ http://localhost:8501
51
+ ```
52
+
53
+ ## Usage
54
+
55
+ 1. **Start the Streamlit application:**
56
+ ```bash
57
+ streamlit run app.py
58
+ ```
59
+ 2. **Open your browser** and navigate to `http://localhost:8501`.
60
+ 3. **Follow the steps in the sidebar:**
61
+ - **1. Upload CSV File**: Click "Browse files" to upload your dataset.
62
+ - **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`.
63
+ - **3. Configure Analysis**:
64
+ - **Language Model**: Choose `english` for English-only data or `multilingual` for other languages.
65
+ - **Number of Topics**: Enter a specific number of topics to find, or use `-1` to let the model decide automatically.
66
+ - **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis.
67
+ - **4. Run Analysis**: Click the "🚀 Analyze Topics" button.
68
+
69
+ 4. **Explore the results** in the five interactive tabs in the main panel.
70
+
71
+ ### Using the Interface
72
+
73
+ The application provides five main tabs:
74
+
75
+ #### 📋 Overview
76
+ - Key metrics, dataset preview, and average Gini coefficient.
77
+
78
+ #### 🎯 Topics
79
+ - Topic information table and topic distribution bar chart.
80
+
81
+ #### 📊 Gini Analysis
82
+ - Analysis of topic diversity for each user and user concentration for each topic.
83
+
84
+ #### 📈 Topic Evolution
85
+ - Timelines showing how topic popularity changes over time, for all users and for individual users.
86
+
87
+ #### 📄 Documents
88
+ - A detailed view of your original data with assigned topics and probabilities.
89
+
90
+ ## Understanding the Results
91
+
92
+ ### Gini Coefficient
93
+ - **Range**: 0 to 1
94
+ - **User Gini**: Measures how diverse a user's topics are. **0** = perfectly diverse (posts on many topics), **1** = perfectly specialized (posts on one topic).
95
+ - **Topic Gini**: Measures how concentrated a topic is among users. **0** = widely discussed by many users, **1** = dominated by a few users.
96
+
97
+ ---
98
+
99
+ **Built with ❤️ using Streamlit and BERTopic**
TopicModelingApp.spec ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # topicmodelingapp.spec
2
+
3
+ import sys
4
+ import os
5
+ import site
6
+ from pathlib import Path
7
+
8
+ from PyInstaller.utils.hooks import collect_all
9
+ from PyInstaller.building.datastruct import Tree
10
+
11
+ # Add the script's directory to the path for local imports
12
+ sys.path.append(os.path.abspath(os.path.dirname(sys.argv[0])))
13
+
14
+ # --- Dynamic Path Logic (Makes the SPEC file generic) ---
15
+ def get_site_packages_path():
16
+ """Tries to find the site-packages directory of the current environment."""
17
+ try:
18
+ # Tries the standard site.getsitepackages method
19
+ return Path(site.getsitepackages()[0])
20
+ except Exception:
21
+ # Fallback for complex environments like Conda
22
+ return Path(sys.prefix) / 'lib' / f'python{sys.version_info.major}.{sys.version_info.minor}' / 'site-packages'
23
+
24
+ SP_PATH_STR = str(get_site_packages_path()) + os.sep
25
+
26
+ def get_model_path(model_name):
27
+ """Gets the absolute path to an installed spaCy model."""
28
+ spacy_path = get_site_packages_path()
29
+ model_dir = spacy_path / model_name
30
+
31
+ if not model_dir.exists():
32
+ raise FileNotFoundError(
33
+ f"spaCy model '{model_name}' not found at expected location: {model_dir}"
34
+ )
35
+ return str(model_dir)
36
+
37
+
38
+ # --- Core Dependency Collection (C-Extension Fix) ---
39
+
40
+ # Use collect_all. The output is a tuple: (datas [0], binaries [1], hiddenimports [2], excludes [3], pathex [4])
41
+ spacy_data = collect_all('spacy')
42
+ numpy_data = collect_all('numpy')
43
+ sklearn_data = collect_all('sklearn')
44
+ hdbscan_data = collect_all('hdbscan')
45
+ scipy_data = collect_all('scipy')
46
+
47
+ # 1. Consolidate ALL hidden imports (index 2 - module names/strings)
48
+ all_collected_imports = []
49
+ all_collected_imports.extend(spacy_data[2])
50
+ all_collected_imports.extend(numpy_data[2])
51
+ all_collected_imports.extend(sklearn_data[2])
52
+ all_collected_imports.extend(hdbscan_data[2])
53
+ all_collected_imports.extend(scipy_data[2])
54
+
55
+ # 2. Consolidate all collected data (index 0 - tuples)
56
+ all_collected_datas = []
57
+ all_collected_datas.extend(spacy_data[0])
58
+ all_collected_datas.extend(numpy_data[0])
59
+ all_collected_datas.extend(sklearn_data[0])
60
+ all_collected_datas.extend(hdbscan_data[0])
61
+ all_collected_datas.extend(scipy_data[0])
62
+
63
+ # 3. Consolidate all collected binaries (index 1 - tuples of C-extensions/dylibs)
64
+ all_collected_binaries = []
65
+ all_collected_binaries.extend(spacy_data[1])
66
+ all_collected_binaries.extend(numpy_data[1])
67
+ all_collected_binaries.extend(sklearn_data[1])
68
+ all_collected_binaries.extend(hdbscan_data[1])
69
+ all_collected_binaries.extend(scipy_data[1])
70
+
71
+
72
+ # --- Analysis Setup ---
73
+
74
+ a = Analysis(
75
+ # 1. Explicitly list all your source files
76
+ ['run.py', 'app.py', 'text_preprocessor.py', 'topic_modeling.py', 'gini_calculator.py', 'narrative_similarity.py', 'resource_path.py', 'topic_evolution.py'],
77
+ pathex=['.'],
78
+
79
+ # *** CRITICAL FIX: Use the collected binaries list for C extensions/dylibs ***
80
+ binaries=all_collected_binaries,
81
+
82
+ # 2. The final datas list: collected tuples + manual tuples
83
+ datas=all_collected_datas + [
84
+ # Streamlit metadata (Dynamic path and wildcard)
85
+ (SP_PATH_STR + 'streamlit*.dist-info', 'streamlit_metadata'),
86
+ (SP_PATH_STR + 'streamlit/static', 'streamlit/static'),
87
+
88
+ # Application resources
89
+ (os.path.abspath('app.py'), '.'),
90
+ ('readme.md', '.'),
91
+ ('requirements.txt', '.'),
92
+
93
+ ],
94
+
95
+ # 3. The final hiddenimports list: collected strings + manual strings
96
+ hiddenimports=all_collected_imports + [
97
+ 'charset_normalizer',
98
+ 'streamlit.runtime.scriptrunner.magic_funcs',
99
+ 'spacy.parts_of_speech',
100
+ 'scipy.spatial.ckdtree',
101
+ 'thinc.extra.wrappers',
102
+ 'streamlit.web.cli',
103
+ ],
104
+ hookspath=[],
105
+ hooksconfig={},
106
+ runtime_hooks=[],
107
+ # Add all collected excludes to the main excludes list
108
+ excludes=['tkinter', 'matplotlib.pyplot'] + spacy_data[3] + numpy_data[3] + sklearn_data[3] + hdbscan_data[3] + scipy_data[3],
109
+ noarchive=False,
110
+ optimize=0,
111
+ )
112
+
113
+ # 4. Explicitly include the actual spaCy model directories using Tree
114
+ a.datas.extend(
115
+ Tree(get_model_path('en_core_web_sm'), prefix='en_core_web_sm')
116
+ )
117
+ a.datas.extend(
118
+ Tree(get_model_path('xx_ent_wiki_sm'), prefix='xx_ent_wiki_sm')
119
+ )
120
+
121
+ pyz = PYZ(a.pure)
122
+
123
+ exe = EXE(
124
+ pyz,
125
+ a.scripts,
126
+ a.binaries,
127
+ a.datas,
128
+ [],
129
+ name='TopicModelingApp',
130
+ debug=False,
131
+ bootloader_ignore_signals=False,
132
+ strip=False,
133
+ upx=True,
134
+ upx_exclude=[],
135
+ runtime_tmpdir=None,
136
+ console=True,
137
+ disable_windowed_traceback=False,
138
+ argv_emulation=False,
139
+ target_arch=None,
140
+ codesign_identity=None,
141
+ entitlements_file=None,
142
+ )
app.py ADDED
@@ -0,0 +1,661 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import numpy as np
4
+ import time
5
+
6
+ import plotly.express as px
7
+ from wordcloud import WordCloud
8
+ import matplotlib.pyplot as plt
9
+
10
+ # Import custom modules
11
+ from text_preprocessor import MultilingualPreprocessor
12
+ from topic_modeling import perform_topic_modeling
13
+ from gini_calculator import calculate_gini_per_user, calculate_gini_per_topic
14
+ from topic_evolution import analyze_general_topic_evolution
15
+ from narrative_similarity import calculate_narrative_similarity, calculate_text_similarity_tfidf
16
+
17
+ # --- Page Configuration ---
18
+ st.set_page_config(
19
+ page_title="Social Media Topic Modeling System",
20
+ page_icon="📊",
21
+ layout="wide",
22
+ )
23
+
24
+ # --- Custom CSS ---
25
+ st.markdown("""
26
+ <style>
27
+ .main-header { font-size: 2.5rem; color: #1f77b4; text-align: center; margin-bottom: 1rem; }
28
+ .sub-header { font-size: 1.75rem; color: #2c3e50; border-bottom: 2px solid #f0f2f6; padding-bottom: 0.3rem; margin-top: 2rem; margin-bottom: 1rem;}
29
+ </style>
30
+ """, unsafe_allow_html=True)
31
+
32
+ # --- Session State Initialization ---
33
+ if 'results' not in st.session_state:
34
+ st.session_state.results = None
35
+ if 'df_raw' not in st.session_state:
36
+ st.session_state.df_raw = None
37
+ if 'custom_stopwords_text' not in st.session_state:
38
+ st.session_state.custom_stopwords_text = ""
39
+ if "topics_info_for_sync" not in st.session_state:
40
+ st.session_state.topics_info_for_sync = []
41
+
42
+
43
+ # --- Helper Functions ---
44
+ @st.cache_data
45
+ def create_word_cloud(_topic_model, topic_id):
46
+ word_freq = _topic_model.get_topic(topic_id)
47
+ if not word_freq: return None
48
+ wc = WordCloud(width=800, height=400, background_color="white", colormap="viridis", max_words=50).generate_from_frequencies(dict(word_freq))
49
+ fig, ax = plt.subplots(figsize=(10, 5))
50
+ ax.imshow(wc, interpolation='bilinear')
51
+ ax.axis("off")
52
+ plt.close(fig)
53
+ return fig
54
+
55
+
56
+
57
+ def interpret_gini(gini_score):
58
+ # Handle NaN or None values
59
+ if gini_score is None or (isinstance(gini_score, float) and np.isnan(gini_score)):
60
+ return "N/A"
61
+ # Logic is now FLIPPED for Gini Impurity
62
+ if gini_score >= 0.6: return "Diverse Interests"
63
+ elif gini_score >= 0.3: return "Moderately Focused"
64
+ else: return "Highly Specialized"
65
+
66
+ # --- START OF DEFINITIVE FIX: Centralized Callback Function ---
67
+ def sync_stopwords():
68
+ """
69
+ This function is the single source of truth for updating stopwords.
70
+ It's called whenever any related widget changes.
71
+ """
72
+ # 1. Get words from all multiselect lists
73
+ selected_from_lists = set()
74
+ for topic_id in st.session_state.topics_info_for_sync:
75
+ key = f"multiselect_topic_{topic_id}"
76
+ if key in st.session_state:
77
+ selected_from_lists.update([s.split(' ')[0] for s in st.session_state[key]])
78
+
79
+ # 2. Get words from the text area
80
+ # The key for the text area is now the master state variable itself.
81
+ typed_stopwords = set([s.strip() for s in st.session_state.custom_stopwords_text.split(',') if s])
82
+
83
+ # 3. Combine them and update the master state variable
84
+ combined_stopwords = typed_stopwords.union(selected_from_lists)
85
+ st.session_state.custom_stopwords_text = ", ".join(sorted(list(combined_stopwords)))
86
+
87
+
88
+ # --- Main Page Layout ---
89
+ st.title("🌍 Multilingual Topic Modeling Dashboard")
90
+ st.markdown("Analyze textual data in multiple languages to discover topics and user trends.")
91
+
92
+ # Use a key to ensure the file uploader keeps its state, and update session_state directly
93
+ uploaded_file = st.file_uploader("Upload your CSV data", type="csv", key="csv_uploader")
94
+
95
+ # Check if a new file has been uploaded (or if it's the first time and a file exists)
96
+ if uploaded_file is not None and uploaded_file != st.session_state.get('last_uploaded_file', None):
97
+ try:
98
+ st.session_state.df_raw = pd.read_csv(uploaded_file)
99
+ st.session_state.results = None # Reset results if a new file is uploaded
100
+ st.session_state.custom_stopwords_text = ""
101
+ st.session_state.last_uploaded_file = uploaded_file # Store the uploaded file itself
102
+ st.success("CSV file loaded successfully!")
103
+ except Exception as e:
104
+ st.error(f"Could not read CSV file. Error: {e}")
105
+ st.session_state.df_raw = None
106
+ st.session_state.last_uploaded_file = None
107
+
108
+ if st.session_state.df_raw is not None:
109
+ df_raw = st.session_state.df_raw
110
+ col1, col2, col3 = st.columns(3)
111
+
112
+ with col1: user_id_col = st.selectbox("User ID Column", df_raw.columns, index=0, key="user_id_col")
113
+ with col2: post_content_col = st.selectbox("Post Content Column", df_raw.columns, index=min(1, len(df_raw.columns)-1), key="post_content_col")
114
+ with col3: timestamp_col = st.selectbox("Timestamp Column", df_raw.columns, index=min(2, len(df_raw.columns)-1), key="timestamp_col")
115
+
116
+ st.subheader("Topic Modeling Settings")
117
+ lang_col, topics_col = st.columns(2)
118
+ with lang_col: language = st.selectbox("Language Model", ["english", "multilingual"], key="language_model")
119
+ with topics_col: num_topics = st.number_input("Number of Topics", -1, help="Use -1 for automatic detection", key="num_topics")
120
+
121
+ with st.expander("Advanced: Text Cleaning & Preprocessing Options", expanded=False):
122
+ c1, c2 = st.columns(2)
123
+ with c1:
124
+ opts = {
125
+ 'lowercase': st.checkbox("Convert to Lowercase", True, key="opt_lowercase"),
126
+ 'lemmatize': st.checkbox("Lemmatize words", False, key="opt_lemmatize"),
127
+ 'remove_urls': st.checkbox("Remove URLs", False, key="opt_remove_urls"),
128
+ 'remove_html': st.checkbox("Remove HTML Tags", False, key="opt_remove_html")
129
+ }
130
+ with c2:
131
+ opts.update({
132
+ 'remove_special_chars': st.checkbox("Remove Special Characters", False, key="opt_remove_special_chars"),
133
+ 'remove_punctuation': st.checkbox("Remove Punctuation", False, key="opt_remove_punctuation"),
134
+ 'remove_numbers': st.checkbox("Remove Numbers", False, key="opt_remove_numbers")
135
+ })
136
+ st.markdown("---")
137
+ c1_emoji, c2_hashtag, c3_mention = st.columns(3)
138
+ with c1_emoji: opts['handle_emojis'] = st.radio("Emoji Handling", ["Keep Emojis", "Remove Emojis", "Convert Emojis to Text"], index=0, key="opt_handle_emojis")
139
+ with c2_hashtag: opts['handle_hashtags'] = st.radio("Hashtag (#) Handling", ["Keep Hashtags", "Remove Hashtags", "Extract Hashtags"], index=0, key="opt_handle_hashtags")
140
+ with c3_mention: opts['handle_mentions'] = st.radio("Mention (@) Handling", ["Keep Mentions", "Remove Mentions", "Extract Mentions"], index=0, key="opt_handle_mentions")
141
+ st.markdown("---")
142
+ opts['remove_stopwords'] = st.checkbox("Remove Stopwords", True, key="opt_remove_stopwords")
143
+
144
+ st.text_area(
145
+ "Custom Stopwords (comma-separated)",
146
+ key="custom_stopwords_text", # This one already had a key
147
+ on_change=sync_stopwords
148
+ )
149
+ opts['custom_stopwords'] = [s.strip().lower() for s in st.session_state.custom_stopwords_text.split(',') if s]
150
+
151
+ st.subheader("User Similarity Analysis")
152
+ enable_similarity = st.checkbox(
153
+ "Enable User Similarity Analysis",
154
+ value=True,
155
+ help="Find users with similar interests based on topics or text content",
156
+ key="enable_similarity"
157
+ )
158
+
159
+ if enable_similarity:
160
+ similarity_method = st.radio(
161
+ "Similarity Method",
162
+ options=["Topic-Based", "Text Similarity (TF-IDF)"],
163
+ index=0,
164
+ help="Topic-Based: Compare topic distributions. TF-IDF: Compare actual text content.",
165
+ key="similarity_method",
166
+ horizontal=True
167
+ )
168
+ else:
169
+ similarity_method = None
170
+
171
+ st.divider()
172
+ process_button = st.button("🚀 Run Full Analysis", type="primary", use_container_width=True)
173
+ else:
174
+ process_button = False
175
+
176
+ st.divider()
177
+
178
+ # --- Main Processing Logic ---
179
+ if process_button:
180
+ st.session_state.results = None
181
+ start_time = time.time()
182
+ with st.spinner("Processing your data... This may take a few minutes."):
183
+ try:
184
+ df = df_raw[[user_id_col, post_content_col, timestamp_col]].copy()
185
+ df.columns = ['user_id', 'post_content', 'timestamp']
186
+ df.dropna(subset=['user_id', 'post_content', 'timestamp'], inplace=True)
187
+ try:
188
+ df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
189
+ invalid_timestamps = df['timestamp'].isna().sum()
190
+ if invalid_timestamps > 0:
191
+ st.warning(f"Warning: {invalid_timestamps} rows have invalid timestamps and will be excluded.")
192
+ df = df.dropna(subset=['timestamp'])
193
+ except Exception as e:
194
+ st.error(f"Could not parse timestamp column: {e}")
195
+ st.stop()
196
+ if opts['handle_hashtags'] == 'Extract Hashtags': df['hashtags'] = df['post_content'].str.findall(r'#\w+')
197
+ if opts['handle_mentions'] == 'Extract Mentions': df['mentions'] = df['post_content'].str.findall(r'@\w+')
198
+
199
+ # 1. Capture the user's actual choice about stopwords
200
+ user_wants_stopwords_removed = opts.get("remove_stopwords", False)
201
+ custom_stopwords_list = opts.get("custom_stopwords", [])
202
+
203
+ # 2. Tell the preprocessor to KEEP stopwords in the text.
204
+ opts_for_preprocessor = opts.copy()
205
+ opts_for_preprocessor['remove_stopwords'] = False
206
+
207
+ st.info("⚙️ Initializing preprocessor and cleaning text (keeping stopwords for now)...")
208
+ preprocessor = MultilingualPreprocessor(language=language)
209
+ df['processed_content'] = preprocessor.preprocess_series(
210
+ df['post_content'],
211
+ opts_for_preprocessor,
212
+ n_process_spacy=-1 # Use all CPU cores for faster processing
213
+ )
214
+
215
+ st.info("🔍 Performing topic modeling...")
216
+ # Add +1 because BERTopic creates an outlier topic (-1), so to get N meaningful topics, request N+1
217
+ if num_topics > 0:
218
+ bertopic_nr_topics = num_topics + 1
219
+ else:
220
+ bertopic_nr_topics = "auto"
221
+
222
+ docs_series = df['processed_content'].fillna('').astype(str)
223
+ docs_to_model = docs_series[docs_series.str.len() > 0].tolist()
224
+ df_with_content = df[docs_series.str.len() > 0].copy()
225
+
226
+ if not docs_to_model:
227
+ st.error("❌ After preprocessing, no documents were left to analyze. Please adjust your cleaning options.")
228
+ st.stop()
229
+
230
+ # 3. Pass the user's choice and stopwords list to BERTopic
231
+ topic_model, topics, probs, coherence_score = perform_topic_modeling(
232
+ docs=docs_to_model,
233
+ language=language,
234
+ nr_topics=bertopic_nr_topics,
235
+ remove_stopwords_bertopic=user_wants_stopwords_removed,
236
+ custom_stopwords=custom_stopwords_list
237
+ )
238
+
239
+ df_with_content['topic_id'] = topics
240
+ df_with_content['probability'] = probs
241
+ df = pd.merge(df, df_with_content[['topic_id', 'probability']], left_index=True, right_index=True, how='left')
242
+ df['topic_id'] = df['topic_id'].fillna(-1).astype(int)
243
+
244
+ st.info("📊 Calculating user engagement metrics...")
245
+ all_unique_topics = sorted(df[df['topic_id'] != -1]['topic_id'].unique().tolist())
246
+ all_unique_users = sorted(df['user_id'].unique().tolist())
247
+
248
+ gini_per_user = calculate_gini_per_user(df[['user_id', 'topic_id']], all_topics=all_unique_topics)
249
+ gini_per_topic = calculate_gini_per_topic(df[['user_id', 'topic_id']], all_users=all_unique_users)
250
+
251
+ st.info("📈 Analyzing topic evolution...")
252
+ general_evolution = analyze_general_topic_evolution(topic_model, docs_to_model, df_with_content['timestamp'].tolist())
253
+
254
+ end_time = time.time()
255
+ elapsed_time = end_time - start_time
256
+
257
+ # Format elapsed time nicely
258
+ if elapsed_time >= 60:
259
+ minutes = int(elapsed_time // 60)
260
+ seconds = elapsed_time % 60
261
+ time_str = f"{minutes} min {seconds:.1f} sec"
262
+ else:
263
+ time_str = f"{elapsed_time:.1f} sec"
264
+
265
+ # Cache df_meaningful for reuse (avoids repeated filtering)
266
+ df_meaningful = df[df['topic_id'] != -1].copy()
267
+
268
+ st.session_state.results = {
269
+ 'topic_model': topic_model,
270
+ 'topic_info': topic_model.get_topic_info(),
271
+ 'df': df,
272
+ 'df_meaningful': df_meaningful, # Cached for performance
273
+ 'gini_per_user': gini_per_user,
274
+ 'gini_per_topic': gini_per_topic,
275
+ 'general_evolution': general_evolution,
276
+ 'coherence_score': coherence_score,
277
+ 'processing_time': elapsed_time
278
+ }
279
+
280
+ st.success(f"✅ Analysis complete! Processing time: {time_str}")
281
+ except OSError as e:
282
+ st.error(f"spaCy Model Error: Could not load model. Please run `python -m spacy download en_core_web_sm` and `python -m spacy download xx_ent_wiki_sm` from your terminal.")
283
+ except Exception as e:
284
+ st.error(f"❌ An error occurred during processing: {e}")
285
+ st.exception(e)
286
+ # --- Display Results ---
287
+ if st.session_state.results:
288
+ results = st.session_state.results
289
+ df = results['df']
290
+ topic_model = results['topic_model']
291
+ topic_info = results['topic_info']
292
+
293
+ st.markdown('<h2 class="sub-header">📋 Overview & Preprocessing</h2>', unsafe_allow_html=True)
294
+ score_text = f"{results['coherence_score']:.3f}" if results['coherence_score'] is not None else "N/A"
295
+ num_users = df['user_id'].nunique()
296
+ avg_posts = len(df) / num_users if num_users > 0 else 0
297
+ start_date, end_date = df['timestamp'].min(), df['timestamp'].max()
298
+ # Option 1: More Compact Date Format
299
+ if start_date.year == end_date.year:
300
+ # If both dates are in the same year, only show year on the end date
301
+ time_range_str = f"{start_date.strftime('%b %d')} - {end_date.strftime('%b %d, %Y')}"
302
+ else:
303
+ # If dates span multiple years, show year on both
304
+ time_range_str = f"{start_date.strftime('%b %d, %Y')} - {end_date.strftime('%b %d, %Y')}"
305
+
306
+ # Format processing time for display
307
+ proc_time = results.get('processing_time', 0)
308
+ if proc_time >= 60:
309
+ proc_time_str = f"{int(proc_time // 60)}m {proc_time % 60:.1f}s"
310
+ else:
311
+ proc_time_str = f"{proc_time:.1f}s"
312
+
313
+ col1, col2, col3, col4, col5, col6 = st.columns(6)
314
+ col1.metric("Total Posts", len(df))
315
+ col2.metric("Unique Users", num_users)
316
+ col3.metric("Avg Posts / User", f"{avg_posts:.1f}")
317
+ col4.metric("Time Range", time_range_str)
318
+ col5.metric("Topic Coherence", score_text)
319
+ col6.metric("Processing Time", proc_time_str)
320
+ st.markdown("#### Preprocessing Results (Sample)")
321
+ st.dataframe(df[['post_content', 'processed_content']].head())
322
+
323
+ with st.expander("📊 Topic Model Evaluation Metrics"):
324
+ st.write("""
325
+ ### 🔹Coherence Score
326
+ - measures how well the discovered topics make sense:
327
+ - **> 0.6**: Excellent - Topics are very distinct and meaningful
328
+ - **0.5 - 0.6**: Good - Topics are generally clear and interpretable
329
+ - **0.4 - 0.5**: Fair - Topics are somewhat meaningful but may overlap
330
+ - **< 0.4**: Poor - Topics may be unclear or too similar
331
+
332
+ 💡 **Tip**: If coherence is low, try adjusting the number of topics or cleaning options.
333
+ """)
334
+
335
+ st.markdown('<h2 class="sub-header">🎯 Topic Visualization & Refinement</h2>', unsafe_allow_html=True)
336
+ topic_options = topic_info[topic_info.Topic != -1].sort_values('Count', ascending=False)
337
+
338
+
339
+
340
+
341
+ view1, view2 = st.tabs(["Word Clouds", "Interactive Word Lists & Refinement"])
342
+
343
+ with view1:
344
+ st.info("Visual representation of the most important words for each topic.")
345
+ topics_to_show = topic_options.head(9)
346
+ num_cols = 3
347
+ cols = st.columns(num_cols)
348
+ for i, row in enumerate(topics_to_show.itertuples()):
349
+ with cols[i % num_cols]:
350
+ st.markdown(f"##### Topic {row.Topic}: {row.Name}")
351
+ fig = create_word_cloud(topic_model, row.Topic)
352
+ if fig: st.pyplot(fig, use_container_width=True)
353
+
354
+ with view2:
355
+ st.info("Select or deselect words from the lists below to instantly update the custom stopwords list in the configuration section above.")
356
+ topics_to_show = topic_options.head(9)
357
+ # Store the topic IDs we are showing so the callback can find the right widgets
358
+ st.session_state.topics_info_for_sync = [row.Topic for row in topics_to_show.itertuples()]
359
+
360
+ num_cols = 3
361
+ cols = st.columns(num_cols)
362
+
363
+ # Calculate which words should be pre-selected in the multiselects
364
+ current_stopwords_set = set([s.strip() for s in st.session_state.custom_stopwords_text.split(',') if s])
365
+
366
+ for i, row in enumerate(topics_to_show.itertuples()):
367
+ with cols[i % num_cols]:
368
+ st.markdown(f"##### Topic {row.Topic}")
369
+ topic_words = topic_model.get_topic(row.Topic)
370
+
371
+ # The options for the multiselect, e.g., ["word1 (0.123)", "word2 (0.122)"]
372
+ formatted_options = [f"{word} ({score:.3f})" for word, score in topic_words[:15]]
373
+
374
+ # Determine the default selected values for this specific multiselect
375
+ default_selection = []
376
+ for formatted_word in formatted_options:
377
+ word_part = formatted_word.split(' ')[0]
378
+ if word_part in current_stopwords_set:
379
+ default_selection.append(formatted_word)
380
+
381
+ st.multiselect(
382
+ f"Select words from Topic {row.Topic}",
383
+ options=formatted_options,
384
+ default=default_selection, # Pre-select words that are already in the list
385
+ key=f"multiselect_topic_{row.Topic}",
386
+ on_change=sync_stopwords, # The callback synchronizes everything
387
+ label_visibility="collapsed"
388
+ )
389
+
390
+
391
+
392
+
393
+ st.markdown('<h2 class="sub-header">📈 Topic Evolution</h2>', unsafe_allow_html=True)
394
+ if not results['general_evolution'].empty:
395
+ evo = results['general_evolution']
396
+
397
+
398
+ # 1. Filter out the outlier topic (-1) and ensure Timestamp is a datetime object
399
+ evo_filtered = evo[evo.Topic != -1].copy()
400
+ evo_filtered['Timestamp'] = pd.to_datetime(evo_filtered['Timestamp'])
401
+
402
+ if not evo_filtered.empty:
403
+ # 2. Pivot the data to get topics as columns and aggregate frequencies
404
+ evo_pivot = evo_filtered.pivot_table(
405
+ index='Timestamp',
406
+ columns='Topic',
407
+ values='Frequency',
408
+ aggfunc='sum'
409
+ ).fillna(0)
410
+
411
+ # 3. Dynamically choose a good resampling frequency (Hourly, Daily, or Weekly)
412
+ time_delta = evo_pivot.index.max() - evo_pivot.index.min()
413
+ if time_delta.days > 60:
414
+ resample_freq, freq_label = 'W', 'Weekly'
415
+ elif time_delta.days > 5:
416
+ resample_freq, freq_label = 'D', 'Daily'
417
+ else:
418
+ resample_freq, freq_label = 'H', 'Hourly'
419
+
420
+ # Resample the data into the chosen time bins by summing up the frequencies
421
+ evo_resampled = evo_pivot.resample(resample_freq).sum()
422
+
423
+ # 4. Create the line chart using plotly.express.line
424
+ # --- The main change is here: from px.area to px.line ---
425
+ fig_evo = px.line(
426
+ evo_resampled,
427
+ x=evo_resampled.index,
428
+ y=evo_resampled.columns,
429
+ title=f"Topic Frequency Over Time ({freq_label} Line Chart)",
430
+ labels={'value': 'Total Frequency', 'variable': 'Topic ID', 'index': 'Time'},
431
+ height=500
432
+ )
433
+ # Make the topic IDs in the legend categorical for better color mapping
434
+ fig_evo.for_each_trace(lambda t: t.update(name=str(t.name)))
435
+ fig_evo.update_layout(legend_title_text='Topic')
436
+
437
+ st.plotly_chart(fig_evo, use_container_width=True)
438
+ else:
439
+ st.info("No topic evolution data available to display (all posts may have been outliers).")
440
+ else:
441
+ st.warning("Could not compute topic evolution (requires more data points over time).")
442
+
443
+
444
+
445
+
446
+
447
+ st.markdown('<h2 class="sub-header">🧑‍🤝‍🧑 User Engagement Profile</h2>', unsafe_allow_html=True)
448
+
449
+ # --- START OF THE CRITICAL FIX ---
450
+
451
+ # 1. Use cached df_meaningful from session_state for performance
452
+ df_meaningful = results.get('df_meaningful', df[df['topic_id'] != -1])
453
+
454
+ # 2. Get post counts based on this meaningful data.
455
+ meaningful_post_counts = df_meaningful.groupby('user_id').size().reset_index(name='post_count')
456
+
457
+ # 3. Merge with the Gini results (which were already correctly calculated on meaningful topics).
458
+ # Using an 'inner' merge ensures we only consider users who have at least one meaningful post.
459
+ user_metrics_df = pd.merge(
460
+ meaningful_post_counts,
461
+ results['gini_per_user'],
462
+ on='user_id',
463
+ how='inner'
464
+ )
465
+
466
+ # 4. Filter to include only users with more than one MEANINGFUL post.
467
+ metrics_to_plot = user_metrics_df[user_metrics_df['post_count'] > 1].copy()
468
+
469
+ total_meaningful_users = len(user_metrics_df)
470
+ st.info(f"Displaying engagement profile for {len(metrics_to_plot)} users out of {total_meaningful_users} who contributed to meaningful topics.")
471
+
472
+ # 5. Add jitter for better visualization (deterministic seed for consistency)
473
+ np.random.seed(42)
474
+ jitter_strength = 0.02
475
+ metrics_to_plot['gini_jittered'] = metrics_to_plot['gini_coefficient'] + \
476
+ np.random.uniform(-jitter_strength, jitter_strength, size=len(metrics_to_plot))
477
+
478
+ # 6. Create the plot using the correctly filtered and prepared data.
479
+ fig = px.scatter(
480
+ metrics_to_plot,
481
+ x='post_count',
482
+ y='gini_jittered',
483
+ title='User Engagement Profile (based on posts in meaningful topics)',
484
+ labels={
485
+ 'post_count': 'Number of Posts in Meaningful Topics', # Updated label
486
+ 'gini_jittered': 'Gini Index (Topic Diversity)'
487
+ },
488
+ custom_data=['user_id', 'gini_coefficient']
489
+ )
490
+ fig.update_traces(
491
+ marker=dict(opacity=0.5),
492
+ hovertemplate="<b>User</b>: %{customdata[0]}<br><b>Meaningful Posts</b>: %{x}<br><b>Gini (Original)</b>: %{customdata[1]:.3f}<extra></extra>"
493
+ )
494
+ fig.update_yaxes(range=[-0.05, 1.05])
495
+ st.plotly_chart(fig, use_container_width=True)
496
+
497
+ # --- END OF THE CRITICAL FIX ---
498
+
499
+ st.markdown('<h2 class="sub-header">👤 User Deep Dive</h2>', unsafe_allow_html=True)
500
+ selected_user = st.selectbox("Select a User to Analyze", options=sorted(df['user_id'].unique()), key="selected_user_dropdown")
501
+
502
+ if selected_user:
503
+ user_df = df[df['user_id'] == selected_user]
504
+ matching_users = user_metrics_df[user_metrics_df['user_id'] == selected_user]
505
+
506
+ if matching_users.empty:
507
+ st.warning("This user has no posts in meaningful topics (all posts were classified as outliers).")
508
+ st.metric("Total Posts by User", len(user_df))
509
+ else:
510
+ user_gini_info = matching_users.iloc[0]
511
+
512
+ # Display the top-level metrics for the user first
513
+ c1, c2 = st.columns(2)
514
+ with c1: st.metric("Total Posts by User", len(user_df))
515
+ with c2: st.metric("Topic Diversity (Gini)", f"{user_gini_info['gini_coefficient']:.3f}", help=interpret_gini(user_gini_info['gini_coefficient']))
516
+
517
+ st.markdown("---") # Add a visual separator
518
+
519
+ # --- START: New Two-Column Layout for Charts ---
520
+ col1, col2 = st.columns(2)
521
+
522
+ with col1:
523
+ # --- Chart 1: Topic Distribution Pie Chart ---
524
+ user_topic_counts = user_df['topic_id'].value_counts().reset_index()
525
+ user_topic_counts.columns = ['topic_id', 'count']
526
+
527
+ fig_pie = px.pie(
528
+ user_topic_counts[user_topic_counts.topic_id != -1],
529
+ names='topic_id',
530
+ values='count',
531
+ title=f"Overall Topic Distribution for {selected_user}",
532
+ hole=0.4
533
+ )
534
+ fig_pie.update_layout(margin=dict(l=0, r=0, t=40, b=0))
535
+ st.plotly_chart(fig_pie, use_container_width=True)
536
+
537
+ with col2:
538
+ # --- Chart 2: Topic Evolution for User ---
539
+ if len(user_df) > 1:
540
+ user_evo_df = user_df[user_df['topic_id'] != -1].copy()
541
+ user_evo_df['timestamp'] = pd.to_datetime(user_evo_df['timestamp'])
542
+
543
+ if not user_evo_df.empty and user_evo_df['timestamp'].nunique() > 1:
544
+ user_pivot = user_evo_df.pivot_table(index='timestamp', columns='topic_id', aggfunc='size', fill_value=0)
545
+
546
+ time_delta = user_pivot.index.max() - user_pivot.index.min()
547
+ if time_delta.days > 30: resample_freq = 'D'
548
+ elif time_delta.days > 2: resample_freq = 'H'
549
+ else: resample_freq = 'T'
550
+
551
+ user_resampled = user_pivot.resample(resample_freq).sum()
552
+ row_sums = user_resampled.sum(axis=1)
553
+ user_proportions = user_resampled.div(row_sums, axis=0).fillna(0)
554
+
555
+ topic_name_map = topic_info.set_index('Topic')['Name'].to_dict()
556
+ user_proportions.rename(columns=topic_name_map, inplace=True)
557
+
558
+ fig_user_evo = px.area(
559
+ user_proportions,
560
+ x=user_proportions.index,
561
+ y=user_proportions.columns,
562
+ title=f"Topic Proportion Over Time for {selected_user}",
563
+ labels={'value': 'Topic Proportion', 'variable': 'Topic', 'index': 'Time'},
564
+ )
565
+ fig_user_evo.update_layout(margin=dict(l=0, r=0, t=40, b=0))
566
+ st.plotly_chart(fig_user_evo, use_container_width=True)
567
+ else:
568
+ st.info("This user has no posts in meaningful topics or all posts occurred at the same time.")
569
+ else:
570
+ st.info("Topic evolution requires more than one post to display.")
571
+
572
+
573
+ st.markdown("#### User's Most Recent Posts")
574
+ user_posts_table = user_df[['post_content', 'timestamp', 'topic_id']] \
575
+ .sort_values(by='timestamp', ascending=False) \
576
+ .head(100)
577
+ user_posts_table.columns = ['Post Content', 'Timestamp', 'Assigned Topic']
578
+ st.dataframe(user_posts_table, use_container_width=True)
579
+
580
+ with st.expander("Show User Distribution by Post Count"):
581
+ # We use 'user_metrics_df' because it's based on meaningful posts
582
+ post_distribution = user_metrics_df['post_count'].value_counts().reset_index()
583
+ post_distribution.columns = ['Number of Posts', 'Number of Users']
584
+ post_distribution = post_distribution.sort_values(by='Number of Posts')
585
+
586
+ # Create a bar chart for the distribution
587
+ fig_dist = px.bar(
588
+ post_distribution,
589
+ x='Number of Posts',
590
+ y='Number of Users',
591
+ title='User Distribution by Number of Meaningful Posts'
592
+ )
593
+ st.plotly_chart(fig_dist, use_container_width=True)
594
+
595
+ # Display the raw data in a table
596
+ st.write("Data Table: User Distribution")
597
+ st.dataframe(post_distribution, use_container_width=True)
598
+
599
+ # --- User Similarity Analysis Section ---
600
+ # Check if similarity analysis is enabled
601
+ if st.session_state.get('enable_similarity', True):
602
+ st.markdown('<h2 class="sub-header">🤝 User Similarity Analysis</h2>', unsafe_allow_html=True)
603
+
604
+ # Get the selected method
605
+ selected_method = st.session_state.get('similarity_method', 'Topic-Based')
606
+
607
+ if selected_method == "Topic-Based":
608
+ st.info("Finding users with similar **topic interests** based on their topic distributions.")
609
+ df_for_similarity = results.get('df_meaningful', df[df['topic_id'] != -1])
610
+ similarity_df = calculate_narrative_similarity(df_for_similarity)
611
+ else: # TF-IDF
612
+ st.info("Finding users with similar **text content** using TF-IDF word analysis.")
613
+ with st.spinner("Calculating text similarity (this may take a moment)..."):
614
+ similarity_df = calculate_text_similarity_tfidf(df)
615
+
616
+ if similarity_df.empty:
617
+ st.warning("Not enough data to calculate similarity. Need at least 2 users with content.")
618
+ else:
619
+ # User selection for similarity analysis
620
+ similarity_user = st.selectbox(
621
+ "Select a User to Find Similar Users",
622
+ options=sorted(similarity_df.index.tolist()),
623
+ key="similarity_user_dropdown"
624
+ )
625
+
626
+ # Similarity threshold slider
627
+ similarity_threshold = st.slider(
628
+ "Similarity Threshold",
629
+ min_value=0.0,
630
+ max_value=1.0,
631
+ value=0.5,
632
+ step=0.05,
633
+ help="Only show users with similarity score above this threshold"
634
+ )
635
+
636
+ if similarity_user:
637
+ # Get similarity scores for the selected user
638
+ user_similarities = similarity_df[similarity_user].drop(similarity_user) # Exclude self
639
+
640
+ # Filter by threshold
641
+ similar_users = user_similarities[user_similarities >= similarity_threshold].sort_values(ascending=False)
642
+
643
+ if similar_users.empty:
644
+ st.info(f"No users found with similarity >= {similarity_threshold}. Try lowering the threshold.")
645
+ else:
646
+ # Create a results DataFrame with post counts
647
+ similar_users_df = pd.DataFrame({
648
+ 'User ID': similar_users.index,
649
+ 'Similarity Score': similar_users.values
650
+ })
651
+
652
+ # Add post count for context
653
+ post_counts = df.groupby('user_id').size()
654
+ similar_users_df['Post Count'] = similar_users_df['User ID'].map(post_counts).fillna(0).astype(int)
655
+
656
+ # Format the similarity score
657
+ similar_users_df['Similarity Score'] = similar_users_df['Similarity Score'].apply(lambda x: f"{x:.3f}")
658
+
659
+ method_label = "topic interests" if selected_method == "Topic-Based" else "text content"
660
+ st.write(f"**Found {len(similar_users_df)} users** with similar {method_label} to **{similarity_user}**:")
661
+ st.dataframe(similar_users_df, use_container_width=True, hide_index=True)
docker-compose.yml ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3.8'
2
+
3
+ services:
4
+ topic-modeling-app:
5
+ build: .
6
+ ports:
7
+ - "8501:8501"
8
+ environment:
9
+ - STREAMLIT_SERVER_PORT=8501
10
+ - STREAMLIT_SERVER_ADDRESS=0.0.0.0
11
+ - STREAMLIT_BROWSER_GATHER_USAGE_STATS=false
12
+ - STREAMLIT_SERVER_HEADLESS=true
13
+ - TOKENIZERS_PARALLELISM=false
14
+
15
+ volumes:
16
+ # Optional: Mount a directory for persistent data storage
17
+ - ./data:/app/data
18
+ restart: unless-stopped
19
+ healthcheck:
20
+ test: ["CMD", "curl", "-f", "http://localhost:8501/_stcore/health"]
21
+ interval: 30s
22
+ timeout: 10s
23
+ retries: 3
24
+ start_period: 40s
25
+
gini_calculator.py ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from math import isnan
3
+ import math
4
+ from typing import List
5
+
6
+ def calculate_gini(counts, *, min_posts=None, normalize=False):
7
+ """
8
+ Compute 1 - sum(p_i^2) where p_i are category probabilities (Gini Impurity).
9
+ Handles: list/tuple of counts, dict {cat: count}, numpy array, pandas Series.
10
+
11
+ Edge cases:
12
+ - total == 0 -> return float('nan')
13
+ - total == 1 -> return 0.0
14
+ - min_posts set and total < min_posts -> return float('nan')
15
+ - normalize=True -> divide by (1 - 1/k_nonzero) when k_nonzero > 1
16
+
17
+ Parameters
18
+ ----------
19
+ counts : Iterable[int] | dict | pandas.Series | numpy.ndarray
20
+ Nonnegative counts per category.
21
+ min_posts : int | None
22
+ If provided and total posts < min_posts, returns NaN.
23
+ normalize : bool
24
+ If True, returns Gini / (1 - 1/k_nonzero) for k_nonzero > 1.
25
+
26
+ Returns
27
+ -------
28
+ float
29
+ """
30
+ # Convert to a flat list of counts
31
+ if counts is None:
32
+ return float('nan')
33
+
34
+ if isinstance(counts, dict):
35
+ vals = list(counts.values())
36
+ else:
37
+ # Works for list/tuple/np.array/Series
38
+ try:
39
+ vals = list(counts)
40
+ except TypeError:
41
+ return float('nan')
42
+
43
+ # Validate & clean
44
+ vals = [float(v) for v in vals if v is not None and not math.isnan(v)]
45
+ if any(v < 0 for v in vals):
46
+ raise ValueError("Counts must be nonnegative.")
47
+ total = sum(vals)
48
+
49
+ # Edge cases
50
+ if total == 0:
51
+ return float('nan')
52
+ if min_posts is not None and total < min_posts:
53
+ return float('nan')
54
+ if total == 1:
55
+ base = 0.0
56
+ else:
57
+ # Compute 1 - sum p_i^2
58
+ s2 = sum((v / total) ** 2 for v in vals)
59
+ base = 1.0 - s2
60
+
61
+ if not normalize:
62
+ return base
63
+
64
+ # Normalization by maximum possible diversity for observed nonzero categories
65
+ k_nonzero = sum(1 for v in vals if v > 0)
66
+ if k_nonzero <= 1:
67
+ # If only one category has posts, diversity is 0 and normalization isn't defined—return 0
68
+ return 0.0
69
+ denom = 1.0 - 1.0 / k_nonzero
70
+ # Guard against floating tiny negatives due to FP
71
+ return max(0.0, min(1.0, base / denom))
72
+
73
+
74
+ def calculate_gini_per_user(df: pd.DataFrame, all_topics: List[int]):
75
+ """
76
+ Calculates the Gini Impurity for topic distribution per user.
77
+ A high value indicates high topic diversity.
78
+ Optimized with groupby for better performance.
79
+ """
80
+ def compute_user_gini(group):
81
+ existing_topic_counts = group["topic_id"].value_counts()
82
+ full_topic_counts = pd.Series(0, index=all_topics)
83
+ full_topic_counts.update(existing_topic_counts)
84
+ return calculate_gini(full_topic_counts.values, normalize=True)
85
+
86
+ # Use groupby instead of loop for O(n) instead of O(n*m) complexity
87
+ user_gini = df.groupby("user_id").apply(compute_user_gini).reset_index()
88
+ user_gini.columns = ["user_id", "gini_coefficient"]
89
+ return user_gini.fillna(0)
90
+
91
+
92
+ def calculate_gini_per_topic(df: pd.DataFrame, all_users: List[str]):
93
+ """
94
+ Calculates the Gini Impurity for user distribution per topic.
95
+ A high value indicates the topic is discussed by a diverse set of users.
96
+ Optimized with groupby for better performance.
97
+ """
98
+ def compute_topic_gini(group):
99
+ existing_user_counts = group["user_id"].value_counts()
100
+ full_user_counts = pd.Series(0, index=all_users)
101
+ full_user_counts.update(existing_user_counts)
102
+ return calculate_gini(full_user_counts.values, normalize=True)
103
+
104
+ # Use groupby instead of loop for O(n) instead of O(n*m) complexity
105
+ topic_gini = df.groupby("topic_id").apply(compute_topic_gini).reset_index()
106
+ topic_gini.columns = ["topic_id", "gini_coefficient"]
107
+ return topic_gini.fillna(0)
narrative_similarity.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # narrative_similarity.py
2
+
3
+ import pandas as pd
4
+ from sklearn.metrics.pairwise import cosine_similarity
5
+ from sklearn.feature_extraction.text import TfidfVectorizer
6
+
7
+ def calculate_narrative_similarity(df: pd.DataFrame):
8
+ """
9
+ Calculates the narrative overlap between users based on their topic distributions.
10
+
11
+ Args:
12
+ df (pd.DataFrame): DataFrame containing 'user_id' and 'topic_id' columns.
13
+ Should already be filtered to exclude outliers (topic_id == -1).
14
+
15
+ Returns:
16
+ pd.DataFrame: A square DataFrame where rows and columns are user_ids
17
+ and values are the cosine similarity of their topic distributions.
18
+ """
19
+ # Filter out outlier posts if any remain
20
+ df_meaningful = df[df['topic_id'] != -1] if 'topic_id' in df.columns else df
21
+
22
+ if df_meaningful.empty:
23
+ return pd.DataFrame()
24
+
25
+ # Create the "narrative vector" for each user
26
+ # Rows: user_id, Columns: topic_id, Values: count of posts
27
+ user_topic_matrix = pd.crosstab(df_meaningful['user_id'], df_meaningful['topic_id'])
28
+
29
+ # Need at least 2 users for meaningful comparison
30
+ if len(user_topic_matrix) < 2:
31
+ return pd.DataFrame()
32
+
33
+ # Normalize rows to get proportions (important for meaningful cosine similarity)
34
+ # This ensures users with different post counts can still be compared fairly
35
+ row_sums = user_topic_matrix.sum(axis=1)
36
+ user_topic_proportions = user_topic_matrix.div(row_sums, axis=0)
37
+
38
+ # Calculate pairwise cosine similarity between all users
39
+ similarity_matrix = cosine_similarity(user_topic_proportions)
40
+
41
+ # Convert the result back to a DataFrame with user_ids as labels
42
+ similarity_df = pd.DataFrame(
43
+ similarity_matrix,
44
+ index=user_topic_matrix.index,
45
+ columns=user_topic_matrix.index
46
+ )
47
+
48
+ return similarity_df
49
+
50
+
51
+ def calculate_text_similarity_tfidf(df: pd.DataFrame):
52
+ """
53
+ Calculates text similarity between users using TF-IDF vectorization.
54
+
55
+ Combines all posts from each user into a single document, then compares
56
+ the word frequencies using TF-IDF and cosine similarity.
57
+
58
+ Args:
59
+ df (pd.DataFrame): DataFrame containing 'user_id' and 'post_content' columns.
60
+
61
+ Returns:
62
+ pd.DataFrame: A square DataFrame where rows and columns are user_ids
63
+ and values are the cosine similarity of their text content.
64
+ """
65
+ if df.empty or 'post_content' not in df.columns:
66
+ return pd.DataFrame()
67
+
68
+ # Combine all posts from each user into a single document
69
+ user_docs = df.groupby('user_id')['post_content'].apply(
70
+ lambda posts: ' '.join(posts.astype(str))
71
+ ).reset_index()
72
+ user_docs.columns = ['user_id', 'combined_text']
73
+
74
+ # Need at least 2 users for meaningful comparison
75
+ if len(user_docs) < 2:
76
+ return pd.DataFrame()
77
+
78
+ # Create TF-IDF vectors for each user's combined text
79
+ tfidf = TfidfVectorizer(
80
+ max_features=5000, # Limit vocabulary size for performance
81
+ stop_words='english',
82
+ min_df=1,
83
+ max_df=0.95
84
+ )
85
+
86
+ try:
87
+ tfidf_matrix = tfidf.fit_transform(user_docs['combined_text'])
88
+ except ValueError:
89
+ # Empty vocabulary (all stop words or empty texts)
90
+ return pd.DataFrame()
91
+
92
+ # Calculate pairwise cosine similarity
93
+ similarity_matrix = cosine_similarity(tfidf_matrix)
94
+
95
+ # Convert to DataFrame with user_ids as labels
96
+ similarity_df = pd.DataFrame(
97
+ similarity_matrix,
98
+ index=user_docs['user_id'],
99
+ columns=user_docs['user_id']
100
+ )
101
+
102
+ return similarity_df
readme.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Social Media Topic Modeling System
2
+
3
+ A comprehensive topic modeling system for social media analysis built with Streamlit and BERTopic. This application supports flexible CSV column mapping, multilingual topic modeling, Gini coefficient calculation for diversity analysis, topic evolution tracking, and semantic narrative overlap detection.
4
+
5
+ ## Features
6
+
7
+ - **📊 Topic Modeling**: Uses BERTopic for state-of-the-art, transformer-based topic modeling.
8
+ - **⚙️ Flexible Configuration**:
9
+ - **Custom Column Mapping**: Use any CSV file by mapping your columns to `user_id`, `post_content`, and `timestamp`.
10
+ - **Topic Number Control**: Let the model find topics automatically or specify the exact number you need.
11
+ - **🌍 Multilingual Support**: Handles English and 50+ other languages using appropriate language models.
12
+ - **📈 Gini Index Analysis**: Calculates topic and user diversity.
13
+ - **⏰ Topic Evolution**: Tracks how topic popularity and user interests change over time with interactive charts.
14
+ - **🤝 Narrative Overlap Analysis**: Identifies users with semantically similar posting patterns (shared narratives), even when their wording differs.
15
+ - **✍️ Interactive Topic Refinement**: Fine-tune topic quality by adding words to a custom stopword list directly from the dashboard.
16
+ - **🎯 Interactive Visualizations**: A rich dashboard with built-in charts and data tables using Plotly.
17
+ - **📱 Responsive Interface**: Clean, modern Streamlit interface with a control panel for all settings.
18
+
19
+ ## Requirements
20
+
21
+ ### CSV File Format
22
+
23
+ Your CSV file must contain columns that can be mapped to the following roles:
24
+ - **User ID**: A column with unique identifiers for each user (string).
25
+ - **Post Content**: A column with the text content of the social media post (string).
26
+ - **Timestamp**: A column with the date and time of the post.
27
+
28
+ The application will prompt you to select the correct column for each role after you upload your file.
29
+
30
+ #### A Note on Timestamp Formatting
31
+
32
+ The application is highly flexible and can automatically parse many common date and time formats thanks to the powerful Pandas library. However, to ensure 100% accuracy and avoid errors, please follow these guidelines for your timestamp column:
33
+
34
+ * **Best Practice (Recommended):** Use a standard, unambiguous format like ISO 8601.
35
+ - `YYYY-MM-DD HH:MM:SS` (e.g., `2023-10-27 15:30:00`)
36
+ - `YYYY-MM-DDTHH:MM:SS` (e.g., `2023-10-27T15:30:00`)
37
+
38
+ * **Supported Formats:** Most common formats will work, including:
39
+ - `MM/DD/YYYY HH:MM` (e.g., `10/27/2023 15:30`)
40
+ - `DD/MM/YYYY HH:MM` (e.g., `27/10/2023 15:30`)
41
+ - `Month D, YYYY` (e.g., `October 27, 2023`)
42
+
43
+ * **Potential Issues to Avoid:**
44
+ - **Ambiguous formats:** A date like `01/02/2023` can be interpreted as either Jan 2nd or Feb 1st. Using a `YYYY-MM-DD` format avoids this.
45
+ - **Mixed formats in one column:** Ensure all timestamps in your column follow the same format for best performance and reliability.
46
+ - **Timezone information:** Formats with timezone offsets (e.g., `2023-10-27 15:30:00+05:30`) are fully supported.
47
+
48
+ ### Dependencies
49
+
50
+ See `requirements.txt` for a full list of dependencies.
51
+
52
+ ## Installation
53
+
54
+ ### Option 1: Local Installation
55
+
56
+ 1. **Clone or download the project files.**
57
+ 2. **Install dependencies:**
58
+ ```bash
59
+ pip install -r requirements.txt
60
+ ```
61
+ 3. **Download spaCy models:**
62
+ ```bash
63
+ python -m spacy download en_core_web_sm
64
+ python -m spacy download xx_ent_wiki_sm
65
+ ```
66
+
67
+ ### Option 2: Docker Installation (Recommended)
68
+
69
+ 1. **Using Docker Compose (easiest):**
70
+ ```bash
71
+ docker-compose up --build
72
+ ```
73
+ 2. **Access the application:**
74
+ Open your browser and go to `http://localhost:8501`.
75
+
76
+ ## Usage
77
+
78
+ 1. **Start the Streamlit application:**
79
+ ```bash
80
+ streamlit run app.py
81
+ ```
82
+ 2. **Open your browser** and navigate to the local URL provided by Streamlit (usually `http://localhost:8501`).
83
+ 3. **Follow the steps in the application:**
84
+ - **1. Upload CSV File**: Click "Browse files" to upload your dataset.
85
+ - **2. Map Data Columns**: Once uploaded, select which of your columns correspond to `User ID`, `Post Content`, and `Timestamp`.
86
+ - **3. Configure Analysis**:
87
+ - **Language Model**: Choose `english` for English-only data or `multilingual` for other languages.
88
+ - **Number of Topics**: Enter a specific number of meaningful topics to find, or use `-1` to let the model decide automatically.
89
+ - **Text Preprocessing**: Expand the advanced options to select cleaning steps like lowercasing, punctuation removal, and more.
90
+ - **Custom Stopwords**: (Optional) Enter comma-separated words to exclude from analysis.
91
+ - **4. Run Analysis**: Click the "🚀 Run Full Analysis" button.
92
+
93
+ 4. **Explore the results** in the interactive sections of the main panel.
94
+
95
+ ### Exploring the Interface
96
+
97
+ The application provides a series of detailed sections:
98
+
99
+ #### 📋 Overview & Preprocessing
100
+ - Key metrics (total posts, unique users), dataset time range, and a topic coherence score.
101
+ - A sample of your data showing the original and processed text.
102
+
103
+ #### 🎯 Topic Visualization & Refinement
104
+ - **Word Clouds**: Visual representation of the most important words for top topics.
105
+ - **Interactive Word Lists**: Interactively select words from topic lists to add them to your custom stopwords for re-analysis.
106
+
107
+ #### 📈 Topic Evolution
108
+ - An interactive line chart showing how topic frequencies change over the entire dataset's timespan.
109
+
110
+ #### 🧑‍🤝‍🧑 User Engagement Profile
111
+ - A scatter plot visualizing the relationship between the number of posts a user makes and the diversity of their topics.
112
+ - An expandable section showing the distribution of users by their post count.
113
+
114
+ #### 👤 User Deep Dive
115
+ - Select a specific user to analyze.
116
+ - View their key metrics, overall topic distribution pie chart, and their personal topic evolution over time.
117
+ - See detailed tables of their topic breakdown and their most recent posts.
118
+
119
+ #### 🤝 Narrative Overlap Analysis
120
+ - Select a user to find other users who discuss a similar mix of topics.
121
+ - Use the slider to adjust the similarity threshold.
122
+ - The results table shows the overlap score and post count of similar users, providing context on both narrative alignment and engagement level.
123
+
124
+ ## Understanding the Results
125
+
126
+ ### Gini Impurity Index
127
+ This application uses the **Gini Impurity Index**, a measure of diversity.
128
+ - **Range**: 0 to 1
129
+ - **User Gini (Topic Diversity)**: Measures how diverse a user's topics are. **0** = perfectly specialized (posts on only one topic), **1** = perfectly diverse (posts spread evenly across all topics).
130
+ - **Topic Gini (User Diversity)**: Measures how concentrated a topic is among users. **0** = dominated by a single user, **1** = widely and evenly discussed by many users.
131
+
132
+ ### Narrative Overlap Score
133
+ - **Range**: 0 to 1
134
+ - This score measures the **cosine similarity** between the topic distributions of two users.
135
+ - A score of **1.0** means the two users have an identical proportional interest in topics (e.g., both are 100% focused on Topic 3).
136
+ - A score of **0.0** means their topic interests are completely different.
137
+ - This helps identify users with similar narrative focus, regardless of their total post count.
138
+
requirements.txt CHANGED
@@ -1,3 +1,19 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit>=1.17.0
2
+ bertopic[all]>=0.16.0
3
+ pandas>=2.0.0
4
+ numpy>=1.20.0
5
+ plotly>=5.0.0
6
+ transformers>=4.21.0
7
+ sentence-transformers>=2.2.0
8
+ scikit-learn>=1.0.0
9
+ hdbscan>=0.8.29
10
+ umap-learn>=0.5.0
11
+ torch>=1.11.0
12
+ matplotlib>=3.5.0
13
+ seaborn>=0.11.0
14
+ gensim>=4.3.0
15
+ nltk>=3.8.0
16
+ wordcloud>=1.9.0
17
+ emoji>=2.2.0
18
+ spacy>=3.4.0
19
+ pyinstaller
resource_path.py ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import os
3
+
4
+ def resource_path(relative_path):
5
+ """ Get absolute path to resource, works for dev and for PyInstaller """
6
+ try:
7
+ # PyInstaller creates a temp folder and stores path in _MEIPASS
8
+ base_path = sys._MEIPASS
9
+ except Exception:
10
+ base_path = os.path.abspath(".")
11
+
12
+ return os.path.join(base_path, relative_path)
run.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # run.py
2
+
3
+ import streamlit.web.cli as stcli
4
+ import os
5
+ import sys
6
+ from resource_path import resource_path # Import resource_path
7
+
8
+ def run_streamlit():
9
+ # Determine the correct base path at runtime
10
+ if hasattr(sys, '_MEIPASS'):
11
+ # In a PyInstaller bundle, the resource is in the temp folder
12
+ base_path = sys._MEIPASS
13
+ else:
14
+ # In development, the resource is in the current directory
15
+ base_path = os.path.abspath(os.path.dirname(__file__))
16
+
17
+ app_path = os.path.join(base_path, 'app.py')
18
+
19
+ # --- ADD DEBUG PRINT HERE ---
20
+ print(f"DEBUG: Calculated Streamlit app_path: {app_path}")
21
+
22
+ # Check if the file actually exists at the calculated path (for debugging the build)
23
+ if not os.path.exists(app_path):
24
+ print(f"FATAL: The file does NOT exist at the expected path: {app_path}")
25
+ # We can stop here and force the user to see the error
26
+ sys.exit(1)
27
+
28
+ # Set the command-line arguments for Streamlit
29
+ sys.argv = [
30
+ "streamlit",
31
+ "run",
32
+ app_path, # Use the correctly calculated path
33
+ "--server.port=8501",
34
+ "--server.headless=true",
35
+ "--global.developmentMode=false",
36
+ ]
37
+
38
+ # Run the Streamlit CLI
39
+ sys.exit(stcli.main())
40
+
41
+ if __name__ == "__main__":
42
+ run_streamlit()
sample_data.csv ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ user_id,post_content,timestamp
2
+ user1,I love watching movies especially action and thriller films. The cinematography is amazing.,2023-01-01 10:00:00
3
+ user2,My new smartphone has incredible camera quality and battery life. Technology is advancing so fast.,2023-01-01 11:00:00
4
+ user1,Just finished watching a sci-fi movie. The special effects were mind-blowing and the story was captivating.,2023-01-02 10:30:00
5
+ user3,Learning about artificial intelligence and machine learning algorithms. The future of technology is fascinating.,2023-01-02 14:00:00
6
+ user2,Need to upgrade my old laptop. It's getting slow and can't handle modern software efficiently.,2023-01-03 09:00:00
7
+ user1,The soundtrack of that movie was incredible. Music really enhances the emotional impact of films.,2023-01-03 16:00:00
8
+ user4,Exploring the mysteries of space and astronomy. The universe is full of wonders waiting to be discovered.,2023-01-04 08:00:00
9
+ user3,Data science and predictive analytics are revolutionizing business intelligence and decision making processes.,2023-01-04 12:00:00
10
+ user2,Shopping for a new computer with better performance. Need something powerful for work and gaming.,2023-01-05 10:00:00
11
+ user1,Reading about quantum physics and theoretical concepts. Science fiction is becoming science fact.,2023-01-05 15:00:00
12
+ user5,Cooking is my passion. Today I experimented with spicy Thai cuisine and aromatic herbs.,2023-01-06 09:30:00
13
+ user4,The cosmos holds infinite mysteries. Black holes and dark matter continue to puzzle scientists worldwide.,2023-01-06 13:00:00
14
+ user3,Deep learning neural networks are achieving remarkable results in image recognition and natural language processing.,2023-01-07 11:00:00
15
+ user2,My laptop keeps crashing during important presentations. Definitely time for a hardware upgrade.,2023-01-07 14:30:00
16
+ user1,Science fiction films always make me contemplate the future of humanity and technological advancement.,2023-01-08 10:00:00
17
+ user6,Traveling to different countries and experiencing diverse cultures. Food and traditions vary so much globally.,2023-01-08 12:00:00
18
+ user5,Experimenting with fusion cuisine combining Asian and European cooking techniques. Flavors are incredible.,2023-01-09 09:00:00
19
+ user4,Studying astrophysics and cosmology. The scale of the universe is beyond human comprehension.,2023-01-09 14:00:00
20
+ user3,Machine learning models are becoming more sophisticated. Artificial neural networks mimic human brain functions.,2023-01-10 11:30:00
21
+ user6,Visited an art museum today. The paintings and sculptures were breathtaking and emotionally moving.,2023-01-10 16:00:00
22
+ user1,Watching classic films from the golden age of cinema. The storytelling techniques were masterful.,2023-01-11 10:15:00
23
+ user2,Finally bought a new gaming laptop with advanced graphics card and high-speed processor.,2023-01-11 13:45:00
24
+ user5,Learning traditional cooking methods from different cultures. Each region has unique culinary secrets.,2023-01-12 08:30:00
25
+ user4,Observing celestial objects through my telescope. Saturn's rings are absolutely magnificent tonight.,2023-01-12 20:00:00
26
+ user3,Working on a computer vision project using convolutional neural networks for object detection.,2023-01-13 09:15:00
27
+
start-streamlit.sh ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # IMPORTANT: Replace the path below with the actual path to your miniconda/anaconda installation
4
+ # This ensures the 'conda' command is available to the script.
5
+ source /Users/mariamalmutairi/miniconda3/etc/profile.d/conda.sh
6
+
7
+ # Activate your specific Python environment
8
+ conda activate nlp
9
+
10
+ # Run the streamlit app. The '--server.headless true' flag is a good practice
11
+ # as it prevents Streamlit from opening a new browser tab on its own.
12
+ streamlit run app.py --server.headless true
text_preprocessor.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import string
3
+ import pandas as pd
4
+ import spacy
5
+ import emoji
6
+ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
7
+ from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
8
+ from spacy.util import compile_infix_regex
9
+ from pathlib import Path
10
+
11
+ from resource_path import resource_path
12
+
13
+
14
+ class MultilingualPreprocessor:
15
+ """
16
+ A robust text preprocessor using spaCy for multilingual support.
17
+ """
18
+ def __init__(self, language: str):
19
+ """
20
+ Initializes the preprocessor and loads the appropriate spaCy model.
21
+
22
+ Args:
23
+ language (str): 'english' or 'multilingual'.
24
+ """
25
+ import sys
26
+
27
+ model_map = {
28
+ 'english': 'en_core_web_sm',
29
+ 'multilingual': 'xx_ent_wiki_sm'
30
+ }
31
+ self.model_name = model_map.get(language, 'xx_ent_wiki_sm')
32
+
33
+ try:
34
+ # Check if running from PyInstaller bundle
35
+ if hasattr(sys, '_MEIPASS'):
36
+ # PyInstaller mode: load from bundled path
37
+ model_path_obj = Path(resource_path(self.model_name))
38
+ self.nlp = spacy.util.load_model_from_path(model_path_obj)
39
+ else:
40
+ # Normal development mode: load by model name
41
+ self.nlp = spacy.load(self.model_name)
42
+
43
+ except OSError as e:
44
+ print(f"spaCy Model Error: Could not load model '{self.model_name}'")
45
+ print(f"Please run: python -m spacy download {self.model_name}")
46
+ raise
47
+
48
+ # Customize tokenizer to not split on hyphens in words
49
+ # CORRECTED LINE: CONCAT_QUOTES is wrapped in a list []
50
+ infixes = LIST_ELLIPSES + LIST_ICONS + [CONCAT_QUOTES]
51
+ infix_regex = compile_infix_regex(infixes)
52
+ self.nlp.tokenizer.infix_finditer = infix_regex.finditer
53
+
54
+ def preprocess_series(self, text_series: pd.Series, options: dict, n_process_spacy: int = -1) -> pd.Series:
55
+ """
56
+ Applies a series of cleaning steps to a pandas Series of text.
57
+
58
+ Args:
59
+ text_series (pd.Series): The text to be cleaned.
60
+ options (dict): A dictionary of preprocessing options.
61
+
62
+ Returns:
63
+ pd.Series: The cleaned text Series.
64
+ """
65
+ # --- Stage 1: Fast, Regex-based cleaning (combined for performance) ---
66
+ processed_text = text_series.copy().astype(str)
67
+
68
+ # Combine all regex patterns into a single pass for better performance
69
+ regex_patterns = []
70
+ if options.get("remove_html"):
71
+ regex_patterns.append(r"<.*?>")
72
+ if options.get("remove_urls"):
73
+ regex_patterns.append(r"http\S+|www\.\S+")
74
+ if options.get("handle_hashtags") == "Remove Hashtags":
75
+ regex_patterns.append(r"#\w+")
76
+ if options.get("handle_mentions") == "Remove Mentions":
77
+ regex_patterns.append(r"@\w+")
78
+
79
+ # Apply all regex replacements in a single pass
80
+ if regex_patterns:
81
+ combined_pattern = "|".join(regex_patterns)
82
+ processed_text = processed_text.str.replace(combined_pattern, "", regex=True)
83
+
84
+ # Emoji handling (separate as it needs special library)
85
+ emoji_option = options.get("handle_emojis", "Keep Emojis")
86
+ if emoji_option == "Remove Emojis":
87
+ processed_text = processed_text.apply(lambda s: emoji.replace_emoji(s, replace=''))
88
+ elif emoji_option == "Convert Emojis to Text":
89
+ processed_text = processed_text.apply(emoji.demojize)
90
+
91
+ # --- Stage 2: spaCy-based advanced processing ---
92
+ # Using nlp.pipe for efficiency on a Series
93
+ cleaned_docs = []
94
+ # docs = self.nlp.pipe(processed_text, n_process=-1, batch_size=500)
95
+ docs = self.nlp.pipe(processed_text, n_process=n_process_spacy, batch_size=500)
96
+
97
+
98
+ # Get custom stopwords and convert to lowercase set for fast lookups
99
+ custom_stopwords = set(options.get("custom_stopwords", []))
100
+
101
+ for doc in docs:
102
+ tokens = []
103
+ for token in doc:
104
+ # Punctuation and Number handling
105
+ if options.get("remove_punctuation") and token.is_punct:
106
+ continue
107
+ if options.get("remove_numbers") and (token.is_digit or token.like_num):
108
+ continue
109
+
110
+ # Stopword handling (including custom stopwords)
111
+ is_stopword = token.is_stop or token.text.lower() in custom_stopwords
112
+ if options.get("remove_stopwords") and is_stopword:
113
+ continue
114
+
115
+ # Use lemma if lemmatization is on, otherwise use the original text
116
+ token_text = token.lemma_ if options.get("lemmatize") else token.text
117
+
118
+ # Lowercasing (language-aware)
119
+ if options.get("lowercase"):
120
+ token_text = token_text.lower()
121
+
122
+ # Remove any leftover special characters or whitespace
123
+ if options.get("remove_special_chars"):
124
+ token_text = re.sub(r'[^\w\s-]', '', token_text)
125
+
126
+ if token_text.strip():
127
+ tokens.append(token_text.strip())
128
+
129
+ cleaned_docs.append(" ".join(tokens))
130
+
131
+ return pd.Series(cleaned_docs, index=text_series.index)
topic_evolution.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from bertopic import BERTopic
3
+ from bertopic.representation import KeyBERTInspired
4
+
5
+
6
+ def analyze_general_topic_evolution(topic_model, docs, timestamps):
7
+ """
8
+ Analyzes general topic evolution over time.
9
+
10
+ Args:
11
+ topic_model: Trained BERTopic model.
12
+ docs (list): List of documents.
13
+ timestamps (list): List of timestamps corresponding to the documents.
14
+
15
+ Returns:
16
+ pd.DataFrame: DataFrame with topic evolution information.
17
+ """
18
+ try:
19
+ topics_over_time = topic_model.topics_over_time(docs, timestamps, global_tuning=True)
20
+ return topics_over_time
21
+ except Exception:
22
+ # Fallback for small datasets or cases where evolution can't be computed
23
+ return pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
24
+
25
+
26
+ def analyze_user_topic_evolution(df: pd.DataFrame, topic_model):
27
+ """
28
+ Analyzes topic evolution per user.
29
+
30
+ Args:
31
+ df (pd.DataFrame): DataFrame with (
32
+ "user_id", "post_content", "timestamp", and "topic_id" columns.
33
+ topic_model: Trained BERTopic model.
34
+
35
+ Returns:
36
+ dict: A dictionary where keys are user_ids and values are DataFrames of topic evolution for that user.
37
+ """
38
+ user_topic_evolution = {}
39
+ for user_id in df["user_id"].unique():
40
+ user_df = df[df["user_id"] == user_id].copy()
41
+ if not user_df.empty and len(user_df) > 1:
42
+ try:
43
+ # Ensure timestamps are sorted for topics_over_time
44
+ user_df = user_df.sort_values(by="timestamp")
45
+ docs = user_df["post_content"].tolist()
46
+ timestamps = user_df["timestamp"].tolist()
47
+ selected_topics = user_df["topic_id"].tolist() # Get topic_ids for the user's posts
48
+ topics_over_time = topic_model.topics_over_time(docs, timestamps, topics=selected_topics, global_tuning=True)
49
+ user_topic_evolution[user_id] = topics_over_time
50
+ except Exception:
51
+ user_topic_evolution[user_id] = pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
52
+ else:
53
+ user_topic_evolution[user_id] = pd.DataFrame(columns=['Topic', 'Words', 'Frequency', 'Timestamp'])
54
+ return user_topic_evolution
55
+
56
+ if __name__ == "__main__":
57
+ # Example Usage:
58
+ data = {
59
+ "user_id": ["user1", "user2", "user1", "user3", "user2", "user1", "user4", "user3", "user2", "user1", "user5", "user4", "user3", "user2", "user1"],
60
+ "post_content": [
61
+ "This is a great movie, I loved the acting and the plot. It was truly captivating.",
62
+ "The new phone has an amazing camera and long battery life. Highly recommend it.",
63
+ "I enjoyed the film, especially the special effects and the soundtrack. A must-watch.",
64
+ "Learning about AI and machine learning is fascinating. The future is here.",
65
+ "My old phone is so slow, I need an upgrade soon. Thinking about the latest model.",
66
+ "The best part of the movie was the soundtrack and the stunning visuals. Very immersive.",
67
+ "Exploring the vastness of space is a lifelong dream. Astronomy is amazing.",
68
+ "Data science is revolutionizing industries. Predictive analytics is key.",
69
+ "I need a new laptop for work. Something powerful and portable.",
70
+ "Just finished reading a fantastic book on quantum physics. Mind-blowing concepts.",
71
+ "Cooking new recipes is my passion. Today, I tried a spicy Thai curry.",
72
+ "The universe is full of mysteries. Black holes and dark matter are intriguing.",
73
+ "Deep learning models are becoming incredibly sophisticated. Image recognition is impressive.",
74
+ "My current laptop is crashing frequently. Time for an upgrade.",
75
+ "Science fiction movies always make me think about the future of humanity."
76
+ ],
77
+ "timestamp": [
78
+ "2023-01-01 10:00:00", "2023-01-01 11:00:00", "2023-01-02 10:30:00",
79
+ "2023-01-02 14:00:00", "2023-01-03 09:00:00", "2023-01-03 16:00:00",
80
+ "2023-01-04 08:00:00", "2023-01-04 12:00:00", "2023-01-05 10:00:00",
81
+ "2023-01-05 15:00:00", "2023-01-06 09:30:00", "2023-01-06 13:00:00",
82
+ "2023-01-07 11:00:00", "2023-01-07 14:30:00", "2023-01-08 10:00:00"
83
+ ]
84
+ }
85
+ df = pd.DataFrame(data)
86
+ df["timestamp"] = pd.to_datetime(df["timestamp"])
87
+
88
+ print("Performing topic modeling (English)...")
89
+ model_en, topics_en, probs_en = perform_topic_modeling(df, language="english")
90
+ df["topic_id"] = topics_en
91
+
92
+ print("\nAnalyzing general topic evolution...")
93
+ general_evolution_df = analyze_general_topic_evolution(model_en, df["post_content"].tolist(), df["timestamp"].tolist())
94
+ print(general_evolution_df.head())
95
+
96
+ print("\nAnalyzing per user topic evolution...")
97
+ user_evolution_dict = analyze_user_topic_evolution(df, model_en)
98
+ for user_id, evolution_df in user_evolution_dict.items():
99
+ print(f"\nTopic evolution for {user_id}:")
100
+ print(evolution_df.head())
topic_modeling.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # topic_modeling.py
2
+
3
+ import random
4
+ import pandas as pd
5
+ from bertopic import BERTopic
6
+ from gensim.corpora import Dictionary
7
+ from gensim.models import CoherenceModel
8
+ from nltk.tokenize import word_tokenize
9
+ from typing import List
10
+ from sklearn.feature_extraction.text import CountVectorizer
11
+
12
+ def perform_topic_modeling(
13
+ docs: List[str],
14
+ language: str = "english",
15
+ nr_topics=None,
16
+ remove_stopwords_bertopic: bool = False, # New parameter to control behavior
17
+ custom_stopwords: List[str] = None
18
+ ):
19
+ """
20
+ Performs topic modeling on a list of documents.
21
+
22
+ Args:
23
+ docs (List[str]): A list of documents. Stopwords should be INCLUDED for best results.
24
+ language (str): Language for the BERTopic model ('english', 'multilingual').
25
+ nr_topics: The number of topics to find ("auto" or an int).
26
+ remove_stopwords_bertopic (bool): If True, stopwords will be removed internally by BERTopic.
27
+ custom_stopwords (List[str]): A list of custom stopwords to use.
28
+
29
+ Returns:
30
+ tuple: BERTopic model, topics, probabilities, and coherence score.
31
+ """
32
+ vectorizer_model = None # Default to no custom vectorizer
33
+
34
+ if remove_stopwords_bertopic:
35
+ stop_words_list = []
36
+ if language == "english":
37
+ # Start with the built-in English stopword list from scikit-learn
38
+ from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
39
+ stop_words_list = list(ENGLISH_STOP_WORDS)
40
+
41
+ # Add any custom stopwords provided by the user
42
+ if custom_stopwords:
43
+ stop_words_list.extend(custom_stopwords)
44
+
45
+ # Only create a vectorizer if there's a list of stopwords to use
46
+ if stop_words_list:
47
+ vectorizer_model = CountVectorizer(stop_words=stop_words_list)
48
+
49
+ # Instantiate BERTopic, passing the vectorizer_model if it was created
50
+ if language == "multilingual":
51
+ topic_model = BERTopic(language="multilingual", nr_topics=nr_topics, vectorizer_model=vectorizer_model)
52
+ else:
53
+ topic_model = BERTopic(language=language, nr_topics=nr_topics, vectorizer_model=vectorizer_model)
54
+
55
+ # The 'docs' passed here should contain stopwords for the embedding model to work best
56
+ topics, probs = topic_model.fit_transform(docs)
57
+
58
+ # --- Calculate Coherence Score ---
59
+ # Sample documents for faster coherence calculation (2000 docs is sufficient for accurate estimate)
60
+ max_coherence_docs = 2000
61
+ if len(docs) > max_coherence_docs:
62
+ sample_docs = random.sample(docs, max_coherence_docs)
63
+ else:
64
+ sample_docs = docs
65
+
66
+ tokenized_docs = [word_tokenize(doc) for doc in sample_docs]
67
+ dictionary = Dictionary(tokenized_docs)
68
+ corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
69
+ topic_words = topic_model.get_topics()
70
+ topics_for_coherence = []
71
+ for topic_id in sorted(topic_words.keys()):
72
+ if topic_id != -1:
73
+ words = [word for word, _ in topic_model.get_topic(topic_id)]
74
+ topics_for_coherence.append(words)
75
+ coherence_score = None
76
+ if topics_for_coherence and corpus:
77
+ try:
78
+ coherence_model = CoherenceModel(
79
+ topics=topics_for_coherence,
80
+ texts=tokenized_docs,
81
+ dictionary=dictionary,
82
+ coherence='c_v'
83
+ )
84
+ coherence_score = coherence_model.get_coherence()
85
+ except Exception as e:
86
+ print(f"Could not calculate coherence score: {e}")
87
+
88
+ return topic_model, topics, probs, coherence_score