DivYonko commited on
Commit
b8d79b9
·
1 Parent(s): 906e964

docs: add session changelog

Browse files
Files changed (1) hide show
  1. CHANGELOG.md +75 -139
CHANGELOG.md CHANGED
@@ -1,177 +1,113 @@
1
- # LivePulse — Development Changelog
2
- **Date:** April 14, 2026
3
- **Session summary:** Dashboard UX upgrades, multi-stream comparison, analytics features, performance optimizations, and bug fixes.
4
 
5
  ---
6
 
7
- ## Files Modified
8
 
9
- | File | Original Lines | Final Lines | Change |
10
- |------|---------------|-------------|--------|
11
- | `frontend/streamlit_app.py` | ~540 | 1354 | +814 |
12
- | `backend/scraper.py` | 115 | 135 | +20 |
13
- | `requirements.txt` | 22 | 35 | +13 |
14
 
15
  ---
16
 
17
- ## 1. Dashboard UX Upgrades (`frontend/streamlit_app.py`)
18
-
19
- ### 1.1 Sentiment Heatmap Over Time
20
- - Added `build_heatmap_data()` — buckets all messages into 1-minute intervals and counts Positive / Neutral / Negative per bucket
21
- - Rendered as a stacked bar chart (Plotly) showing mood volume over the full stream lifetime
22
- - Includes "View data" toggle and CSV export
23
-
24
- ### 1.2 Sentiment Velocity
25
- - Added `compute_velocity()` — compares positive ratio of last 20 messages vs previous 20
26
- - Displayed as a 5th stat card alongside cumulative counts
27
- - Three states: ↑ Rising (green), → Stable (yellow), ↓ Falling (red)
28
- - Shows delta percentage shift
29
-
30
- ### 1.3 Notification / Alert System
31
- - **Negative spike alert** — pulsing red banner when negative % in rolling window exceeds configurable threshold (default 40%)
32
- - **Spam surge alert** — separate orange banner when spam topic % exceeds configurable threshold (default 30%)
33
- - Both alerts are dismissable with a ✕ button and re-arm automatically when new messages arrive
34
- - Alert window size and thresholds configurable from sidebar sliders
35
-
36
- ### 1.4 Pinned Messages
37
- - Every message in the live feed has a 📍 pin button
38
- - Pinned messages appear in a dedicated "Pinned Messages" section above the feed with gold highlight styling
39
- - Individual unpin buttons per message
40
- - Sidebar shows pin count and a "Clear pins" button
41
- - Pin state persists across auto-refreshes via `st.session_state`
42
-
43
- ### 1.5 Multi-Stream Comparison (fully rebuilt)
44
- - Sidebar now manages up to **5 independent stream slots** (A–E), each with its own color, video ID field, Redis key field, and Start/Stop buttons
45
- - **+ Add stream / - Remove last** buttons to dynamically add/remove slots
46
- - Comparison section appears automatically when 2+ streams have data — no toggle needed
47
- - Renders sentiment bar charts in rows of 3
48
- - Overlay line chart shows rolling positive % for all active streams on the same axis
49
- - Fixed Streamlit widget re-render bug: widget keys used as single source of truth instead of `value=` overrides
50
 
51
  ---
52
 
53
- ## 2. Analytics & Insights Features (`frontend/streamlit_app.py`)
54
-
55
- ### 2.1 Engagement Score
56
- - `compute_engagement()` composite 0–100 score from:
57
- - Message rate (msgs/min) — 40% weight
58
- - Positive ratio — 40% weight
59
- - Question density — 20% weight
60
- - Displayed as a large score card with a fill bar and grade (🔥 High / ⚡ Medium / 💤 Low)
61
- - Three supporting metric tiles: Msgs/min, Positive ratio, Question density
62
-
63
- ### 2.2 Top Contributors Leaderboard
64
- - `compute_top_contributors()` — ranks authors by message count, tracks per-author sentiment breakdown
65
- - Left panel: ranked list with 🥇🥈🥉 medals, progress bar, colored sentiment dots per author
66
- - Right panel: stacked horizontal bar chart showing sentiment % for top 5 authors
67
- - CSV export of full leaderboard
68
-
69
- ### 2.3 Word Cloud
70
- - `compute_word_freq()` — extracts top 60 words after removing stopwords (English + common Hinglish filler words)
71
- - Filterable by sentiment (All / Positive / Neutral / Negative) and topic
72
- - Renders word cloud image via `wordcloud` library using `wc.to_array()` directly (no matplotlib pipeline)
73
- - Top-20 frequency bar chart shown below the cloud
74
- - Falls back to bar chart only if `wordcloud` not installed
75
-
76
- ### 2.4 Spam Rate Alert
77
- - `check_spam_alert()` — monitors spam topic ratio in rolling window
78
- - Separate dismissable banner distinct from the negative sentiment alert
79
- - Configurable threshold and window from sidebar
80
 
81
  ---
82
 
83
- ## 3. Backend: Multi-Stream Scraper (`backend/scraper.py`)
 
 
 
84
 
85
- ### Changes
86
- - Added `argparse` CLI interface with two arguments:
87
- - `--video_id` — YouTube video ID to scrape (defaults to `config.py` value)
88
- - `--redis_key` — Redis list key to write messages to (defaults to `chat_messages`)
89
- - `run()` function now accepts `video_id` and `redis_key` as parameters instead of reading globals
90
- - Redis connection moved inside `run()` so each scraper instance is fully independent
91
- - Each stream writes to its own Redis key, enabling true parallel multi-stream operation
92
 
93
- **Usage:**
94
- ```bash
95
- # Stream A (default)
96
- python -m backend.scraper --video_id ABC123 --redis_key chat_messages
97
 
98
- # Stream B
99
- python -m backend.scraper --video_id XYZ789 --redis_key chat_messages_b
100
 
101
- # Stream C
102
- python -m backend.scraper --video_id DEF456 --redis_key chat_messages_c
103
- ```
 
104
 
105
  ---
106
 
107
- ## 4. Performance Optimizations (`frontend/streamlit_app.py`)
 
 
 
108
 
109
- ### 4.1 Redis Read Deduplication
110
- - `load_stream_data("chat_messages")` called **once** per refresh cycle
111
- - Windowed slice (`data = all_data[-msg_limit:]`) derived in-memory instead of a second Redis read
112
- - Multi-stream comparison reuses cached data instead of calling `load_stream_data` twice per stream
113
 
114
- ### 4.2 `st.cache_data` on Heavy Functions
115
- | Function | TTL | Benefit |
116
- |----------|-----|---------|
117
- | `load_stream_data()` | 5s | Prevents redundant Redis reads within same refresh |
118
- | `compute_velocity()` | 10s | Skips recompute if data unchanged |
119
- | `build_heatmap_data()` | 10s | Skips full groupby on every refresh |
120
- | `compute_engagement()` | 10s | Skips recompute if data unchanged |
121
- | `compute_top_contributors()` | 10s | Skips recompute if data unchanged |
122
- | `compute_word_freq()` | 10s | Skips word counting on every refresh |
123
 
124
- ### 4.3 Cache-Compatible Function Signatures
125
- - `compute_velocity()` and `build_heatmap_data()` refactored to accept JSON strings instead of DataFrames — `st.cache_data` requires hashable arguments and DataFrames are not hashable
126
 
127
- ### 4.4 DataFrame Construction
128
- - `all_df` built once from `all_data`, `df` sliced from it — no duplicate parsing
 
 
129
 
130
  ---
131
 
132
- ## 5. Bug Fixes
 
 
 
133
 
134
- ### 5.1 Multi-Stream Widget Re-render Bug
135
- - **Problem:** `st.text_input(value=stream["video_id"])` was resetting the field to the old value on every Streamlit rerun, so video IDs typed for Stream B/C were wiped before the Start button handler could read them
136
- - **Fix:** Widget keys (`vid_0`, `rkey_0`, etc.) initialized once via `st.session_state[key] = ...` and used as the sole source of truth. `value=` parameter removed entirely.
 
 
 
137
 
138
- ### 5.2 Active Stream Detection
139
- - **Problem:** `r.exists(key)` returns an integer (0 or 1), not a bool, and returns 1 for any existing key including empty lists
140
- - **Fix:** Changed to `r.llen(key) > 0` which correctly checks for actual message data
141
 
142
- ### 5.3 WordCloud Crash
143
- - **Problem:** `background_color="transparent"` is not a valid PIL color specifier, causing `ValueError: unknown color specifier: 'transparent'`
144
- - **Fix:** Changed to `background_color="white"` and render via `wc.to_array()` directly — removes the matplotlib pipeline entirely
145
 
146
- ### 5.4 Streamlit Deprecation Warning
147
- - **Problem:** `use_container_width=True/False` deprecated, removed after 2025-12-31
148
- - **Fix:** All 21 occurrences replaced with `width='stretch'` / `width='content'`
 
149
 
150
  ---
151
 
152
- ## 6. Dependencies Added (`requirements.txt`)
153
 
154
- ```
155
- matplotlib
156
- wordcloud
157
- ```
 
 
158
 
159
  ---
160
 
161
- ## Architecture Overview (Post-Session)
162
-
163
- ```
164
- Redis
165
- ├── chat_messages ← Stream A scraper writes here
166
- ├── chat_messages_b ← Stream B scraper writes here
167
- ├── chat_messages_c ← Stream C scraper writes here
168
- ├── chat_messages_d ← Stream D scraper writes here
169
- ├── chat_messages_e ← Stream E scraper writes here
170
- └── video_title ← Stream A title for page header
171
-
172
- backend/scraper.py ← One process per stream, --video_id + --redis_key args
173
- backend/main.py ← FastAPI REST API (reads from chat_messages)
174
- frontend/streamlit_app.py ← Dashboard (reads from all active Redis keys)
175
- ml/sentiment_model.py ← 3-model ensemble (MuRIL + XLM-R + Multilingual)
176
- ml/topic_model.py ← Keyword fast-path + BART zero-shot fallback
177
- ```
 
1
+ # LivePulse — Session Changelog
2
+ **Date:** April 16, 2026
3
+ **Session:** HF Spaces Deployment Debugging & Fixes
4
 
5
  ---
6
 
7
+ ## Summary
8
 
9
+ This session was entirely focused on getting the deployed LivePulse app on Hugging Face Spaces (`huggingface.co/spaces/Divyonko/LivePulse`) to actually work end-to-end — from scraping YouTube live chat to displaying analytics in the dashboard.
 
 
 
 
10
 
11
  ---
12
 
13
+ ## Issues Found & Fixed (in order)
14
+
15
+ ### 1. Missing `return None` in `_get_live_chat_id`
16
+ **File:** `app.py`
17
+ **Problem:** The `except` block in `_get_live_chat_id` was missing `return None`, meaning on an exception the function could fall through with undefined behavior.
18
+ **Fix:** Added explicit `return None` in the `except` block.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ---
21
 
22
+ ### 2. No logging output visible in HF Spaces logs
23
+ **File:** `app.py`
24
+ **Problem:** Python's root logger defaults to WARNING level. All our `logger.info()` calls were silently dropped — nothing useful appeared in the logs.
25
+ **Fix:** Added `logging.basicConfig(level=logging.INFO, force=True)` so all INFO and above messages appear in HF Spaces logs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
  ---
28
 
29
+ ### 3. Torchvision warnings flooding the logs
30
+ **File:** `Dockerfile`
31
+ **Problem:** Streamlit's file watcher scans all imported modules including `transformers`, which tries to import `torchvision` (not installed). This produced hundreds of `ModuleNotFoundError: No module named 'torchvision'` lines, making real errors impossible to find.
32
+ **Fix:** Added `ENV STREAMLIT_SERVER_FILE_WATCHER_TYPE=none` to the Dockerfile to disable the file watcher entirely.
33
 
34
+ ---
 
 
 
 
 
 
35
 
36
+ ### 4. Improved HTTP error logging in `_get_live_chat_id`
37
+ **File:** `app.py`
38
+ **Problem:** Generic `except Exception` swallowed the actual YouTube API error body (e.g. "API key invalid", "quota exceeded").
39
+ **Fix:** Added a separate `urllib.error.HTTPError` handler that reads and logs the full error response body, making API failures immediately diagnosable.
40
 
41
+ ---
 
42
 
43
+ ### 5. API key presence logging
44
+ **File:** `app.py`
45
+ **Problem:** No way to confirm whether the `YOUTUBE_API_KEY` secret was actually being read from HF Spaces environment.
46
+ **Fix:** Added `logger.info("YOUTUBE_API_KEY present: %s (length=%d)", ...)` at scraper thread start.
47
 
48
  ---
49
 
50
+ ### 6. Chat message fetch logging
51
+ **File:** `app.py`
52
+ **Problem:** No confirmation that `liveChat/messages` API calls were succeeding.
53
+ **Fix:** Added `logger.info("Fetched %d chat messages ...")` after each successful API poll.
54
 
55
+ ---
 
 
 
56
 
57
+ ### 7. `@st.cache_data` on `load_stream_data` returning stale empty results
58
+ **File:** `app.py`
59
+ **Problem:** `load_stream_data` was decorated with `@st.cache_data(ttl=5)`. The cache key was just `redis_key` (a constant string), so it cached the first result (empty list) and kept returning it even after the scraper had written messages. Attempted fix with `_store_len` cache-busting parameter failed because Streamlit ignores parameters prefixed with `_` for hashing purposes.
60
+ **Fix:** Removed `@st.cache_data` entirely from `load_stream_data`. Since the store is in-memory (later SQLite), there is zero I/O cost to reading it directly on every rerun.
 
 
 
 
 
61
 
62
+ ---
 
63
 
64
+ ### 8. Scraper thread blocking on ML inference for 60+ backlog messages
65
+ **File:** `app.py`
66
+ **Problem:** On startup, the YouTube API returns a backlog of 50-70 messages from the last few minutes. The scraper was running full ML inference (MuRIL + XLM-R + BART = 3 models × 60 messages = 180 forward passes on CPU) before writing a single message to the store. This took several minutes, during which the UI showed "No messages yet" and users kept clicking Start again, killing and restarting the thread.
67
+ **Fix:** Added `is_first_page` flag. On the first API page (backlog), messages are stored immediately with `Neutral/General` placeholder sentiment so the UI shows data within seconds. Full ML inference only runs on subsequent pages (new live messages, typically 5-15 at a time).
68
 
69
  ---
70
 
71
+ ### 9. Per-message ML inference error logging
72
+ **File:** `app.py`
73
+ **Problem:** If `predict_sentiment` or `predict_topic` threw an exception for a specific message, it was silently caught by `_safe_sentiment`/`_safe_topic` with no indication of which message failed or why.
74
+ **Fix:** Added explicit `try/except` with `logger.error("ML inference failed for text=%r: %s", ...)` around each message's inference call in the scraper loop.
75
 
76
+ ---
77
+
78
+ ### 10. Root cause: In-memory store not shared across Streamlit worker processes
79
+ **File:** `app.py`
80
+ **Problem:** This was the fundamental bug causing "No messages yet" despite the scraper working correctly. HF Spaces runs Streamlit with multiple worker processes. The scraper thread ran in worker process A and wrote to `_STORE` (a Python `dict` in that process's RAM). Browser requests were served by worker process B, which had its own separate empty `_STORE`. The two processes never shared memory — the UI always saw zero messages regardless of how many the scraper had collected.
81
+ **Fix:** Replaced the entire in-memory `deque`-based store with **SQLite** at `/tmp/livepulse.db`. SQLite is a file on disk that all worker processes in the container share. The scraper writes to it; any worker serving the UI reads from the same file. All store functions (`store_rpush`, `store_lrange`, `store_llen`, `store_delete`) were rewritten to use SQLite queries with a threading lock.
82
 
83
+ ---
 
 
84
 
85
+ ## Files Changed
 
 
86
 
87
+ | File | Changes |
88
+ |------|---------|
89
+ | `app.py` | SQLite store, logging setup, backlog fix, cache removal, HTTP error handling, `return None` fix |
90
+ | `Dockerfile` | Added `STREAMLIT_SERVER_FILE_WATCHER_TYPE=none` |
91
 
92
  ---
93
 
94
+ ## What Was NOT Changed
95
 
96
+ - All dashboard features preserved: charts, alerts, word cloud, engagement score, leaderboard, multi-stream comparison, pinned messages, sentiment heatmap, topic distribution, confidence trend, CSV export
97
+ - ML models unchanged: MuRIL + XLM-R + BART ensemble still runs on new messages
98
+ - YouTube Data API v3 scraper logic unchanged
99
+ - `requirements.txt` unchanged
100
+ - `.gitattributes` (Git LFS for model weights) unchanged
101
+ - `README.md` unchanged
102
 
103
  ---
104
 
105
+ ## Current State
106
+
107
+ The app is fully functional on HF Spaces:
108
+ - Scraper fetches YouTube live chat via YouTube Data API v3
109
+ - API key read from HF Spaces secret `YOUTUBE_API_KEY`
110
+ - Backlog messages stored immediately on start (with placeholder sentiment)
111
+ - New messages processed with full ML inference
112
+ - SQLite ensures scraper and UI share data across all worker processes
113
+ - Dashboard displays all analytics once messages are in the store