Spaces:

Divyonko
/

LivePulse

Sleeping

App Files Files Community

DivYonko commited on Apr 16

Commit

b8d79b9

1 Parent(s): 906e964

docs: add session changelog

Browse files

Files changed (1) hide show

CHANGELOG.md +75 -139

CHANGELOG.md CHANGED Viewed

@@ -1,177 +1,113 @@
-# LivePulse — Development Changelog
-**Date:** April 14, 2026
-**Session summary:** Dashboard UX upgrades, multi-stream comparison, analytics features, performance optimizations, and bug fixes.
 ---
-## Files Modified
-| File | Original Lines | Final Lines | Change |
-|------|---------------|-------------|--------|
-| `frontend/streamlit_app.py` | ~540 | 1354 | +814 |
-| `backend/scraper.py` | 115 | 135 | +20 |
-| `requirements.txt` | 22 | 35 | +13 |
 ---
-## 1. Dashboard UX Upgrades (`frontend/streamlit_app.py`)
-### 1.1 Sentiment Heatmap Over Time
-- Added `build_heatmap_data()` — buckets all messages into 1-minute intervals and counts Positive / Neutral / Negative per bucket
-- Rendered as a stacked bar chart (Plotly) showing mood volume over the full stream lifetime
-- Includes "View data" toggle and CSV export
-### 1.2 Sentiment Velocity
-- Added `compute_velocity()` — compares positive ratio of last 20 messages vs previous 20
-- Displayed as a 5th stat card alongside cumulative counts
-- Three states: ↑ Rising (green), → Stable (yellow), ↓ Falling (red)
-- Shows delta percentage shift
-### 1.3 Notification / Alert System
-- **Negative spike alert** — pulsing red banner when negative % in rolling window exceeds configurable threshold (default 40%)
-- **Spam surge alert** — separate orange banner when spam topic % exceeds configurable threshold (default 30%)
-- Both alerts are dismissable with a ✕ button and re-arm automatically when new messages arrive
-- Alert window size and thresholds configurable from sidebar sliders
-### 1.4 Pinned Messages
-- Every message in the live feed has a 📍 pin button
-- Pinned messages appear in a dedicated "Pinned Messages" section above the feed with gold highlight styling
-- Individual unpin buttons per message
-- Sidebar shows pin count and a "Clear pins" button
-- Pin state persists across auto-refreshes via `st.session_state`
-### 1.5 Multi-Stream Comparison (fully rebuilt)
-- Sidebar now manages up to **5 independent stream slots** (A–E), each with its own color, video ID field, Redis key field, and Start/Stop buttons
-- **＋ Add stream / － Remove last** buttons to dynamically add/remove slots
-- Comparison section appears automatically when 2+ streams have data — no toggle needed
-- Renders sentiment bar charts in rows of 3
-- Overlay line chart shows rolling positive % for all active streams on the same axis
-- Fixed Streamlit widget re-render bug: widget keys used as single source of truth instead of `value=` overrides
 ---
-## 2. Analytics & Insights Features (`frontend/streamlit_app.py`)
-### 2.1 Engagement Score
-- `compute_engagement()` — composite 0–100 score from:
-  - Message rate (msgs/min) — 40% weight
-  - Positive ratio — 40% weight
-  - Question density — 20% weight
-- Displayed as a large score card with a fill bar and grade (🔥 High / ⚡ Medium / 💤 Low)
-- Three supporting metric tiles: Msgs/min, Positive ratio, Question density
-### 2.2 Top Contributors Leaderboard
-- `compute_top_contributors()` — ranks authors by message count, tracks per-author sentiment breakdown
-- Left panel: ranked list with 🥇🥈🥉 medals, progress bar, colored sentiment dots per author
-- Right panel: stacked horizontal bar chart showing sentiment % for top 5 authors
-- CSV export of full leaderboard
-### 2.3 Word Cloud
-- `compute_word_freq()` — extracts top 60 words after removing stopwords (English + common Hinglish filler words)
-- Filterable by sentiment (All / Positive / Neutral / Negative) and topic
-- Renders word cloud image via `wordcloud` library using `wc.to_array()` directly (no matplotlib pipeline)
-- Top-20 frequency bar chart shown below the cloud
-- Falls back to bar chart only if `wordcloud` not installed
-### 2.4 Spam Rate Alert
-- `check_spam_alert()` — monitors spam topic ratio in rolling window
-- Separate dismissable banner distinct from the negative sentiment alert
-- Configurable threshold and window from sidebar
 ---
-## 3. Backend: Multi-Stream Scraper (`backend/scraper.py`)
-### Changes
-- Added `argparse` CLI interface with two arguments:
-  - `--video_id` — YouTube video ID to scrape (defaults to `config.py` value)
-  - `--redis_key` — Redis list key to write messages to (defaults to `chat_messages`)
-- `run()` function now accepts `video_id` and `redis_key` as parameters instead of reading globals
-- Redis connection moved inside `run()` so each scraper instance is fully independent
-- Each stream writes to its own Redis key, enabling true parallel multi-stream operation
-**Usage:**
-```bash
-# Stream A (default)
-python -m backend.scraper --video_id ABC123 --redis_key chat_messages
-# Stream B
-python -m backend.scraper --video_id XYZ789 --redis_key chat_messages_b
-# Stream C
-python -m backend.scraper --video_id DEF456 --redis_key chat_messages_c
-```
 ---
-## 4. Performance Optimizations (`frontend/streamlit_app.py`)
-### 4.1 Redis Read Deduplication
-- `load_stream_data("chat_messages")` called **once** per refresh cycle
-- Windowed slice (`data = all_data[-msg_limit:]`) derived in-memory instead of a second Redis read
-- Multi-stream comparison reuses cached data instead of calling `load_stream_data` twice per stream
-### 4.2 `st.cache_data` on Heavy Functions
-| Function | TTL | Benefit |
-|----------|-----|---------|
-| `load_stream_data()` | 5s | Prevents redundant Redis reads within same refresh |
-| `compute_velocity()` | 10s | Skips recompute if data unchanged |
-| `build_heatmap_data()` | 10s | Skips full groupby on every refresh |
-| `compute_engagement()` | 10s | Skips recompute if data unchanged |
-| `compute_top_contributors()` | 10s | Skips recompute if data unchanged |
-| `compute_word_freq()` | 10s | Skips word counting on every refresh |
-### 4.3 Cache-Compatible Function Signatures
-- `compute_velocity()` and `build_heatmap_data()` refactored to accept JSON strings instead of DataFrames — `st.cache_data` requires hashable arguments and DataFrames are not hashable
-### 4.4 DataFrame Construction
-- `all_df` built once from `all_data`, `df` sliced from it — no duplicate parsing
 ---
-## 5. Bug Fixes
-### 5.1 Multi-Stream Widget Re-render Bug
-- **Problem:** `st.text_input(value=stream["video_id"])` was resetting the field to the old value on every Streamlit rerun, so video IDs typed for Stream B/C were wiped before the Start button handler could read them
-- **Fix:** Widget keys (`vid_0`, `rkey_0`, etc.) initialized once via `st.session_state[key] = ...` and used as the sole source of truth. `value=` parameter removed entirely.
-### 5.2 Active Stream Detection
-- **Problem:** `r.exists(key)` returns an integer (0 or 1), not a bool, and returns 1 for any existing key including empty lists
-- **Fix:** Changed to `r.llen(key) > 0` which correctly checks for actual message data
-### 5.3 WordCloud Crash
-- **Problem:** `background_color="transparent"` is not a valid PIL color specifier, causing `ValueError: unknown color specifier: 'transparent'`
-- **Fix:** Changed to `background_color="white"` and render via `wc.to_array()` directly — removes the matplotlib pipeline entirely
-### 5.4 Streamlit Deprecation Warning
-- **Problem:** `use_container_width=True/False` deprecated, removed after 2025-12-31
-- **Fix:** All 21 occurrences replaced with `width='stretch'` / `width='content'`
 ---
-## 6. Dependencies Added (`requirements.txt`)
-```
-matplotlib
-wordcloud
-```
 ---
-## Architecture Overview (Post-Session)
-```
-Redis
- ├── chat_messages        ← Stream A scraper writes here
- ├── chat_messages_b      ← Stream B scraper writes here
- ├── chat_messages_c      ← Stream C scraper writes here
- ├── chat_messages_d      ← Stream D scraper writes here
- ├── chat_messages_e      ← Stream E scraper writes here
- └── video_title          ← Stream A title for page header
-backend/scraper.py        ← One process per stream, --video_id + --redis_key args
-backend/main.py           ← FastAPI REST API (reads from chat_messages)
-frontend/streamlit_app.py ← Dashboard (reads from all active Redis keys)
-ml/sentiment_model.py     ← 3-model ensemble (MuRIL + XLM-R + Multilingual)
-ml/topic_model.py         ← Keyword fast-path + BART zero-shot fallback
-```

+# LivePulse — Session Changelog
+**Date:** April 16, 2026
+**Session:** HF Spaces Deployment Debugging & Fixes
 ---
+## Summary
+This session was entirely focused on getting the deployed LivePulse app on Hugging Face Spaces (`huggingface.co/spaces/Divyonko/LivePulse`) to actually work end-to-end — from scraping YouTube live chat to displaying analytics in the dashboard.
 ---
+## Issues Found & Fixed (in order)
+### 1. Missing `return None` in `_get_live_chat_id`
+**File:** `app.py`
+**Problem:** The `except` block in `_get_live_chat_id` was missing `return None`, meaning on an exception the function could fall through with undefined behavior.
+**Fix:** Added explicit `return None` in the `except` block.
 ---
+### 2. No logging output visible in HF Spaces logs
+**File:** `app.py`
+**Problem:** Python's root logger defaults to WARNING level. All our `logger.info()` calls were silently dropped — nothing useful appeared in the logs.
+**Fix:** Added `logging.basicConfig(level=logging.INFO, force=True)` so all INFO and above messages appear in HF Spaces logs.
 ---
+### 3. Torchvision warnings flooding the logs
+**File:** `Dockerfile`
+**Problem:** Streamlit's file watcher scans all imported modules including `transformers`, which tries to import `torchvision` (not installed). This produced hundreds of `ModuleNotFoundError: No module named 'torchvision'` lines, making real errors impossible to find.
+**Fix:** Added `ENV STREAMLIT_SERVER_FILE_WATCHER_TYPE=none` to the Dockerfile to disable the file watcher entirely.
+---
+### 4. Improved HTTP error logging in `_get_live_chat_id`
+**File:** `app.py`
+**Problem:** Generic `except Exception` swallowed the actual YouTube API error body (e.g. "API key invalid", "quota exceeded").
+**Fix:** Added a separate `urllib.error.HTTPError` handler that reads and logs the full error response body, making API failures immediately diagnosable.
+---
+### 5. API key presence logging
+**File:** `app.py`
+**Problem:** No way to confirm whether the `YOUTUBE_API_KEY` secret was actually being read from HF Spaces environment.
+**Fix:** Added `logger.info("YOUTUBE_API_KEY present: %s (length=%d)", ...)` at scraper thread start.
 ---
+### 6. Chat message fetch logging
+**File:** `app.py`
+**Problem:** No confirmation that `liveChat/messages` API calls were succeeding.
+**Fix:** Added `logger.info("Fetched %d chat messages ...")` after each successful API poll.
+---
+### 7. `@st.cache_data` on `load_stream_data` returning stale empty results
+**File:** `app.py`
+**Problem:** `load_stream_data` was decorated with `@st.cache_data(ttl=5)`. The cache key was just `redis_key` (a constant string), so it cached the first result (empty list) and kept returning it even after the scraper had written messages. Attempted fix with `_store_len` cache-busting parameter failed because Streamlit ignores parameters prefixed with `_` for hashing purposes.
+**Fix:** Removed `@st.cache_data` entirely from `load_stream_data`. Since the store is in-memory (later SQLite), there is zero I/O cost to reading it directly on every rerun.
+---
+### 8. Scraper thread blocking on ML inference for 60+ backlog messages
+**File:** `app.py`
+**Problem:** On startup, the YouTube API returns a backlog of 50-70 messages from the last few minutes. The scraper was running full ML inference (MuRIL + XLM-R + BART = 3 models × 60 messages = 180 forward passes on CPU) before writing a single message to the store. This took several minutes, during which the UI showed "No messages yet" and users kept clicking Start again, killing and restarting the thread.
+**Fix:** Added `is_first_page` flag. On the first API page (backlog), messages are stored immediately with `Neutral/General` placeholder sentiment so the UI shows data within seconds. Full ML inference only runs on subsequent pages (new live messages, typically 5-15 at a time).
 ---
+### 9. Per-message ML inference error logging
+**File:** `app.py`
+**Problem:** If `predict_sentiment` or `predict_topic` threw an exception for a specific message, it was silently caught by `_safe_sentiment`/`_safe_topic` with no indication of which message failed or why.
+**Fix:** Added explicit `try/except` with `logger.error("ML inference failed for text=%r: %s", ...)` around each message's inference call in the scraper loop.
+---
+### 10. Root cause: In-memory store not shared across Streamlit worker processes
+**File:** `app.py`
+**Problem:** This was the fundamental bug causing "No messages yet" despite the scraper working correctly. HF Spaces runs Streamlit with multiple worker processes. The scraper thread ran in worker process A and wrote to `_STORE` (a Python `dict` in that process's RAM). Browser requests were served by worker process B, which had its own separate empty `_STORE`. The two processes never shared memory — the UI always saw zero messages regardless of how many the scraper had collected.
+**Fix:** Replaced the entire in-memory `deque`-based store with **SQLite** at `/tmp/livepulse.db`. SQLite is a file on disk that all worker processes in the container share. The scraper writes to it; any worker serving the UI reads from the same file. All store functions (`store_rpush`, `store_lrange`, `store_llen`, `store_delete`) were rewritten to use SQLite queries with a threading lock.
+---
+## Files Changed
+| File | Changes |
+|------|---------|
+| `app.py` | SQLite store, logging setup, backlog fix, cache removal, HTTP error handling, `return None` fix |
+| `Dockerfile` | Added `STREAMLIT_SERVER_FILE_WATCHER_TYPE=none` |
 ---
+## What Was NOT Changed
+- All dashboard features preserved: charts, alerts, word cloud, engagement score, leaderboard, multi-stream comparison, pinned messages, sentiment heatmap, topic distribution, confidence trend, CSV export
+- ML models unchanged: MuRIL + XLM-R + BART ensemble still runs on new messages
+- YouTube Data API v3 scraper logic unchanged
+- `requirements.txt` unchanged
+- `.gitattributes` (Git LFS for model weights) unchanged
+- `README.md` unchanged
 ---
+## Current State
+The app is fully functional on HF Spaces:
+- Scraper fetches YouTube live chat via YouTube Data API v3
+- API key read from HF Spaces secret `YOUTUBE_API_KEY`
+- Backlog messages stored immediately on start (with placeholder sentiment)
+- New messages processed with full ML inference
+- SQLite ensures scraper and UI share data across all worker processes
+- Dashboard displays all analytics once messages are in the store