Spaces:

MusoraProductDepartment
/

Sentiment_analysis

Running

App Files Files Community

Sentiment_analysis / visualization /README.md

Danialebrat

Deploying sentiment analysis project

9858829 13 days ago

preview code

raw

history blame contribute delete

15.1 kB

	# Musora Sentiment Analysis Dashboard

	A Streamlit dashboard for visualising sentiment analysis results from social media comments (Facebook, Instagram, YouTube, Twitter) and the Musora internal app across brands (Drumeo, Pianote, Guitareo, Singeo, Musora).

	---

	## Table of Contents

	1. [Project Structure](#project-structure)
	2. [How Data Flows](#how-data-flows)
	3. [Data Loading Strategy](#data-loading-strategy)
	4. [Pages](#pages)
	5. [Global Filters & Session State](#global-filters--session-state)
	6. [Snowflake Queries](#snowflake-queries)
	7. [Adding or Changing Things](#adding-or-changing-things)
	8. [Running the App](#running-the-app)
	9. [Configuration Reference](#configuration-reference)

	---

	## Project Structure

	```
	visualization/
	├── app.py # Entry point — routing, sidebar, session state
	├── config/
	│ └── viz_config.json # Colors, query strings, dashboard settings
	├── data/
	│ └── data_loader.py # All Snowflake queries and caching logic
	├── utils/
	│ ├── data_processor.py # Pandas aggregations (intent dist, content summary, etc.)
	│ └── metrics.py # KPI calculations (sentiment score, urgency, etc.)
	├── components/
	│ ├── dashboard.py # Dashboard page renderer
	│ ├── sentiment_analysis.py # Sentiment Analysis page renderer
	│ └── reply_required.py # Reply Required page renderer
	├── visualizations/
	│ ├── sentiment_charts.py # Plotly sentiment chart functions
	│ ├── distribution_charts.py # Plotly distribution / heatmap / scatter functions
	│ ├── demographic_charts.py # Plotly demographic chart functions
	│ └── content_cards.py # Streamlit card components (comment cards, content cards)
	├── agents/
	│ └── content_summary_agent.py # AI analysis agent (OpenAI) for comment summarisation
	├── img/
	│ └── musora.png # Sidebar logo
	└── SnowFlakeConnection.py # Snowflake connection wrapper (Snowpark session)
	```

	---

	## How Data Flows

	```
	Snowflake
	│
	▼
	data_loader.py ← Three separate loading modes (see below)
	│
	├── load_dashboard_data() ──► st.session_state['dashboard_df']
	│ └─► app.py sidebar (filter options, counts)
	│ └─► dashboard.py (all charts)
	│
	├── load_sa_data() ──► st.session_state['sa_contents']
	│ (on-demand, button) st.session_state['sa_comments']
	│ └─► sentiment_analysis.py
	│
	└── load_reply_required_data() ► st.session_state['rr_df']
	(on-demand, button) └─► reply_required.py
	```

	Key principle: Data is loaded as little as possible, as late as possible.

	- The Dashboard uses a lightweight query (no text columns, no content join) cached for 24 hours.
	- The Sentiment Analysis and Reply Required pages never load data automatically — they wait for the user to click Fetch Data.
	- All data is stored in `st.session_state` so page navigation and widget interactions do not re-trigger Snowflake queries.

	---

	## Data Loading Strategy

	All loading logic lives in `data/data_loader.py` (`SentimentDataLoader` class).

	### `load_dashboard_data()`
	- Uses `dashboard_query` from `viz_config.json`.
	- Fetches only: `comment_sk, content_sk, platform, brand, sentiment_polarity, intent, requires_reply, detected_language, comment_timestamp, processed_at, author_id`.
	- No text columns, no `DIM_CONTENT` join — significantly faster than the full query.
	- Also merges demographics data if `demographics_query` is configured.
	- Cached for 24 hours (`@st.cache_data(ttl=86400)`).
	- Called once by `app.py` at startup; result stored in `st.session_state['dashboard_df']`.

	### `load_sa_data(platform, brand, top_n, min_comments, sort_by, sentiments, intents, date_range)`
	- Runs two sequential Snowflake queries:
	1. Content aggregation — groups by `content_sk`, counts per sentiment, computes severity score, returns top N.
	2. Sampled comments — for the top N `content_sk`s only, fetches up to 50 comments per sentiment group per content (negative, positive, other), using Snowflake `QUALIFY ROW_NUMBER()`. `display_text` is computed in SQL (`CASE WHEN IS_ENGLISH = FALSE AND TRANSLATED_TEXT IS NOT NULL THEN TRANSLATED_TEXT ELSE ORIGINAL_TEXT END`).
	- Returns a tuple `(contents_df, comments_df)`.
	- Cached for 24 hours.
	- Called only when the user clicks Fetch Data on the Sentiment Analysis page.

	### `load_reply_required_data(platforms, brands, date_range)`
	- Runs a single query filtering `REQUIRES_REPLY = TRUE`.
	- Dynamically includes/excludes the social media table and musora table based on selected platforms.
	- `display_text` computed in SQL.
	- Cached for 24 hours.
	- Called only when the user clicks Fetch Data on the Reply Required page.

	### Important: SQL Column Qualification
	Both the social media table (`COMMENT_SENTIMENT_FEATURES`) and the content dimension table (`DIM_CONTENT`) share column names. Any `WHERE` clause inside a query that joins these two tables must use the table alias prefix (e.g. `s.PLATFORM`, `s.COMMENT_TIMESTAMP`, `s.CHANNEL_NAME`) to avoid Snowflake `ambiguous column name` errors. The musora table (`MUSORA_COMMENT_SENTIMENT_FEATURES`) has no joins so unqualified column names are fine there.

	---

	## Pages

	### Dashboard (`components/dashboard.py`)

	Receives: `filtered_df` — the lightweight dashboard dataframe (after optional global filter applied by `app.py`).

	Does not need: text, translations, content URLs. All charts work purely on aggregated columns (sentiment_polarity, brand, platform, intent, requires_reply, comment_timestamp).

	Key sections:
	- Summary stats + health indicator
	- Sentiment distribution (pie + gauge)
	- Sentiment by brand and platform (stacked + percentage bar charts)
	- Intent analysis
	- Brand-Platform heatmap
	- Reply requirements + urgency breakdown
	- Demographics (age, timezone, experience level) — only rendered if `author_id` is present and demographics were merged

	To add a new chart: create the chart function in `visualizations/` and call it from `render_dashboard()`. The function receives `filtered_df`.

	---

	### Sentiment Analysis (`components/sentiment_analysis.py`)

	Receives: `data_loader` instance only (no dataframe).

	Flow:
	1. Reads `st.session_state['dashboard_df']` for filter option lists (platforms, brands, sentiments, intents).
	2. Pre-populates platform/brand dropdowns from `st.session_state['global_filters']`.
	3. Shows filter controls (platform, brand, sentiment, intent, top_n, min_comments, sort_by).
	4. On Fetch Data click: calls `data_loader.load_sa_data(...)` and stores results in `st.session_state['sa_contents']` and `['sa_comments']`.
	5. Renders content cards, per-content sentiment + intent charts, AI analysis buttons, and sampled comment expanders.

	Pagination: `st.session_state['sentiment_page']` (5 contents per page). Reset on new fetch.

	Comments: Sampled (up to 50 negative + 50 positive + 50 neutral per content). These are already in memory after the fetch — no extra query is needed when the user expands a comment section.

	AI Analysis: Uses `ContentSummaryAgent` (see `agents/`). Results cached in `st.session_state['content_summaries']`.

	---

	### Reply Required (`components/reply_required.py`)

	Receives: `data_loader` instance only.

	Flow:
	1. Reads `st.session_state['dashboard_df']` for filter option lists.
	2. Pre-populates platform, brand, and date from `st.session_state['global_filters']`.
	3. On Fetch Data click: calls `data_loader.load_reply_required_data(...)` and stores result in `st.session_state['rr_df']`.
	4. Shows urgency breakdown, in-page view filters (priority, platform, brand, intent — applied in Python, no new query), paginated comment cards, and a "Reply by Content" summary.

	Pagination: `st.session_state['reply_page']` (10 comments per page). Reset on new fetch.

	---

	## Global Filters & Session State

	Global filters live in the sidebar (`app.py`) and are stored in `st.session_state['global_filters']` as a dict:

	```python
	{
	'platforms': ['facebook', 'instagram'], # list or []
	'brands': ['drumeo'],
	'sentiments': [],
	'date_range': (date(2025, 1, 1), date(2025, 12, 31)), # or None
	}
	```

	- Dashboard: `app.py` applies global filters to `dashboard_df` using `data_loader.apply_filters()` and passes the result to `render_dashboard()`.
	- Sentiment Analysis / Reply Required: global filters are used to pre-populate their own filter widgets. The actual Snowflake query uses those values when the user clicks Fetch. The pages do not receive a pre-filtered dataframe.

	### Full session state key reference

	\| Key \| Set by \| Used by \|
	\|-----\|--------\|---------\|
	\| `dashboard_df` \| `app.py` on startup \| sidebar (filter options), dashboard, SA + RR (filter option lists) \|
	\| `global_filters` \| sidebar "Apply Filters" button \| app.py (dashboard filter), SA + RR (pre-populate widgets) \|
	\| `filters_applied` \| sidebar buttons \| app.py (whether to apply filters) \|
	\| `sa_contents` \| SA fetch button \| SA page rendering \|
	\| `sa_comments` \| SA fetch button \| SA page rendering \|
	\| `sa_fetch_key` \| SA fetch button \| SA page (detect stale data) \|
	\| `rr_df` \| RR fetch button \| RR page rendering \|
	\| `rr_fetch_key` \| RR fetch button \| RR page (detect stale data) \|
	\| `sentiment_page` \| SA page / fetch \| SA pagination \|
	\| `reply_page` \| RR page / fetch \| RR pagination \|
	\| `content_summaries` \| AI analysis buttons \| SA AI analysis display \|

	---

	## Snowflake Queries

	All query strings are either stored in `config/viz_config.json` (static queries) or built dynamically in `data/data_loader.py` (page-specific queries).

	### Static queries (in `viz_config.json`)

	\| Key \| Purpose \|
	\|-----\|---------\|
	\| `query` \| Full query with all columns (legacy, kept for compatibility) \|
	\| `dashboard_query` \| Lightweight query — no text, no DIM_CONTENT join \|
	\| `demographics_query` \| Joins `usora_users` with `preprocessed.users` to get age/timezone/experience \|

	### Dynamic queries (built in `data_loader.py`)

	\| Method \| Description \|
	\|--------\|-------------\|
	\| `_build_sa_content_query()` \| Content aggregation for SA page; filters by platform + brand + date \|
	\| `_build_sa_comments_query()` \| Sampled comments for SA page; uses `QUALIFY ROW_NUMBER() <= 50` \|
	\| `_build_rr_query()` \| Reply-required comments; filters by platform/brand/date; conditionally includes social media and/or musora table \|

	### Data source tables

	\| Table \| Platform \| Notes \|
	\|-------\|----------\|-------\|
	\| `SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES` \| facebook, instagram, youtube, twitter \| Needs `LEFT JOIN SOCIAL_MEDIA_DB.CORE.DIM_CONTENT` for `PERMALINK_URL` \|
	\| `SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES` \| musora_app \| Has `PERMALINK_URL` and `THUMBNAIL_URL` natively; platform stored as `'musora'`, mapped to `'musora_app'` in queries \|

	---

	## Adding or Changing Things

	### Add a new chart to the Dashboard
	1. Write the chart function in the appropriate `visualizations/` file.
	2. Call it from `render_dashboard()` in `components/dashboard.py`, passing `filtered_df`.
	3. The chart function receives a lightweight df — it has no text columns but has all the columns listed in `dashboard_query`.

	### Add a new filter to the Dashboard sidebar
	1. Add the widget in `app.py` under the "Global Filters" section.
	2. Store the selected value in the `global_filters` dict under `st.session_state`.
	3. Pass it to `data_loader.apply_filters()`.

	### Change what the Sentiment Analysis page queries
	- Edit `_build_sa_content_query()` and/or `_build_sa_comments_query()` in `data_loader.py`.
	- If you add new columns to the content aggregation result, also update `_process_sa_content_stats()` so they are available in `contents_df`.
	- If you add new columns to the comments result, update `_process_sa_comments()`.

	### Change what the Reply Required page queries
	- Edit `_build_rr_query()` in `data_loader.py`.
	- Remember: all column references inside the social media block (which has a `JOIN`) must be prefixed with `s.` to avoid Snowflake ambiguity errors.

	### Change the cache duration
	- `@st.cache_data(ttl=86400)` is set on `load_dashboard_data`, `_fetch_sa_data`, `_fetch_rr_data`, and `load_demographics_data`.
	- Change `86400` (seconds) to the desired TTL, or set `ttl=None` for no expiry.
	- Users can always force a refresh with the "Reload Data" button in the sidebar (which calls `st.cache_data.clear()` and deletes `st.session_state['dashboard_df']`).

	### Add a new page
	1. Create `components/new_page.py` with a `render_new_page(data_loader)` function.
	2. Import and add a radio option in `app.py`.
	3. If the page needs its own Snowflake data, add a `load_new_page_data()` method to `SentimentDataLoader` following the same pattern as `load_sa_data`.

	### Add a new column to the Dashboard query
	- Edit `dashboard_query` in `config/viz_config.json`.
	- Both UNION branches must select the same columns in the same order.
	- `_process_dashboard_dataframe()` in `data_loader.py` handles basic type casting — add processing there if needed.

	---

	## Running the App

	```bash
	# From the project root
	streamlit run visualization/app.py
	```

	Required environment variables (in `.env` at project root):

	```
	SNOWFLAKE_USER
	SNOWFLAKE_PASSWORD
	SNOWFLAKE_ACCOUNT
	SNOWFLAKE_ROLE
	SNOWFLAKE_DATABASE
	SNOWFLAKE_WAREHOUSE
	SNOWFLAKE_SCHEMA
	```

	---

	## Configuration Reference

	`config/viz_config.json` controls:

	\| Section \| What it configures \|
	\|---------\|-------------------\|
	\| `color_schemes.sentiment_polarity` \| Hex colors for each sentiment level \|
	\| `color_schemes.intent` \| Hex colors for each intent label \|
	\| `color_schemes.platform` \| Hex colors for each platform \|
	\| `color_schemes.brand` \| Hex colors for each brand \|
	\| `sentiment_order` \| Display order for sentiment categories in charts \|
	\| `intent_order` \| Display order for intent categories \|
	\| `negative_sentiments` \| Which sentiment values count as "negative" \|
	\| `dashboard.default_date_range_days` \| Default date filter window (days) \|
	\| `dashboard.max_comments_display` \| Max comments shown per pagination page \|
	\| `dashboard.chart_height` \| Default Plotly chart height \|
	\| `dashboard.top_n_contents` \| Default top-N for content ranking \|
	\| `snowflake.query` \| Full query (legacy, all columns) \|
	\| `snowflake.dashboard_query` \| Lightweight dashboard query (no text columns) \|
	\| `snowflake.demographics_query` \| Demographics join query \|
	\| `demographics.age_groups` \| Age bucket definitions (label → [min, max]) \|
	\| `demographics.experience_groups` \| Experience bucket definitions \|
	\| `demographics.top_timezones_count` \| How many timezones to show in the geographic chart \|