Spaces:

MusoraProductDepartment
/

Sentiment_analysis

Running

File size: 15,148 Bytes
# Musora Sentiment Analysis Dashboard

A Streamlit dashboard for visualising sentiment analysis results from **social media comments** (Facebook, Instagram, YouTube, Twitter) and the **Musora internal app** across brands (Drumeo, Pianote, Guitareo, Singeo, Musora).

---

## Table of Contents

1. [Project Structure](#project-structure)
2. [How Data Flows](#how-data-flows)
3. [Data Loading Strategy](#data-loading-strategy)
4. [Pages](#pages)
5. [Global Filters & Session State](#global-filters--session-state)
6. [Snowflake Queries](#snowflake-queries)
7. [Adding or Changing Things](#adding-or-changing-things)
8. [Running the App](#running-the-app)
9. [Configuration Reference](#configuration-reference)

---

## Project Structure

```
visualization/
├── app.py                          # Entry point — routing, sidebar, session state
├── config/
│   └── viz_config.json             # Colors, query strings, dashboard settings
├── data/
│   └── data_loader.py              # All Snowflake queries and caching logic
├── utils/
│   ├── data_processor.py           # Pandas aggregations (intent dist, content summary, etc.)
│   └── metrics.py                  # KPI calculations (sentiment score, urgency, etc.)
├── components/
│   ├── dashboard.py                # Dashboard page renderer
│   ├── sentiment_analysis.py       # Sentiment Analysis page renderer
│   └── reply_required.py           # Reply Required page renderer
├── visualizations/
│   ├── sentiment_charts.py         # Plotly sentiment chart functions
│   ├── distribution_charts.py      # Plotly distribution / heatmap / scatter functions
│   ├── demographic_charts.py       # Plotly demographic chart functions
│   └── content_cards.py            # Streamlit card components (comment cards, content cards)
├── agents/
│   └── content_summary_agent.py    # AI analysis agent (OpenAI) for comment summarisation
├── img/
│   └── musora.png                  # Sidebar logo
└── SnowFlakeConnection.py          # Snowflake connection wrapper (Snowpark session)
```

---

## How Data Flows

```
Snowflake
    │
    ▼
data_loader.py          ← Three separate loading modes (see below)
    │
    ├── load_dashboard_data()   ──► st.session_state['dashboard_df']
    │                                   └─► app.py sidebar (filter options, counts)
    │                                   └─► dashboard.py (all charts)
    │
    ├── load_sa_data()          ──► st.session_state['sa_contents']
    │   (on-demand, button)          st.session_state['sa_comments']
    │                                   └─► sentiment_analysis.py
    │
    └── load_reply_required_data() ► st.session_state['rr_df']
        (on-demand, button)             └─► reply_required.py
```

**Key principle:** Data is loaded as little as possible, as late as possible.

- The **Dashboard** uses a lightweight query (no text columns, no content join) cached for 24 hours.
- The **Sentiment Analysis** and **Reply Required** pages never load data automatically — they wait for the user to click **Fetch Data**.
- All data is stored in `st.session_state` so page navigation and widget interactions do not re-trigger Snowflake queries.

---

## Data Loading Strategy

All loading logic lives in **`data/data_loader.py`** (`SentimentDataLoader` class).

### `load_dashboard_data()`
- Uses `dashboard_query` from `viz_config.json`.
- Fetches only: `comment_sk, content_sk, platform, brand, sentiment_polarity, intent, requires_reply, detected_language, comment_timestamp, processed_at, author_id`.
- No text columns, no `DIM_CONTENT` join — significantly faster than the full query.
- Also merges demographics data if `demographics_query` is configured.
- Cached for **24 hours** (`@st.cache_data(ttl=86400)`).
- Called once by `app.py` at startup; result stored in `st.session_state['dashboard_df']`.

### `load_sa_data(platform, brand, top_n, min_comments, sort_by, sentiments, intents, date_range)`
- Runs **two** sequential Snowflake queries:
  1. **Content aggregation** — groups by `content_sk`, counts per sentiment, computes severity score, returns top N.
  2. **Sampled comments** — for the top N `content_sk`s only, fetches up to 50 comments per sentiment group per content (negative, positive, other), using Snowflake `QUALIFY ROW_NUMBER()`. `display_text` is computed in SQL (`CASE WHEN IS_ENGLISH = FALSE AND TRANSLATED_TEXT IS NOT NULL THEN TRANSLATED_TEXT ELSE ORIGINAL_TEXT END`).
- Returns a tuple `(contents_df, comments_df)`.
- Cached for **24 hours**.
- Called only when the user clicks **Fetch Data** on the Sentiment Analysis page.

### `load_reply_required_data(platforms, brands, date_range)`
- Runs a single query filtering `REQUIRES_REPLY = TRUE`.
- Dynamically includes/excludes the social media table and musora table based on selected platforms.
- `display_text` computed in SQL.
- Cached for **24 hours**.
- Called only when the user clicks **Fetch Data** on the Reply Required page.

### Important: SQL Column Qualification
Both the social media table (`COMMENT_SENTIMENT_FEATURES`) and the content dimension table (`DIM_CONTENT`) share column names. Any `WHERE` clause inside a query that joins these two tables **must** use the table alias prefix (e.g. `s.PLATFORM`, `s.COMMENT_TIMESTAMP`, `s.CHANNEL_NAME`) to avoid Snowflake `ambiguous column name` errors. The musora table (`MUSORA_COMMENT_SENTIMENT_FEATURES`) has no joins so unqualified column names are fine there.

---

## Pages

### Dashboard (`components/dashboard.py`)

**Receives:** `filtered_df` — the lightweight dashboard dataframe (after optional global filter applied by `app.py`).

**Does not need:** text, translations, content URLs. All charts work purely on aggregated columns (sentiment_polarity, brand, platform, intent, requires_reply, comment_timestamp).

**Key sections:**
- Summary stats + health indicator
- Sentiment distribution (pie + gauge)
- Sentiment by brand and platform (stacked + percentage bar charts)
- Intent analysis
- Brand-Platform heatmap
- Reply requirements + urgency breakdown
- Demographics (age, timezone, experience level) — only rendered if `author_id` is present and demographics were merged

**To add a new chart:** create the chart function in `visualizations/` and call it from `render_dashboard()`. The function receives `filtered_df`.

---

### Sentiment Analysis (`components/sentiment_analysis.py`)

**Receives:** `data_loader` instance only (no dataframe).

**Flow:**
1. Reads `st.session_state['dashboard_df']` for filter option lists (platforms, brands, sentiments, intents).
2. Pre-populates platform/brand dropdowns from `st.session_state['global_filters']`.
3. Shows filter controls (platform, brand, sentiment, intent, top_n, min_comments, sort_by).
4. On **Fetch Data** click: calls `data_loader.load_sa_data(...)` and stores results in `st.session_state['sa_contents']` and `['sa_comments']`.
5. Renders content cards, per-content sentiment + intent charts, AI analysis buttons, and sampled comment expanders.

**Pagination:** `st.session_state['sentiment_page']` (5 contents per page). Reset on new fetch.

**Comments:** Sampled (up to 50 negative + 50 positive + 50 neutral per content). These are already in memory after the fetch — no extra query is needed when the user expands a comment section.

**AI Analysis:** Uses `ContentSummaryAgent` (see `agents/`). Results cached in `st.session_state['content_summaries']`.

---

### Reply Required (`components/reply_required.py`)

**Receives:** `data_loader` instance only.

**Flow:**
1. Reads `st.session_state['dashboard_df']` for filter option lists.
2. Pre-populates platform, brand, and date from `st.session_state['global_filters']`.
3. On **Fetch Data** click: calls `data_loader.load_reply_required_data(...)` and stores result in `st.session_state['rr_df']`.
4. Shows urgency breakdown, in-page view filters (priority, platform, brand, intent — applied in Python, no new query), paginated comment cards, and a "Reply by Content" summary.

**Pagination:** `st.session_state['reply_page']` (10 comments per page). Reset on new fetch.

---

## Global Filters & Session State

Global filters live in the sidebar (`app.py`) and are stored in `st.session_state['global_filters']` as a dict:

```python
{
    'platforms':  ['facebook', 'instagram'],   # list or []
    'brands':     ['drumeo'],
    'sentiments': [],
    'date_range': (date(2025, 1, 1), date(2025, 12, 31)),  # or None
}
```

- **Dashboard:** `app.py` applies global filters to `dashboard_df` using `data_loader.apply_filters()` and passes the result to `render_dashboard()`.
- **Sentiment Analysis / Reply Required:** global filters are used to pre-populate their own filter widgets. The actual Snowflake query uses those values when the user clicks Fetch. The pages do **not** receive a pre-filtered dataframe.

### Full session state key reference

| Key | Set by | Used by |
|-----|--------|---------|
| `dashboard_df` | `app.py` on startup | sidebar (filter options), dashboard, SA + RR (filter option lists) |
| `global_filters` | sidebar "Apply Filters" button | app.py (dashboard filter), SA + RR (pre-populate widgets) |
| `filters_applied` | sidebar buttons | app.py (whether to apply filters) |
| `sa_contents` | SA fetch button | SA page rendering |
| `sa_comments` | SA fetch button | SA page rendering |
| `sa_fetch_key` | SA fetch button | SA page (detect stale data) |
| `rr_df` | RR fetch button | RR page rendering |
| `rr_fetch_key` | RR fetch button | RR page (detect stale data) |
| `sentiment_page` | SA page / fetch | SA pagination |
| `reply_page` | RR page / fetch | RR pagination |
| `content_summaries` | AI analysis buttons | SA AI analysis display |

---

## Snowflake Queries

All query strings are either stored in `config/viz_config.json` (static queries) or built dynamically in `data/data_loader.py` (page-specific queries).

### Static queries (in `viz_config.json`)

| Key | Purpose |
|-----|---------|
| `query` | Full query with all columns (legacy, kept for compatibility) |
| `dashboard_query` | Lightweight query — no text, no DIM_CONTENT join |
| `demographics_query` | Joins `usora_users` with `preprocessed.users` to get age/timezone/experience |

### Dynamic queries (built in `data_loader.py`)

| Method | Description |
|--------|-------------|
| `_build_sa_content_query()` | Content aggregation for SA page; filters by platform + brand + date |
| `_build_sa_comments_query()` | Sampled comments for SA page; uses `QUALIFY ROW_NUMBER() <= 50` |
| `_build_rr_query()` | Reply-required comments; filters by platform/brand/date; conditionally includes social media and/or musora table |

### Data source tables

| Table | Platform | Notes |
|-------|----------|-------|
| `SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES` | facebook, instagram, youtube, twitter | Needs `LEFT JOIN SOCIAL_MEDIA_DB.CORE.DIM_CONTENT` for `PERMALINK_URL` |
| `SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES` | musora_app | Has `PERMALINK_URL` and `THUMBNAIL_URL` natively; platform stored as `'musora'`, mapped to `'musora_app'` in queries |

---

## Adding or Changing Things

### Add a new chart to the Dashboard
1. Write the chart function in the appropriate `visualizations/` file.
2. Call it from `render_dashboard()` in `components/dashboard.py`, passing `filtered_df`.
3. The chart function receives a lightweight df — it has no text columns but has all the columns listed in `dashboard_query`.

### Add a new filter to the Dashboard sidebar
1. Add the widget in `app.py` under the "Global Filters" section.
2. Store the selected value in the `global_filters` dict under `st.session_state`.
3. Pass it to `data_loader.apply_filters()`.

### Change what the Sentiment Analysis page queries
- Edit `_build_sa_content_query()` and/or `_build_sa_comments_query()` in `data_loader.py`.
- If you add new columns to the content aggregation result, also update `_process_sa_content_stats()` so they are available in `contents_df`.
- If you add new columns to the comments result, update `_process_sa_comments()`.

### Change what the Reply Required page queries
- Edit `_build_rr_query()` in `data_loader.py`.
- Remember: all column references inside the social media block (which has a `JOIN`) must be prefixed with `s.` to avoid Snowflake ambiguity errors.

### Change the cache duration
- `@st.cache_data(ttl=86400)` is set on `load_dashboard_data`, `_fetch_sa_data`, `_fetch_rr_data`, and `load_demographics_data`.
- Change `86400` (seconds) to the desired TTL, or set `ttl=None` for no expiry.
- Users can always force a refresh with the "Reload Data" button in the sidebar (which calls `st.cache_data.clear()` and deletes `st.session_state['dashboard_df']`).

### Add a new page
1. Create `components/new_page.py` with a `render_new_page(data_loader)` function.
2. Import and add a radio option in `app.py`.
3. If the page needs its own Snowflake data, add a `load_new_page_data()` method to `SentimentDataLoader` following the same pattern as `load_sa_data`.

### Add a new column to the Dashboard query
- Edit `dashboard_query` in `config/viz_config.json`.
- Both UNION branches must select the same columns in the same order.
- `_process_dashboard_dataframe()` in `data_loader.py` handles basic type casting — add processing there if needed.

---

## Running the App

```bash
# From the project root
streamlit run visualization/app.py
```

**Required environment variables** (in `.env` at project root):

```
SNOWFLAKE_USER
SNOWFLAKE_PASSWORD
SNOWFLAKE_ACCOUNT
SNOWFLAKE_ROLE
SNOWFLAKE_DATABASE
SNOWFLAKE_WAREHOUSE
SNOWFLAKE_SCHEMA
```

---

## Configuration Reference

`config/viz_config.json` controls:

| Section | What it configures |
|---------|-------------------|
| `color_schemes.sentiment_polarity` | Hex colors for each sentiment level |
| `color_schemes.intent` | Hex colors for each intent label |
| `color_schemes.platform` | Hex colors for each platform |
| `color_schemes.brand` | Hex colors for each brand |
| `sentiment_order` | Display order for sentiment categories in charts |
| `intent_order` | Display order for intent categories |
| `negative_sentiments` | Which sentiment values count as "negative" |
| `dashboard.default_date_range_days` | Default date filter window (days) |
| `dashboard.max_comments_display` | Max comments shown per pagination page |
| `dashboard.chart_height` | Default Plotly chart height |
| `dashboard.top_n_contents` | Default top-N for content ranking |
| `snowflake.query` | Full query (legacy, all columns) |
| `snowflake.dashboard_query` | Lightweight dashboard query (no text columns) |
| `snowflake.demographics_query` | Demographics join query |
| `demographics.age_groups` | Age bucket definitions (label → [min, max]) |
| `demographics.experience_groups` | Experience bucket definitions |
| `demographics.top_timezones_count` | How many timezones to show in the geographic chart |