File size: 15,148 Bytes
9858829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
# Musora Sentiment Analysis Dashboard

A Streamlit dashboard for visualising sentiment analysis results from **social media comments** (Facebook, Instagram, YouTube, Twitter) and the **Musora internal app** across brands (Drumeo, Pianote, Guitareo, Singeo, Musora).

---

## Table of Contents

1. [Project Structure](#project-structure)
2. [How Data Flows](#how-data-flows)
3. [Data Loading Strategy](#data-loading-strategy)
4. [Pages](#pages)
5. [Global Filters & Session State](#global-filters--session-state)
6. [Snowflake Queries](#snowflake-queries)
7. [Adding or Changing Things](#adding-or-changing-things)
8. [Running the App](#running-the-app)
9. [Configuration Reference](#configuration-reference)

---

## Project Structure

```
visualization/
β”œβ”€β”€ app.py                          # Entry point β€” routing, sidebar, session state
β”œβ”€β”€ config/
β”‚   └── viz_config.json             # Colors, query strings, dashboard settings
β”œβ”€β”€ data/
β”‚   └── data_loader.py              # All Snowflake queries and caching logic
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ data_processor.py           # Pandas aggregations (intent dist, content summary, etc.)
β”‚   └── metrics.py                  # KPI calculations (sentiment score, urgency, etc.)
β”œβ”€β”€ components/
β”‚   β”œβ”€β”€ dashboard.py                # Dashboard page renderer
β”‚   β”œβ”€β”€ sentiment_analysis.py       # Sentiment Analysis page renderer
β”‚   └── reply_required.py           # Reply Required page renderer
β”œβ”€β”€ visualizations/
β”‚   β”œβ”€β”€ sentiment_charts.py         # Plotly sentiment chart functions
β”‚   β”œβ”€β”€ distribution_charts.py      # Plotly distribution / heatmap / scatter functions
β”‚   β”œβ”€β”€ demographic_charts.py       # Plotly demographic chart functions
β”‚   └── content_cards.py            # Streamlit card components (comment cards, content cards)
β”œβ”€β”€ agents/
β”‚   └── content_summary_agent.py    # AI analysis agent (OpenAI) for comment summarisation
β”œβ”€β”€ img/
β”‚   └── musora.png                  # Sidebar logo
└── SnowFlakeConnection.py          # Snowflake connection wrapper (Snowpark session)
```

---

## How Data Flows

```
Snowflake
    β”‚
    β–Ό
data_loader.py          ← Three separate loading modes (see below)
    β”‚
    β”œβ”€β”€ load_dashboard_data()   ──► st.session_state['dashboard_df']
    β”‚                                   └─► app.py sidebar (filter options, counts)
    β”‚                                   └─► dashboard.py (all charts)
    β”‚
    β”œβ”€β”€ load_sa_data()          ──► st.session_state['sa_contents']
    β”‚   (on-demand, button)          st.session_state['sa_comments']
    β”‚                                   └─► sentiment_analysis.py
    β”‚
    └── load_reply_required_data() β–Ί st.session_state['rr_df']
        (on-demand, button)             └─► reply_required.py
```

**Key principle:** Data is loaded as little as possible, as late as possible.

- The **Dashboard** uses a lightweight query (no text columns, no content join) cached for 24 hours.
- The **Sentiment Analysis** and **Reply Required** pages never load data automatically β€” they wait for the user to click **Fetch Data**.
- All data is stored in `st.session_state` so page navigation and widget interactions do not re-trigger Snowflake queries.

---

## Data Loading Strategy

All loading logic lives in **`data/data_loader.py`** (`SentimentDataLoader` class).

### `load_dashboard_data()`
- Uses `dashboard_query` from `viz_config.json`.
- Fetches only: `comment_sk, content_sk, platform, brand, sentiment_polarity, intent, requires_reply, detected_language, comment_timestamp, processed_at, author_id`.
- No text columns, no `DIM_CONTENT` join β€” significantly faster than the full query.
- Also merges demographics data if `demographics_query` is configured.
- Cached for **24 hours** (`@st.cache_data(ttl=86400)`).
- Called once by `app.py` at startup; result stored in `st.session_state['dashboard_df']`.

### `load_sa_data(platform, brand, top_n, min_comments, sort_by, sentiments, intents, date_range)`
- Runs **two** sequential Snowflake queries:
  1. **Content aggregation** β€” groups by `content_sk`, counts per sentiment, computes severity score, returns top N.
  2. **Sampled comments** β€” for the top N `content_sk`s only, fetches up to 50 comments per sentiment group per content (negative, positive, other), using Snowflake `QUALIFY ROW_NUMBER()`. `display_text` is computed in SQL (`CASE WHEN IS_ENGLISH = FALSE AND TRANSLATED_TEXT IS NOT NULL THEN TRANSLATED_TEXT ELSE ORIGINAL_TEXT END`).
- Returns a tuple `(contents_df, comments_df)`.
- Cached for **24 hours**.
- Called only when the user clicks **Fetch Data** on the Sentiment Analysis page.

### `load_reply_required_data(platforms, brands, date_range)`
- Runs a single query filtering `REQUIRES_REPLY = TRUE`.
- Dynamically includes/excludes the social media table and musora table based on selected platforms.
- `display_text` computed in SQL.
- Cached for **24 hours**.
- Called only when the user clicks **Fetch Data** on the Reply Required page.

### Important: SQL Column Qualification
Both the social media table (`COMMENT_SENTIMENT_FEATURES`) and the content dimension table (`DIM_CONTENT`) share column names. Any `WHERE` clause inside a query that joins these two tables **must** use the table alias prefix (e.g. `s.PLATFORM`, `s.COMMENT_TIMESTAMP`, `s.CHANNEL_NAME`) to avoid Snowflake `ambiguous column name` errors. The musora table (`MUSORA_COMMENT_SENTIMENT_FEATURES`) has no joins so unqualified column names are fine there.

---

## Pages

### Dashboard (`components/dashboard.py`)

**Receives:** `filtered_df` β€” the lightweight dashboard dataframe (after optional global filter applied by `app.py`).

**Does not need:** text, translations, content URLs. All charts work purely on aggregated columns (sentiment_polarity, brand, platform, intent, requires_reply, comment_timestamp).

**Key sections:**
- Summary stats + health indicator
- Sentiment distribution (pie + gauge)
- Sentiment by brand and platform (stacked + percentage bar charts)
- Intent analysis
- Brand-Platform heatmap
- Reply requirements + urgency breakdown
- Demographics (age, timezone, experience level) β€” only rendered if `author_id` is present and demographics were merged

**To add a new chart:** create the chart function in `visualizations/` and call it from `render_dashboard()`. The function receives `filtered_df`.

---

### Sentiment Analysis (`components/sentiment_analysis.py`)

**Receives:** `data_loader` instance only (no dataframe).

**Flow:**
1. Reads `st.session_state['dashboard_df']` for filter option lists (platforms, brands, sentiments, intents).
2. Pre-populates platform/brand dropdowns from `st.session_state['global_filters']`.
3. Shows filter controls (platform, brand, sentiment, intent, top_n, min_comments, sort_by).
4. On **Fetch Data** click: calls `data_loader.load_sa_data(...)` and stores results in `st.session_state['sa_contents']` and `['sa_comments']`.
5. Renders content cards, per-content sentiment + intent charts, AI analysis buttons, and sampled comment expanders.

**Pagination:** `st.session_state['sentiment_page']` (5 contents per page). Reset on new fetch.

**Comments:** Sampled (up to 50 negative + 50 positive + 50 neutral per content). These are already in memory after the fetch β€” no extra query is needed when the user expands a comment section.

**AI Analysis:** Uses `ContentSummaryAgent` (see `agents/`). Results cached in `st.session_state['content_summaries']`.

---

### Reply Required (`components/reply_required.py`)

**Receives:** `data_loader` instance only.

**Flow:**
1. Reads `st.session_state['dashboard_df']` for filter option lists.
2. Pre-populates platform, brand, and date from `st.session_state['global_filters']`.
3. On **Fetch Data** click: calls `data_loader.load_reply_required_data(...)` and stores result in `st.session_state['rr_df']`.
4. Shows urgency breakdown, in-page view filters (priority, platform, brand, intent β€” applied in Python, no new query), paginated comment cards, and a "Reply by Content" summary.

**Pagination:** `st.session_state['reply_page']` (10 comments per page). Reset on new fetch.

---

## Global Filters & Session State

Global filters live in the sidebar (`app.py`) and are stored in `st.session_state['global_filters']` as a dict:

```python
{
    'platforms':  ['facebook', 'instagram'],   # list or []
    'brands':     ['drumeo'],
    'sentiments': [],
    'date_range': (date(2025, 1, 1), date(2025, 12, 31)),  # or None
}
```

- **Dashboard:** `app.py` applies global filters to `dashboard_df` using `data_loader.apply_filters()` and passes the result to `render_dashboard()`.
- **Sentiment Analysis / Reply Required:** global filters are used to pre-populate their own filter widgets. The actual Snowflake query uses those values when the user clicks Fetch. The pages do **not** receive a pre-filtered dataframe.

### Full session state key reference

| Key | Set by | Used by |
|-----|--------|---------|
| `dashboard_df` | `app.py` on startup | sidebar (filter options), dashboard, SA + RR (filter option lists) |
| `global_filters` | sidebar "Apply Filters" button | app.py (dashboard filter), SA + RR (pre-populate widgets) |
| `filters_applied` | sidebar buttons | app.py (whether to apply filters) |
| `sa_contents` | SA fetch button | SA page rendering |
| `sa_comments` | SA fetch button | SA page rendering |
| `sa_fetch_key` | SA fetch button | SA page (detect stale data) |
| `rr_df` | RR fetch button | RR page rendering |
| `rr_fetch_key` | RR fetch button | RR page (detect stale data) |
| `sentiment_page` | SA page / fetch | SA pagination |
| `reply_page` | RR page / fetch | RR pagination |
| `content_summaries` | AI analysis buttons | SA AI analysis display |

---

## Snowflake Queries

All query strings are either stored in `config/viz_config.json` (static queries) or built dynamically in `data/data_loader.py` (page-specific queries).

### Static queries (in `viz_config.json`)

| Key | Purpose |
|-----|---------|
| `query` | Full query with all columns (legacy, kept for compatibility) |
| `dashboard_query` | Lightweight query β€” no text, no DIM_CONTENT join |
| `demographics_query` | Joins `usora_users` with `preprocessed.users` to get age/timezone/experience |

### Dynamic queries (built in `data_loader.py`)

| Method | Description |
|--------|-------------|
| `_build_sa_content_query()` | Content aggregation for SA page; filters by platform + brand + date |
| `_build_sa_comments_query()` | Sampled comments for SA page; uses `QUALIFY ROW_NUMBER() <= 50` |
| `_build_rr_query()` | Reply-required comments; filters by platform/brand/date; conditionally includes social media and/or musora table |

### Data source tables

| Table | Platform | Notes |
|-------|----------|-------|
| `SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES` | facebook, instagram, youtube, twitter | Needs `LEFT JOIN SOCIAL_MEDIA_DB.CORE.DIM_CONTENT` for `PERMALINK_URL` |
| `SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES` | musora_app | Has `PERMALINK_URL` and `THUMBNAIL_URL` natively; platform stored as `'musora'`, mapped to `'musora_app'` in queries |

---

## Adding or Changing Things

### Add a new chart to the Dashboard
1. Write the chart function in the appropriate `visualizations/` file.
2. Call it from `render_dashboard()` in `components/dashboard.py`, passing `filtered_df`.
3. The chart function receives a lightweight df β€” it has no text columns but has all the columns listed in `dashboard_query`.

### Add a new filter to the Dashboard sidebar
1. Add the widget in `app.py` under the "Global Filters" section.
2. Store the selected value in the `global_filters` dict under `st.session_state`.
3. Pass it to `data_loader.apply_filters()`.

### Change what the Sentiment Analysis page queries
- Edit `_build_sa_content_query()` and/or `_build_sa_comments_query()` in `data_loader.py`.
- If you add new columns to the content aggregation result, also update `_process_sa_content_stats()` so they are available in `contents_df`.
- If you add new columns to the comments result, update `_process_sa_comments()`.

### Change what the Reply Required page queries
- Edit `_build_rr_query()` in `data_loader.py`.
- Remember: all column references inside the social media block (which has a `JOIN`) must be prefixed with `s.` to avoid Snowflake ambiguity errors.

### Change the cache duration
- `@st.cache_data(ttl=86400)` is set on `load_dashboard_data`, `_fetch_sa_data`, `_fetch_rr_data`, and `load_demographics_data`.
- Change `86400` (seconds) to the desired TTL, or set `ttl=None` for no expiry.
- Users can always force a refresh with the "Reload Data" button in the sidebar (which calls `st.cache_data.clear()` and deletes `st.session_state['dashboard_df']`).

### Add a new page
1. Create `components/new_page.py` with a `render_new_page(data_loader)` function.
2. Import and add a radio option in `app.py`.
3. If the page needs its own Snowflake data, add a `load_new_page_data()` method to `SentimentDataLoader` following the same pattern as `load_sa_data`.

### Add a new column to the Dashboard query
- Edit `dashboard_query` in `config/viz_config.json`.
- Both UNION branches must select the same columns in the same order.
- `_process_dashboard_dataframe()` in `data_loader.py` handles basic type casting β€” add processing there if needed.

---

## Running the App

```bash
# From the project root
streamlit run visualization/app.py
```

**Required environment variables** (in `.env` at project root):

```
SNOWFLAKE_USER
SNOWFLAKE_PASSWORD
SNOWFLAKE_ACCOUNT
SNOWFLAKE_ROLE
SNOWFLAKE_DATABASE
SNOWFLAKE_WAREHOUSE
SNOWFLAKE_SCHEMA
```

---

## Configuration Reference

`config/viz_config.json` controls:

| Section | What it configures |
|---------|-------------------|
| `color_schemes.sentiment_polarity` | Hex colors for each sentiment level |
| `color_schemes.intent` | Hex colors for each intent label |
| `color_schemes.platform` | Hex colors for each platform |
| `color_schemes.brand` | Hex colors for each brand |
| `sentiment_order` | Display order for sentiment categories in charts |
| `intent_order` | Display order for intent categories |
| `negative_sentiments` | Which sentiment values count as "negative" |
| `dashboard.default_date_range_days` | Default date filter window (days) |
| `dashboard.max_comments_display` | Max comments shown per pagination page |
| `dashboard.chart_height` | Default Plotly chart height |
| `dashboard.top_n_contents` | Default top-N for content ranking |
| `snowflake.query` | Full query (legacy, all columns) |
| `snowflake.dashboard_query` | Lightweight dashboard query (no text columns) |
| `snowflake.demographics_query` | Demographics join query |
| `demographics.age_groups` | Age bucket definitions (label β†’ [min, max]) |
| `demographics.experience_groups` | Experience bucket definitions |
| `demographics.top_timezones_count` | How many timezones to show in the geographic chart |