github-actions[bot] commited on
Commit
7718c02
·
1 Parent(s): 7370e7e

sync: automatic content update from github

Browse files
Files changed (8) hide show
  1. .gitattributes +0 -35
  2. INSTRUCTIONS.md +251 -0
  3. README.md +29 -10
  4. app.py +433 -0
  5. changelog.md +3 -0
  6. index.html +0 -19
  7. requirements.txt +10 -0
  8. style.css +0 -28
.gitattributes DELETED
@@ -1,35 +0,0 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
INSTRUCTIONS.md ADDED
@@ -0,0 +1,251 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 🧠 Purpose
2
+ Craft copy-paste-ready SQL queries for Redash (Snowflake) that pull Raptive content using URL keyword filtering, ingredient matching, and optional vertical matching. These queries answer custom RFPs across themes like food, family, travel, business, and more — always precision-focused to avoid irrelevant matches.
3
+
4
+ ✅ Key Behavior Rules
5
+
6
+ Keyword Count
7
+ Default: Include 20–25 of the best-performing URL path keywords.
8
+
9
+ Override: If the user says “add more” or asks for “at least X,” always meet or exceed the requested count with maximum specificity.
10
+
11
+ Intent + Root Matching
12
+ Use high-intent, high-signal keywords.
13
+
14
+ Add root forms where relevant — e.g., '%kebab%' covers 'kebabs', so don’t use '%kebabs%' alone.
15
+
16
+ Risky Short Words
17
+ Wrap ambiguous short words (e.g., dip, sub, rib, ham) using safe URL-specific or ingredient-specific patterns:
18
+
19
+ ✅ Use for URL: '%/rib-%', '%rib/%', '%rib-%'.
20
+ ✅ Use for URL: '% ham', 'ham', 'ham %'
21
+
22
+ ❌ Avoid: '%rib%' (matches ribeye, attribute, etc.), '%ham%' (matches hamburger, graham, etc.)
23
+
24
+ Ask yourself: “Could this appear inside another word?” If yes, wrap it.
25
+
26
+ Root > Plural
27
+ Use the root if it naturally covers plural/singular forms.
28
+
29
+ ❌ Never use only the plural if the singular/root is sufficient.
30
+
31
+ Multi-Word Keyword Handling
32
+ ❌ NEVER use spaces in LIKE statements for URLs.
33
+
34
+ ✅ Use:
35
+
36
+ Wildcards: '%dinner%party%' for general multi-word coverage
37
+
38
+ Hyphens: '%dinner-party%' only if a tighter match is needed
39
+
40
+ ❌ Never include both unless the user explicitly requests both.
41
+
42
+ ❌ Never write '%dinner party%'.
43
+
44
+ Wildcards > Hyphens by Default
45
+ Use wildcards first for multi-word phrases ('%castle%trip%', '%visit%castle%').
46
+
47
+ Use hyphens only if:
48
+
49
+ The phrase is short AND
50
+
51
+ The wildcard creates too many irrelevant matches
52
+
53
+ Include both only when necessary for coverage — otherwise pick the cleaner option.
54
+
55
+ Root Coverage & Redundancy Elimination
56
+ If a root term (e.g., '%soccer%') already captures meaningful variations, do not include those variations unless:
57
+
58
+ The root is too broad/noisy, or
59
+
60
+ The variation has clear standalone value and isn't already implied.
61
+
62
+ ✅ OK: '%soccer%', '%fifa%', '%mls%', '%world%cup%'
63
+
64
+ ❌ Redundant: '%soccer%game%', '%soccer%tips%', '%soccer%tournament%' if '%soccer%' is already present.
65
+
66
+ Date Logic
67
+ Use full-month BETWEEN ranges unless specified otherwise.
68
+
69
+ Tailor to reflect the campaign's timing or seasonality.
70
+
71
+ Keyword Scan Before Sending
72
+ Confirm the following:
73
+
74
+ ✅ Short words safely wrapped?
75
+
76
+ ✅ Root > plural where appropriate?
77
+
78
+ ✅ Redundancies eliminated?
79
+
80
+ ✅ Wildcards used instead of hyphens unless otherwise needed?
81
+
82
+ ✅ Root keyword included when appropriate?
83
+
84
+ ✅ All spaces removed from LIKE patterns?
85
+
86
+ Output Rules
87
+ Always return a full, runnable SQL query (unless snippets are explicitly requested).
88
+
89
+ Format cleanly — no cleanup required.
90
+
91
+ Use only the approved templates below — never improvise structure.
92
+
93
+ Include Iconic Entities When Relevant
94
+ For any topic (travel, sports, auto, entertainment, etc.), include:
95
+
96
+ 🏝️ Places: top destinations, cities, landmarks ('%hawaii%', '%italy%')
97
+
98
+ 🏎️ Brands: leading products/models ('%tesla%', '%mustang%', '%toyota%')
99
+
100
+ 📺 Celebs/Franchises: top entertainment hooks ('%netflix%', '%oscars%', '%taylor%swift%')
101
+
102
+ ⚽ Teams/Players: top sports figures and organizations ('%messi%', '%uswnt%', '%fifa%')
103
+ Add these if they:
104
+
105
+ Frequently appear in content
106
+
107
+ Are search-motivated
108
+
109
+ Represent high-value interest signals
110
+
111
+ 🧾 Templates to Use – DO NOT ALTER
112
+ Use these exact query templates. Replace the LIKE '%appetizer%' and ingredient terms with those given by the user. Leave all filters intact.
113
+
114
+ 🔑 JUST KEYWORD, NO VERTICAL
115
+ sql
116
+ Copy
117
+ Edit
118
+ SELECT
119
+ parse_url(concat('http://', r.url)):"host"::string AS domain,
120
+ parse_url(concat('http://', r.url)):"path"::string AS article_title,
121
+ r.url,
122
+ SUM(pageviews) AS pageviews,
123
+ r.primary_vertical
124
+ FROM sigma_aggregations.rpm_base_agg r
125
+ WHERE date BETWEEN date '2025-02-04' AND date '2025-03-05'
126
+ AND site_id IN (
127
+ SELECT site_id FROM ADTHRIVE.SITE_EXTENDED WHERE status = 'Active'
128
+ )
129
+ AND pageviews > 9
130
+ AND (parse_url(concat('http://', r.url)):"path" LIKE '%appetizer%'
131
+ OR parse_url(concat('http://', r.url)):"path" LIKE '%finger%food%'
132
+ OR parse_url(concat('http://', r.url)):"path" LIKE '%dip-recipe%'
133
+ AND pmp_enabled = 'true'
134
+ AND r.url NOT LIKE '%atlantablack%'
135
+ AND r.url != ''
136
+ AND r.url NOT LIKE '%forum%'
137
+ AND r.url NOT LIKE '%mediaite%'
138
+ AND r.url NOT LIKE '%page%'
139
+ AND r.url NOT LIKE '%comment%'
140
+ AND r.url NOT LIKE '%print%'
141
+ AND r.url NOT LIKE '%staging%'
142
+ AND r.url NOT LIKE '%width=%'
143
+ AND r.url NOT LIKE '%subscribe%'
144
+ GROUP BY 1, 2, 3, 5
145
+ Order by 4 desc
146
+
147
+ 📌 WITH PRIMARY VERTICAL
148
+
149
+ sql
150
+ Copy
151
+ Edit
152
+ ... AND LOWER(primary_vertical) LIKE '%food%' ...
153
+ 📍 WITH PRIMARY OR SECONDARY VERTICAL
154
+
155
+ sql
156
+ Copy
157
+ Edit
158
+ ... AND LOWER(verticals) LIKE '%food%' ...
159
+ 🌐 IN THE URL OR INGREDIENT
160
+
161
+ WITH base_agg AS (
162
+ SELECT r.url, SUM(r.pageviews) AS pageviews,
163
+ parse_url(concat('http://', r.url)):"host"::string AS domain,
164
+ parse_url(concat('http://', r.url)):"path"::string AS article_title,
165
+ r.primary_vertical, r.verticals
166
+ FROM sigma_aggregations.rpm_base_agg r
167
+ WHERE r.date BETWEEN date '2025-02-04' AND date '2025-03-05'
168
+ AND r.site_id IN (SELECT site_id FROM ADTHRIVE.SITE_EXTENDED WHERE status = 'Active')
169
+ AND r.pageviews > 9
170
+ AND r.pmp_enabled = 'true'
171
+ AND r.url NOT LIKE '%atlantablack%' AND r.url != '' AND ...
172
+ GROUP BY r.url, r.primary_vertical, r.verticals
173
+ ),
174
+ ingredient_clean AS (
175
+ SELECT DISTINCT
176
+ regexp_replace(regexp_replace(regexp_replace(url, '^http://',''), '/$',''),'^https://','') AS url_clean,
177
+ ingredient
178
+ FROM DI.SALES_AVAILS_INGREDIENTS
179
+ )
180
+ SELECT b.domain, b.article_title, b.url, b.pageviews, b.primary_vertical
181
+ FROM base_agg b
182
+ LEFT JOIN ingredient_clean i ON b.url = i.url_clean
183
+ WHERE (
184
+ b.article_title LIKE '%appetizer%'
185
+ OR lower(coalesce(i.ingredient, 'none')) LIKE '%cream cheese%'
186
+ OR lower(coalesce(i.ingredient, 'none')) LIKE '% ham'
187
+ )
188
+ AND lower(b.verticals) LIKE '%food%'
189
+ ORDER BY b.pageviews DESC
190
+
191
+ 🧀 URL AND INGREDIENT
192
+
193
+ sql
194
+ Copy
195
+ Edit
196
+ WITH base_agg AS (...), ingredient_clean AS (...)
197
+ SELECT ...
198
+ FROM base_agg b
199
+ LEFT JOIN ingredient_clean i ON b.url = i.url_clean
200
+ WHERE b.article_title LIKE '%appetizer%'
201
+ AND lower(coalesce(i.ingredient, 'none')) LIKE '%cream cheese%'
202
+ AND lower(b.verticals) LIKE '%food%'
203
+
204
+ 🥄 INGREDIENT ONLY
205
+
206
+ sql
207
+ Copy
208
+ Edit
209
+ WITH base_agg AS (...), ingredient_clean AS (...)
210
+ SELECT ...
211
+ FROM base_agg b
212
+ LEFT JOIN ingredient_clean i ON b.url = i.url_clean
213
+ WHERE lower(coalesce(i.ingredient, 'none')) LIKE '%cream cheese%'
214
+ AND lower(b.verticals) LIKE '%food%'
215
+
216
+ 📅 BY DAY TRAFFIC
217
+
218
+ sql
219
+ Copy
220
+ Edit
221
+ SELECT date, SUM(pageviews) AS pageviews
222
+ FROM sigma_aggregations.rpm_base_agg r
223
+ WHERE date BETWEEN date '2024-10-01' AND '2025-03-06'
224
+ AND ...
225
+ AND (
226
+ parse_url(...) LIKE '%winter%' OR
227
+ parse_url(...) LIKE '%december%' OR ...
228
+ )
229
+ GROUP BY 1
230
+ ORDER BY 1 ASC
231
+
232
+ 🧪 PROMPT EXAMPLES
233
+ “Write a full SQL query for ‘family activity content’ with food vertical.”
234
+
235
+ “Ingredient only: ‘evaporated milk.’”
236
+
237
+ “Pull daily traffic for winter holidays.”
238
+
239
+ 💬 TONE + PERSONALITY
240
+ Energetic, enthusiastic, and super supportive 🥳
241
+
242
+ Give compliments! Make the user feel like a data queen or king 👑
243
+
244
+ Examples:
245
+
246
+ “Oooooh, this one is chef’s kiss — ready to roll 🍽️”
247
+
248
+ “Marie, you slay. Here’s your pixel-perfect query 💅”
249
+
250
+ “Here comes a beautiful block of SQL brilliance for your brilliance 💡”
251
+
README.md CHANGED
@@ -1,10 +1,29 @@
1
- ---
2
- title: Content Analysis Workflow Automation
3
- emoji: 🏃
4
- colorFrom: pink
5
- colorTo: pink
6
- sdk: static
7
- pinned: false
8
- ---
9
-
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Content Analysis Workflow Automation
2
+
3
+ This Streamlit dashboard integrates OpenAI and Snowflake to generate and run
4
+ SQL queries for content analysis. Provide a description of the content you want
5
+ to analyze and the app will:
6
+
7
+ 1. Use OpenAI to craft a Snowflake SQL query based on custom instructions.
8
+ 2. Execute the query against your Snowflake warehouse.
9
+ 3. Display the results in an interactive table.
10
+
11
+ ## Setup
12
+
13
+ 1. Install dependencies:
14
+ ```bash
15
+ pip install -r requirements.txt
16
+ ```
17
+ 2. Set the required environment variables for OpenAI and Snowflake:
18
+ - `OPENAI_API_KEY`
19
+ - `SNOWFLAKE_USER`
20
+ - `SNOWFLAKE_PASSWORD`
21
+ - `SNOWFLAKE_ACCOUNT`
22
+ - `SNOWFLAKE_WAREHOUSE`
23
+ - `SNOWFLAKE_DATABASE`
24
+ - `SNOWFLAKE_SCHEMA`
25
+ 3. Run the app:
26
+ ```bash
27
+ streamlit run app.py
28
+ ```
29
+
app.py ADDED
@@ -0,0 +1,433 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import re
3
+ import streamlit as st
4
+ import pandas as pd
5
+ import snowflake.connector
6
+ from openai import OpenAI
7
+ from cryptography.hazmat.primitives import serialization
8
+ from cryptography.hazmat.backends import default_backend
9
+ from dateutil.relativedelta import relativedelta
10
+ from typing import Optional
11
+
12
+ STATIC_PRIMARY_VERTICALS = [
13
+ "Arts & Creativity",
14
+ "Auto",
15
+ "Baby",
16
+ "Beauty",
17
+ "Business",
18
+ "Careers",
19
+ "Clean Eating",
20
+ "Crafts",
21
+ "Deals",
22
+ "Education",
23
+ "Entertainment",
24
+ "Family and Parenting",
25
+ "Fitness",
26
+ "Food",
27
+ "Gaming",
28
+ "Gardening",
29
+ "Green Living",
30
+ "Health and Wellness",
31
+ "History & Culture",
32
+ "Hobbies & Interests",
33
+ "Home Decor and Design",
34
+ "Law, Gov't & Politics",
35
+ "Lifestyle",
36
+ "Mens Style and Grooming",
37
+ "Natural Parenting",
38
+ "News",
39
+ "Other",
40
+ "Personal Finance",
41
+ "Pets",
42
+ "Pregnancy",
43
+ "Professional Finance",
44
+ "Real Estate",
45
+ "Religion & Spirituality",
46
+ "Science",
47
+ "Shopping",
48
+ "Sports",
49
+ "Tech",
50
+ "Toddler",
51
+ "Travel",
52
+ "Vegetarian",
53
+ "Wedding",
54
+ "Womens Style",
55
+ ]
56
+
57
+
58
+ def extract_primary_verticals(text: str) -> list[str]:
59
+ text = text.lower()
60
+ candidates = set()
61
+ m = re.search(r"themes like ([^—]+)", text)
62
+ if m:
63
+ for part in re.split(r",|and", m.group(1)):
64
+ w = part.strip()
65
+ if w and w not in {"more"}:
66
+ candidates.add(w)
67
+ m2 = re.search(r"topic \(([^)]+)\)", text)
68
+ if m2:
69
+ for part in m2.group(1).split(","):
70
+ w = part.strip().strip(" etc.")
71
+ if w:
72
+ candidates.add(w)
73
+ return [w.title() for w in sorted(candidates)]
74
+
75
+
76
+ # ——————————————
77
+ # 1) STREAMLIT PAGE CONFIG
78
+ # ——————————————
79
+ st.set_page_config(page_title="Content Analysis Workflow", layout="wide")
80
+ st.title("Content Analysis Workflow Automation")
81
+
82
+ # ——————————————
83
+ # 2) LOAD SYSTEM PROMPT
84
+ # ——————————————
85
+ INSTRUCTIONS_PATH = os.path.join(os.path.dirname(__file__), "INSTRUCTIONS.md")
86
+ try:
87
+ with open(INSTRUCTIONS_PATH, "r", encoding="utf-8") as f:
88
+ SYSTEM_PROMPT = f.read()
89
+ extracted_verticals = extract_primary_verticals(SYSTEM_PROMPT)
90
+ except FileNotFoundError:
91
+ SYSTEM_PROMPT = ""
92
+ extracted_verticals = []
93
+ st.warning(f"Could not find INSTRUCTIONS.md at {INSTRUCTIONS_PATH}")
94
+
95
+ PRIMARY_VERTICALS = sorted(set(STATIC_PRIMARY_VERTICALS) | set(extracted_verticals))
96
+
97
+ # ——————————————
98
+ # 3) DATE RANGE FILTERS
99
+ # ——————————————
100
+ col1, col2 = st.columns(2)
101
+ with col1:
102
+ start_date = st.date_input("Start date", value=pd.to_datetime("2025-02-01"))
103
+ with col2:
104
+ end_date = st.date_input("End date", value=pd.to_datetime("2025-03-01"))
105
+
106
+ col3, col4 = st.columns(2)
107
+ with col3:
108
+ prior_start = st.date_input(
109
+ "Prior year start date", value=start_date - relativedelta(years=1)
110
+ )
111
+ with col4:
112
+ prior_end = st.date_input(
113
+ "Prior year end date", value=end_date - relativedelta(years=1)
114
+ )
115
+
116
+ if start_date > end_date or prior_start > prior_end:
117
+ st.error("Start date must be on or before end date for both ranges.")
118
+ st.stop()
119
+
120
+ col5, col6 = st.columns(2)
121
+ with col5:
122
+ include_verticals = st.multiselect(
123
+ "Filter to primary vertical", PRIMARY_VERTICALS, default=[]
124
+ )
125
+ with col6:
126
+ exclude_verticals = st.multiselect(
127
+ "Exclude primary vertical", PRIMARY_VERTICALS, default=[]
128
+ )
129
+
130
+ # ——————————————
131
+ # 4) CHECK ENVIRONMENT VARIABLES
132
+ # ——————————————
133
+ REQUIRED_VARS = [
134
+ "snowflake_user",
135
+ "snowflake_account_identifier",
136
+ "snowflake_warehouse",
137
+ "snowflake_database",
138
+ "snowflake_role",
139
+ "snowflake_private_key",
140
+ "OPENAI_API_KEY",
141
+ ]
142
+ missing = [v for v in REQUIRED_VARS if not os.getenv(v)]
143
+ if missing:
144
+ st.error("Missing required secrets: " + ", ".join(missing))
145
+ st.stop()
146
+
147
+ # ——————————————
148
+ # 5) INSTANTIATE OPENAI CLIENT
149
+ # ——————————————
150
+ client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
151
+
152
+ # ——————————————
153
+ # 6) PARSE PRIVATE KEY → DER BYTES
154
+ # ——————————————
155
+ pem_bytes = os.getenv("snowflake_private_key").encode("utf-8")
156
+ try:
157
+ key_obj = serialization.load_pem_private_key(
158
+ pem_bytes, password=None, backend=default_backend()
159
+ )
160
+ private_key_der = key_obj.private_bytes(
161
+ encoding=serialization.Encoding.DER,
162
+ format=serialization.PrivateFormat.PKCS8,
163
+ encryption_algorithm=serialization.NoEncryption(),
164
+ )
165
+ except Exception as e:
166
+ st.error(f"Failed to load Snowflake private key: {e}")
167
+ st.stop()
168
+
169
+ # ——————————————
170
+ # 7) BUILD SNOWFLAKE CONFIG
171
+ # ——————————————
172
+ SNOWFLAKE_CONFIG = {
173
+ "user": os.getenv("snowflake_user"),
174
+ "account": os.getenv("snowflake_account_identifier"),
175
+ "warehouse": os.getenv("snowflake_warehouse"),
176
+ "database": os.getenv("snowflake_database"),
177
+ "role": os.getenv("snowflake_role"),
178
+ "private_key": private_key_der,
179
+ }
180
+
181
+
182
+ # ——————————————
183
+ # 8) HELPERS
184
+ # ——————————————
185
+ def extract_sql_block(text: str) -> str:
186
+ """Extract SQL from the first ```sql …``` fence."""
187
+ m = re.search(r"```(?:sql)?\s*([\s\S]*?)```", text, re.IGNORECASE)
188
+ return m.group(1).strip() if m else text.strip()
189
+
190
+
191
+ def extract_keywords(sql: str) -> list[str]:
192
+ found = re.findall(r"(?<!NOT\s)LIKE\s+'%([^%]+)%'", sql, flags=re.IGNORECASE)
193
+ seen, kws = set(), []
194
+ for kw in found:
195
+ if kw not in seen:
196
+ seen.add(kw)
197
+ kws.append(kw)
198
+ return kws
199
+
200
+
201
+ def extract_title_words(df: pd.DataFrame) -> list[str]:
202
+ """Split article titles into unique lowercase words."""
203
+ seen = set()
204
+ words = []
205
+ for title in df.get("article_title", []):
206
+ for w in re.split(r"\W+", str(title)):
207
+ w = w.lower().strip()
208
+ if not w or w.isdigit():
209
+ continue
210
+ if w not in seen:
211
+ seen.add(w)
212
+ words.append(w)
213
+ return words
214
+
215
+
216
+ def apply_vertical_filter(
217
+ sql: str,
218
+ include: Optional[list[str]],
219
+ exclude: Optional[list[str]],
220
+ ) -> str:
221
+ clauses = []
222
+
223
+ if include:
224
+ inc_clauses = []
225
+ for v in include:
226
+ # sanitize any single-quotes by doubling them
227
+ sanitized = v.lower().replace("'", "''")
228
+ inc_clauses.append(
229
+ f"LOWER(primary_vertical) LIKE '%{sanitized}%'"
230
+ )
231
+ clauses.append("(" + " OR ".join(inc_clauses) + ")")
232
+
233
+ if exclude:
234
+ exc_clauses = []
235
+ for v in exclude:
236
+ sanitized = v.lower().replace("'", "''")
237
+ exc_clauses.append(
238
+ f"LOWER(primary_vertical) NOT LIKE '%{sanitized}%'"
239
+ )
240
+ clauses.append("(" + " AND ".join(exc_clauses) + ")")
241
+
242
+ if not clauses:
243
+ return sql
244
+
245
+ full_clause = "AND " + " AND ".join(clauses)
246
+
247
+ # strip any old single-vertical filters
248
+ sql = re.sub(
249
+ r"\s+AND\s+LOWER\(primary_vertical\)[^\n]*", "", sql, flags=re.IGNORECASE
250
+ )
251
+ sql = re.sub(
252
+ r"\s+AND\s+r\.primary_vertical\s*=\s*'[^']*'", "", sql, flags=re.IGNORECASE
253
+ )
254
+
255
+ # inject before GROUP BY
256
+ return re.sub(
257
+ r"(WHERE[\s\S]*?)(GROUP BY)",
258
+ lambda m: f"{m.group(1)} {full_clause}\n{m.group(2)}",
259
+ sql,
260
+ count=1,
261
+ flags=re.IGNORECASE,
262
+ )
263
+
264
+
265
+
266
+ def highlight_sov(val: float) -> str:
267
+ """Color SOV change green for positive, red for negative."""
268
+ if pd.isna(val):
269
+ return ""
270
+ color = "green" if val > 0 else "red"
271
+ return f"color: {color};"
272
+
273
+
274
+ def get_sql_template_from_openai(user_text: str) -> str:
275
+ prompt = f"""
276
+ You are a SQL maestro.
277
+
278
+ 1) From the user’s description:
279
+ \"\"\"{user_text}\"\"\"
280
+ identify the top **25** keywords.
281
+
282
+ 2) Generate one complete SQL query that:
283
+ • Selects domain, article_title, url, pageviews, primary_vertical
284
+ • Filters date BETWEEN '{{START_DATE}}' AND '{{END_DATE}}'
285
+ • Filters only active sites
286
+ • Only includes pageviews > 9 and pmp_enabled = 'true'
287
+ • Excludes unwanted URLs (e.g. '%atlanta%', '%forum%', etc.)
288
+ • Uses **at least 20** lines of:
289
+ `OR parse_url(...):"path" LIKE '%<keyword>%'`
290
+ all wrapped in a single `AND ( … )` block
291
+ • GROUPs and ORDERs as needed
292
+
293
+ Return *only* the SQL, with the placeholders literally in the BETWEEN clause, inside a ```sql …``` fence—no extra text.
294
+ """
295
+ resp = client.chat.completions.create(
296
+ model="gpt-4o-mini",
297
+ messages=[
298
+ {"role": "system", "content": SYSTEM_PROMPT},
299
+ {"role": "user", "content": prompt},
300
+ ],
301
+ )
302
+ return extract_sql_block(resp.choices[0].message.content)
303
+
304
+
305
+ def run_query(sql: str) -> pd.DataFrame:
306
+ """Execute SQL on Snowflake and return a lowercase-column DataFrame."""
307
+ conn = snowflake.connector.connect(**SNOWFLAKE_CONFIG)
308
+ cur = conn.cursor()
309
+ cur.execute(sql)
310
+ rows = cur.fetchall()
311
+ cols = [c[0].lower() for c in cur.description]
312
+ conn.close()
313
+ return pd.DataFrame(rows, columns=cols)
314
+
315
+
316
+ # ——————————————
317
+ # 9) USER INPUT & EXECUTION
318
+ # ——————————————
319
+ user_prompt = st.text_area(
320
+ "Describe the content or keywords for your analysis:",
321
+ height=150,
322
+ )
323
+
324
+ if st.button("Generate Table"):
325
+ if not user_prompt.strip():
326
+ st.warning("Enter some analysis keywords or description.")
327
+ else:
328
+ # Generate SQL once and swap the date range for prior-year query
329
+ template_sql = get_sql_template_from_openai(user_prompt)
330
+ sql_current = template_sql.replace(
331
+ "{START_DATE}", start_date.isoformat()
332
+ ).replace("{END_DATE}", end_date.isoformat())
333
+ sql_prior = template_sql.replace(
334
+ "{START_DATE}", prior_start.isoformat()
335
+ ).replace("{END_DATE}", prior_end.isoformat())
336
+
337
+ include_sel = include_verticals or None
338
+ exclude_sel = exclude_verticals or None
339
+ sql_current = apply_vertical_filter(sql_current, include_sel, exclude_sel)
340
+ sql_prior = apply_vertical_filter(sql_prior, include_sel, exclude_sel)
341
+
342
+ # Run queries
343
+ df_current = run_query(sql_current)
344
+ df_prior = run_query(sql_prior)
345
+
346
+ # Extract terms
347
+ url_kws = extract_keywords(sql_current)
348
+ if len(url_kws) < 20:
349
+ st.warning(
350
+ "Fewer than 20 keywords detected; refine your prompt for broader coverage."
351
+ )
352
+ title_kws = extract_title_words(df_current) + extract_title_words(df_prior)
353
+ all_terms = []
354
+ seen = set()
355
+ for term in url_kws + title_kws:
356
+ term = term.strip()
357
+ if len(term) <= 3 or term in seen:
358
+ continue
359
+ seen.add(term)
360
+ all_terms.append(term)
361
+
362
+ # Totals for pageview display
363
+ total_cy = df_current["pageviews"].sum()
364
+ total_py = df_prior["pageviews"].sum()
365
+
366
+ # Build metrics without a totals row
367
+ metrics = []
368
+ for term in all_terms:
369
+ cy = df_current[
370
+ df_current["article_title"].str.contains(term, case=False, na=False)
371
+ | df_current["url"].str.contains(term, case=False, na=False)
372
+ ]["pageviews"].sum()
373
+ py = df_prior[
374
+ df_prior["article_title"].str.contains(term, case=False, na=False)
375
+ | df_prior["url"].str.contains(term, case=False, na=False)
376
+ ]["pageviews"].sum()
377
+ yoy = (cy - py) / py * 100 if py else float("nan")
378
+ metrics.append(
379
+ {
380
+ "term": term,
381
+ "CY pageviews": cy,
382
+ "PY pageviews": py,
383
+ "YoY %": yoy,
384
+ }
385
+ )
386
+
387
+ sum_cy_terms = sum(m["CY pageviews"] for m in metrics)
388
+ sum_py_terms = sum(m["PY pageviews"] for m in metrics)
389
+ for m in metrics:
390
+ m["SOV CY"] = (
391
+ m["CY pageviews"] / sum_cy_terms if sum_cy_terms else float("nan")
392
+ )
393
+ m["SOV PY"] = (
394
+ m["PY pageviews"] / sum_py_terms if sum_py_terms else float("nan")
395
+ )
396
+ m["SOV % Change"] = (
397
+ (m["SOV CY"] / m["SOV PY"] - 1)
398
+ if (not pd.isna(m["SOV CY"]) and not pd.isna(m["SOV PY"]))
399
+ else float("nan")
400
+ )
401
+
402
+ metrics_df = pd.DataFrame(metrics).sort_values("CY pageviews", ascending=False)
403
+
404
+ # Display SQL in a hidden expander above metrics
405
+ with st.expander("Show SQL Queries"):
406
+ st.subheader("Current Year SQL")
407
+ st.code(sql_current, language="sql")
408
+ st.subheader("Prior Year SQL")
409
+ st.code(sql_prior, language="sql")
410
+ # Format percentages
411
+ fmt = {
412
+ "CY pageviews": "{:,}", # add thousand separators
413
+ "PY pageviews": "{:,}", # add thousand separators
414
+ "YoY %": "{:.1f}%",
415
+ "SOV CY": "{:.1%}",
416
+ "SOV PY": "{:.1%}",
417
+ "SOV % Change": "{:.1%}",
418
+ }
419
+
420
+ # Display with conditional formatting
421
+ st.subheader("Term Performance Metrics")
422
+ styled = metrics_df.style.format(fmt, na_rep="-").applymap(
423
+ highlight_sov, subset=["SOV % Change"]
424
+ )
425
+ st.dataframe(styled, height=400)
426
+
427
+ # Show raw result tables with totals
428
+ with st.expander(f"Current Year Results: {start_date} to {end_date}"):
429
+ st.dataframe(df_current.style.format({"pageviews": "{:,}"}))
430
+ st.write(f"Total pageviews: {total_cy:,}")
431
+ with st.expander(f"Prior Year Results: {prior_start} to {prior_end}"):
432
+ st.dataframe(df_prior.style.format({"pageviews": "{:,}"}))
433
+ st.write(f"Total pageviews: {total_py:,}")
changelog.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # Changelog
2
+
3
+ - 2025-08-07 14:28 UTC: Initialized changelog to track project updates.
index.html DELETED
@@ -1,19 +0,0 @@
1
- <!doctype html>
2
- <html>
3
- <head>
4
- <meta charset="utf-8" />
5
- <meta name="viewport" content="width=device-width" />
6
- <title>My static Space</title>
7
- <link rel="stylesheet" href="style.css" />
8
- </head>
9
- <body>
10
- <div class="card">
11
- <h1>Welcome to your static Space!</h1>
12
- <p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
13
- <p>
14
- Also don't forget to check the
15
- <a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
16
- </p>
17
- </div>
18
- </body>
19
- </html>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit
2
+
3
+ openai>=1.0.0
4
+ =======
5
+ openai
6
+
7
+ pandas
8
+ python-dotenv
9
+ snowflake-connector-python
10
+
style.css DELETED
@@ -1,28 +0,0 @@
1
- body {
2
- padding: 2rem;
3
- font-family: -apple-system, BlinkMacSystemFont, "Arial", sans-serif;
4
- }
5
-
6
- h1 {
7
- font-size: 16px;
8
- margin-top: 0;
9
- }
10
-
11
- p {
12
- color: rgb(107, 114, 128);
13
- font-size: 15px;
14
- margin-bottom: 10px;
15
- margin-top: 5px;
16
- }
17
-
18
- .card {
19
- max-width: 620px;
20
- margin: 0 auto;
21
- padding: 16px;
22
- border: 1px solid lightgray;
23
- border-radius: 16px;
24
- }
25
-
26
- .card p:last-child {
27
- margin-bottom: 0;
28
- }