Omraghu07 commited on
Commit
b71ce4b
Β·
verified Β·
1 Parent(s): dd50577

Upload 11 files

Browse files

πŸš€ SEO Keyword Research AI Agent

An AI-powered SEO keyword research agent that discovers, analyzes, and ranks keyword opportunities using SerpAPI, featuring an interactive Streamlit dashboard for visualization and n8n automation integration.
This project showcases end-to-end skills in Python, AI agents, API integration, data visualization, and deployment (Render + n8n).

✨ Features

πŸ” Keyword Discovery – Finds semantically related keywords for any seed keyword.
πŸ“Š Keyword Analysis – Scores each keyword based on volume, competition, and SERP metrics.
πŸ“‚ Data Export – Export analyzed results as CSV/Excel files.
πŸ“ˆ Interactive Dashboard – Visualize keyword trends, heatmaps, and search intent using Streamlit + Plotly.
πŸ€– AI Agent Workflow – Automates research β†’ processing β†’ reporting pipeline.
πŸ”— n8n Integration – Trigger workflows via webhooks (e.g., run research + auto-send reports to Slack/Email).
🌐 Deployment – Hosted on Render, accessible via API and dashboard.

Files changed (11) hide show
  1. README.md +156 -0
  2. __init__.py +0 -0
  3. app.py +584 -0
  4. dashboard.py +830 -0
  5. git +0 -0
  6. keyword_agent.py +19 -0
  7. postprocess.py +366 -0
  8. ranking.py +569 -0
  9. requirements.txt +60 -0
  10. server.py +625 -0
  11. tempCodeRunnerFile.py +1 -0
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SEO Keyword Research AI Agent
2
+
3
+ An **AI-powered SEO keyword research agent** that discovers, analyzes, and ranks keyword opportunities using **SerpAPI**, with an interactive **Streamlit dashboard** for visualization and an **n8n integration** for automation.
4
+
5
+ This project was built to demonstrate skills in **Python, AI agents, API integration, data visualization, and deployment** (Render + n8n).
6
+
7
+ ---
8
+
9
+ ## πŸš€ Features
10
+
11
+ - πŸ” **Keyword Discovery** – Finds related keywords for any seed keyword.
12
+ - πŸ“Š **Keyword Analysis** – Scores keywords based on search volume, competition, and SERP signals.
13
+ - πŸ“‚ **Data Export** – Saves results to CSV/Excel with metadata.
14
+ - πŸ“ˆ **Interactive Dashboard** – Streamlit + Plotly for keyword trends, competition heatmaps, and intent analysis.
15
+ - πŸ€– **AI Agent Workflow** – Automates tasks like keyword research β†’ processing β†’ reporting.
16
+ - πŸ”— **n8n Integration** – Trigger workflows via webhooks (e.g., run keyword research and auto-send results to Slack/Email).
17
+ - 🌐 **Deployment** – Hosted on **Render** for API and dashboard access.
18
+
19
+ ---
20
+
21
+ ## πŸ—οΈ Project Structure
22
+
23
+ ## seo-keyword-ai-agent/
24
+ β”‚
25
+ β”œβ”€β”€ app.py # Master pipeline orchestrator
26
+
27
+ │── dashboard.py # Streamlit visualization
28
+
29
+ │── src/
30
+
31
+ β”‚ β”œβ”€β”€ postprocess.py # Cleans & enriches results
32
+
33
+ β”‚ β”œβ”€β”€ ranking.py # Keyword discovery & scoring
34
+
35
+ β”‚ β”œβ”€β”€ server.py # FastAPI/Render server
36
+
37
+ │── output/ # Generated keyword results
38
+
39
+ │── .env # API keys (not committed)
40
+
41
+ │── requirements.txt # Python dependencies
42
+
43
+ │── README.md # Project documentation
44
+
45
+
46
+
47
+ ---
48
+
49
+ ## βš™οΈ Installation
50
+
51
+ ---
52
+
53
+ ## βš™οΈ Installation
54
+
55
+ 1. **Clone the repo**
56
+ ```bash
57
+ git clone https://github.com/omraghu07/seo-keyword-ai-agent.git
58
+ cd seo-keyword-ai-agent
59
+
60
+ ```
61
+
62
+ 2. **Create a virtual environment**
63
+ ```bash
64
+ python -m venv agent_venv
65
+
66
+ # Mac/Linux
67
+ source agent_venv/bin/activate
68
+
69
+ # Windows
70
+ agent_venv\Scripts\activate
71
+ ```
72
+
73
+ 3. **Install dependencies**
74
+ ```bash
75
+ pip install -r requirements.txt
76
+ ```
77
+
78
+ 4. **Setup .env file**
79
+ Create a `.env` file in the root directory and add your API key:
80
+ ```
81
+ SERPAPI_KEY=your_serpapi_key_here
82
+ ```
83
+
84
+ ---
85
+
86
+ # ▢️ Usage
87
+
88
+
89
+
90
+ ## Run the full pipeline
91
+
92
+
93
+ ```bash
94
+ python app.py "global internship" --max-candidates 100 --top-results 50
95
+
96
+ ```
97
+
98
+ ## Launch the dashboard
99
+
100
+ ```bash
101
+ streamlit run dashboard.py
102
+
103
+ ```
104
+ ## Run as an API (Render/FastAPI)
105
+ ```bash
106
+ gunicorn -k uvicorn.workers.UvicornWorker src.server:app --bind 0.0.0.0:8000 --workers 2
107
+
108
+ ```
109
+ # πŸ”— n8n Integration
110
+
111
+ - Create an n8n workflow with a Webhook node.
112
+
113
+ - Connect it to Render API:
114
+
115
+ ```bash
116
+ POST https://seo-keyword-ai-agent.onrender.com/analyze
117
+ {
118
+ "seed": "global internship",
119
+ "top": 10
120
+ }
121
+
122
+ ```
123
+ - Add email/Slack nodes to auto-send reports.
124
+
125
+ # πŸ“Š Example Output
126
+
127
+ ## Top 5 Keyword Opportunities:
128
+
129
+
130
+ | Keyword | Volume | Competition | Score | Results |
131
+ | --------------------------------- | ------ | ----------- | ------ | ------- |
132
+ | UCLA Global Internship Program | 2000 | 0.0 | 330.12 | 0 |
133
+ | Summer Internship Programs - CIEE | 1666 | 0.33 | 9.26 | 54,000 |
134
+ | Global Internship Program HENNGE | 2000 | 0.35 | 9.01 | 10,200 |
135
+ | Berkeley Global Internships Paid | 1666 | 0.45 | 6.98 | 219,000 |
136
+ | Global Internship Remote | 2500 | 0.50 | 6.66 | 174M |
137
+
138
+ ## πŸ› οΈ Tech Stack
139
+
140
+ - Python (Core language)
141
+ - SerpAPI (Google search results API)
142
+ - Pandas, Requests, Tabulate (Data processing)
143
+ - Streamlit + Plotly (Dashboard & charts)
144
+ - FastAPI + Gunicorn (API server)
145
+ - Render (Deployment)
146
+ - n8n (Workflow automation)
147
+
148
+ # πŸ‘¨β€πŸ’» Author
149
+
150
+ Om Raghuwanshi – Engineering student passionate about AI
151
+
152
+ ## πŸ”— Links
153
+
154
+ [![linkedin](https://img.shields.io/badge/linkedin-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/om-raghuwanshi-b5136a298)
155
+
156
+ ⚑ If you like this project, don’t forget to ⭐ star the repo and fork it!
__init__.py ADDED
File without changes
app.py ADDED
@@ -0,0 +1,584 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # app.py
2
+ """
3
+ Complete Keyword Research Pipeline
4
+ Integrates keyword discovery, analysis, and post-processing into one workflow
5
+ """
6
+
7
+ import os
8
+ import sys
9
+ import argparse
10
+ from pathlib import Path
11
+ from dotenv import load_dotenv
12
+
13
+ # Load environment variables first
14
+ load_dotenv()
15
+
16
+ # Add current directory to path for imports
17
+ current_dir = Path(__file__).parent
18
+ sys.path.insert(0, str(current_dir))
19
+
20
+ def check_setup():
21
+ """Check if all requirements are met"""
22
+ print("πŸ” Checking setup...")
23
+
24
+ # Check API key
25
+ api_key = os.getenv("SERPAPI_KEY")
26
+ if not api_key:
27
+ print("❌ SERPAPI_KEY not found in environment variables")
28
+ print("Make sure your .env file contains: SERPAPI_KEY=your_key_here")
29
+ return False
30
+
31
+ print(f"βœ… API key found: {api_key[:10]}...")
32
+
33
+ # Check required packages
34
+ required_packages = [
35
+ ('serpapi', 'google-search-results'),
36
+ ('pandas', 'pandas'),
37
+ ('tabulate', 'tabulate'),
38
+ ('openpyxl', 'openpyxl')
39
+ ]
40
+
41
+ missing = []
42
+ for import_name, pip_name in required_packages:
43
+ try:
44
+ __import__(import_name)
45
+ except ImportError:
46
+ missing.append(pip_name)
47
+
48
+ if missing:
49
+ print("❌ Missing packages:")
50
+ for pkg in missing:
51
+ print(f" pip install {pkg}")
52
+ return False
53
+
54
+ print("βœ… All packages available")
55
+ return True
56
+
57
+ def run_keyword_analysis(seed_keyword, use_volume_api=False):
58
+ """Run the keyword analysis using the professional tool"""
59
+ print("\nπŸ” Step 1: Running keyword analysis...")
60
+
61
+ try:
62
+ # Import and run the KeywordResearchTool
63
+ import os
64
+ import math
65
+ import csv
66
+ import re
67
+ import logging
68
+ from datetime import date
69
+ from typing import List, Dict, Optional, Tuple, Any
70
+ from dataclasses import dataclass
71
+ from serpapi import GoogleSearch
72
+
73
+ # Configure logging to be less verbose
74
+ logging.basicConfig(level=logging.WARNING)
75
+
76
+ @dataclass
77
+ class KeywordMetrics:
78
+ keyword: str
79
+ monthly_searches: int
80
+ competition_score: float
81
+ opportunity_score: float
82
+ total_results: int
83
+ ads_count: int
84
+ has_featured_snippet: bool
85
+ has_people_also_ask: bool
86
+ has_knowledge_graph: bool
87
+
88
+ class CompetitionCalculator:
89
+ WEIGHTS = {
90
+ 'total_results': 0.50,
91
+ 'ads': 0.25,
92
+ 'featured_snippet': 0.15,
93
+ 'people_also_ask': 0.07,
94
+ 'knowledge_graph': 0.03
95
+ }
96
+
97
+ @staticmethod
98
+ def extract_total_results(search_info):
99
+ if not search_info:
100
+ return 0
101
+
102
+ total = (search_info.get("total_results") or
103
+ search_info.get("total_results_raw") or
104
+ search_info.get("total"))
105
+
106
+ if isinstance(total, int):
107
+ return total
108
+
109
+ if isinstance(total, str):
110
+ numbers_only = re.sub(r"[^\d]", "", total)
111
+ try:
112
+ return int(numbers_only) if numbers_only else 0
113
+ except ValueError:
114
+ return 0
115
+
116
+ return 0
117
+
118
+ def calculate_score(self, search_results):
119
+ search_info = search_results.get("search_information", {})
120
+
121
+ total_results = self.extract_total_results(search_info)
122
+ normalized_results = min(math.log10(total_results + 1) / 7, 1.0)
123
+
124
+ ads = search_results.get("ads_results", [])
125
+ ads_count = len(ads) if ads else 0
126
+ ads_score = min(ads_count / 3, 1.0)
127
+
128
+ has_featured_snippet = bool(
129
+ search_results.get("featured_snippet") or
130
+ search_results.get("answer_box")
131
+ )
132
+
133
+ has_people_also_ask = bool(
134
+ search_results.get("related_questions") or
135
+ search_results.get("people_also_ask")
136
+ )
137
+
138
+ has_knowledge_graph = bool(search_results.get("knowledge_graph"))
139
+
140
+ competition_score = (
141
+ self.WEIGHTS['total_results'] * normalized_results +
142
+ self.WEIGHTS['ads'] * ads_score +
143
+ self.WEIGHTS['featured_snippet'] * has_featured_snippet +
144
+ self.WEIGHTS['people_also_ask'] * has_people_also_ask +
145
+ self.WEIGHTS['knowledge_graph'] * has_knowledge_graph
146
+ )
147
+
148
+ competition_score = max(0.0, min(1.0, competition_score))
149
+
150
+ breakdown = {
151
+ "total_results": total_results,
152
+ "ads_count": ads_count,
153
+ "has_featured_snippet": has_featured_snippet,
154
+ "has_people_also_ask": has_people_also_ask,
155
+ "has_knowledge_graph": has_knowledge_graph
156
+ }
157
+
158
+ return competition_score, breakdown
159
+
160
+ # Main analysis functions
161
+ def find_related_keywords(seed_keyword, max_results=120):
162
+ print(f"Finding related keywords for: '{seed_keyword}'...")
163
+
164
+ params = {
165
+ "engine": "google",
166
+ "q": seed_keyword,
167
+ "api_key": os.getenv("SERPAPI_KEY"),
168
+ "hl": "en",
169
+ "gl": "us"
170
+ }
171
+
172
+ try:
173
+ search = GoogleSearch(params)
174
+ results = search.get_dict()
175
+ except Exception as e:
176
+ print(f"Error getting related keywords: {e}")
177
+ return []
178
+
179
+ keyword_candidates = set()
180
+
181
+ # Get related searches
182
+ related_searches = results.get("related_searches", [])
183
+ for item in related_searches:
184
+ query = item.get("query") or item.get("suggestion")
185
+ if query and len(query.strip()) > 0:
186
+ keyword_candidates.add(query.strip())
187
+
188
+ # Get people also ask
189
+ related_questions = results.get("related_questions", [])
190
+ for item in related_questions:
191
+ question = item.get("question") or item.get("query")
192
+ if question and len(question.strip()) > 0:
193
+ keyword_candidates.add(question.strip())
194
+
195
+ # Get organic titles
196
+ organic_results = results.get("organic_results", [])
197
+ for result in organic_results[:10]:
198
+ title = result.get("title", "")
199
+ if title and len(title.strip()) > 0:
200
+ keyword_candidates.add(title.strip())
201
+
202
+ final_keywords = list(keyword_candidates)[:max_results]
203
+ print(f"Found {len(final_keywords)} keyword candidates")
204
+
205
+ return final_keywords
206
+
207
+ def analyze_keywords(keywords, use_volume_api=False):
208
+ print(f"Analyzing {len(keywords)} keywords...")
209
+
210
+ calculator = CompetitionCalculator()
211
+ analyzed_keywords = []
212
+
213
+ for i, keyword in enumerate(keywords, 1):
214
+ if i % 10 == 0:
215
+ print(f"Progress: {i}/{len(keywords)} keywords processed")
216
+
217
+ # Search for keyword
218
+ params = {
219
+ "engine": "google",
220
+ "q": keyword,
221
+ "api_key": os.getenv("SERPAPI_KEY"),
222
+ "hl": "en",
223
+ "gl": "us",
224
+ "num": 10
225
+ }
226
+
227
+ try:
228
+ search = GoogleSearch(params)
229
+ search_results = search.get_dict()
230
+ except Exception as e:
231
+ print(f"Error analyzing '{keyword}': {e}")
232
+ continue
233
+
234
+ # Calculate competition
235
+ competition_score, breakdown = calculator.calculate_score(search_results)
236
+
237
+ # Estimate volume
238
+ word_count = len(keyword.split())
239
+ search_volume = max(10, 10000 // (word_count + 1))
240
+
241
+ # Calculate opportunity score
242
+ volume_score = math.log10(search_volume + 1)
243
+ opportunity_score = volume_score / (competition_score + 0.01)
244
+
245
+ metrics = KeywordMetrics(
246
+ keyword=keyword,
247
+ monthly_searches=search_volume,
248
+ competition_score=round(competition_score, 4),
249
+ opportunity_score=round(opportunity_score, 2),
250
+ total_results=breakdown["total_results"],
251
+ ads_count=breakdown["ads_count"],
252
+ has_featured_snippet=breakdown["has_featured_snippet"],
253
+ has_people_also_ask=breakdown["has_people_also_ask"],
254
+ has_knowledge_graph=breakdown["has_knowledge_graph"]
255
+ )
256
+
257
+ analyzed_keywords.append(metrics)
258
+
259
+ # Sort by opportunity score
260
+ analyzed_keywords.sort(key=lambda x: x.opportunity_score, reverse=True)
261
+
262
+ print(f"Analysis complete! {len(analyzed_keywords)} keywords analyzed")
263
+ return analyzed_keywords
264
+
265
+ def save_to_csv(keyword_metrics, seed_keyword, top_count=50):
266
+ if not keyword_metrics:
267
+ print("No data to save!")
268
+ return None
269
+
270
+ # Create filename
271
+ today = date.today()
272
+ safe_seed = re.sub(r"[^\w\s-]", "", seed_keyword).strip().replace(" ", "_")[:30]
273
+ filename = f"keywords_{safe_seed}_{today}.csv"
274
+
275
+ try:
276
+ with open(filename, "w", newline='', encoding='utf-8') as file:
277
+ writer = csv.writer(file)
278
+
279
+ # Write header
280
+ headers = [
281
+ "Keyword", "Monthly Searches", "Competition Score",
282
+ "Opportunity Score", "Total Results", "Ads Count",
283
+ "Featured Snippet", "People Also Ask", "Knowledge Graph"
284
+ ]
285
+ writer.writerow(headers)
286
+
287
+ # Write data
288
+ for metrics in keyword_metrics[:top_count]:
289
+ row = [
290
+ metrics.keyword,
291
+ metrics.monthly_searches,
292
+ metrics.competition_score,
293
+ metrics.opportunity_score,
294
+ metrics.total_results,
295
+ metrics.ads_count,
296
+ "Yes" if metrics.has_featured_snippet else "No",
297
+ "Yes" if metrics.has_people_also_ask else "No",
298
+ "Yes" if metrics.has_knowledge_graph else "No"
299
+ ]
300
+ writer.writerow(row)
301
+
302
+ saved_count = min(top_count, len(keyword_metrics))
303
+ print(f"βœ… Saved {saved_count} keywords to {filename}")
304
+ return filename
305
+
306
+ except Exception as e:
307
+ print(f"Error saving CSV: {e}")
308
+ return None
309
+
310
+ def display_top_results(keyword_metrics, top_count=5):
311
+ if not keyword_metrics:
312
+ print("No results to display!")
313
+ return
314
+
315
+ print(f"\nπŸ† Top {min(top_count, len(keyword_metrics))} Keywords:")
316
+ print("-" * 80)
317
+
318
+ for i, metrics in enumerate(keyword_metrics[:top_count], 1):
319
+ print(f"{i}. {metrics.keyword}")
320
+ print(f" Score: {metrics.opportunity_score} | Volume: {metrics.monthly_searches:,} | Competition: {metrics.competition_score}")
321
+ print()
322
+
323
+ # Run the analysis
324
+ related_keywords = find_related_keywords(seed_keyword)
325
+ if not related_keywords:
326
+ print("❌ No keyword candidates found")
327
+ return None
328
+
329
+ analyzed_keywords = analyze_keywords(related_keywords, use_volume_api)
330
+ if not analyzed_keywords:
331
+ print("❌ No keywords analyzed successfully")
332
+ return None
333
+
334
+ filename = save_to_csv(analyzed_keywords, seed_keyword)
335
+ display_top_results(analyzed_keywords)
336
+
337
+ return filename
338
+
339
+ except Exception as e:
340
+ print(f"❌ Error in keyword analysis: {e}")
341
+ return None
342
+
343
+ def run_postprocessing(csv_filename, seed_keyword):
344
+ """Run post-processing on the CSV file"""
345
+ print("\n🧹 Step 2: Running post-processing...")
346
+
347
+ try:
348
+ import pandas as pd
349
+ import re
350
+ import json
351
+ from datetime import date, datetime
352
+
353
+ # Try to import optional packages
354
+ try:
355
+ from tabulate import tabulate
356
+ HAS_TABULATE = True
357
+ except ImportError:
358
+ HAS_TABULATE = False
359
+
360
+ try:
361
+ import openpyxl
362
+ HAS_EXCEL = True
363
+ except ImportError:
364
+ HAS_EXCEL = False
365
+
366
+ # Configuration
367
+ BRAND_KEYWORDS = {
368
+ "linkedin", "indeed", "glassdoor", "ucla", "asu", "berkeley",
369
+ "hennge", "ciee", "google", "facebook", "microsoft", "amazon"
370
+ }
371
+
372
+ def is_brand_query(keyword):
373
+ if not keyword:
374
+ return False
375
+ keyword_lower = keyword.lower()
376
+ for brand in BRAND_KEYWORDS:
377
+ if brand in keyword_lower:
378
+ return True
379
+ if re.search(r"\.(com|edu|org|net|gov|io)\b", keyword_lower):
380
+ return True
381
+ return False
382
+
383
+ def classify_intent(keyword):
384
+ if not keyword:
385
+ return "informational"
386
+
387
+ k = keyword.lower()
388
+ if any(signal in k for signal in ["how to", "what is", "why", "guide", "tutorial"]):
389
+ return "informational"
390
+ if any(signal in k for signal in ["buy", "price", "cost", "apply", "register"]):
391
+ return "transactional"
392
+ if any(signal in k for signal in ["best", "top", "compare", "vs", "reviews"]):
393
+ return "commercial"
394
+ if is_brand_query(keyword):
395
+ return "navigational"
396
+ return "informational"
397
+
398
+ def classify_tail(keyword):
399
+ if not keyword:
400
+ return "short-tail"
401
+ word_count = len(str(keyword).split())
402
+ if word_count >= 4:
403
+ return "long-tail"
404
+ elif word_count == 3:
405
+ return "mid-tail"
406
+ else:
407
+ return "short-tail"
408
+
409
+ # Load and process the CSV
410
+ print(f"Loading {csv_filename}...")
411
+ df = pd.read_csv(csv_filename)
412
+ print(f"Loaded {len(df)} keywords")
413
+
414
+ # Clean and enhance the data
415
+ print("Processing data...")
416
+
417
+ # Standardize column names
418
+ column_mapping = {
419
+ 'Keyword': 'Keyword',
420
+ 'Monthly Searches': 'Monthly Searches',
421
+ 'Competition Score': 'Competition',
422
+ 'Opportunity Score': 'Opportunity Score',
423
+ 'Total Results': 'Google Results',
424
+ 'Ads Count': 'Ads Shown',
425
+ 'Featured Snippet': 'Featured Snippet?',
426
+ 'People Also Ask': 'PAA Available?',
427
+ 'Knowledge Graph': 'Knowledge Graph?'
428
+ }
429
+
430
+ # Rename columns that exist
431
+ for old_name, new_name in column_mapping.items():
432
+ if old_name in df.columns:
433
+ df = df.rename(columns={old_name: new_name})
434
+
435
+ # Remove duplicates and sort
436
+ df = df.drop_duplicates(subset=['Keyword'], keep='first')
437
+ df = df.sort_values('Opportunity Score', ascending=False)
438
+
439
+ # Add enhancement columns
440
+ df['Intent'] = df['Keyword'].apply(classify_intent)
441
+ df['Tail'] = df['Keyword'].apply(classify_tail)
442
+ df['Is Brand/Navigational'] = df['Keyword'].apply(lambda x: "Yes" if is_brand_query(x) else "No")
443
+
444
+ # Reorder columns
445
+ column_order = [
446
+ 'Keyword', 'Intent', 'Tail', 'Is Brand/Navigational',
447
+ 'Monthly Searches', 'Competition', 'Opportunity Score',
448
+ 'Google Results', 'Ads Shown', 'Featured Snippet?',
449
+ 'PAA Available?', 'Knowledge Graph?'
450
+ ]
451
+
452
+ available_columns = [col for col in column_order if col in df.columns]
453
+ df = df[available_columns]
454
+
455
+ # Create output directory
456
+ os.makedirs("results", exist_ok=True)
457
+
458
+ # Generate filenames
459
+ today = date.today().isoformat()
460
+ safe_seed = re.sub(r"[^\w\s-]", "", seed_keyword).strip().replace(" ", "_")[:30]
461
+ base_name = f"keywords_{safe_seed}_{today}"
462
+
463
+ csv_path = f"results/{base_name}.csv"
464
+ excel_path = f"results/{base_name}.xlsx"
465
+ meta_path = f"results/{base_name}.meta.json"
466
+
467
+ # Save enhanced CSV
468
+ df.to_csv(csv_path, index=False)
469
+ print(f"πŸ’Ύ Saved enhanced CSV: {csv_path}")
470
+
471
+ # Save Excel if available
472
+ if HAS_EXCEL:
473
+ with pd.ExcelWriter(excel_path, engine="openpyxl") as writer:
474
+ df.head(50).to_excel(writer, sheet_name="Top_50", index=False)
475
+ df.to_excel(writer, sheet_name="All_Keywords", index=False)
476
+ print(f"πŸ“Š Saved Excel: {excel_path}")
477
+
478
+ # Save metadata
479
+ metadata = {
480
+ "seed_keyword": seed_keyword,
481
+ "generated_at": datetime.utcnow().isoformat() + "Z",
482
+ "total_keywords": len(df),
483
+ "data_source": "SerpApi with heuristic search volumes",
484
+ "methodology": "Opportunity Score = log10(volume+1) / (competition + 0.01)"
485
+ }
486
+
487
+ with open(meta_path, "w", encoding="utf-8") as f:
488
+ json.dump(metadata, f, indent=2)
489
+
490
+ print(f"πŸ“‹ Saved metadata: {meta_path}")
491
+
492
+ # Display results
493
+ print(f"\nπŸ† Top 10 Enhanced Results:")
494
+
495
+ preview_df = df.head(10)
496
+ if HAS_TABULATE:
497
+ display_columns = ['Keyword', 'Intent', 'Tail', 'Monthly Searches', 'Competition', 'Opportunity Score']
498
+ display_data = preview_df[display_columns]
499
+ print(tabulate(display_data, headers="keys", tablefmt="github", showindex=False))
500
+ else:
501
+ for i, row in preview_df.iterrows():
502
+ print(f"{i+1}. {row['Keyword']} | Score: {row['Opportunity Score']} | Intent: {row['Intent']} | Tail: {row['Tail']}")
503
+
504
+ # Summary stats
505
+ print(f"\nπŸ“ˆ Summary:")
506
+ print(f"β€’ Total keywords: {len(df)}")
507
+ print(f"β€’ Long-tail keywords: {len(df[df['Tail'] == 'long-tail'])}")
508
+ print(f"β€’ Non-brand keywords: {len(df[df['Is Brand/Navigational'] == 'No'])}")
509
+ print(f"β€’ High opportunity (score > 50): {len(df[df['Opportunity Score'] > 50])}")
510
+
511
+ return csv_path, excel_path, meta_path
512
+
513
+ except Exception as e:
514
+ print(f"❌ Error in post-processing: {e}")
515
+ return None, None, None
516
+
517
+ def run_complete_pipeline(seed_keyword, use_volume_api=False):
518
+ """Run the complete pipeline"""
519
+ print("πŸš€ Starting Complete Keyword Research Pipeline")
520
+ print("=" * 60)
521
+ print(f"Seed Keyword: '{seed_keyword}'")
522
+ print("=" * 60)
523
+
524
+ # Step 1: Run keyword analysis
525
+ csv_filename = run_keyword_analysis(seed_keyword, use_volume_api)
526
+
527
+ if not csv_filename:
528
+ print("❌ Pipeline failed at Step 1")
529
+ return False
530
+
531
+ # Step 2: Run post-processing
532
+ csv_path, excel_path, meta_path = run_postprocessing(csv_filename, seed_keyword)
533
+
534
+ if not csv_path:
535
+ print("❌ Pipeline failed at Step 2")
536
+ return False
537
+
538
+ # Final summary
539
+ print("\n🎯 PIPELINE COMPLETE! 🎯")
540
+ print("=" * 60)
541
+ print(f"πŸ“ Original CSV: {csv_filename}")
542
+ print(f"πŸ“ Enhanced CSV: {csv_path}")
543
+ if excel_path:
544
+ print(f"πŸ“ Excel file: {excel_path}")
545
+ if meta_path:
546
+ print(f"πŸ“ Metadata: {meta_path}")
547
+ print("=" * 60)
548
+
549
+ return True
550
+
551
+ def main():
552
+ """Main function with command line support"""
553
+ parser = argparse.ArgumentParser(description="Complete Keyword Research Pipeline")
554
+ parser.add_argument("seed_keyword", nargs="?", default="global internship",
555
+ help="Seed keyword (default: 'global internship')")
556
+ parser.add_argument("--use-volume-api", action="store_true",
557
+ help="Use real volume API (requires implementation)")
558
+ parser.add_argument("--check-only", action="store_true",
559
+ help="Only check setup, don't run pipeline")
560
+
561
+ args = parser.parse_args()
562
+
563
+ # Check setup
564
+ if not check_setup():
565
+ return 1
566
+
567
+ if args.check_only:
568
+ print("βœ… Setup check complete!")
569
+ return 0
570
+
571
+ # Run pipeline
572
+ success = run_complete_pipeline(args.seed_keyword, args.use_volume_api)
573
+ return 0 if success else 1
574
+
575
+ if __name__ == "__main__":
576
+ try:
577
+ exit_code = main()
578
+ sys.exit(exit_code)
579
+ except KeyboardInterrupt:
580
+ print("\n⚠️ Pipeline interrupted by user")
581
+ sys.exit(1)
582
+ except Exception as e:
583
+ print(f"\n❌ Unexpected error: {e}")
584
+ sys.exit(1)
dashboard.py ADDED
@@ -0,0 +1,830 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # dashboard.py
2
+ """
3
+ SEO Keyword Research Dashboard
4
+
5
+ A Streamlit web interface for the keyword research pipeline.
6
+ Provides interactive analysis, visualization, and download capabilities.
7
+
8
+ Requirements:
9
+ pip install streamlit plotly pandas
10
+
11
+ Usage:
12
+ streamlit run dashboard.py
13
+ """
14
+
15
+ import streamlit as st
16
+ import pandas as pd
17
+ import plotly.express as px
18
+ import plotly.graph_objects as go
19
+ from plotly.subplots import make_subplots
20
+ import os
21
+ import sys
22
+ from pathlib import Path
23
+ from datetime import date, datetime
24
+ import re
25
+ import json
26
+ import io
27
+ from typing import Optional, Tuple, Dict, Any
28
+
29
+ # Add project directories to path
30
+ project_root = Path(__file__).parent
31
+ src_path = project_root / "src"
32
+ if src_path.exists():
33
+ sys.path.insert(0, str(src_path))
34
+ sys.path.insert(0, str(project_root))
35
+
36
+ # Import backend functions
37
+ try:
38
+ from dotenv import load_dotenv
39
+ load_dotenv()
40
+ except ImportError:
41
+ st.error("Missing required package: python-dotenv. Install with: pip install python-dotenv")
42
+ st.stop()
43
+
44
+ # Page configuration
45
+ st.set_page_config(
46
+ page_title="SEO Keyword Research Dashboard",
47
+ page_icon="πŸ”",
48
+ layout="wide",
49
+ initial_sidebar_state="expanded"
50
+ )
51
+
52
+ # Custom CSS for better styling
53
+ st.markdown("""
54
+ <style>
55
+ .main-header {
56
+ font-size: 3rem;
57
+ color: #1f77b4;
58
+ text-align: center;
59
+ margin-bottom: 2rem;
60
+ background: linear-gradient(90deg, #1f77b4, #ff7f0e);
61
+ -webkit-background-clip: text;
62
+ -webkit-text-fill-color: transparent;
63
+ background-clip: text;
64
+ }
65
+
66
+ .metric-card {
67
+ background-color: #f0f2f6;
68
+ padding: 1rem;
69
+ border-radius: 0.5rem;
70
+ border-left: 4px solid #1f77b4;
71
+ margin: 0.5rem 0;
72
+ }
73
+
74
+ .success-message {
75
+ background-color: #d4edda;
76
+ color: #155724;
77
+ padding: 1rem;
78
+ border-radius: 0.5rem;
79
+ border: 1px solid #c3e6cb;
80
+ margin: 1rem 0;
81
+ }
82
+
83
+ .error-message {
84
+ background-color: #f8d7da;
85
+ color: #721c24;
86
+ padding: 1rem;
87
+ border-radius: 0.5rem;
88
+ border: 1px solid #f5c6cb;
89
+ margin: 1rem 0;
90
+ }
91
+
92
+ .stDataFrame {
93
+ border-radius: 0.5rem;
94
+ overflow: hidden;
95
+ }
96
+ </style>
97
+ """, unsafe_allow_html=True)
98
+
99
+ class KeywordDashboard:
100
+ """Main dashboard class for SEO keyword research interface."""
101
+
102
+ def __init__(self):
103
+ """Initialize the dashboard with necessary configurations."""
104
+ self.setup_directories()
105
+ self.check_environment()
106
+
107
+ def setup_directories(self):
108
+ """Create necessary output directories."""
109
+ self.output_dir = Path("output")
110
+ self.processed_dir = self.output_dir / "processed"
111
+ self.reports_dir = self.output_dir / "reports"
112
+
113
+ self.output_dir.mkdir(exist_ok=True)
114
+ self.processed_dir.mkdir(exist_ok=True)
115
+ self.reports_dir.mkdir(exist_ok=True)
116
+
117
+ def check_environment(self):
118
+ """Check if the environment is properly configured."""
119
+ self.api_key = os.getenv("SERPAPI_KEY")
120
+ self.environment_ready = bool(self.api_key)
121
+
122
+ def render_header(self):
123
+ """Render the main dashboard header."""
124
+ st.markdown('<h1 class="main-header">πŸ” SEO Keyword Research Dashboard</h1>',
125
+ unsafe_allow_html=True)
126
+
127
+ if not self.environment_ready:
128
+ st.markdown("""
129
+ <div class="error-message">
130
+ ⚠️ <strong>Environment Setup Required</strong><br>
131
+ Please ensure your .env file contains: SERPAPI_KEY=your_key_here
132
+ </div>
133
+ """, unsafe_allow_html=True)
134
+ return False
135
+
136
+ st.markdown("""
137
+ <div class="success-message">
138
+ βœ… <strong>Environment Ready</strong><br>
139
+ API key detected and ready for keyword research.
140
+ </div>
141
+ """, unsafe_allow_html=True)
142
+ return True
143
+
144
+ def render_sidebar(self) -> Dict[str, Any]:
145
+ """Render the sidebar with input controls."""
146
+ st.sidebar.markdown("## 🎯 Analysis Parameters")
147
+
148
+ # Input parameters
149
+ seed_keyword = st.sidebar.text_input(
150
+ "πŸ” Seed Keyword",
151
+ value="global internship",
152
+ help="Enter the main keyword to research"
153
+ )
154
+
155
+ max_candidates = st.sidebar.slider(
156
+ "πŸ“Š Max Candidates",
157
+ min_value=20,
158
+ max_value=300,
159
+ value=120,
160
+ step=10,
161
+ help="Maximum number of keyword candidates to analyze"
162
+ )
163
+
164
+ top_results = st.sidebar.slider(
165
+ "πŸ† Top Results",
166
+ min_value=10,
167
+ max_value=100,
168
+ value=50,
169
+ step=5,
170
+ help="Number of top results to display and save"
171
+ )
172
+
173
+ # Advanced options
174
+ st.sidebar.markdown("## βš™οΈ Advanced Options")
175
+
176
+ use_volume_api = st.sidebar.checkbox(
177
+ "πŸ“ˆ Use Real Volume API",
178
+ value=False,
179
+ help="Enable when volume API is implemented",
180
+ disabled=True # Disabled until implemented
181
+ )
182
+
183
+ # Filtering options
184
+ st.sidebar.markdown("## πŸ”§ Filters")
185
+
186
+ min_search_volume = st.sidebar.number_input(
187
+ "πŸ“ˆ Min Search Volume",
188
+ min_value=0,
189
+ max_value=10000,
190
+ value=10,
191
+ step=10,
192
+ help="Minimum monthly search volume"
193
+ )
194
+
195
+ max_competition = st.sidebar.slider(
196
+ "βš”οΈ Max Competition Score",
197
+ min_value=0.0,
198
+ max_value=1.0,
199
+ value=1.0,
200
+ step=0.1,
201
+ help="Maximum competition score (0=easy, 1=hard)"
202
+ )
203
+
204
+ # Run button
205
+ run_analysis = st.sidebar.button(
206
+ "πŸš€ Run Analysis",
207
+ type="primary",
208
+ help="Start the keyword research analysis"
209
+ )
210
+
211
+ return {
212
+ "seed_keyword": seed_keyword,
213
+ "max_candidates": max_candidates,
214
+ "top_results": top_results,
215
+ "use_volume_api": use_volume_api,
216
+ "min_search_volume": min_search_volume,
217
+ "max_competition": max_competition,
218
+ "run_analysis": run_analysis
219
+ }
220
+
221
+ def run_keyword_analysis(self, params: Dict[str, Any]) -> Optional[pd.DataFrame]:
222
+ """Run the keyword analysis using the backend pipeline."""
223
+ try:
224
+ # Import the analysis function from app.py
225
+ sys.path.insert(0, str(project_root))
226
+
227
+ # Since we need to reuse the logic from app.py, let's import what we need
228
+ import math
229
+ import csv
230
+ import re
231
+ from serpapi import GoogleSearch
232
+ from dataclasses import dataclass
233
+
234
+ @dataclass
235
+ class KeywordMetrics:
236
+ keyword: str
237
+ monthly_searches: int
238
+ competition_score: float
239
+ opportunity_score: float
240
+ total_results: int
241
+ ads_count: int
242
+ has_featured_snippet: bool
243
+ has_people_also_ask: bool
244
+ has_knowledge_graph: bool
245
+
246
+ # Competition calculator (from your app.py)
247
+ class CompetitionCalculator:
248
+ WEIGHTS = {
249
+ 'total_results': 0.50,
250
+ 'ads': 0.25,
251
+ 'featured_snippet': 0.15,
252
+ 'people_also_ask': 0.07,
253
+ 'knowledge_graph': 0.03
254
+ }
255
+
256
+ @staticmethod
257
+ def extract_total_results(search_info):
258
+ if not search_info:
259
+ return 0
260
+
261
+ total = (search_info.get("total_results") or
262
+ search_info.get("total_results_raw") or
263
+ search_info.get("total"))
264
+
265
+ if isinstance(total, int):
266
+ return total
267
+
268
+ if isinstance(total, str):
269
+ numbers_only = re.sub(r"[^\d]", "", total)
270
+ try:
271
+ return int(numbers_only) if numbers_only else 0
272
+ except ValueError:
273
+ return 0
274
+
275
+ return 0
276
+
277
+ def calculate_score(self, search_results):
278
+ search_info = search_results.get("search_information", {})
279
+
280
+ total_results = self.extract_total_results(search_info)
281
+ normalized_results = min(math.log10(total_results + 1) / 7, 1.0)
282
+
283
+ ads = search_results.get("ads_results", [])
284
+ ads_count = len(ads) if ads else 0
285
+ ads_score = min(ads_count / 3, 1.0)
286
+
287
+ has_featured_snippet = bool(
288
+ search_results.get("featured_snippet") or
289
+ search_results.get("answer_box")
290
+ )
291
+
292
+ has_people_also_ask = bool(
293
+ search_results.get("related_questions") or
294
+ search_results.get("people_also_ask")
295
+ )
296
+
297
+ has_knowledge_graph = bool(search_results.get("knowledge_graph"))
298
+
299
+ competition_score = (
300
+ self.WEIGHTS['total_results'] * normalized_results +
301
+ self.WEIGHTS['ads'] * ads_score +
302
+ self.WEIGHTS['featured_snippet'] * has_featured_snippet +
303
+ self.WEIGHTS['people_also_ask'] * has_people_also_ask +
304
+ self.WEIGHTS['knowledge_graph'] * has_knowledge_graph
305
+ )
306
+
307
+ competition_score = max(0.0, min(1.0, competition_score))
308
+
309
+ breakdown = {
310
+ "total_results": total_results,
311
+ "ads_count": ads_count,
312
+ "has_featured_snippet": has_featured_snippet,
313
+ "has_people_also_ask": has_people_also_ask,
314
+ "has_knowledge_graph": has_knowledge_graph
315
+ }
316
+
317
+ return competition_score, breakdown
318
+
319
+ def find_related_keywords(seed_keyword, max_results=120):
320
+ progress_placeholder = st.empty()
321
+ progress_placeholder.info(f"πŸ” Finding related keywords for: '{seed_keyword}'...")
322
+
323
+ search_params = {
324
+ "engine": "google",
325
+ "q": seed_keyword,
326
+ "api_key": self.api_key,
327
+ "hl": "en",
328
+ "gl": "us"
329
+ }
330
+
331
+ try:
332
+ search = GoogleSearch(search_params)
333
+ results = search.get_dict()
334
+ except Exception as e:
335
+ progress_placeholder.error(f"❌ Error getting related keywords: {e}")
336
+ return []
337
+
338
+ keyword_candidates = set()
339
+
340
+ # Extract keywords from different sources
341
+ related_searches = results.get("related_searches", [])
342
+ for item in related_searches:
343
+ query = item.get("query") or item.get("suggestion")
344
+ if query and len(query.strip()) > 0:
345
+ keyword_candidates.add(query.strip())
346
+
347
+ related_questions = results.get("related_questions", [])
348
+ for item in related_questions:
349
+ question = item.get("question") or item.get("query")
350
+ if question and len(question.strip()) > 0:
351
+ keyword_candidates.add(question.strip())
352
+
353
+ organic_results = results.get("organic_results", [])
354
+ for result in organic_results[:10]:
355
+ title = result.get("title", "")
356
+ if title and len(title.strip()) > 0:
357
+ keyword_candidates.add(title.strip())
358
+
359
+ final_keywords = list(keyword_candidates)[:max_results]
360
+ progress_placeholder.success(f"βœ… Found {len(final_keywords)} keyword candidates")
361
+ return final_keywords
362
+
363
+ def analyze_keywords_batch(keywords):
364
+ calculator = CompetitionCalculator()
365
+ analyzed_keywords = []
366
+
367
+ progress_bar = st.progress(0)
368
+ status_text = st.empty()
369
+
370
+ for i, keyword in enumerate(keywords):
371
+ progress = (i + 1) / len(keywords)
372
+ progress_bar.progress(progress)
373
+ status_text.text(f"Analyzing keyword {i+1}/{len(keywords)}: {keyword}")
374
+
375
+ # Search for keyword
376
+ search_params = {
377
+ "engine": "google",
378
+ "q": keyword,
379
+ "api_key": self.api_key,
380
+ "hl": "en",
381
+ "gl": "us",
382
+ "num": 10
383
+ }
384
+
385
+ try:
386
+ search = GoogleSearch(search_params)
387
+ search_results = search.get_dict()
388
+ except Exception as e:
389
+ continue
390
+
391
+ # Calculate competition
392
+ competition_score, breakdown = calculator.calculate_score(search_results)
393
+
394
+ # Estimate volume
395
+ word_count = len(keyword.split())
396
+ search_volume = max(10, 10000 // (word_count + 1))
397
+
398
+ # Calculate opportunity score
399
+ volume_score = math.log10(search_volume + 1)
400
+ opportunity_score = volume_score / (competition_score + 0.01)
401
+
402
+ metrics = KeywordMetrics(
403
+ keyword=keyword,
404
+ monthly_searches=search_volume,
405
+ competition_score=round(competition_score, 4),
406
+ opportunity_score=round(opportunity_score, 2),
407
+ total_results=breakdown["total_results"],
408
+ ads_count=breakdown["ads_count"],
409
+ has_featured_snippet=breakdown["has_featured_snippet"],
410
+ has_people_also_ask=breakdown["has_people_also_ask"],
411
+ has_knowledge_graph=breakdown["has_knowledge_graph"]
412
+ )
413
+
414
+ analyzed_keywords.append(metrics)
415
+
416
+ progress_bar.empty()
417
+ status_text.empty()
418
+
419
+ # Sort by opportunity score
420
+ analyzed_keywords.sort(key=lambda x: x.opportunity_score, reverse=True)
421
+ return analyzed_keywords
422
+
423
+ # Run the analysis
424
+ with st.spinner("πŸ” Discovering related keywords..."):
425
+ related_keywords = find_related_keywords(
426
+ params["seed_keyword"],
427
+ params["max_candidates"]
428
+ )
429
+
430
+ if not related_keywords:
431
+ st.error("❌ No keyword candidates found. Please check your API key and try again.")
432
+ return None
433
+
434
+ with st.spinner("πŸ“Š Analyzing keywords and calculating scores..."):
435
+ analyzed_keywords = analyze_keywords_batch(related_keywords)
436
+
437
+ if not analyzed_keywords:
438
+ st.error("❌ No keywords were successfully analyzed.")
439
+ return None
440
+
441
+ # Convert to DataFrame
442
+ data = []
443
+ for metrics in analyzed_keywords:
444
+ data.append({
445
+ 'Keyword': metrics.keyword,
446
+ 'Monthly Searches': metrics.monthly_searches,
447
+ 'Competition': metrics.competition_score,
448
+ 'Opportunity Score': metrics.opportunity_score,
449
+ 'Total Results': metrics.total_results,
450
+ 'Ads Count': metrics.ads_count,
451
+ 'Featured Snippet': 'Yes' if metrics.has_featured_snippet else 'No',
452
+ 'People Also Ask': 'Yes' if metrics.has_people_also_ask else 'No',
453
+ 'Knowledge Graph': 'Yes' if metrics.has_knowledge_graph else 'No'
454
+ })
455
+
456
+ df = pd.DataFrame(data)
457
+
458
+ # Apply filters
459
+ df = df[
460
+ (df['Monthly Searches'] >= params['min_search_volume']) &
461
+ (df['Competition'] <= params['max_competition'])
462
+ ]
463
+
464
+ return df
465
+
466
+ except Exception as e:
467
+ st.error(f"❌ Analysis failed: {str(e)}")
468
+ return None
469
+
470
+ def add_enhancement_columns(self, df: pd.DataFrame) -> pd.DataFrame:
471
+ """Add intent and tail classification columns."""
472
+ def classify_intent(keyword):
473
+ if not keyword:
474
+ return "informational"
475
+
476
+ k = keyword.lower()
477
+ if any(signal in k for signal in ["how to", "what is", "why", "guide", "tutorial"]):
478
+ return "informational"
479
+ if any(signal in k for signal in ["buy", "price", "cost", "apply", "register"]):
480
+ return "transactional"
481
+ if any(signal in k for signal in ["best", "top", "compare", "vs", "reviews"]):
482
+ return "commercial"
483
+ return "informational"
484
+
485
+ def classify_tail(keyword):
486
+ if not keyword:
487
+ return "short-tail"
488
+ word_count = len(str(keyword).split())
489
+ if word_count >= 4:
490
+ return "long-tail"
491
+ elif word_count == 3:
492
+ return "mid-tail"
493
+ else:
494
+ return "short-tail"
495
+
496
+ df['Intent'] = df['Keyword'].apply(classify_intent)
497
+ df['Tail'] = df['Keyword'].apply(classify_tail)
498
+
499
+ return df
500
+
501
+ def render_summary_metrics(self, df: pd.DataFrame):
502
+ """Render summary metrics cards."""
503
+ col1, col2, col3, col4 = st.columns(4)
504
+
505
+ with col1:
506
+ st.markdown("""
507
+ <div class="metric-card">
508
+ <h3>πŸ“Š Total Keywords</h3>
509
+ <h2 style="color: #1f77b4;">{}</h2>
510
+ </div>
511
+ """.format(len(df)), unsafe_allow_html=True)
512
+
513
+ with col2:
514
+ avg_score = df['Opportunity Score'].mean()
515
+ st.markdown("""
516
+ <div class="metric-card">
517
+ <h3>⭐ Avg Opportunity Score</h3>
518
+ <h2 style="color: #ff7f0e;">{:.2f}</h2>
519
+ </div>
520
+ """.format(avg_score), unsafe_allow_html=True)
521
+
522
+ with col3:
523
+ high_opportunity = len(df[df['Opportunity Score'] > 50])
524
+ st.markdown("""
525
+ <div class="metric-card">
526
+ <h3>πŸš€ High Opportunity</h3>
527
+ <h2 style="color: #2ca02c;">{}</h2>
528
+ </div>
529
+ """.format(high_opportunity), unsafe_allow_html=True)
530
+
531
+ with col4:
532
+ long_tail = len(df[df['Tail'] == 'long-tail'])
533
+ st.markdown("""
534
+ <div class="metric-card">
535
+ <h3>🎯 Long-tail Keywords</h3>
536
+ <h2 style="color: #d62728;">{}</h2>
537
+ </div>
538
+ """.format(long_tail), unsafe_allow_html=True)
539
+
540
+ def render_top_keywords_table(self, df: pd.DataFrame, top_n: int = 10):
541
+ """Render the top keywords table with styling."""
542
+ st.markdown("## πŸ† Top Keyword Opportunities")
543
+
544
+ if df.empty:
545
+ st.warning("No keywords to display.")
546
+ return
547
+
548
+ # Prepare display DataFrame
549
+ display_df = df.head(top_n).copy()
550
+
551
+ # Format columns for better display
552
+ display_df['Monthly Searches'] = display_df['Monthly Searches'].apply(lambda x: f"{x:,}")
553
+ display_df['Total Results'] = display_df['Total Results'].apply(lambda x: f"{x:,}")
554
+
555
+ # Style the dataframe
556
+ def highlight_max_score(s):
557
+ is_max = s == s.max()
558
+ return ['background-color: lightgreen' if v else '' for v in is_max]
559
+
560
+ styled_df = display_df.style.apply(
561
+ highlight_max_score,
562
+ subset=['Opportunity Score']
563
+ ).format({
564
+ 'Competition': '{:.3f}',
565
+ 'Opportunity Score': '{:.2f}'
566
+ })
567
+
568
+ st.dataframe(styled_df, use_container_width=True)
569
+
570
+ def render_visualizations(self, df: pd.DataFrame):
571
+ """Render interactive charts and visualizations."""
572
+ if df.empty:
573
+ st.warning("No data available for visualization.")
574
+ return
575
+
576
+ # Chart selection tabs
577
+ chart_tab1, chart_tab2, chart_tab3 = st.tabs(["πŸ“Š Opportunity Scores", "🎯 Intent Analysis", "πŸ’Ή Volume vs Competition"])
578
+
579
+ with chart_tab1:
580
+ st.markdown("### Top 10 Keywords by Opportunity Score")
581
+ top_10 = df.head(10)
582
+
583
+ fig = px.bar(
584
+ top_10,
585
+ x='Opportunity Score',
586
+ y='Keyword',
587
+ orientation='h',
588
+ title="Top 10 Keyword Opportunities",
589
+ color='Opportunity Score',
590
+ color_continuous_scale='viridis'
591
+ )
592
+ fig.update_layout(height=500, yaxis={'categoryorder': 'total ascending'})
593
+ st.plotly_chart(fig, use_container_width=True)
594
+
595
+ with chart_tab2:
596
+ st.markdown("### Intent Distribution")
597
+ col1, col2 = st.columns(2)
598
+
599
+ with col1:
600
+ intent_counts = df['Intent'].value_counts()
601
+ fig_pie = px.pie(
602
+ values=intent_counts.values,
603
+ names=intent_counts.index,
604
+ title="Search Intent Distribution",
605
+ color_discrete_sequence=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
606
+ )
607
+ st.plotly_chart(fig_pie, use_container_width=True)
608
+
609
+ with col2:
610
+ tail_counts = df['Tail'].value_counts()
611
+ fig_tail = px.pie(
612
+ values=tail_counts.values,
613
+ names=tail_counts.index,
614
+ title="Keyword Tail Distribution",
615
+ color_discrete_sequence=['#9467bd', '#8c564b', '#e377c2']
616
+ )
617
+ st.plotly_chart(fig_tail, use_container_width=True)
618
+
619
+ with chart_tab3:
620
+ st.markdown("### Search Volume vs Competition Analysis")
621
+
622
+ fig_scatter = px.scatter(
623
+ df.head(50), # Limit to top 50 for readability
624
+ x='Competition',
625
+ y='Monthly Searches',
626
+ size='Opportunity Score',
627
+ color='Intent',
628
+ hover_name='Keyword',
629
+ title="Search Volume vs Competition (Size = Opportunity Score)",
630
+ labels={'Competition': 'Competition Score', 'Monthly Searches': 'Est. Monthly Searches'}
631
+ )
632
+ fig_scatter.update_layout(height=500)
633
+ st.plotly_chart(fig_scatter, use_container_width=True)
634
+
635
+ def save_results(self, df: pd.DataFrame, params: Dict[str, Any]) -> Tuple[str, str, str]:
636
+ """Save results to files and return file paths."""
637
+ if df.empty:
638
+ return None, None, None
639
+
640
+ # Generate file names
641
+ today = date.today().isoformat()
642
+ safe_seed = re.sub(r"[^\w\s-]", "", params['seed_keyword']).strip().replace(" ", "_")[:30]
643
+ base_name = f"keywords_{safe_seed}_{today}"
644
+
645
+ # File paths
646
+ csv_path = self.processed_dir / f"{base_name}.csv"
647
+ excel_path = self.processed_dir / f"{base_name}.xlsx"
648
+ report_path = self.reports_dir / f"{base_name}_report.json"
649
+
650
+ try:
651
+ # Save CSV
652
+ df.to_csv(csv_path, index=False)
653
+
654
+ # Save Excel with multiple sheets
655
+ with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
656
+ df.head(params['top_results']).to_excel(writer, sheet_name='Top_Results', index=False)
657
+ df.to_excel(writer, sheet_name='All_Keywords', index=False)
658
+
659
+ # Summary sheet
660
+ summary_data = {
661
+ 'Metric': [
662
+ 'Total Keywords',
663
+ 'Average Opportunity Score',
664
+ 'High Opportunity Keywords (>50)',
665
+ 'Long-tail Keywords',
666
+ 'Informational Intent',
667
+ 'Commercial Intent',
668
+ 'Transactional Intent'
669
+ ],
670
+ 'Value': [
671
+ len(df),
672
+ round(df['Opportunity Score'].mean(), 2),
673
+ len(df[df['Opportunity Score'] > 50]),
674
+ len(df[df['Tail'] == 'long-tail']),
675
+ len(df[df['Intent'] == 'informational']),
676
+ len(df[df['Intent'] == 'commercial']),
677
+ len(df[df['Intent'] == 'transactional'])
678
+ ]
679
+ }
680
+ pd.DataFrame(summary_data).to_excel(writer, sheet_name='Summary', index=False)
681
+
682
+ # Save JSON report
683
+ report_data = {
684
+ 'analysis_date': datetime.now().isoformat(),
685
+ 'seed_keyword': params['seed_keyword'],
686
+ 'parameters': {
687
+ 'max_candidates': params['max_candidates'],
688
+ 'top_results': params['top_results'],
689
+ 'min_search_volume': params['min_search_volume'],
690
+ 'max_competition': params['max_competition']
691
+ },
692
+ 'summary': {
693
+ 'total_keywords': len(df),
694
+ 'average_opportunity_score': float(df['Opportunity Score'].mean()),
695
+ 'top_keyword': df.iloc[0]['Keyword'] if not df.empty else None,
696
+ 'intent_distribution': df['Intent'].value_counts().to_dict(),
697
+ 'tail_distribution': df['Tail'].value_counts().to_dict()
698
+ }
699
+ }
700
+
701
+ with open(report_path, 'w', encoding='utf-8') as f:
702
+ json.dump(report_data, f, indent=2, ensure_ascii=False)
703
+
704
+ return str(csv_path), str(excel_path), str(report_path)
705
+
706
+ except Exception as e:
707
+ st.error(f"❌ Error saving files: {e}")
708
+ return None, None, None
709
+
710
+ def render_download_section(self, csv_path: str, excel_path: str, report_path: str):
711
+ """Render download buttons for generated files."""
712
+ st.markdown("## πŸ“₯ Download Results")
713
+
714
+ col1, col2, col3 = st.columns(3)
715
+
716
+ if csv_path and os.path.exists(csv_path):
717
+ with col1:
718
+ with open(csv_path, 'rb') as file:
719
+ st.download_button(
720
+ label="πŸ“Š Download CSV",
721
+ data=file.read(),
722
+ file_name=os.path.basename(csv_path),
723
+ mime="text/csv"
724
+ )
725
+
726
+ if excel_path and os.path.exists(excel_path):
727
+ with col2:
728
+ with open(excel_path, 'rb') as file:
729
+ st.download_button(
730
+ label="πŸ“ˆ Download Excel",
731
+ data=file.read(),
732
+ file_name=os.path.basename(excel_path),
733
+ mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
734
+ )
735
+
736
+ if report_path and os.path.exists(report_path):
737
+ with col3:
738
+ with open(report_path, 'rb') as file:
739
+ st.download_button(
740
+ label="πŸ“‹ Download Report",
741
+ data=file.read(),
742
+ file_name=os.path.basename(report_path),
743
+ mime="application/json"
744
+ )
745
+
746
+ def run(self):
747
+ """Main dashboard execution method."""
748
+ # Render header
749
+ if not self.render_header():
750
+ st.stop()
751
+
752
+ # Render sidebar
753
+ params = self.render_sidebar()
754
+
755
+ # Main content area
756
+ if params["run_analysis"]:
757
+ # Store analysis state
758
+ if 'analysis_complete' not in st.session_state:
759
+ st.session_state.analysis_complete = False
760
+
761
+ # Run analysis
762
+ df = self.run_keyword_analysis(params)
763
+
764
+ if df is not None and not df.empty:
765
+ # Add enhancement columns
766
+ df = self.add_enhancement_columns(df)
767
+
768
+ # Store results in session state
769
+ st.session_state.results_df = df
770
+ st.session_state.analysis_params = params
771
+ st.session_state.analysis_complete = True
772
+
773
+ # Success message
774
+ st.success(f"βœ… Analysis complete! Found {len(df)} keywords matching your criteria.")
775
+
776
+ # Display results if analysis is complete
777
+ if st.session_state.get('analysis_complete', False) and 'results_df' in st.session_state:
778
+ df = st.session_state.results_df
779
+ params = st.session_state.analysis_params
780
+
781
+ # Render summary metrics
782
+ self.render_summary_metrics(df)
783
+
784
+ # Create view toggle
785
+ view_option = st.radio("πŸ“‹ Choose View", ["Table View", "Chart View"], horizontal=True)
786
+
787
+ if view_option == "Table View":
788
+ self.render_top_keywords_table(df, params['top_results'])
789
+ else:
790
+ self.render_visualizations(df)
791
+
792
+ # Save results and provide downloads
793
+ with st.spinner("πŸ’Ύ Preparing download files..."):
794
+ csv_path, excel_path, report_path = self.save_results(df, params)
795
+
796
+ if csv_path:
797
+ self.render_download_section(csv_path, excel_path, report_path)
798
+
799
+ elif not st.session_state.get('analysis_complete', False):
800
+ # Show welcome message
801
+ st.markdown("""
802
+ ## πŸ‘‹ Welcome to the SEO Keyword Research Dashboard
803
+
804
+ This dashboard helps you discover and analyze keyword opportunities using advanced SEO metrics.
805
+
806
+ ### πŸš€ Getting Started:
807
+ 1. **Enter your seed keyword** in the sidebar (e.g., "digital marketing")
808
+ 2. **Adjust analysis parameters** (candidates, results, filters)
809
+ 3. **Click "Run Analysis"** to start the keyword research
810
+ 4. **Explore results** through tables and interactive charts
811
+ 5. **Download reports** in CSV, Excel, or JSON format
812
+
813
+ ### πŸ“Š Features:
814
+ - **Real-time keyword discovery** using SerpAPI
815
+ - **Competition analysis** based on SERP features
816
+ - **Intent classification** (informational, commercial, transactional)
817
+ - **Interactive visualizations** with Plotly charts
818
+ - **Advanced filtering** by volume and competition
819
+ - **Multi-format exports** (CSV, Excel, JSON reports)
820
+ """)
821
+
822
+
823
+ def main():
824
+ """Main function to run the Streamlit dashboard."""
825
+ dashboard = KeywordDashboard()
826
+ dashboard.run()
827
+
828
+
829
+ if __name__ == "__main__":
830
+ main()
git ADDED
File without changes
keyword_agent.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from dotenv import load_dotenv
3
+
4
+ # Load environment variables from .env file
5
+ load_dotenv()
6
+
7
+ def main():
8
+ # Get the API key from environment variables
9
+ api_key = os.getenv("SERPAPI_KEY")
10
+
11
+ if api_key:
12
+ print("βœ… Project setup complete!")
13
+ print(f"API key loaded: {api_key[:5]}...")
14
+ else:
15
+ print("❌ Warning: API_KEY not found in environment variables")
16
+ print("Make sure you have a .env file with your API_KEY")
17
+
18
+ if __name__ == "__main__":
19
+ main()
postprocess.py ADDED
@@ -0,0 +1,366 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # src/postprocess.py
2
+ """
3
+ Post-processing tool for keyword research results
4
+ Cleans, annotates, and formats CSV output for professional presentation
5
+ """
6
+
7
+ import pandas as pd
8
+ from datetime import date, datetime
9
+ import os
10
+ import re
11
+ import json
12
+
13
+ # Install these if you haven't: pip install pandas openpyxl tabulate
14
+ try:
15
+ from tabulate import tabulate
16
+ TABULATE_AVAILABLE = True
17
+ except ImportError:
18
+ TABULATE_AVAILABLE = False
19
+ print("Note: Install 'tabulate' for prettier table output: pip install tabulate")
20
+
21
+ try:
22
+ import openpyxl
23
+ EXCEL_AVAILABLE = True
24
+ except ImportError:
25
+ EXCEL_AVAILABLE = False
26
+ print("Note: Install 'openpyxl' for Excel export: pip install openpyxl")
27
+
28
+ # Configuration
29
+ BRAND_KEYWORDS = {
30
+ "linkedin", "indeed", "glassdoor", "ucla", "asu", "berkeley",
31
+ "hennge", "ciee", "google", "facebook", "microsoft", "amazon",
32
+ "apple", "netflix", "spotify", "youtube", "instagram", "twitter"
33
+ }
34
+ OUTPUT_DIR = "results" # Directory to save processed files
35
+
36
+ def normalize_keyword(keyword):
37
+ """Clean and normalize keyword text"""
38
+ if not keyword or pd.isna(keyword):
39
+ return ""
40
+ return str(keyword).strip()
41
+
42
+ def is_brand_query(keyword, brand_set=BRAND_KEYWORDS):
43
+ """
44
+ Check if keyword is a brand/navigational query
45
+ These are harder to rank for if you're not that brand
46
+ """
47
+ if not keyword:
48
+ return False
49
+
50
+ keyword_lower = keyword.lower()
51
+
52
+ # Check if any brand name appears in keyword
53
+ for brand in brand_set:
54
+ if brand in keyword_lower:
55
+ return True
56
+
57
+ # Check for domains (.com, .edu, etc.)
58
+ if re.search(r"\.(com|edu|org|net|gov|io)\b", keyword_lower):
59
+ return True
60
+
61
+ return False
62
+
63
+ def classify_search_intent(keyword):
64
+ """
65
+ Classify keyword by search intent:
66
+ - informational: seeking information
67
+ - commercial: researching before buying
68
+ - transactional: ready to take action
69
+ - navigational: looking for specific site/brand
70
+ """
71
+ if not keyword:
72
+ return "informational"
73
+
74
+ keyword_lower = keyword.lower()
75
+
76
+ # Informational intent signals
77
+ if any(signal in keyword_lower for signal in [
78
+ "how to", "what is", "why", "are", "do ", "does ", "can ",
79
+ "guide", "tutorial", "learn", "definition", "meaning"
80
+ ]):
81
+ return "informational"
82
+
83
+ # Transactional intent signals
84
+ if any(signal in keyword_lower for signal in [
85
+ "buy", "price", "cost", "apply", "register", "admission",
86
+ "apply now", "enroll", "join", "signup", "book", "order"
87
+ ]):
88
+ return "transactional"
89
+
90
+ # Commercial intent signals
91
+ if any(signal in keyword_lower for signal in [
92
+ "best", "top", "compare", "vs", "reviews", "review",
93
+ "cheap", "affordable", "discount", "deal"
94
+ ]):
95
+ return "commercial"
96
+
97
+ # Navigational intent (brand queries)
98
+ if is_brand_query(keyword):
99
+ return "navigational"
100
+
101
+ # Default to informational
102
+ return "informational"
103
+
104
+ def classify_keyword_tail(keyword):
105
+ """
106
+ Classify keyword by tail length:
107
+ - short-tail: 1-2 words (high competition, high volume)
108
+ - mid-tail: 3 words (moderate competition/volume)
109
+ - long-tail: 4+ words (low competition, low volume)
110
+ """
111
+ if not keyword:
112
+ return "short-tail"
113
+
114
+ word_count = len(str(keyword).split())
115
+
116
+ if word_count >= 4:
117
+ return "long-tail"
118
+ elif word_count == 3:
119
+ return "mid-tail"
120
+ else:
121
+ return "short-tail"
122
+
123
+ def format_large_number(number):
124
+ """Format large numbers with commas for readability"""
125
+ try:
126
+ return f"{int(number):,}"
127
+ except (ValueError, TypeError):
128
+ return str(number)
129
+
130
+ def clean_and_process_dataframe(df, seed_keyword):
131
+ """Main processing function to clean and enhance the dataframe"""
132
+
133
+ # Make a copy to avoid modifying original
134
+ df = df.copy()
135
+
136
+ print("🧹 Cleaning and processing data...")
137
+
138
+ # 1. Normalize keywords and remove duplicates
139
+ df["Keyword"] = df["Keyword"].astype(str).apply(normalize_keyword)
140
+
141
+ # Remove empty keywords
142
+ df = df[df["Keyword"].str.len() > 0]
143
+
144
+ # Sort by Opportunity Score and remove duplicates (keep highest score)
145
+ df = df.sort_values(by="Opportunity Score", ascending=False)
146
+ df = df.drop_duplicates(subset=["Keyword"], keep="first")
147
+
148
+ # 2. Fix data types and handle missing values
149
+
150
+ # Monthly Searches: convert to int, fill missing with 0
151
+ df["Monthly Searches"] = pd.to_numeric(df["Monthly Searches"], errors="coerce").fillna(0).astype(int)
152
+
153
+ # Competition: round to 4 decimal places
154
+ df["Competition"] = pd.to_numeric(df["Competition"], errors="coerce").fillna(0.0).round(4)
155
+
156
+ # Opportunity Score: round to 2 decimal places for readability
157
+ df["Opportunity Score"] = pd.to_numeric(df["Opportunity Score"], errors="coerce").fillna(0.0).round(2)
158
+
159
+ # Google Results: clean and convert to int
160
+ if "Google Results" in df.columns:
161
+ # Remove any non-digit characters and convert to int
162
+ df["Google Results"] = df["Google Results"].astype(str).str.replace(r"[^\d]", "", regex=True)
163
+ df["Google Results"] = pd.to_numeric(df["Google Results"], errors="coerce").fillna(0).astype(int)
164
+
165
+ # Ads Shown: convert to int
166
+ if "Ads Shown" in df.columns:
167
+ df["Ads Shown"] = pd.to_numeric(df["Ads Shown"], errors="coerce").fillna(0).astype(int)
168
+
169
+ # 3. Add enhancement columns
170
+ print("πŸ“Š Adding analysis columns...")
171
+
172
+ df["Intent"] = df["Keyword"].apply(classify_search_intent)
173
+ df["Tail"] = df["Keyword"].apply(classify_keyword_tail)
174
+ df["Is Brand/Navigational"] = df["Keyword"].apply(lambda x: "Yes" if is_brand_query(x) else "No")
175
+
176
+ # 4. Reorder columns for better presentation
177
+ column_order = [
178
+ "Keyword",
179
+ "Intent",
180
+ "Tail",
181
+ "Is Brand/Navigational",
182
+ "Monthly Searches",
183
+ "Competition",
184
+ "Opportunity Score",
185
+ "Google Results",
186
+ "Ads Shown",
187
+ "Featured Snippet?",
188
+ "PAA Available?",
189
+ "Knowledge Graph?"
190
+ ]
191
+
192
+ # Only include columns that exist in the dataframe
193
+ available_columns = [col for col in column_order if col in df.columns]
194
+ df = df[available_columns]
195
+
196
+ # 5. Final sort by Opportunity Score
197
+ df = df.sort_values(by="Opportunity Score", ascending=False).reset_index(drop=True)
198
+
199
+ print(f"βœ… Processing complete! {len(df)} keywords ready")
200
+ return df
201
+
202
+ def save_processed_results(df, seed_keyword, output_dir=OUTPUT_DIR):
203
+ """Save processed results in multiple formats with metadata"""
204
+
205
+ # Create output directory
206
+ os.makedirs(output_dir, exist_ok=True)
207
+
208
+ # Generate safe filename from seed keyword
209
+ today = date.today().isoformat()
210
+ safe_seed = re.sub(r"[^\w\s-]", "", seed_keyword).strip().replace(" ", "_")[:50]
211
+ base_filename = f"keywords_{safe_seed}_{today}"
212
+
213
+ # File paths
214
+ csv_path = os.path.join(output_dir, f"{base_filename}.csv")
215
+ excel_path = os.path.join(output_dir, f"{base_filename}.xlsx")
216
+ meta_path = os.path.join(output_dir, f"{base_filename}.meta.json")
217
+
218
+ # Save CSV
219
+ df.to_csv(csv_path, index=False)
220
+ print(f"πŸ’Ύ Saved CSV: {csv_path}")
221
+
222
+ # Save Excel with multiple sheets (if openpyxl is available)
223
+ if EXCEL_AVAILABLE:
224
+ try:
225
+ with pd.ExcelWriter(excel_path, engine="openpyxl") as writer:
226
+ # Top 50 sheet
227
+ df.head(50).to_excel(writer, sheet_name="Top_50", index=False)
228
+ # All results sheet
229
+ df.to_excel(writer, sheet_name="All_Keywords", index=False)
230
+ # Summary sheet
231
+ summary_data = {
232
+ "Metric": [
233
+ "Total Keywords",
234
+ "Informational Keywords",
235
+ "Commercial Keywords",
236
+ "Transactional Keywords",
237
+ "Navigational Keywords",
238
+ "Long-tail Keywords",
239
+ "Brand/Navigational Keywords"
240
+ ],
241
+ "Count": [
242
+ len(df),
243
+ len(df[df["Intent"] == "informational"]),
244
+ len(df[df["Intent"] == "commercial"]),
245
+ len(df[df["Intent"] == "transactional"]),
246
+ len(df[df["Intent"] == "navigational"]),
247
+ len(df[df["Tail"] == "long-tail"]),
248
+ len(df[df["Is Brand/Navigational"] == "Yes"])
249
+ ]
250
+ }
251
+ pd.DataFrame(summary_data).to_excel(writer, sheet_name="Summary", index=False)
252
+
253
+ print(f"πŸ“Š Saved Excel: {excel_path}")
254
+ except Exception as e:
255
+ print(f"⚠️ Could not save Excel file: {e}")
256
+ else:
257
+ print("πŸ“Š Excel export skipped (install openpyxl to enable)")
258
+
259
+ # Save metadata
260
+ metadata = {
261
+ "seed_keyword": seed_keyword,
262
+ "generated_at": datetime.utcnow().isoformat() + "Z",
263
+ "total_keywords": len(df),
264
+ "data_source": "SerpApi with heuristic search volumes",
265
+ "methodology": "Opportunity Score = log10(volume+1) / (competition + 0.01)",
266
+ "notes": [
267
+ "Brand/navigational queries are flagged for filtering",
268
+ "Search volumes are estimated - replace with real API data for production",
269
+ "Competition scores based on SERP feature analysis"
270
+ ],
271
+ "intent_breakdown": {
272
+ "informational": int(len(df[df["Intent"] == "informational"])),
273
+ "commercial": int(len(df[df["Intent"] == "commercial"])),
274
+ "transactional": int(len(df[df["Intent"] == "transactional"])),
275
+ "navigational": int(len(df[df["Intent"] == "navigational"]))
276
+ },
277
+ "tail_breakdown": {
278
+ "short-tail": int(len(df[df["Tail"] == "short-tail"])),
279
+ "mid-tail": int(len(df[df["Tail"] == "mid-tail"])),
280
+ "long-tail": int(len(df[df["Tail"] == "long-tail"]))
281
+ }
282
+ }
283
+
284
+ with open(meta_path, "w", encoding="utf-8") as f:
285
+ json.dump(metadata, f, indent=2, ensure_ascii=False)
286
+
287
+ print(f"πŸ“‹ Saved metadata: {meta_path}")
288
+
289
+ return csv_path, excel_path, meta_path
290
+
291
+ def display_results_preview(df, top_n=10):
292
+ """Display a nice preview of the top results"""
293
+
294
+ if df.empty:
295
+ print("❌ No results to display!")
296
+ return
297
+
298
+ print(f"\nπŸ† Top {min(top_n, len(df))} Keywords:")
299
+
300
+ # Prepare data for display
301
+ preview_df = df.head(top_n).copy()
302
+
303
+ # Format large numbers for readability
304
+ if "Monthly Searches" in preview_df.columns:
305
+ preview_df["Monthly Searches"] = preview_df["Monthly Searches"].apply(format_large_number)
306
+
307
+ if "Google Results" in preview_df.columns:
308
+ preview_df["Google Results"] = preview_df["Google Results"].apply(format_large_number)
309
+
310
+ # Display using tabulate if available
311
+ if TABULATE_AVAILABLE:
312
+ print(tabulate(preview_df, headers="keys", tablefmt="github", showindex=False))
313
+ else:
314
+ # Fallback display
315
+ for i, row in preview_df.iterrows():
316
+ print(f"{i+1}. {row['Keyword']} | Score: {row['Opportunity Score']} | "
317
+ f"Volume: {row['Monthly Searches']} | Competition: {row['Competition']} | "
318
+ f"Intent: {row['Intent']} | Tail: {row['Tail']}")
319
+
320
+ def postprocess_keywords(csv_file_path, seed_keyword):
321
+ """
322
+ Main postprocessing function
323
+ Call this after your ranking.py generates the initial CSV
324
+ """
325
+
326
+ print(f"πŸš€ Starting postprocessing for: '{seed_keyword}'")
327
+ print(f"πŸ“ Input file: {csv_file_path}")
328
+
329
+ try:
330
+ # Load the CSV from ranking.py
331
+ df = pd.read_csv(csv_file_path)
332
+ print(f"πŸ“Š Loaded {len(df)} keywords from CSV")
333
+
334
+ # Clean and process the data
335
+ processed_df = clean_and_process_dataframe(df, seed_keyword)
336
+
337
+ # Save in multiple formats
338
+ csv_path, excel_path, meta_path = save_processed_results(processed_df, seed_keyword)
339
+
340
+ # Display preview
341
+ display_results_preview(processed_df, top_n=10)
342
+
343
+ # Summary stats
344
+ print(f"\nπŸ“ˆ Summary Statistics:")
345
+ print(f"β€’ Total keywords analyzed: {len(processed_df)}")
346
+ print(f"β€’ Long-tail opportunities: {len(processed_df[processed_df['Tail'] == 'long-tail'])}")
347
+ print(f"β€’ Non-brand keywords: {len(processed_df[processed_df['Is Brand/Navigational'] == 'No'])}")
348
+ print(f"β€’ High opportunity (score > 50): {len(processed_df[processed_df['Opportunity Score'] > 50])}")
349
+
350
+ return csv_path, excel_path, meta_path, processed_df
351
+
352
+ except Exception as e:
353
+ print(f"❌ Error during postprocessing: {e}")
354
+ raise
355
+
356
+ # Example usage
357
+ if __name__ == "__main__":
358
+ # Example: process a CSV file generated by ranking.py
359
+ input_csv = "best_keywords_2025-09-23.csv" # Replace with your actual file
360
+ seed_keyword = "global internship"
361
+
362
+ if os.path.exists(input_csv):
363
+ postprocess_keywords(input_csv, seed_keyword)
364
+ else:
365
+ print(f"❌ Input file not found: {input_csv}")
366
+ print("Run your ranking.py script first to generate the initial CSV")
ranking.py ADDED
@@ -0,0 +1,569 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Professional Keyword Research Tool
3
+
4
+ A comprehensive tool for analyzing keyword opportunities using SerpApi.
5
+ Calculates competition scores and opportunity rankings based on SERP analysis.
6
+
7
+ Requirements:
8
+ pip install serpapi tabulate python-dotenv
9
+
10
+ Setup:
11
+ 1. Create a .env file with your SerpApi key: SERPAPI_KEY=your_key_here
12
+ 2. Run the script with your desired seed keyword
13
+ """
14
+
15
+ import os
16
+ import math
17
+ import csv
18
+ import re
19
+ import logging
20
+ from datetime import date
21
+ from typing import List, Dict, Optional, Tuple, Any
22
+ from dataclasses import dataclass
23
+ from dotenv import load_dotenv
24
+ from serpapi import GoogleSearch
25
+
26
+ # Optional dependency for better table formatting
27
+ try:
28
+ from tabulate import tabulate
29
+ HAS_TABULATE = True
30
+ except ImportError:
31
+ HAS_TABULATE = False
32
+ print("πŸ’‘ Tip: Install 'tabulate' for prettier output: pip install tabulate")
33
+
34
+ # Configure logging
35
+ logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
36
+ logger = logging.getLogger(__name__)
37
+
38
+
39
+ @dataclass
40
+ class KeywordMetrics:
41
+ """Container for keyword analysis results."""
42
+ keyword: str
43
+ monthly_searches: int
44
+ competition_score: float
45
+ opportunity_score: float
46
+ total_results: int
47
+ ads_count: int
48
+ has_featured_snippet: bool
49
+ has_people_also_ask: bool
50
+ has_knowledge_graph: bool
51
+
52
+
53
+ class Config:
54
+ """Configuration settings for the keyword research tool."""
55
+
56
+ def __init__(self):
57
+ load_dotenv()
58
+ self.serpapi_key = os.getenv("SERPAPI_KEY")
59
+ self.default_location = "United States"
60
+ self.results_per_query = 10
61
+ self.max_related_keywords = 150
62
+ self.top_keywords_to_save = 50
63
+ self.progress_update_interval = 10
64
+
65
+ if not self.serpapi_key:
66
+ raise ValueError("SERPAPI_KEY not found in environment variables")
67
+
68
+
69
+ class CompetitionCalculator:
70
+ """Calculates keyword competition scores based on SERP features."""
71
+
72
+ # Scoring weights for different competition factors
73
+ WEIGHTS = {
74
+ 'total_results': 0.50,
75
+ 'ads': 0.25,
76
+ 'featured_snippet': 0.15,
77
+ 'people_also_ask': 0.07,
78
+ 'knowledge_graph': 0.03
79
+ }
80
+
81
+ @staticmethod
82
+ def extract_total_results(search_info: Dict[str, Any]) -> int:
83
+ """
84
+ Extract total results count from SerpApi response.
85
+
86
+ Args:
87
+ search_info: Search information dictionary from SerpApi
88
+
89
+ Returns:
90
+ Total number of results as integer, 0 if not found
91
+ """
92
+ if not search_info:
93
+ return 0
94
+
95
+ # Try different possible field names
96
+ total = (search_info.get("total_results") or
97
+ search_info.get("total_results_raw") or
98
+ search_info.get("total"))
99
+
100
+ if isinstance(total, int):
101
+ return total
102
+
103
+ if isinstance(total, str):
104
+ # Extract only digits (remove commas, spaces, etc.)
105
+ numbers_only = re.sub(r"[^\d]", "", total)
106
+ try:
107
+ return int(numbers_only) if numbers_only else 0
108
+ except ValueError:
109
+ return 0
110
+
111
+ return 0
112
+
113
+ def calculate_score(self, search_results: Dict[str, Any]) -> Tuple[float, Dict[str, Any]]:
114
+ """
115
+ Calculate competition score based on SERP features.
116
+
117
+ Args:
118
+ search_results: Complete search results from SerpApi
119
+
120
+ Returns:
121
+ Tuple of (competition_score, analysis_breakdown)
122
+ Score ranges from 0-1 where 1 = very competitive
123
+ """
124
+ search_info = search_results.get("search_information", {})
125
+
126
+ # Factor 1: Total number of results (normalized using log scale)
127
+ total_results = self.extract_total_results(search_info)
128
+ normalized_results = min(math.log10(total_results + 1) / 7, 1.0)
129
+
130
+ # Factor 2: Number of ads (more ads = more competition)
131
+ ads = search_results.get("ads_results", [])
132
+ ads_count = len(ads) if ads else 0
133
+ ads_score = min(ads_count / 3, 1.0)
134
+
135
+ # Factor 3: SERP features that make ranking more difficult
136
+ has_featured_snippet = bool(
137
+ search_results.get("featured_snippet") or
138
+ search_results.get("answer_box")
139
+ )
140
+
141
+ has_people_also_ask = bool(
142
+ search_results.get("related_questions") or
143
+ search_results.get("people_also_ask")
144
+ )
145
+
146
+ has_knowledge_graph = bool(search_results.get("knowledge_graph"))
147
+
148
+ # Calculate weighted competition score
149
+ competition_score = (
150
+ self.WEIGHTS['total_results'] * normalized_results +
151
+ self.WEIGHTS['ads'] * ads_score +
152
+ self.WEIGHTS['featured_snippet'] * has_featured_snippet +
153
+ self.WEIGHTS['people_also_ask'] * has_people_also_ask +
154
+ self.WEIGHTS['knowledge_graph'] * has_knowledge_graph
155
+ )
156
+
157
+ # Ensure score stays within bounds
158
+ competition_score = max(0.0, min(1.0, competition_score))
159
+
160
+ # Create analysis breakdown for reporting
161
+ breakdown = {
162
+ "total_results": total_results,
163
+ "ads_count": ads_count,
164
+ "has_featured_snippet": has_featured_snippet,
165
+ "has_people_also_ask": has_people_also_ask,
166
+ "has_knowledge_graph": has_knowledge_graph
167
+ }
168
+
169
+ return competition_score, breakdown
170
+
171
+
172
+ class SearchVolumeEstimator:
173
+ """Handles search volume estimation and integration with volume APIs."""
174
+
175
+ def get_search_volume(self, keyword: str) -> Optional[int]:
176
+ """
177
+ Get search volume for a keyword.
178
+
179
+ TODO: Integrate with DataForSEO, Google Keyword Planner, or similar API
180
+
181
+ Args:
182
+ keyword: The keyword to get volume for
183
+
184
+ Returns:
185
+ Monthly search volume or None if unavailable
186
+ """
187
+ # Placeholder for real volume API integration
188
+ # Examples of what you might implement:
189
+ # - return self._call_dataforseo_api(keyword)
190
+ # - return self._call_google_ads_api(keyword)
191
+ return None
192
+
193
+ def estimate_volume(self, keyword: str) -> int:
194
+ """
195
+ Estimate search volume using simple heuristics.
196
+
197
+ Args:
198
+ keyword: The keyword to estimate volume for
199
+
200
+ Returns:
201
+ Estimated monthly search volume
202
+ """
203
+ # Simple heuristic: longer phrases typically have lower volume
204
+ word_count = len(keyword.split())
205
+ # This is rough estimation - replace with real data when possible
206
+ return max(10, 10000 // (word_count + 1))
207
+
208
+
209
+ class KeywordDiscovery:
210
+ """Discovers related keywords from search results."""
211
+
212
+ def __init__(self, config: Config):
213
+ self.config = config
214
+
215
+ def find_related_keywords(self, seed_keyword: str) -> List[str]:
216
+ """
217
+ Find related keywords from Google's suggestions and related searches.
218
+
219
+ Args:
220
+ seed_keyword: The base keyword to find related terms for
221
+
222
+ Returns:
223
+ List of related keyword candidates
224
+ """
225
+ logger.info(f"Discovering related keywords for: '{seed_keyword}'")
226
+
227
+ search_params = {
228
+ "engine": "google",
229
+ "q": seed_keyword,
230
+ "api_key": self.config.serpapi_key,
231
+ "hl": "en",
232
+ "gl": "us"
233
+ }
234
+
235
+ try:
236
+ search = GoogleSearch(search_params)
237
+ results = search.get_dict()
238
+ except Exception as e:
239
+ logger.error(f"Failed to get related keywords: {e}")
240
+ return []
241
+
242
+ keyword_candidates = set()
243
+
244
+ # Extract keywords from different sources
245
+ self._extract_from_related_searches(results, keyword_candidates)
246
+ self._extract_from_people_also_ask(results, keyword_candidates)
247
+ self._extract_from_organic_titles(results, keyword_candidates)
248
+
249
+ # Convert to list and limit results
250
+ final_keywords = list(keyword_candidates)[:self.config.max_related_keywords]
251
+ logger.info(f"Found {len(final_keywords)} keyword candidates")
252
+
253
+ return final_keywords
254
+
255
+ def _extract_from_related_searches(self, results: Dict[str, Any],
256
+ candidates: set) -> None:
257
+ """Extract keywords from 'related searches' section."""
258
+ related_searches = results.get("related_searches", [])
259
+ for item in related_searches:
260
+ query = item.get("query") or item.get("suggestion")
261
+ if query and len(query.strip()) > 0:
262
+ candidates.add(query.strip())
263
+
264
+ def _extract_from_people_also_ask(self, results: Dict[str, Any],
265
+ candidates: set) -> None:
266
+ """Extract keywords from 'People also ask' questions."""
267
+ related_questions = results.get("related_questions", [])
268
+ for item in related_questions:
269
+ question = item.get("question") or item.get("query")
270
+ if question and len(question.strip()) > 0:
271
+ candidates.add(question.strip())
272
+
273
+ def _extract_from_organic_titles(self, results: Dict[str, Any],
274
+ candidates: set) -> None:
275
+ """Extract potential keywords from organic result titles."""
276
+ organic_results = results.get("organic_results", [])
277
+ for result in organic_results[:10]: # Only top 10 results
278
+ title = result.get("title", "")
279
+ if title and len(title.strip()) > 0:
280
+ candidates.add(title.strip())
281
+
282
+
283
+ class KeywordAnalyzer:
284
+ """Main class for analyzing keywords and calculating opportunity scores."""
285
+
286
+ def __init__(self, config: Config):
287
+ self.config = config
288
+ self.competition_calc = CompetitionCalculator()
289
+ self.volume_estimator = SearchVolumeEstimator()
290
+ self.keyword_discovery = KeywordDiscovery(config)
291
+
292
+ def search_google(self, keyword: str) -> Dict[str, Any]:
293
+ """
294
+ Fetch search results for a keyword using SerpApi.
295
+
296
+ Args:
297
+ keyword: The keyword to search for
298
+
299
+ Returns:
300
+ Search results dictionary from SerpApi
301
+ """
302
+ search_params = {
303
+ "engine": "google",
304
+ "q": keyword,
305
+ "api_key": self.config.serpapi_key,
306
+ "hl": "en",
307
+ "gl": "us",
308
+ "num": self.config.results_per_query
309
+ }
310
+
311
+ try:
312
+ search = GoogleSearch(search_params)
313
+ return search.get_dict()
314
+ except Exception as e:
315
+ logger.error(f"Search failed for '{keyword}': {e}")
316
+ return {}
317
+
318
+ def analyze_keyword(self, keyword: str, use_volume_api: bool = False) -> Optional[KeywordMetrics]:
319
+ """
320
+ Analyze a single keyword and calculate its opportunity score.
321
+
322
+ Args:
323
+ keyword: The keyword to analyze
324
+ use_volume_api: Whether to use real volume API (not implemented yet)
325
+
326
+ Returns:
327
+ KeywordMetrics object or None if analysis failed
328
+ """
329
+ # Get search results
330
+ search_results = self.search_google(keyword)
331
+ if not search_results:
332
+ return None
333
+
334
+ # Calculate competition score
335
+ competition_score, breakdown = self.competition_calc.calculate_score(search_results)
336
+
337
+ # Get or estimate search volume
338
+ if use_volume_api:
339
+ search_volume = self.volume_estimator.get_search_volume(keyword)
340
+ else:
341
+ search_volume = None
342
+
343
+ if search_volume is None:
344
+ search_volume = self.volume_estimator.estimate_volume(keyword)
345
+
346
+ # Calculate opportunity score
347
+ # Higher volume = better, lower competition = better
348
+ volume_score = math.log10(search_volume + 1)
349
+ opportunity_score = volume_score / (competition_score + 0.01) # Avoid division by zero
350
+
351
+ return KeywordMetrics(
352
+ keyword=keyword,
353
+ monthly_searches=search_volume,
354
+ competition_score=round(competition_score, 4),
355
+ opportunity_score=round(opportunity_score, 2),
356
+ total_results=breakdown["total_results"],
357
+ ads_count=breakdown["ads_count"],
358
+ has_featured_snippet=breakdown["has_featured_snippet"],
359
+ has_people_also_ask=breakdown["has_people_also_ask"],
360
+ has_knowledge_graph=breakdown["has_knowledge_graph"]
361
+ )
362
+
363
+ def analyze_keywords_batch(self, keywords: List[str],
364
+ use_volume_api: bool = False) -> List[KeywordMetrics]:
365
+ """
366
+ Analyze multiple keywords and return sorted results.
367
+
368
+ Args:
369
+ keywords: List of keywords to analyze
370
+ use_volume_api: Whether to use real volume API
371
+
372
+ Returns:
373
+ List of KeywordMetrics sorted by opportunity score (highest first)
374
+ """
375
+ logger.info(f"Analyzing {len(keywords)} keywords...")
376
+ analyzed_keywords = []
377
+
378
+ for i, keyword in enumerate(keywords, 1):
379
+ if i % self.config.progress_update_interval == 0:
380
+ logger.info(f"Progress: {i}/{len(keywords)} keywords processed")
381
+
382
+ metrics = self.analyze_keyword(keyword, use_volume_api)
383
+ if metrics:
384
+ analyzed_keywords.append(metrics)
385
+
386
+ # Sort by opportunity score (highest first)
387
+ analyzed_keywords.sort(key=lambda x: x.opportunity_score, reverse=True)
388
+
389
+ logger.info(f"Analysis complete! {len(analyzed_keywords)} keywords analyzed")
390
+ return analyzed_keywords
391
+
392
+
393
+ class ResultsExporter:
394
+ """Handles exporting results to various formats."""
395
+
396
+ def save_to_csv(self, keyword_metrics: List[KeywordMetrics],
397
+ base_filename: str = "keyword_analysis",
398
+ top_count: int = 50) -> Optional[str]:
399
+ """
400
+ Save keyword analysis results to CSV file.
401
+
402
+ Args:
403
+ keyword_metrics: List of analyzed keyword metrics
404
+ base_filename: Base name for the output file
405
+ top_count: Number of top results to save
406
+
407
+ Returns:
408
+ Filename if successful, None if failed
409
+ """
410
+ if not keyword_metrics:
411
+ logger.warning("No data to save!")
412
+ return None
413
+
414
+ # Create filename with timestamp
415
+ today = date.today()
416
+ filename = f"{base_filename}_{today}.csv"
417
+
418
+ try:
419
+ with open(filename, "w", newline='', encoding='utf-8') as file:
420
+ writer = csv.writer(file)
421
+
422
+ # Write header
423
+ headers = [
424
+ "Keyword", "Monthly Searches", "Competition Score",
425
+ "Opportunity Score", "Total Results", "Ads Count",
426
+ "Featured Snippet", "People Also Ask", "Knowledge Graph"
427
+ ]
428
+ writer.writerow(headers)
429
+
430
+ # Write data rows
431
+ for metrics in keyword_metrics[:top_count]:
432
+ row = [
433
+ metrics.keyword,
434
+ metrics.monthly_searches,
435
+ metrics.competition_score,
436
+ metrics.opportunity_score,
437
+ metrics.total_results,
438
+ metrics.ads_count,
439
+ "Yes" if metrics.has_featured_snippet else "No",
440
+ "Yes" if metrics.has_people_also_ask else "No",
441
+ "Yes" if metrics.has_knowledge_graph else "No"
442
+ ]
443
+ writer.writerow(row)
444
+
445
+ saved_count = min(top_count, len(keyword_metrics))
446
+ logger.info(f"βœ… Results saved to {filename} ({saved_count} keywords)")
447
+ return filename
448
+
449
+ except Exception as e:
450
+ logger.error(f"Failed to save CSV: {e}")
451
+ return None
452
+
453
+ def display_top_results(self, keyword_metrics: List[KeywordMetrics],
454
+ top_count: int = 5) -> None:
455
+ """
456
+ Display top results in formatted table.
457
+
458
+ Args:
459
+ keyword_metrics: List of analyzed keyword metrics
460
+ top_count: Number of top results to display
461
+ """
462
+ if not keyword_metrics:
463
+ logger.warning("No results to display!")
464
+ return
465
+
466
+ top_results = keyword_metrics[:top_count]
467
+
468
+ print(f"\nπŸ† Top {len(top_results)} Keyword Opportunities:")
469
+
470
+ if HAS_TABULATE:
471
+ # Create table data
472
+ table_data = []
473
+ for metrics in top_results:
474
+ table_data.append([
475
+ metrics.keyword,
476
+ f"{metrics.monthly_searches:,}",
477
+ f"{metrics.competition_score:.3f}",
478
+ f"{metrics.opportunity_score:.2f}",
479
+ f"{metrics.total_results:,}",
480
+ metrics.ads_count
481
+ ])
482
+
483
+ headers = ["Keyword", "Volume", "Competition", "Score", "Results", "Ads"]
484
+ print(tabulate(table_data, headers=headers, tablefmt="pretty"))
485
+ else:
486
+ # Fallback to simple format
487
+ for i, metrics in enumerate(top_results, 1):
488
+ print(f"{i}. {metrics.keyword}")
489
+ print(f" Score: {metrics.opportunity_score}, "
490
+ f"Volume: {metrics.monthly_searches:,}, "
491
+ f"Competition: {metrics.competition_score:.3f}")
492
+
493
+
494
+ class KeywordResearchTool:
495
+ """Main application class that orchestrates the keyword research process."""
496
+
497
+ def __init__(self, seed_keyword: str):
498
+ self.seed_keyword = seed_keyword
499
+ self.config = Config()
500
+ self.analyzer = KeywordAnalyzer(self.config)
501
+ self.exporter = ResultsExporter()
502
+
503
+ def run_analysis(self, use_volume_api: bool = False) -> None:
504
+ """
505
+ Run the complete keyword research analysis.
506
+
507
+ Args:
508
+ use_volume_api: Whether to use real volume API (requires implementation)
509
+ """
510
+ print("πŸ” Starting keyword research analysis...")
511
+ print(f"Seed keyword: '{self.seed_keyword}'")
512
+
513
+ try:
514
+ # Step 1: Discover related keywords
515
+ related_keywords = self.analyzer.keyword_discovery.find_related_keywords(
516
+ self.seed_keyword
517
+ )
518
+
519
+ if not related_keywords:
520
+ logger.error("No keyword candidates found. Check your SerpApi key.")
521
+ return
522
+
523
+ # Step 2: Analyze keywords and calculate scores
524
+ analyzed_keywords = self.analyzer.analyze_keywords_batch(
525
+ related_keywords, use_volume_api
526
+ )
527
+
528
+ if not analyzed_keywords:
529
+ logger.error("No keywords were successfully analyzed.")
530
+ return
531
+
532
+ # Step 3: Save results to file
533
+ self.exporter.save_to_csv(
534
+ analyzed_keywords,
535
+ base_filename=f"keywords_{self.seed_keyword.replace(' ', '_')}",
536
+ top_count=self.config.top_keywords_to_save
537
+ )
538
+
539
+ # Step 4: Display top results
540
+ self.exporter.display_top_results(analyzed_keywords, top_count=5)
541
+
542
+ except Exception as e:
543
+ logger.error(f"Analysis failed: {e}")
544
+ raise
545
+
546
+
547
+ def main():
548
+ """Main entry point for the keyword research tool."""
549
+ # Configuration
550
+ SEED_KEYWORD = "global internship"
551
+ USE_VOLUME_API = False # Set to True when you implement get_search_volume()
552
+
553
+ try:
554
+ tool = KeywordResearchTool(SEED_KEYWORD)
555
+ tool.run_analysis(use_volume_api=USE_VOLUME_API)
556
+
557
+ except ValueError as e:
558
+ logger.error(f"Configuration error: {e}")
559
+ print("\nπŸ’‘ Setup Instructions:")
560
+ print("1. Create a .env file in the same directory")
561
+ print("2. Add your SerpApi key: SERPAPI_KEY=your_key_here")
562
+ print("3. Get your free key at: https://serpapi.com/")
563
+
564
+ except Exception as e:
565
+ logger.error(f"Unexpected error: {e}")
566
+
567
+
568
+ if __name__ == "__main__":
569
+ main()
requirements.txt ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ altair==5.5.0
2
+ annotated-types==0.7.0
3
+ anyio==4.11.0
4
+ attrs==25.3.0
5
+ blinker==1.9.0
6
+ cachetools==6.2.0
7
+ certifi==2025.8.3
8
+ charset-normalizer==3.4.3
9
+ click==8.1.8
10
+ colorama==0.4.6
11
+ et_xmlfile==2.0.0
12
+ exceptiongroup==1.3.0
13
+ fastapi==0.118.0
14
+ gitdb==4.0.12
15
+ GitPython==3.1.45
16
+ google_search_results==2.4.2
17
+ gunicorn==23.0.0
18
+ h11==0.16.0
19
+ httptools==0.6.4
20
+ idna==3.10
21
+ Jinja2==3.1.6
22
+ jsonschema==4.25.1
23
+ jsonschema-specifications==2025.9.1
24
+ MarkupSafe==3.0.3
25
+ narwhals==2.6.0
26
+ numpy==2.0.2
27
+ openpyxl==3.1.5
28
+ packaging==25.0
29
+ pandas==2.3.2
30
+ pillow==11.3.0
31
+ plotly==6.3.0
32
+ protobuf==6.32.1
33
+ pyarrow==21.0.0
34
+ pydantic==2.11.9
35
+ pydantic_core==2.33.2
36
+ pydeck==0.9.1
37
+ python-dateutil==2.9.0.post0
38
+ python-dotenv==1.1.1
39
+ pytz==2025.2
40
+ PyYAML==6.0.3
41
+ referencing==0.36.2
42
+ requests==2.32.5
43
+ rpds-py==0.27.1
44
+ six==1.17.0
45
+ smmap==5.0.2
46
+ sniffio==1.3.1
47
+ starlette==0.48.0
48
+ streamlit==1.50.0
49
+ tabulate==0.9.0
50
+ tenacity==9.1.2
51
+ toml==0.10.2
52
+ tornado==6.5.2
53
+ typing-inspection==0.4.1
54
+ typing_extensions==4.15.0
55
+ tzdata==2025.2
56
+ urllib3==2.5.0
57
+ uvicorn==0.37.0
58
+ watchdog==6.0.0
59
+ watchfiles==1.1.0
60
+ websockets==15.0.1
server.py ADDED
@@ -0,0 +1,625 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # src/server.py
2
+ """
3
+ Free-Plan Friendly SEO Keyword Research API
4
+ Optimized to minimize SerpAPI calls while maximizing keyword discovery
5
+
6
+ Key Features:
7
+ - Configurable keyword count (5, 10, 20, 50, etc.)
8
+ - Only 1 SerpAPI call per seed for candidate collection
9
+ - Mock scoring for initial ranking
10
+ - Optional SerpAPI verification for top N results
11
+ - Strict mode for free plan protection (max 5 API calls per request)
12
+ """
13
+
14
+ import os
15
+ import logging
16
+ import time
17
+ import math
18
+ import re
19
+ import io
20
+ from typing import List, Dict, Any, Optional, Tuple
21
+ from datetime import datetime
22
+ from collections import Counter
23
+
24
+ from fastapi import FastAPI, HTTPException, Query, Request
25
+ from fastapi.middleware.cors import CORSMiddleware
26
+ from fastapi.responses import JSONResponse, StreamingResponse
27
+ from pydantic import BaseModel, Field
28
+ from dotenv import load_dotenv
29
+
30
+ try:
31
+ import pandas as pd
32
+ HAS_PANDAS = True
33
+ except ImportError:
34
+ HAS_PANDAS = False
35
+
36
+ try:
37
+ from serpapi import GoogleSearch
38
+ HAS_SERPAPI = True
39
+ except ImportError:
40
+ try:
41
+ from google_search_results import GoogleSearch
42
+ HAS_SERPAPI = True
43
+ except ImportError:
44
+ HAS_SERPAPI = False
45
+
46
+ # Load environment
47
+ load_dotenv()
48
+
49
+ # Configure logging
50
+ logging.basicConfig(
51
+ level=logging.INFO,
52
+ format='%(asctime)s - %(levelname)s - %(message)s'
53
+ )
54
+ logger = logging.getLogger(__name__)
55
+
56
+ # Initialize FastAPI
57
+ app = FastAPI(
58
+ title="Free-Plan Friendly SEO Keyword API",
59
+ description="Efficient keyword research optimized for SerpAPI free plan",
60
+ version="4.0.0",
61
+ docs_url="/docs"
62
+ )
63
+
64
+ # CORS
65
+ app.add_middleware(
66
+ CORSMiddleware,
67
+ allow_origins=["*"],
68
+ allow_credentials=True,
69
+ allow_methods=["GET", "POST", "OPTIONS"],
70
+ allow_headers=["*"],
71
+ )
72
+
73
+ # Configuration
74
+ SERPAPI_KEY = os.getenv("SERPAPI_KEY")
75
+ API_AUTH_KEY = os.getenv("API_AUTH_KEY")
76
+ USE_SERPAPI_STRICT_MODE = os.getenv("USE_SERPAPI_STRICT_MODE", "true").lower() == "true"
77
+ MAX_SERPAPI_CALLS_STRICT = 5 # Maximum API calls in strict mode
78
+ MAX_SERPAPI_CALLS_NORMAL = 20 # Maximum API calls in normal mode
79
+
80
+ # Rate limiting
81
+ REQUEST_TIMES = {}
82
+ RATE_LIMIT_WINDOW = 60
83
+ RATE_LIMIT_MAX_REQUESTS = 30
84
+
85
+ # Request counter for monitoring
86
+ API_CALL_COUNTER = {"total": 0, "session_start": time.time()}
87
+
88
+ class KeywordResponse(BaseModel):
89
+ """API response model."""
90
+ success: bool = True
91
+ seed: str
92
+ requested: int
93
+ returned: int
94
+ results: List[Dict[str, Any]]
95
+ processing_time: float
96
+ api_calls_used: int
97
+ api_budget_remaining: int
98
+ data_source: str
99
+ timestamp: str
100
+
101
+ def count_api_call():
102
+ """Track API usage."""
103
+ API_CALL_COUNTER["total"] += 1
104
+ logger.info(f"API call #{API_CALL_COUNTER['total']} - Session time: {time.time() - API_CALL_COUNTER['session_start']:.1f}s")
105
+
106
+ def get_api_budget() -> int:
107
+ """Calculate remaining API budget for this request."""
108
+ max_calls = MAX_SERPAPI_CALLS_STRICT if USE_SERPAPI_STRICT_MODE else MAX_SERPAPI_CALLS_NORMAL
109
+ used = API_CALL_COUNTER["total"]
110
+ return max(0, max_calls - used)
111
+
112
+ def heuristic_competition_score(keyword: str) -> float:
113
+ """
114
+ Calculate mock competition score based on keyword characteristics.
115
+ Does NOT use any API calls.
116
+ """
117
+ words = keyword.lower().split()
118
+ word_count = len(words)
119
+
120
+ # Base competition by word count
121
+ base_scores = {1: 0.8, 2: 0.6, 3: 0.4, 4: 0.25, 5: 0.2}
122
+ base_score = base_scores.get(word_count, max(0.15, 0.3 - (word_count * 0.02)))
123
+
124
+ # Adjust for question keywords (lower competition)
125
+ question_words = ["how", "what", "why", "when", "where", "who", "which", "can", "should", "is", "are", "does"]
126
+ if any(word in words for word in question_words):
127
+ base_score *= 0.7
128
+
129
+ # Adjust for commercial intent (higher competition)
130
+ commercial_words = ["buy", "best", "top", "review", "price", "cheap", "discount"]
131
+ if any(word in words for word in commercial_words):
132
+ base_score *= 1.3
133
+
134
+ # Adjust for specific/niche keywords (lower competition)
135
+ specific_words = ["beginner", "tutorial", "guide", "explained", "step", "diy", "simple"]
136
+ if any(word in words for word in specific_words):
137
+ base_score *= 0.8
138
+
139
+ # Add some deterministic variation based on keyword hash
140
+ variation = (hash(keyword) % 20) / 100 # -0.1 to +0.1
141
+ base_score += variation
142
+
143
+ return max(0.05, min(0.95, base_score))
144
+
145
+ def heuristic_search_volume(keyword: str) -> int:
146
+ """
147
+ Estimate search volume based on keyword characteristics.
148
+ Does NOT use any API calls.
149
+ """
150
+ words = keyword.lower().split()
151
+ word_count = len(words)
152
+
153
+ # Base volumes
154
+ base_volumes = {1: 10000, 2: 5000, 3: 2000, 4: 800, 5: 400}
155
+ base_volume = base_volumes.get(word_count, max(100, 500 - (word_count * 50)))
156
+
157
+ # Adjust for popular terms
158
+ popular_terms = ["free", "online", "best", "how", "tutorial", "guide"]
159
+ if any(term in words for term in popular_terms):
160
+ base_volume = int(base_volume * 1.5)
161
+
162
+ # Adjust for very specific/niche terms
163
+ niche_terms = ["advanced", "professional", "enterprise", "custom"]
164
+ if any(term in words for term in niche_terms):
165
+ base_volume = int(base_volume * 0.6)
166
+
167
+ # Add deterministic variation
168
+ variation_factor = 1 + ((hash(keyword) % 40) - 20) / 100 # 0.8 to 1.2
169
+ volume = int(base_volume * variation_factor)
170
+
171
+ return max(10, min(100000, volume))
172
+
173
+ def calculate_opportunity_score(volume: int, competition: float) -> float:
174
+ """Calculate opportunity score."""
175
+ volume_score = math.log10(volume + 1)
176
+ return volume_score / (competition + 0.1)
177
+
178
+ def score_keyword_heuristic(keyword: str) -> Dict[str, Any]:
179
+ """
180
+ Score a keyword using only heuristics (NO API calls).
181
+ Fast and free method for initial ranking.
182
+ """
183
+ competition = heuristic_competition_score(keyword)
184
+ volume = heuristic_search_volume(keyword)
185
+ opportunity = calculate_opportunity_score(volume, competition)
186
+
187
+ # Determine difficulty
188
+ if competition < 0.3:
189
+ difficulty = "Easy"
190
+ elif competition < 0.5:
191
+ difficulty = "Medium"
192
+ elif competition < 0.7:
193
+ difficulty = "Hard"
194
+ else:
195
+ difficulty = "Very Hard"
196
+
197
+ # Estimate ranking potential
198
+ if competition < 0.4 and volume >= 300:
199
+ ranking_chance = "High"
200
+ elif competition < 0.6 and volume >= 100:
201
+ ranking_chance = "Medium"
202
+ else:
203
+ ranking_chance = "Low"
204
+
205
+ return {
206
+ "keyword": keyword,
207
+ "monthly_searches": volume,
208
+ "competition_score": round(competition, 4),
209
+ "opportunity_score": round(opportunity, 2),
210
+ "difficulty": difficulty,
211
+ "ranking_chance": ranking_chance,
212
+ "data_source": "heuristic"
213
+ }
214
+
215
+ def enrich_with_serpapi(keyword: str) -> Optional[Dict[str, Any]]:
216
+ """
217
+ Enrich a keyword with real SerpAPI data.
218
+ Uses 1 API call per keyword.
219
+ """
220
+ if not HAS_SERPAPI or not SERPAPI_KEY:
221
+ logger.warning("SerpAPI not available for enrichment")
222
+ return None
223
+
224
+ try:
225
+ count_api_call()
226
+
227
+ params = {
228
+ "engine": "google",
229
+ "q": keyword,
230
+ "api_key": SERPAPI_KEY,
231
+ "hl": "en",
232
+ "gl": "us",
233
+ "num": 10
234
+ }
235
+
236
+ search = GoogleSearch(params)
237
+ results = search.get_dict()
238
+
239
+ if "error" in results:
240
+ logger.error(f"SerpAPI error: {results['error']}")
241
+ return None
242
+
243
+ # Extract metrics
244
+ search_info = results.get("search_information", {})
245
+ total_results_raw = search_info.get("total_results") or search_info.get("total_results_raw") or ""
246
+ total_results = 0
247
+ if isinstance(total_results_raw, int):
248
+ total_results = total_results_raw
249
+ elif isinstance(total_results_raw, str):
250
+ nums = re.sub(r"[^\d]", "", total_results_raw)
251
+ total_results = int(nums) if nums else 0
252
+
253
+ ads_count = len(results.get("ads_results", []))
254
+ has_featured_snippet = bool(results.get("featured_snippet") or results.get("answer_box"))
255
+ has_paa = bool(results.get("related_questions") or results.get("people_also_ask"))
256
+ has_kg = bool(results.get("knowledge_graph"))
257
+
258
+ # Calculate real competition
259
+ normalized_results = min(math.log10(total_results + 1) / 7, 1.0) if total_results > 0 else 0
260
+ ads_score = min(ads_count / 3, 1.0)
261
+
262
+ competition = (
263
+ 0.40 * normalized_results +
264
+ 0.25 * ads_score +
265
+ 0.15 * (1 if has_featured_snippet else 0) +
266
+ 0.10 * (1 if has_paa else 0) +
267
+ 0.10 * (1 if has_kg else 0)
268
+ )
269
+ competition = max(0.0, min(1.0, competition))
270
+
271
+ # Estimate volume from signals
272
+ word_count = len(keyword.split())
273
+ base_volume = max(100, 8000 // (word_count + 1))
274
+
275
+ if ads_count > 2:
276
+ base_volume = int(base_volume * 1.5)
277
+ if has_featured_snippet:
278
+ base_volume = int(base_volume * 1.2)
279
+
280
+ volume = min(base_volume, 50000)
281
+ opportunity = calculate_opportunity_score(volume, competition)
282
+
283
+ # Determine difficulty
284
+ if competition < 0.3:
285
+ difficulty = "Easy"
286
+ elif competition < 0.5:
287
+ difficulty = "Medium"
288
+ elif competition < 0.7:
289
+ difficulty = "Hard"
290
+ else:
291
+ difficulty = "Very Hard"
292
+
293
+ # Ranking chance
294
+ if competition < 0.35:
295
+ ranking_chance = "High"
296
+ elif competition < 0.55:
297
+ ranking_chance = "Medium"
298
+ else:
299
+ ranking_chance = "Low"
300
+
301
+ return {
302
+ "keyword": keyword,
303
+ "monthly_searches": volume,
304
+ "competition_score": round(competition, 4),
305
+ "opportunity_score": round(opportunity, 2),
306
+ "difficulty": difficulty,
307
+ "ranking_chance": ranking_chance,
308
+ "total_results": total_results,
309
+ "ads_count": ads_count,
310
+ "featured_snippet": "Yes" if has_featured_snippet else "No",
311
+ "people_also_ask": "Yes" if has_paa else "No",
312
+ "knowledge_graph": "Yes" if has_kg else "No",
313
+ "data_source": "serpapi"
314
+ }
315
+
316
+ except Exception as e:
317
+ logger.error(f"SerpAPI enrichment failed for '{keyword}': {e}")
318
+ return None
319
+
320
+ def collect_candidates_from_seed(seed: str) -> Tuple[List[str], int]:
321
+ """
322
+ Collect keyword candidates using ONLY 1 SerpAPI call.
323
+ Returns (candidates, api_calls_used)
324
+ """
325
+ candidates = set()
326
+ candidates.add(seed) # Always include seed
327
+ api_calls = 0
328
+
329
+ # Generate synthetic candidates (NO API calls)
330
+ question_words = ["how to", "what is", "why", "when", "where", "can i", "should i"]
331
+ modifiers = ["best", "free", "online", "guide", "tutorial", "tips", "examples",
332
+ "for beginners", "explained", "2024", "2025", "cheap", "review"]
333
+
334
+ for q in question_words[:5]:
335
+ candidates.add(f"{q} {seed}")
336
+
337
+ for mod in modifiers[:15]:
338
+ candidates.add(f"{seed} {mod}")
339
+ candidates.add(f"{mod} {seed}")
340
+
341
+ # Make ONE SerpAPI call to get real related keywords
342
+ if HAS_SERPAPI and SERPAPI_KEY:
343
+ try:
344
+ count_api_call()
345
+ api_calls = 1
346
+
347
+ params = {
348
+ "engine": "google",
349
+ "q": seed,
350
+ "api_key": SERPAPI_KEY,
351
+ "hl": "en",
352
+ "gl": "us"
353
+ }
354
+
355
+ search = GoogleSearch(params)
356
+ results = search.get_dict()
357
+
358
+ if "error" not in results:
359
+ # Extract related searches
360
+ for item in results.get("related_searches", [])[:20]:
361
+ query = item.get("query", "")
362
+ if query and len(query.split()) <= 6:
363
+ candidates.add(query.lower().strip())
364
+
365
+ # Extract PAA questions
366
+ for item in results.get("related_questions", [])[:15]:
367
+ question = item.get("question", "")
368
+ if question:
369
+ candidates.add(question.lower().strip())
370
+
371
+ logger.info(f"SerpAPI call successful: collected real suggestions")
372
+ else:
373
+ logger.warning(f"SerpAPI error: {results.get('error')}")
374
+
375
+ except Exception as e:
376
+ logger.error(f"SerpAPI collection failed: {e}")
377
+
378
+ final_candidates = list(candidates)
379
+ logger.info(f"Collected {len(final_candidates)} candidates ({api_calls} API call)")
380
+
381
+ return final_candidates, api_calls
382
+
383
+ def check_rate_limit(client_ip: str) -> bool:
384
+ """Rate limiting."""
385
+ current_time = time.time()
386
+
387
+ if client_ip not in REQUEST_TIMES:
388
+ REQUEST_TIMES[client_ip] = []
389
+
390
+ REQUEST_TIMES[client_ip] = [
391
+ t for t in REQUEST_TIMES[client_ip]
392
+ if current_time - t < RATE_LIMIT_WINDOW
393
+ ]
394
+
395
+ if len(REQUEST_TIMES[client_ip]) >= RATE_LIMIT_MAX_REQUESTS:
396
+ return False
397
+
398
+ REQUEST_TIMES[client_ip].append(current_time)
399
+ return True
400
+
401
+ @app.on_event("startup")
402
+ async def startup():
403
+ """Startup logging."""
404
+ logger.info("=" * 60)
405
+ logger.info("SEO Keyword API - Free Plan Optimized")
406
+ logger.info(f"Strict Mode: {USE_SERPAPI_STRICT_MODE}")
407
+ logger.info(f"Max API calls per request: {MAX_SERPAPI_CALLS_STRICT if USE_SERPAPI_STRICT_MODE else MAX_SERPAPI_CALLS_NORMAL}")
408
+ logger.info(f"SerpAPI Available: {HAS_SERPAPI and bool(SERPAPI_KEY)}")
409
+ logger.info("=" * 60)
410
+
411
+ @app.get("/")
412
+ async def root():
413
+ """Root endpoint."""
414
+ return {
415
+ "service": "Free-Plan Friendly SEO Keyword API",
416
+ "version": "4.0.0",
417
+ "strict_mode": USE_SERPAPI_STRICT_MODE,
418
+ "max_api_calls": MAX_SERPAPI_CALLS_STRICT if USE_SERPAPI_STRICT_MODE else MAX_SERPAPI_CALLS_NORMAL,
419
+ "strategy": "1 API call for candidate collection + optional enrichment for top N",
420
+ "endpoints": {
421
+ "/keywords": "Main keyword research (configurable count)",
422
+ "/health": "Health check",
423
+ "/stats": "API usage statistics"
424
+ }
425
+ }
426
+
427
+ @app.get("/health")
428
+ async def health():
429
+ """Health check."""
430
+ return {
431
+ "status": "healthy",
432
+ "timestamp": datetime.utcnow().isoformat(),
433
+ "serpapi_available": HAS_SERPAPI and bool(SERPAPI_KEY),
434
+ "strict_mode": USE_SERPAPI_STRICT_MODE,
435
+ "session_api_calls": API_CALL_COUNTER["total"]
436
+ }
437
+
438
+ @app.get("/stats")
439
+ async def stats():
440
+ """API usage statistics."""
441
+ uptime = time.time() - API_CALL_COUNTER["session_start"]
442
+ return {
443
+ "session_start": datetime.fromtimestamp(API_CALL_COUNTER["session_start"]).isoformat(),
444
+ "uptime_seconds": round(uptime, 1),
445
+ "total_api_calls": API_CALL_COUNTER["total"],
446
+ "strict_mode": USE_SERPAPI_STRICT_MODE,
447
+ "max_calls_per_request": MAX_SERPAPI_CALLS_STRICT if USE_SERPAPI_STRICT_MODE else MAX_SERPAPI_CALLS_NORMAL
448
+ }
449
+
450
+ @app.get("/keywords", response_model=KeywordResponse)
451
+ async def get_keywords(
452
+ request: Request,
453
+ seed: str = Query(..., description="Seed keyword", min_length=1, max_length=100),
454
+ top: int = Query(50, description="Number of keywords to return", ge=1, le=100),
455
+ enrich_top: int = Query(4, description="Number of top results to enrich with SerpAPI", ge=0, le=20)
456
+ ):
457
+ """
458
+ Main keyword research endpoint.
459
+
460
+ Strategy:
461
+ 1. Make 1 SerpAPI call to collect candidates from seed
462
+ 2. Score all candidates with heuristics (free)
463
+ 3. Optionally enrich top N with real SerpAPI data
464
+
465
+ Parameters:
466
+ - seed: Your main keyword
467
+ - top: How many keywords you want (e.g., 5, 10, 20, 50)
468
+ - enrich_top: How many of the top results to verify with SerpAPI (0 = none, saves API calls)
469
+
470
+ Example: top=10, enrich_top=3 means:
471
+ - 1 API call to collect candidates
472
+ - Return 10 keywords scored with heuristics
473
+ - Enrich the top 3 with real SerpAPI data (3 more API calls)
474
+ - Total: 4 API calls
475
+ """
476
+ start_time = time.time()
477
+ client_ip = request.client.host or "unknown"
478
+
479
+ # Authentication
480
+ if API_AUTH_KEY:
481
+ auth = request.headers.get("Authorization", "").replace("Bearer ", "")
482
+ if auth != API_AUTH_KEY:
483
+ raise HTTPException(401, "Invalid or missing API key")
484
+
485
+ # Rate limiting
486
+ if not check_rate_limit(client_ip):
487
+ raise HTTPException(429, "Rate limit exceeded")
488
+
489
+ # Validate
490
+ seed = seed.strip().lower()
491
+ if not seed:
492
+ raise HTTPException(400, "Invalid seed keyword")
493
+
494
+ # Check API budget
495
+ max_calls = MAX_SERPAPI_CALLS_STRICT if USE_SERPAPI_STRICT_MODE else MAX_SERPAPI_CALLS_NORMAL
496
+ if enrich_top > 0:
497
+ required_calls = 1 + enrich_top # 1 for collection + N for enrichment
498
+ if required_calls > max_calls:
499
+ raise HTTPException(
500
+ 400,
501
+ f"Request would use {required_calls} API calls, but budget is {max_calls}. "
502
+ f"Reduce enrich_top to {max_calls - 1} or less."
503
+ )
504
+
505
+ try:
506
+ logger.info(f"Request: seed='{seed}', top={top}, enrich_top={enrich_top}")
507
+
508
+ # Step 1: Collect candidates (1 API call)
509
+ candidates, api_calls_used = collect_candidates_from_seed(seed)
510
+
511
+ if not candidates:
512
+ raise HTTPException(404, "No candidates found")
513
+
514
+ # Step 2: Score all candidates with heuristics (FREE - no API calls)
515
+ logger.info(f"Scoring {len(candidates)} candidates with heuristics...")
516
+ scored_candidates = []
517
+ for candidate in candidates:
518
+ try:
519
+ result = score_keyword_heuristic(candidate)
520
+ scored_candidates.append(result)
521
+ except Exception as e:
522
+ logger.warning(f"Heuristic scoring failed for '{candidate}': {e}")
523
+ continue
524
+
525
+ # Sort by opportunity score (highest first)
526
+ scored_candidates.sort(key=lambda x: x["opportunity_score"], reverse=True)
527
+
528
+ # Get top N requested
529
+ top_results = scored_candidates[:top]
530
+
531
+ # Step 3: Optionally enrich top results with real SerpAPI data
532
+ data_source = "heuristic"
533
+ if enrich_top > 0 and HAS_SERPAPI and SERPAPI_KEY:
534
+ logger.info(f"Enriching top {enrich_top} results with SerpAPI...")
535
+
536
+ for i in range(min(enrich_top, len(top_results))):
537
+ keyword = top_results[i]["keyword"]
538
+
539
+ # Check budget before each call
540
+ if api_calls_used >= max_calls:
541
+ logger.warning(f"API budget exhausted at {api_calls_used} calls")
542
+ break
543
+
544
+ enriched = enrich_with_serpapi(keyword)
545
+ if enriched:
546
+ top_results[i] = enriched
547
+ api_calls_used += 1
548
+ data_source = "mixed"
549
+
550
+ # Small delay between calls
551
+ time.sleep(0.2)
552
+
553
+ logger.info(f"Enrichment complete: {api_calls_used} total API calls used")
554
+
555
+ # Add ranking
556
+ for rank, result in enumerate(top_results, 1):
557
+ result["rank"] = rank
558
+
559
+ processing_time = time.time() - start_time
560
+ budget_remaining = max_calls - api_calls_used
561
+
562
+ logger.info(
563
+ f"SUCCESS: Returned {len(top_results)} keywords, "
564
+ f"API calls: {api_calls_used}/{max_calls}, "
565
+ f"Time: {processing_time:.2f}s"
566
+ )
567
+
568
+ return KeywordResponse(
569
+ success=True,
570
+ seed=seed,
571
+ requested=top,
572
+ returned=len(top_results),
573
+ results=top_results,
574
+ processing_time=round(processing_time, 2),
575
+ api_calls_used=api_calls_used,
576
+ api_budget_remaining=budget_remaining,
577
+ data_source=data_source,
578
+ timestamp=datetime.utcnow().isoformat()
579
+ )
580
+
581
+ except HTTPException:
582
+ raise
583
+ except Exception as e:
584
+ logger.error(f"Request failed: {e}")
585
+ raise HTTPException(500, f"Processing error: {str(e)}")
586
+
587
+ @app.get("/export/csv")
588
+ async def export_csv(
589
+ seed: str = Query(...),
590
+ top: int = Query(50),
591
+ enrich_top: int = Query(0)
592
+ ):
593
+ """Export results as CSV."""
594
+ if not HAS_PANDAS:
595
+ raise HTTPException(500, "CSV export unavailable (pandas not installed)")
596
+
597
+ # Get keyword data
598
+ response = await get_keywords(Request(scope={"type": "http", "client": ("127.0.0.1", 0), "headers": []}), seed, top, enrich_top)
599
+
600
+ # Convert to DataFrame
601
+ df = pd.DataFrame(response.results)
602
+
603
+ # Create CSV
604
+ output = io.StringIO()
605
+ df.to_csv(output, index=False)
606
+ output.seek(0)
607
+
608
+ return StreamingResponse(
609
+ iter([output.getvalue()]),
610
+ media_type="text/csv",
611
+ headers={"Content-Disposition": f"attachment; filename=keywords_{seed.replace(' ', '_')}.csv"}
612
+ )
613
+
614
+ if __name__ == "__main__":
615
+ import uvicorn
616
+
617
+ port = int(os.getenv("PORT", 8000))
618
+ logger.info(f"Starting server on port {port}")
619
+
620
+ uvicorn.run(
621
+ app,
622
+ host="0.0.0.0",
623
+ port=port,
624
+ log_level="info"
625
+ )
tempCodeRunnerFile.py ADDED
@@ -0,0 +1 @@
 
 
1
+ pip install serpapi tabulate python-dotenv