Ryan commited on
Commit
99a81ef
·
1 Parent(s): 60c7f79

- add all content to vector db

Browse files
DEPLOYMENT.md DELETED
@@ -1,123 +0,0 @@
1
- # Deploying to Hugging Face Spaces
2
-
3
- ## Prerequisites
4
-
5
- 1. A Hugging Face account (sign up at https://huggingface.co/)
6
- 2. Qdrant Cloud instance with your data uploaded
7
- 3. OpenAI API key
8
-
9
- ## Step-by-Step Deployment
10
-
11
- ### 1. Create a New Space
12
-
13
- 1. Go to https://huggingface.co/spaces
14
- 2. Click **"Create new Space"**
15
- 3. Fill in the details:
16
- - **Owner**: Your username or organization
17
- - **Space name**: `80k-rag-qa` (or your preferred name)
18
- - **License**: Choose appropriate license (e.g., MIT)
19
- - **Space SDK**: Select **"Gradio"**
20
- - **Hardware**: Select **"CPU basic"** (free tier) or upgrade if needed
21
- - **Visibility**: Choose "Public" or "Private"
22
- 4. Click **"Create Space"**
23
-
24
- ### 2. Configure Secrets
25
-
26
- Before uploading code, set up your API keys:
27
-
28
- 1. Go to your Space's page
29
- 2. Click **"Settings"** → **"Variables and Secrets"**
30
- 3. Click **"New Secret"** for each of the following:
31
- - **Name**: `QDRANT_URL` | **Value**: Your Qdrant instance URL
32
- - **Name**: `QDRANT_API_KEY` | **Value**: Your Qdrant API key
33
- - **Name**: `OPENAI_API_KEY` | **Value**: Your OpenAI API key
34
- 4. Click **"Save"** for each secret
35
-
36
- ### 3. Upload Your Code
37
-
38
- **Option A: Using Git (Recommended)**
39
-
40
- ```bash
41
- # Clone your new Space
42
- git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
43
- cd YOUR_SPACE_NAME
44
-
45
- # Copy necessary files from this project
46
- cp /home/ryan/Documents/80k_rag/app.py .
47
- cp /home/ryan/Documents/80k_rag/rag_chat.py .
48
- cp /home/ryan/Documents/80k_rag/citation_validator.py .
49
- cp /home/ryan/Documents/80k_rag/config.py .
50
- cp /home/ryan/Documents/80k_rag/requirements.txt .
51
- cp /home/ryan/Documents/80k_rag/README.md .
52
-
53
- # Add, commit, and push
54
- git add .
55
- git commit -m "Initial deployment"
56
- git push
57
- ```
58
-
59
- **Option B: Using the Web Interface**
60
-
61
- 1. Go to your Space → **"Files and versions"** tab
62
- 2. Click **"Add file"** → **"Upload files"**
63
- 3. Upload these files:
64
- - `app.py`
65
- - `rag_chat.py`
66
- - `citation_validator.py`
67
- - `config.py`
68
- - `requirements.txt`
69
- - `README.md`
70
- 4. Click **"Commit changes to main"**
71
-
72
- ### 4. Monitor Deployment
73
-
74
- 1. Go to the **"App"** tab to see your Space building
75
- 2. Check the **"Logs"** section (click "See logs" if build fails)
76
- 3. Wait for the build to complete (usually 2-5 minutes)
77
- 4. Your app will be live at: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
78
-
79
- ## Troubleshooting
80
-
81
- ### Build Fails
82
- - Check the logs for missing dependencies
83
- - Ensure all required files are uploaded
84
- - Verify `requirements.txt` has correct package names
85
-
86
- ### Runtime Errors
87
- - Verify secrets are set correctly in Settings
88
- - Check logs for import errors or missing modules
89
- - Ensure your Qdrant instance is accessible
90
-
91
- ### Out of Memory
92
- - Consider upgrading to a larger hardware tier
93
- - Optimize model loading and caching
94
- - Reduce `SOURCE_COUNT` in `rag_chat.py`
95
-
96
- ## Updating Your Space
97
-
98
- To update your deployed app:
99
-
100
- ```bash
101
- # Make changes to your local files
102
- # Then push updates
103
- git add .
104
- git commit -m "Update: describe your changes"
105
- git push
106
- ```
107
-
108
- The Space will automatically rebuild with your changes.
109
-
110
- ## Cost Considerations
111
-
112
- - **Hugging Face Space**: Free for CPU basic tier
113
- - **OpenAI API**: Pay per token (GPT-4o-mini is cost-effective)
114
- - **Qdrant Cloud**: Has free tier, pay for larger datasets
115
- - **Estimated cost**: ~$0.01-0.10 per query depending on usage
116
-
117
- ## Security Notes
118
-
119
- - Never commit API keys to git (they should only be in Space Secrets)
120
- - Use `.gitignore` to exclude sensitive files
121
- - Regularly rotate API keys
122
- - Monitor API usage to prevent abuse
123
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
TODO.txt CHANGED
@@ -1,7 +1,2 @@
1
- Technical:
2
- - Test source citation
3
- - Setup demo website
4
  - Post video on Linkedin & send outreach messages
5
- - have u ever wondered what to do about AI?
6
-
7
- - Fix the dates that trafilatura scrapes
 
 
 
 
1
  - Post video on Linkedin & send outreach messages
2
+ - have u ever wondered what to do about AI?
 
 
chunk_articles_cli.py CHANGED
@@ -1,4 +1,5 @@
1
  import json
 
2
  from llama_index.core.node_parser import SemanticSplitterNodeParser
3
  from llama_index.core.schema import Document
4
  from llama_index.embeddings.huggingface import HuggingFaceEmbedding
@@ -7,6 +8,8 @@ from config import MODEL_NAME
7
  BUFFER_SIZE = 3
8
  BREAKPOINT_PERCENTILE_THRESHOLD = 87
9
  NUMBER_OF_ARTICLES = 86
 
 
10
 
11
  def load_articles(json_path="articles.json", n=None):
12
  """Load articles from JSON file. Optionally load only first N articles."""
@@ -25,7 +28,7 @@ def chunk_text_semantic(text, embed_model):
25
  nodes = splitter.get_nodes_from_documents([doc])
26
  return [node.text for node in nodes]
27
 
28
- def make_jsonl(articles, out_path="article_chunks.jsonl"):
29
  """Create JSONL with semantic chunks from multiple articles."""
30
  print("Loading embedding model for semantic chunking...")
31
  embed_model = HuggingFaceEmbedding(model_name=MODEL_NAME)
@@ -44,6 +47,54 @@ def make_jsonl(articles, out_path="article_chunks.jsonl"):
44
  }
45
  f.write(json.dumps(record, ensure_ascii=False) + "\n")
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  def main():
48
  articles = load_articles(n=NUMBER_OF_ARTICLES)
49
  if not articles:
@@ -51,7 +102,7 @@ def main():
51
  return
52
 
53
  make_jsonl(articles)
54
- print(f"Chunks from {len(articles)} articles written to article_chunks.jsonl")
55
 
56
  if __name__ == "__main__":
57
  main()
 
1
  import json
2
+ import os
3
  from llama_index.core.node_parser import SemanticSplitterNodeParser
4
  from llama_index.core.schema import Document
5
  from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 
8
  BUFFER_SIZE = 3
9
  BREAKPOINT_PERCENTILE_THRESHOLD = 87
10
  NUMBER_OF_ARTICLES = 86
11
+ INPUT_FOLDER = "extracted_content"
12
+ OUTPUT_FILE = "chunks.jsonl"
13
 
14
  def load_articles(json_path="articles.json", n=None):
15
  """Load articles from JSON file. Optionally load only first N articles."""
 
28
  nodes = splitter.get_nodes_from_documents([doc])
29
  return [node.text for node in nodes]
30
 
31
+ def make_jsonl(articles, out_path="chunks.jsonl"):
32
  """Create JSONL with semantic chunks from multiple articles."""
33
  print("Loading embedding model for semantic chunking...")
34
  embed_model = HuggingFaceEmbedding(model_name=MODEL_NAME)
 
47
  }
48
  f.write(json.dumps(record, ensure_ascii=False) + "\n")
49
 
50
+ def chunk_from_json_files(input_folder=INPUT_FOLDER, output_file=OUTPUT_FILE):
51
+ """Load articles from JSON files in folder and chunk them to JSONL."""
52
+ if not os.path.exists(input_folder):
53
+ print(f"Input folder '{input_folder}' not found")
54
+ return
55
+
56
+ # Load all articles from JSON files
57
+ all_articles = []
58
+ json_files = [f for f in os.listdir(input_folder) if f.endswith('.json')]
59
+
60
+ if not json_files:
61
+ print(f"No JSON files found in {input_folder}")
62
+ return
63
+
64
+ for json_file in json_files:
65
+ json_path = os.path.join(input_folder, json_file)
66
+ with open(json_path, "r", encoding="utf-8") as f:
67
+ articles = json.load(f)
68
+ all_articles.extend(articles)
69
+ print(f"Loaded {len(articles)} articles from {json_file}")
70
+
71
+ if not all_articles:
72
+ print("No articles found to chunk")
73
+ return
74
+
75
+ print(f"\nTotal articles to chunk: {len(all_articles)}")
76
+ print("Loading embedding model for semantic chunking...")
77
+ embed_model = HuggingFaceEmbedding(model_name=MODEL_NAME)
78
+
79
+ chunk_count = 0
80
+ with open(output_file, "w", encoding="utf-8") as f:
81
+ for idx, article in enumerate(all_articles, 1):
82
+ print(f"Chunking ({idx}/{len(all_articles)}): {article['title']}")
83
+ chunks = chunk_text_semantic(article["text"], embed_model)
84
+ for i, chunk in enumerate(chunks, 1):
85
+ record = {
86
+ "url": article["url"],
87
+ "title": article["title"],
88
+ "date": article.get("date"),
89
+ "chunk_id": i,
90
+ "text": chunk,
91
+ }
92
+ chunk_count += 1
93
+ f.write(json.dumps(record, ensure_ascii=False) + "\n")
94
+
95
+ print(f"\n✓ Created {chunk_count} chunks from {len(all_articles)} articles")
96
+ print(f"💾 Saved to {output_file}")
97
+
98
  def main():
99
  articles = load_articles(n=NUMBER_OF_ARTICLES)
100
  if not articles:
 
102
  return
103
 
104
  make_jsonl(articles)
105
+ print(f"Chunks from {len(articles)} articles written to chunks.jsonl")
106
 
107
  if __name__ == "__main__":
108
  main()
extract_articles_cli.py DELETED
@@ -1,87 +0,0 @@
1
- import requests, re, json
2
- import trafilatura
3
- from typing import List, Dict, Optional
4
-
5
- HEADERS = {"User-Agent": "RAG-80k/0.1 (+your-contact)"}
6
- NUMBER_OF_ARTICLES = 10
7
-
8
- def get_all_article_urls() -> List[str]:
9
- """Extract all article URLs from the sitemap."""
10
- sitemap_url = "https://80000hours.org/article-sitemap.xml"
11
- r = requests.get(sitemap_url, headers=HEADERS, timeout=20)
12
- r.raise_for_status()
13
-
14
- # Find all <loc> tags in the sitemap
15
- urls = re.findall(r"<loc>(.*?)</loc>", r.text)
16
- return urls
17
-
18
- def extract_article(url: str) -> Optional[Dict]:
19
- """Extract article content and metadata from a URL."""
20
- r = requests.get(url, headers=HEADERS, timeout=30)
21
- r.raise_for_status()
22
- data = trafilatura.extract(
23
- r.content,
24
- url=url,
25
- with_metadata=True,
26
- include_links=False,
27
- include_comments=False,
28
- include_formatting=False,
29
- output_format="json",
30
- )
31
- return json.loads(data) if data else None
32
-
33
- def extract_all_articles() -> List[Dict]:
34
- """Extract all articles from the sitemap."""
35
- urls = get_all_article_urls()
36
- print(f"Found {len(urls)} articles in sitemap")
37
-
38
- articles = []
39
- for i, url in enumerate(urls, 1):
40
- print(f"[{i}/{len(urls)}] Extracting: {url}")
41
- record = extract_article(url)
42
- if record and record.get("text"):
43
- articles.append({
44
- "url": url,
45
- "title": record.get("title", ""),
46
- "date": record.get("date"),
47
- "text": record.get("text", "").strip()
48
- })
49
- else:
50
- print(f" Failed to extract: {url}")
51
-
52
- print(f"Successfully extracted {len(articles)} articles")
53
- return articles
54
-
55
- def extract_first_n_articles(n: int) -> List[Dict]:
56
- """Extract the first N articles from the sitemap."""
57
- urls = get_all_article_urls()[:n]
58
- print(f"Extracting first {n} articles")
59
-
60
- articles = []
61
- for i, url in enumerate(urls, 1):
62
- print(f"[{i}/{len(urls)}] Extracting: {url}")
63
- record = extract_article(url)
64
- if record and record.get("text"):
65
- articles.append({
66
- "url": url,
67
- "title": record.get("title", ""),
68
- "date": record.get("date"),
69
- "text": record.get("text", "").strip()
70
- })
71
- else:
72
- print(f" Failed to extract: {url}")
73
-
74
- print(f"Successfully extracted {len(articles)} articles")
75
- return articles
76
-
77
- def main():
78
- articles = extract_all_articles()
79
-
80
- if articles:
81
- # Save to JSON file
82
- output_file = "articles.json"
83
- with open(output_file, "w", encoding="utf-8") as f:
84
- json.dump(articles, f, ensure_ascii=False, indent=2)
85
-
86
- if __name__ == "__main__":
87
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
extract_content_cli.py ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests, re, json
2
+ import trafilatura
3
+ from typing import List, Dict, Optional
4
+ from time import sleep
5
+ from dateutil import parser as date_parser
6
+ from concurrent.futures import ThreadPoolExecutor, as_completed
7
+ import random
8
+ import os
9
+ import threading
10
+ import time
11
+ from requests.adapters import HTTPAdapter
12
+ from urllib3.util.retry import Retry
13
+
14
+ # Parallel processing settings
15
+ USE_PARALLEL = True
16
+ MAX_WORKERS = 3
17
+
18
+ # Rate limiting settings
19
+ MIN_DELAY = 1.0
20
+ MAX_DELAY = 3.0
21
+ RATE_LOCK = threading.Lock()
22
+ _next_request_time = 0.0
23
+
24
+ # Output settings
25
+ OUTPUT_FOLDER = "extracted_content"
26
+ TEST_LIMIT = None
27
+
28
+ # HTTP settings
29
+ HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
30
+
31
+ # All content sitemaps (excluding category/author which are just metadata)
32
+ SITEMAPS = {
33
+ "ai_career_guide_pages": "https://80000hours.org/ai_career_guide_page-sitemap.xml",
34
+ # "articles": "https://80000hours.org/article-sitemap.xml",
35
+ # "career_guide_pages": "https://80000hours.org/careerguidepage-sitemap.xml",
36
+ "career_profiles": "https://80000hours.org/career_profile-sitemap.xml",
37
+ # "career_reports": "https://80000hours.org/career_report-sitemap.xml",
38
+ # "case_studies": "https://80000hours.org/case_study-sitemap.xml",
39
+ "posts": "https://80000hours.org/post-sitemap.xml",
40
+ "problem_profiles": "https://80000hours.org/problem_profile-sitemap.xml",
41
+ # "podcasts": "https://80000hours.org/podcast-sitemap.xml",
42
+ # "podcast_after_hours": "https://80000hours.org/podcast_after_hours-sitemap.xml",
43
+ "skill_sets": "https://80000hours.org/skill_set-sitemap.xml",
44
+ # "videos": "https://80000hours.org/video-sitemap.xml",
45
+ }
46
+
47
+ # Thread-local session with retries and backoff
48
+ thread_local = threading.local()
49
+
50
+ def get_session():
51
+ """Get or create a thread-local requests session with retries and connection pooling."""
52
+ s = getattr(thread_local, "session", None)
53
+ if s is None:
54
+ s = requests.Session()
55
+ s.headers.update(HEADERS)
56
+ retry = Retry(
57
+ total=5, connect=3, read=3, status=3,
58
+ status_forcelist=[429, 500, 502, 503, 504],
59
+ allowed_methods={"GET", "HEAD"},
60
+ backoff_factor=0.8,
61
+ raise_on_status=False,
62
+ respect_retry_after_header=True,
63
+ )
64
+ adapter = HTTPAdapter(
65
+ max_retries=retry,
66
+ pool_connections=MAX_WORKERS * 2,
67
+ pool_maxsize=MAX_WORKERS * 2,
68
+ )
69
+ s.mount("http://", adapter)
70
+ s.mount("https://", adapter)
71
+ thread_local.session = s
72
+ return s
73
+
74
+ def throttle():
75
+ """Enforce rate limiting across all threads."""
76
+ global _next_request_time
77
+ delay = random.uniform(MIN_DELAY, MAX_DELAY)
78
+ with RATE_LOCK:
79
+ now = time.monotonic()
80
+ wait = max(0.0, _next_request_time - now)
81
+ _next_request_time = max(now, _next_request_time) + delay
82
+ if wait > 0:
83
+ time.sleep(wait)
84
+
85
+ def get_urls_from_sitemap(sitemap_url: str) -> List[str]:
86
+ """Extract all URLs from a sitemap."""
87
+ throttle()
88
+ r = get_session().get(sitemap_url, timeout=20)
89
+ r.raise_for_status()
90
+ return re.findall(r"<loc>(.*?)</loc>", r.text)
91
+
92
+ def parse_custom_date(html_content: str) -> Optional[str]:
93
+ """
94
+ Extract and parse publication date from 80,000 Hours HTML content.
95
+
96
+ Priority:
97
+ 1. "Updated [date]" if present
98
+ 2. "Published [date]" otherwise
99
+
100
+ Returns date in YYYY-MM-DD format, or None if not found.
101
+ """
102
+ # Date pattern: month + optional day (with ordinal) + year
103
+ date_pattern = r'([A-Za-z]+\s+(?:\d{1,2}(?:st|nd|rd|th)?,?\s+)?\d{4})'
104
+
105
+ # Try "Updated" first, then "Published"
106
+ for keyword in ['Updated', 'Published']:
107
+ match = re.search(f'{keyword}\\s+{date_pattern}', html_content, re.IGNORECASE)
108
+ if match:
109
+ try:
110
+ parsed_date = date_parser.parse(match.group(1), fuzzy=True)
111
+ return parsed_date.strftime('%Y-%m-%d')
112
+ except:
113
+ pass
114
+
115
+ return None
116
+
117
+ def extract_content(url: str) -> Optional[Dict]:
118
+ """Extract content and metadata from a URL."""
119
+ try:
120
+ throttle()
121
+ r = get_session().get(url, timeout=30)
122
+ r.raise_for_status()
123
+ except Exception as e:
124
+ print(f" ❌ Request failed: {e}")
125
+ return None
126
+
127
+ data = trafilatura.extract(
128
+ r.content, url=url, with_metadata=True,
129
+ include_links=False, include_comments=False,
130
+ include_formatting=False, output_format="json"
131
+ )
132
+
133
+ if not data:
134
+ return None
135
+
136
+ result = json.loads(data)
137
+ if custom_date := parse_custom_date(r.text):
138
+ result['date'] = custom_date
139
+
140
+ return result
141
+
142
+
143
+ def process_record(record: Optional[Dict], url: str, sitemap_name: str) -> Optional[Dict]:
144
+ """Convert extraction record to final output format."""
145
+ if not (record and record.get("text")):
146
+ return None
147
+ return {
148
+ "url": url,
149
+ "title": record.get("title", ""),
150
+ "date": record.get("date"),
151
+ "author": record.get("author"),
152
+ "text": record.get("text", "").strip(),
153
+ "content_type": sitemap_name
154
+ }
155
+
156
+ def handle_extraction_result(record: Optional[Dict], url: str, sitemap_name: str, index: int, total: int, items: List[Dict]) -> None:
157
+ """Process extraction result and add to items list if successful."""
158
+ try:
159
+ result = process_record(record, url, sitemap_name)
160
+ if result:
161
+ items.append(result)
162
+ status = "✓" if result else "⚠️ Failed:"
163
+ print(f"[{index}/{total}] {status} {url}")
164
+ except Exception as e:
165
+ print(f"[{index}/{total}] ❌ {url}: {e}")
166
+
167
+ def extract_from_sitemap(sitemap_name: str, sitemap_url: str, limit: int = None, parallel: bool = True, max_workers: int = 5) -> List[Dict]:
168
+ """Extract content from a sitemap using either parallel or sequential processing."""
169
+ print(f"\n{'='*80}")
170
+ print(f"Processing {sitemap_name}...")
171
+ print(f"{'='*80}")
172
+
173
+ urls = get_urls_from_sitemap(sitemap_url)
174
+ print(f"Found {len(urls)} URLs in sitemap")
175
+
176
+ if limit:
177
+ urls = urls[:limit]
178
+ print(f"Limiting to first {limit} URL(s)")
179
+
180
+ items = []
181
+
182
+ if parallel and len(urls) > 1:
183
+ print(f"🚀 Using parallel processing with {max_workers} workers")
184
+ completed = 0
185
+
186
+ with ThreadPoolExecutor(max_workers=max_workers) as executor:
187
+ # Submit all tasks
188
+ future_to_url = {
189
+ executor.submit(extract_content, url): url
190
+ for url in urls
191
+ }
192
+
193
+ # Process completed tasks
194
+ for future in as_completed(future_to_url):
195
+ url = future_to_url[future]
196
+ completed += 1
197
+ handle_extraction_result(future.result(), url, sitemap_name, completed, len(urls), items)
198
+ else:
199
+ print("📝 Using sequential processing")
200
+ for i, url in enumerate(urls, 1):
201
+ handle_extraction_result(extract_content(url), url, sitemap_name, i, len(urls), items)
202
+
203
+ print(f"✓ Successfully extracted {len(items)}/{len(urls)} items")
204
+ return items
205
+
206
+ def extract_all_to_json():
207
+ """Extract all content from sitemaps and save to individual JSON files."""
208
+ os.makedirs(OUTPUT_FOLDER, exist_ok=True)
209
+
210
+ print("Starting 80,000 Hours content extraction...")
211
+ print(f"Total content types: {len(SITEMAPS)}")
212
+ print(f"Output folder: {OUTPUT_FOLDER}/")
213
+ if TEST_LIMIT:
214
+ print(f"⚠️ TEST MODE: Extracting only {TEST_LIMIT} item(s) per content type\n")
215
+
216
+ all_stats = {}
217
+ for content_type, sitemap_url in SITEMAPS.items():
218
+ items = extract_from_sitemap(
219
+ content_type, sitemap_url,
220
+ limit=TEST_LIMIT, parallel=USE_PARALLEL, max_workers=MAX_WORKERS
221
+ )
222
+ all_stats[content_type] = len(items)
223
+
224
+ if items:
225
+ output_file = os.path.join(OUTPUT_FOLDER, f"{content_type}.json")
226
+ with open(output_file, "w", encoding="utf-8") as f:
227
+ json.dump(items, f, ensure_ascii=False, indent=2)
228
+ print(f"💾 Saved to {output_file}")
229
+
230
+ print(f"\n{'='*80}\nEXTRACTION COMPLETE\n{'='*80}")
231
+ print(f"Total items extracted: {sum(all_stats.values())}")
232
+ print("\nBreakdown by content type:")
233
+ for content_type, count in sorted(all_stats.items(), key=lambda x: x[1], reverse=True):
234
+ print(f" {content_type:25s}: {count:4d} items → {OUTPUT_FOLDER}/{content_type}.json")
235
+
236
+ def main():
237
+ extract_all_to_json()
238
+
239
+ if __name__ == "__main__":
240
+ main()
241
+
pipeline.py ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Combined pipeline: Extract → Chunk → Upload to Qdrant (additive)"""
2
+ from extract_content_cli import extract_all_to_json
3
+ from chunk_articles_cli import chunk_from_json_files
4
+ from upload_to_qdrant_cli import upload_chunks_additive
5
+
6
+
7
+ def main():
8
+ print("="*80)
9
+ print("80,000 HOURS RAG PIPELINE")
10
+ print("Extract → Chunk → Upload (Additive)")
11
+ print("="*80)
12
+
13
+ # Step 1: Extract to individual JSON files
14
+ print("\n" + "="*80)
15
+ print("STEP 1: EXTRACTING CONTENT")
16
+ print("="*80)
17
+ extract_all_to_json()
18
+
19
+ # Step 2: Chunk from JSON files
20
+ print("\n" + "="*80)
21
+ print("STEP 2: CHUNKING ARTICLES")
22
+ print("="*80)
23
+ chunk_from_json_files()
24
+
25
+ # Step 3: Upload to Qdrant from chunks file (additive)
26
+ print("\n" + "="*80)
27
+ print("STEP 3: UPLOADING TO QDRANT (ADDITIVE)")
28
+ print("="*80)
29
+ upload_chunks_additive()
30
+
31
+ print("\n" + "="*80)
32
+ print("PIPELINE COMPLETE ✓")
33
+ print("="*80)
34
+
35
+
36
+ if __name__ == "__main__":
37
+ main()
requirements.txt CHANGED
@@ -7,6 +7,8 @@ requests>=2.31.0
7
  gradio>=4.0.0
8
  rapidfuzz>=3.0.0
9
  fuzzysearch>=0.7.3
 
 
10
  --extra-index-url https://download.pytorch.org/whl/cpu
11
  torch>=2.0.0
12
 
 
7
  gradio>=4.0.0
8
  rapidfuzz>=3.0.0
9
  fuzzysearch>=0.7.3
10
+ trafilatura>=1.6.0
11
+ python-dateutil>=2.8.0
12
  --extra-index-url https://download.pytorch.org/whl/cpu
13
  torch>=2.0.0
14
 
upload_to_qdrant_cli.py CHANGED
@@ -8,7 +8,7 @@ from config import MODEL_NAME, COLLECTION_NAME, EMBEDDING_DIM
8
 
9
  load_dotenv()
10
 
11
- def load_chunks(jsonl_path="article_chunks.jsonl"):
12
  chunks = []
13
  with open(jsonl_path, "r", encoding="utf-8") as f:
14
  for line in f:
@@ -64,9 +64,16 @@ def create_points(chunks, embeddings):
64
  points.append(point)
65
  return points
66
 
67
- def upload_points(client, points, collection_name=COLLECTION_NAME):
68
- print(f"Uploading {len(points)} points...")
69
- client.upsert(collection_name=collection_name, points=points)
 
 
 
 
 
 
 
70
  print(f"✓ Uploaded {len(points)} chunks to collection '{collection_name}'")
71
 
72
  def verify_upload(client, collection_name=COLLECTION_NAME):
@@ -74,21 +81,63 @@ def verify_upload(client, collection_name=COLLECTION_NAME):
74
  print(f"Collection now has {collection_info.points_count} points")
75
  return collection_info.points_count
76
 
77
- def main():
78
- chunks = load_chunks()
79
- print(f"Found {len(chunks)} chunks")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
- client = create_qdrant_client()
82
- model = load_embedding_model()
83
 
84
- if not create_collection(client):
 
85
  return
86
 
 
 
 
 
87
  embeddings = generate_embeddings(model, chunks)
88
  points = create_points(chunks, embeddings)
 
 
89
  upload_points(client, points)
90
- verify_upload(client)
 
 
91
 
92
- if __name__ == "__main__":
93
- main()
94
 
 
 
 
8
 
9
  load_dotenv()
10
 
11
+ def load_chunks(jsonl_path="chunks.jsonl"):
12
  chunks = []
13
  with open(jsonl_path, "r", encoding="utf-8") as f:
14
  for line in f:
 
64
  points.append(point)
65
  return points
66
 
67
+ def upload_points(client, points, collection_name=COLLECTION_NAME, batch_size=100):
68
+ print(f"Uploading {len(points)} points in batches of {batch_size}...")
69
+ total_batches = (len(points) + batch_size - 1) // batch_size
70
+
71
+ for i in range(0, len(points), batch_size):
72
+ batch = points[i:i + batch_size]
73
+ batch_num = (i // batch_size) + 1
74
+ print(f" Batch {batch_num}/{total_batches}: Uploading {len(batch)} points...")
75
+ client.upsert(collection_name=collection_name, points=batch)
76
+
77
  print(f"✓ Uploaded {len(points)} chunks to collection '{collection_name}'")
78
 
79
  def verify_upload(client, collection_name=COLLECTION_NAME):
 
81
  print(f"Collection now has {collection_info.points_count} points")
82
  return collection_info.points_count
83
 
84
+ def ensure_collection_exists(client, collection_name=COLLECTION_NAME, embedding_dim=EMBEDDING_DIM):
85
+ """Ensure collection exists, create if it doesn't. Returns starting ID for new points."""
86
+ if not client.collection_exists(collection_name):
87
+ print(f"Collection '{collection_name}' doesn't exist. Creating...")
88
+ client.create_collection(
89
+ collection_name=collection_name,
90
+ vectors_config=VectorParams(size=embedding_dim, distance=Distance.COSINE),
91
+ )
92
+ return 0
93
+ else:
94
+ collection_info = client.get_collection(collection_name)
95
+ point_count = collection_info.points_count
96
+ print(f"Collection '{collection_name}' exists with {point_count} points")
97
+ return point_count
98
+
99
+ def offset_point_ids(points, start_id):
100
+ """Update point IDs to start from a given offset."""
101
+ print(f"Setting point IDs starting from {start_id}...")
102
+ for i, point in enumerate(points):
103
+ point.id = start_id + i
104
+ return points
105
+
106
+ def print_upload_summary(start_id, added_count, new_count):
107
+ """Print upload summary statistics."""
108
+ print(f"\n✓ Upload complete!")
109
+ print(f" Previous: {start_id} points")
110
+ print(f" Added: {added_count} points")
111
+ print(f" Total now: {new_count} points")
112
+
113
+ def upload_chunks_additive(chunks_file="chunks.jsonl"):
114
+ """Upload chunks to Qdrant additively (preserves existing data)."""
115
+ if not os.path.exists(chunks_file):
116
+ print(f"Chunks file '{chunks_file}' not found")
117
+ return
118
 
119
+ chunks = load_chunks(chunks_file)
120
+ print(f"Found {len(chunks)} chunks")
121
 
122
+ if not chunks:
123
+ print("No chunks to upload")
124
  return
125
 
126
+ client = create_qdrant_client()
127
+ start_id = ensure_collection_exists(client)
128
+
129
+ model = load_embedding_model()
130
  embeddings = generate_embeddings(model, chunks)
131
  points = create_points(chunks, embeddings)
132
+ points = offset_point_ids(points, start_id)
133
+
134
  upload_points(client, points)
135
+
136
+ new_count = verify_upload(client)
137
+ print_upload_summary(start_id, len(points), new_count)
138
 
139
+ def main():
140
+ upload_chunks_additive()
141
 
142
+ if __name__ == "__main__":
143
+ main()