Spaces:

Automaton9
/

80000_Hours_AI_Assistant

Sleeping

App Files Files Community

Ryan commited on Oct 11, 2025

Commit

99a81ef

1 Parent(s): 60c7f79

- add all content to vector db

Browse files

Files changed (8) hide show

DEPLOYMENT.md +0 -123
TODO.txt +1 -6
chunk_articles_cli.py +53 -2
extract_articles_cli.py +0 -87
extract_content_cli.py +241 -0
pipeline.py +37 -0
requirements.txt +2 -0
upload_to_qdrant_cli.py +62 -13

DEPLOYMENT.md DELETED Viewed

@@ -1,123 +0,0 @@
-# Deploying to Hugging Face Spaces
-## Prerequisites
-1. A Hugging Face account (sign up at https://huggingface.co/)
-2. Qdrant Cloud instance with your data uploaded
-3. OpenAI API key
-## Step-by-Step Deployment
-### 1. Create a New Space
-1. Go to https://huggingface.co/spaces
-2. Click **"Create new Space"**
-3. Fill in the details:
-   - **Owner**: Your username or organization
-   - **Space name**: `80k-rag-qa` (or your preferred name)
-   - **License**: Choose appropriate license (e.g., MIT)
-   - **Space SDK**: Select **"Gradio"**
-   - **Hardware**: Select **"CPU basic"** (free tier) or upgrade if needed
-   - **Visibility**: Choose "Public" or "Private"
-4. Click **"Create Space"**
-### 2. Configure Secrets
-Before uploading code, set up your API keys:
-1. Go to your Space's page
-2. Click **"Settings"** → **"Variables and Secrets"**
-3. Click **"New Secret"** for each of the following:
-   - **Name**: `QDRANT_URL` | **Value**: Your Qdrant instance URL
-   - **Name**: `QDRANT_API_KEY` | **Value**: Your Qdrant API key
-   - **Name**: `OPENAI_API_KEY` | **Value**: Your OpenAI API key
-4. Click **"Save"** for each secret
-### 3. Upload Your Code
-**Option A: Using Git (Recommended)**
-```bash
-# Clone your new Space
-git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
-cd YOUR_SPACE_NAME
-# Copy necessary files from this project
-cp /home/ryan/Documents/80k_rag/app.py .
-cp /home/ryan/Documents/80k_rag/rag_chat.py .
-cp /home/ryan/Documents/80k_rag/citation_validator.py .
-cp /home/ryan/Documents/80k_rag/config.py .
-cp /home/ryan/Documents/80k_rag/requirements.txt .
-cp /home/ryan/Documents/80k_rag/README.md .
-# Add, commit, and push
-git add .
-git commit -m "Initial deployment"
-git push
-```
-**Option B: Using the Web Interface**
-1. Go to your Space → **"Files and versions"** tab
-2. Click **"Add file"** → **"Upload files"**
-3. Upload these files:
-   - `app.py`
-   - `rag_chat.py`
-   - `citation_validator.py`
-   - `config.py`
-   - `requirements.txt`
-   - `README.md`
-4. Click **"Commit changes to main"**
-### 4. Monitor Deployment
-1. Go to the **"App"** tab to see your Space building
-2. Check the **"Logs"** section (click "See logs" if build fails)
-3. Wait for the build to complete (usually 2-5 minutes)
-4. Your app will be live at: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
-## Troubleshooting
-### Build Fails
-- Check the logs for missing dependencies
-- Ensure all required files are uploaded
-- Verify `requirements.txt` has correct package names
-### Runtime Errors
-- Verify secrets are set correctly in Settings
-- Check logs for import errors or missing modules
-- Ensure your Qdrant instance is accessible
-### Out of Memory
-- Consider upgrading to a larger hardware tier
-- Optimize model loading and caching
-- Reduce `SOURCE_COUNT` in `rag_chat.py`
-## Updating Your Space
-To update your deployed app:
-```bash
-# Make changes to your local files
-# Then push updates
-git add .
-git commit -m "Update: describe your changes"
-git push
-```
-The Space will automatically rebuild with your changes.
-## Cost Considerations
-- **Hugging Face Space**: Free for CPU basic tier
-- **OpenAI API**: Pay per token (GPT-4o-mini is cost-effective)
-- **Qdrant Cloud**: Has free tier, pay for larger datasets
-- **Estimated cost**: ~$0.01-0.10 per query depending on usage
-## Security Notes
-- Never commit API keys to git (they should only be in Space Secrets)
-- Use `.gitignore` to exclude sensitive files
-- Regularly rotate API keys
-- Monitor API usage to prevent abuse

TODO.txt CHANGED Viewed

@@ -1,7 +1,2 @@
-Technical:
-- Test source citation
-- Setup demo website
 - Post video on Linkedin & send outreach messages
-    - have u ever wondered what to do about AI?
-- Fix the dates that trafilatura scrapes





1	- Post video on Linkedin & send outreach messages
2	+ - have u ever wondered what to do about AI?

chunk_articles_cli.py CHANGED Viewed

@@ -1,4 +1,5 @@
 import json
 from llama_index.core.node_parser import SemanticSplitterNodeParser
 from llama_index.core.schema import Document
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
@@ -7,6 +8,8 @@ from config import MODEL_NAME
 BUFFER_SIZE = 3
 BREAKPOINT_PERCENTILE_THRESHOLD = 87
 NUMBER_OF_ARTICLES = 86
 def load_articles(json_path="articles.json", n=None):
     """Load articles from JSON file. Optionally load only first N articles."""
@@ -25,7 +28,7 @@ def chunk_text_semantic(text, embed_model):
     nodes = splitter.get_nodes_from_documents([doc])
     return [node.text for node in nodes]
-def make_jsonl(articles, out_path="article_chunks.jsonl"):
     """Create JSONL with semantic chunks from multiple articles."""
     print("Loading embedding model for semantic chunking...")
     embed_model = HuggingFaceEmbedding(model_name=MODEL_NAME)
@@ -44,6 +47,54 @@ def make_jsonl(articles, out_path="article_chunks.jsonl"):
                 }
                 f.write(json.dumps(record, ensure_ascii=False) + "\n")
 def main():
     articles = load_articles(n=NUMBER_OF_ARTICLES)
     if not articles:
@@ -51,7 +102,7 @@ def main():
         return
     make_jsonl(articles)
-    print(f"Chunks from {len(articles)} articles written to article_chunks.jsonl")
 if __name__ == "__main__":
     main()

 import json
+import os
 from llama_index.core.node_parser import SemanticSplitterNodeParser
 from llama_index.core.schema import Document
 from llama_index.embeddings.huggingface import HuggingFaceEmbedding
 BUFFER_SIZE = 3
 BREAKPOINT_PERCENTILE_THRESHOLD = 87
 NUMBER_OF_ARTICLES = 86
+INPUT_FOLDER = "extracted_content"
+OUTPUT_FILE = "chunks.jsonl"
 def load_articles(json_path="articles.json", n=None):
     """Load articles from JSON file. Optionally load only first N articles."""
     nodes = splitter.get_nodes_from_documents([doc])
     return [node.text for node in nodes]
+def make_jsonl(articles, out_path="chunks.jsonl"):
     """Create JSONL with semantic chunks from multiple articles."""
     print("Loading embedding model for semantic chunking...")
     embed_model = HuggingFaceEmbedding(model_name=MODEL_NAME)
                 }
                 f.write(json.dumps(record, ensure_ascii=False) + "\n")
+def chunk_from_json_files(input_folder=INPUT_FOLDER, output_file=OUTPUT_FILE):
+    """Load articles from JSON files in folder and chunk them to JSONL."""
+    if not os.path.exists(input_folder):
+        print(f"Input folder '{input_folder}' not found")
+        return
+    # Load all articles from JSON files
+    all_articles = []
+    json_files = [f for f in os.listdir(input_folder) if f.endswith('.json')]
+    if not json_files:
+        print(f"No JSON files found in {input_folder}")
+        return
+    for json_file in json_files:
+        json_path = os.path.join(input_folder, json_file)
+        with open(json_path, "r", encoding="utf-8") as f:
+            articles = json.load(f)
+            all_articles.extend(articles)
+            print(f"Loaded {len(articles)} articles from {json_file}")
+    if not all_articles:
+        print("No articles found to chunk")
+        return
+    print(f"\nTotal articles to chunk: {len(all_articles)}")
+    print("Loading embedding model for semantic chunking...")
+    embed_model = HuggingFaceEmbedding(model_name=MODEL_NAME)
+    chunk_count = 0
+    with open(output_file, "w", encoding="utf-8") as f:
+        for idx, article in enumerate(all_articles, 1):
+            print(f"Chunking ({idx}/{len(all_articles)}): {article['title']}")
+            chunks = chunk_text_semantic(article["text"], embed_model)
+            for i, chunk in enumerate(chunks, 1):
+                record = {
+                    "url": article["url"],
+                    "title": article["title"],
+                    "date": article.get("date"),
+                    "chunk_id": i,
+                    "text": chunk,
+                }
+                chunk_count += 1
+                f.write(json.dumps(record, ensure_ascii=False) + "\n")
+    print(f"\n✓ Created {chunk_count} chunks from {len(all_articles)} articles")
+    print(f"💾 Saved to {output_file}")
 def main():
     articles = load_articles(n=NUMBER_OF_ARTICLES)
     if not articles:
         return
     make_jsonl(articles)
+    print(f"Chunks from {len(articles)} articles written to chunks.jsonl")
 if __name__ == "__main__":
     main()

extract_articles_cli.py DELETED Viewed

@@ -1,87 +0,0 @@
-import requests, re, json
-import trafilatura
-from typing import List, Dict, Optional
-HEADERS = {"User-Agent": "RAG-80k/0.1 (+your-contact)"}
-NUMBER_OF_ARTICLES = 10
-def get_all_article_urls() -> List[str]:
-    """Extract all article URLs from the sitemap."""
-    sitemap_url = "https://80000hours.org/article-sitemap.xml"
-    r = requests.get(sitemap_url, headers=HEADERS, timeout=20)
-    r.raise_for_status()
-    # Find all <loc> tags in the sitemap
-    urls = re.findall(r"<loc>(.*?)</loc>", r.text)
-    return urls
-def extract_article(url: str) -> Optional[Dict]:
-    """Extract article content and metadata from a URL."""
-    r = requests.get(url, headers=HEADERS, timeout=30)
-    r.raise_for_status()
-    data = trafilatura.extract(
-        r.content,
-        url=url,
-        with_metadata=True,
-        include_links=False,
-        include_comments=False,
-        include_formatting=False,
-        output_format="json",
-    )
-    return json.loads(data) if data else None
-def extract_all_articles() -> List[Dict]:
-    """Extract all articles from the sitemap."""
-    urls = get_all_article_urls()
-    print(f"Found {len(urls)} articles in sitemap")
-    articles = []
-    for i, url in enumerate(urls, 1):
-        print(f"[{i}/{len(urls)}] Extracting: {url}")
-        record = extract_article(url)
-        if record and record.get("text"):
-            articles.append({
-                "url": url,
-                "title": record.get("title", ""),
-                "date": record.get("date"),
-                "text": record.get("text", "").strip()
-            })
-        else:
-            print(f"  Failed to extract: {url}")
-    print(f"Successfully extracted {len(articles)} articles")
-    return articles
-def extract_first_n_articles(n: int) -> List[Dict]:
-    """Extract the first N articles from the sitemap."""
-    urls = get_all_article_urls()[:n]
-    print(f"Extracting first {n} articles")
-    articles = []
-    for i, url in enumerate(urls, 1):
-        print(f"[{i}/{len(urls)}] Extracting: {url}")
-        record = extract_article(url)
-        if record and record.get("text"):
-            articles.append({
-                "url": url,
-                "title": record.get("title", ""),
-                "date": record.get("date"),
-                "text": record.get("text", "").strip()
-            })
-        else:
-            print(f"  Failed to extract: {url}")
-    print(f"Successfully extracted {len(articles)} articles")
-    return articles
-def main():
-    articles = extract_all_articles()
-    if articles:
-        # Save to JSON file
-        output_file = "articles.json"
-        with open(output_file, "w", encoding="utf-8") as f:
-            json.dump(articles, f, ensure_ascii=False, indent=2)
-if __name__ == "__main__":
-    main()

extract_content_cli.py ADDED Viewed

	@@ -0,0 +1,241 @@

+import requests, re, json
+import trafilatura
+from typing import List, Dict, Optional
+from time import sleep
+from dateutil import parser as date_parser
+from concurrent.futures import ThreadPoolExecutor, as_completed
+import random
+import os
+import threading
+import time
+from requests.adapters import HTTPAdapter
+from urllib3.util.retry import Retry
+# Parallel processing settings
+USE_PARALLEL = True
+MAX_WORKERS = 3
+# Rate limiting settings
+MIN_DELAY = 1.0
+MAX_DELAY = 3.0
+RATE_LOCK = threading.Lock()
+_next_request_time = 0.0
+# Output settings
+OUTPUT_FOLDER = "extracted_content"
+TEST_LIMIT = None
+# HTTP settings
+HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
+# All content sitemaps (excluding category/author which are just metadata)
+SITEMAPS = {
+    "ai_career_guide_pages": "https://80000hours.org/ai_career_guide_page-sitemap.xml",
+    # "articles": "https://80000hours.org/article-sitemap.xml",
+    # "career_guide_pages": "https://80000hours.org/careerguidepage-sitemap.xml",
+    "career_profiles": "https://80000hours.org/career_profile-sitemap.xml",
+    # "career_reports": "https://80000hours.org/career_report-sitemap.xml",
+    # "case_studies": "https://80000hours.org/case_study-sitemap.xml",
+    "posts": "https://80000hours.org/post-sitemap.xml",
+    "problem_profiles": "https://80000hours.org/problem_profile-sitemap.xml",
+    # "podcasts": "https://80000hours.org/podcast-sitemap.xml",
+    # "podcast_after_hours": "https://80000hours.org/podcast_after_hours-sitemap.xml",
+    "skill_sets": "https://80000hours.org/skill_set-sitemap.xml",
+    # "videos": "https://80000hours.org/video-sitemap.xml",
+}
+# Thread-local session with retries and backoff
+thread_local = threading.local()
+def get_session():
+    """Get or create a thread-local requests session with retries and connection pooling."""
+    s = getattr(thread_local, "session", None)
+    if s is None:
+        s = requests.Session()
+        s.headers.update(HEADERS)
+        retry = Retry(
+            total=5, connect=3, read=3, status=3,
+            status_forcelist=[429, 500, 502, 503, 504],
+            allowed_methods={"GET", "HEAD"},
+            backoff_factor=0.8,
+            raise_on_status=False,
+            respect_retry_after_header=True,
+        )
+        adapter = HTTPAdapter(
+            max_retries=retry,
+            pool_connections=MAX_WORKERS * 2,
+            pool_maxsize=MAX_WORKERS * 2,
+        )
+        s.mount("http://", adapter)
+        s.mount("https://", adapter)
+        thread_local.session = s
+    return s
+def throttle():
+    """Enforce rate limiting across all threads."""
+    global _next_request_time
+    delay = random.uniform(MIN_DELAY, MAX_DELAY)
+    with RATE_LOCK:
+        now = time.monotonic()
+        wait = max(0.0, _next_request_time - now)
+        _next_request_time = max(now, _next_request_time) + delay
+    if wait > 0:
+        time.sleep(wait)
+def get_urls_from_sitemap(sitemap_url: str) -> List[str]:
+    """Extract all URLs from a sitemap."""
+    throttle()
+    r = get_session().get(sitemap_url, timeout=20)
+    r.raise_for_status()
+    return re.findall(r"<loc>(.*?)</loc>", r.text)
+def parse_custom_date(html_content: str) -> Optional[str]:
+    """
+    Extract and parse publication date from 80,000 Hours HTML content.
+    Priority:
+    1. "Updated [date]" if present
+    2. "Published [date]" otherwise
+    Returns date in YYYY-MM-DD format, or None if not found.
+    """
+    # Date pattern: month + optional day (with ordinal) + year
+    date_pattern = r'([A-Za-z]+\s+(?:\d{1,2}(?:st|nd|rd|th)?,?\s+)?\d{4})'
+    # Try "Updated" first, then "Published"
+    for keyword in ['Updated', 'Published']:
+        match = re.search(f'{keyword}\\s+{date_pattern}', html_content, re.IGNORECASE)
+        if match:
+            try:
+                parsed_date = date_parser.parse(match.group(1), fuzzy=True)
+                return parsed_date.strftime('%Y-%m-%d')
+            except:
+                pass
+    return None
+def extract_content(url: str) -> Optional[Dict]:
+    """Extract content and metadata from a URL."""
+    try:
+        throttle()
+        r = get_session().get(url, timeout=30)
+        r.raise_for_status()
+    except Exception as e:
+        print(f"  ❌ Request failed: {e}")
+        return None
+    data = trafilatura.extract(
+        r.content, url=url, with_metadata=True,
+        include_links=False, include_comments=False,
+        include_formatting=False, output_format="json"
+    )
+    if not data:
+        return None
+    result = json.loads(data)
+    if custom_date := parse_custom_date(r.text):
+        result['date'] = custom_date
+    return result
+def process_record(record: Optional[Dict], url: str, sitemap_name: str) -> Optional[Dict]:
+    """Convert extraction record to final output format."""
+    if not (record and record.get("text")):
+        return None
+    return {
+        "url": url,
+        "title": record.get("title", ""),
+        "date": record.get("date"),
+        "author": record.get("author"),
+        "text": record.get("text", "").strip(),
+        "content_type": sitemap_name
+    }
+def handle_extraction_result(record: Optional[Dict], url: str, sitemap_name: str, index: int, total: int, items: List[Dict]) -> None:
+    """Process extraction result and add to items list if successful."""
+    try:
+        result = process_record(record, url, sitemap_name)
+        if result:
+            items.append(result)
+        status = "✓" if result else "⚠️  Failed:"
+        print(f"[{index}/{total}] {status} {url}")
+    except Exception as e:
+        print(f"[{index}/{total}] ❌ {url}: {e}")
+def extract_from_sitemap(sitemap_name: str, sitemap_url: str, limit: int = None, parallel: bool = True, max_workers: int = 5) -> List[Dict]:
+    """Extract content from a sitemap using either parallel or sequential processing."""
+    print(f"\n{'='*80}")
+    print(f"Processing {sitemap_name}...")
+    print(f"{'='*80}")
+    urls = get_urls_from_sitemap(sitemap_url)
+    print(f"Found {len(urls)} URLs in sitemap")
+    if limit:
+        urls = urls[:limit]
+        print(f"Limiting to first {limit} URL(s)")
+    items = []
+    if parallel and len(urls) > 1:
+        print(f"🚀 Using parallel processing with {max_workers} workers")
+        completed = 0
+        with ThreadPoolExecutor(max_workers=max_workers) as executor:
+            # Submit all tasks
+            future_to_url = {
+                executor.submit(extract_content, url): url
+                for url in urls
+            }
+            # Process completed tasks
+            for future in as_completed(future_to_url):
+                url = future_to_url[future]
+                completed += 1
+                handle_extraction_result(future.result(), url, sitemap_name, completed, len(urls), items)
+    else:
+        print("📝 Using sequential processing")
+        for i, url in enumerate(urls, 1):
+            handle_extraction_result(extract_content(url), url, sitemap_name, i, len(urls), items)
+    print(f"✓ Successfully extracted {len(items)}/{len(urls)} items")
+    return items
+def extract_all_to_json():
+    """Extract all content from sitemaps and save to individual JSON files."""
+    os.makedirs(OUTPUT_FOLDER, exist_ok=True)
+    print("Starting 80,000 Hours content extraction...")
+    print(f"Total content types: {len(SITEMAPS)}")
+    print(f"Output folder: {OUTPUT_FOLDER}/")
+    if TEST_LIMIT:
+        print(f"⚠️  TEST MODE: Extracting only {TEST_LIMIT} item(s) per content type\n")
+    all_stats = {}
+    for content_type, sitemap_url in SITEMAPS.items():
+        items = extract_from_sitemap(
+            content_type, sitemap_url,
+            limit=TEST_LIMIT, parallel=USE_PARALLEL, max_workers=MAX_WORKERS
+        )
+        all_stats[content_type] = len(items)
+        if items:
+            output_file = os.path.join(OUTPUT_FOLDER, f"{content_type}.json")
+            with open(output_file, "w", encoding="utf-8") as f:
+                json.dump(items, f, ensure_ascii=False, indent=2)
+            print(f"💾 Saved to {output_file}")
+    print(f"\n{'='*80}\nEXTRACTION COMPLETE\n{'='*80}")
+    print(f"Total items extracted: {sum(all_stats.values())}")
+    print("\nBreakdown by content type:")
+    for content_type, count in sorted(all_stats.items(), key=lambda x: x[1], reverse=True):
+        print(f"  {content_type:25s}: {count:4d} items → {OUTPUT_FOLDER}/{content_type}.json")
+def main():
+    extract_all_to_json()
+if __name__ == "__main__":
+    main()

pipeline.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""Combined pipeline: Extract → Chunk → Upload to Qdrant (additive)"""
+from extract_content_cli import extract_all_to_json
+from chunk_articles_cli import chunk_from_json_files
+from upload_to_qdrant_cli import upload_chunks_additive
+def main():
+    print("="*80)
+    print("80,000 HOURS RAG PIPELINE")
+    print("Extract → Chunk → Upload (Additive)")
+    print("="*80)
+    # Step 1: Extract to individual JSON files
+    print("\n" + "="*80)
+    print("STEP 1: EXTRACTING CONTENT")
+    print("="*80)
+    extract_all_to_json()
+    # Step 2: Chunk from JSON files
+    print("\n" + "="*80)
+    print("STEP 2: CHUNKING ARTICLES")
+    print("="*80)
+    chunk_from_json_files()
+    # Step 3: Upload to Qdrant from chunks file (additive)
+    print("\n" + "="*80)
+    print("STEP 3: UPLOADING TO QDRANT (ADDITIVE)")
+    print("="*80)
+    upload_chunks_additive()
+    print("\n" + "="*80)
+    print("PIPELINE COMPLETE ✓")
+    print("="*80)
+if __name__ == "__main__":
+    main()

requirements.txt CHANGED Viewed

@@ -7,6 +7,8 @@ requests>=2.31.0
 gradio>=4.0.0
 rapidfuzz>=3.0.0
 fuzzysearch>=0.7.3
 --extra-index-url https://download.pytorch.org/whl/cpu
 torch>=2.0.0

 gradio>=4.0.0
 rapidfuzz>=3.0.0
 fuzzysearch>=0.7.3
+trafilatura>=1.6.0
+python-dateutil>=2.8.0
 --extra-index-url https://download.pytorch.org/whl/cpu
 torch>=2.0.0

upload_to_qdrant_cli.py CHANGED Viewed

@@ -8,7 +8,7 @@ from config import MODEL_NAME, COLLECTION_NAME, EMBEDDING_DIM
 load_dotenv()
-def load_chunks(jsonl_path="article_chunks.jsonl"):
     chunks = []
     with open(jsonl_path, "r", encoding="utf-8") as f:
         for line in f:
@@ -64,9 +64,16 @@ def create_points(chunks, embeddings):
         points.append(point)
     return points
-def upload_points(client, points, collection_name=COLLECTION_NAME):
-    print(f"Uploading {len(points)} points...")
-    client.upsert(collection_name=collection_name, points=points)
     print(f"✓ Uploaded {len(points)} chunks to collection '{collection_name}'")
 def verify_upload(client, collection_name=COLLECTION_NAME):
@@ -74,21 +81,63 @@ def verify_upload(client, collection_name=COLLECTION_NAME):
     print(f"Collection now has {collection_info.points_count} points")
     return collection_info.points_count
-def main():
-    chunks = load_chunks()
-    print(f"Found {len(chunks)} chunks")
-    client = create_qdrant_client()
-    model = load_embedding_model()
-    if not create_collection(client):
         return
     embeddings = generate_embeddings(model, chunks)
     points = create_points(chunks, embeddings)
     upload_points(client, points)
-    verify_upload(client)
-if __name__ == "__main__":
-    main()

 load_dotenv()
+def load_chunks(jsonl_path="chunks.jsonl"):
     chunks = []
     with open(jsonl_path, "r", encoding="utf-8") as f:
         for line in f:
         points.append(point)
     return points
+def upload_points(client, points, collection_name=COLLECTION_NAME, batch_size=100):
+    print(f"Uploading {len(points)} points in batches of {batch_size}...")
+    total_batches = (len(points) + batch_size - 1) // batch_size
+    for i in range(0, len(points), batch_size):
+        batch = points[i:i + batch_size]
+        batch_num = (i // batch_size) + 1
+        print(f"  Batch {batch_num}/{total_batches}: Uploading {len(batch)} points...")
+        client.upsert(collection_name=collection_name, points=batch)
     print(f"✓ Uploaded {len(points)} chunks to collection '{collection_name}'")
 def verify_upload(client, collection_name=COLLECTION_NAME):
     print(f"Collection now has {collection_info.points_count} points")
     return collection_info.points_count
+def ensure_collection_exists(client, collection_name=COLLECTION_NAME, embedding_dim=EMBEDDING_DIM):
+    """Ensure collection exists, create if it doesn't. Returns starting ID for new points."""
+    if not client.collection_exists(collection_name):
+        print(f"Collection '{collection_name}' doesn't exist. Creating...")
+        client.create_collection(
+            collection_name=collection_name,
+            vectors_config=VectorParams(size=embedding_dim, distance=Distance.COSINE),
+        )
+        return 0
+    else:
+        collection_info = client.get_collection(collection_name)
+        point_count = collection_info.points_count
+        print(f"Collection '{collection_name}' exists with {point_count} points")
+        return point_count
+def offset_point_ids(points, start_id):
+    """Update point IDs to start from a given offset."""
+    print(f"Setting point IDs starting from {start_id}...")
+    for i, point in enumerate(points):
+        point.id = start_id + i
+    return points
+def print_upload_summary(start_id, added_count, new_count):
+    """Print upload summary statistics."""
+    print(f"\n✓ Upload complete!")
+    print(f"  Previous: {start_id} points")
+    print(f"  Added: {added_count} points")
+    print(f"  Total now: {new_count} points")
+def upload_chunks_additive(chunks_file="chunks.jsonl"):
+    """Upload chunks to Qdrant additively (preserves existing data)."""
+    if not os.path.exists(chunks_file):
+        print(f"Chunks file '{chunks_file}' not found")
+        return
+    chunks = load_chunks(chunks_file)
+    print(f"Found {len(chunks)} chunks")
+    if not chunks:
+        print("No chunks to upload")
         return
+    client = create_qdrant_client()
+    start_id = ensure_collection_exists(client)
+    model = load_embedding_model()
     embeddings = generate_embeddings(model, chunks)
     points = create_points(chunks, embeddings)
+    points = offset_point_ids(points, start_id)
     upload_points(client, points)
+    new_count = verify_upload(client)
+    print_upload_summary(start_id, len(points), new_count)
+def main():
+    upload_chunks_additive()
+if __name__ == "__main__":
+    main()