Spaces:
Sleeping
Sleeping
Ryan
commited on
Commit
·
99a81ef
1
Parent(s):
60c7f79
- add all content to vector db
Browse files- DEPLOYMENT.md +0 -123
- TODO.txt +1 -6
- chunk_articles_cli.py +53 -2
- extract_articles_cli.py +0 -87
- extract_content_cli.py +241 -0
- pipeline.py +37 -0
- requirements.txt +2 -0
- upload_to_qdrant_cli.py +62 -13
DEPLOYMENT.md
DELETED
|
@@ -1,123 +0,0 @@
|
|
| 1 |
-
# Deploying to Hugging Face Spaces
|
| 2 |
-
|
| 3 |
-
## Prerequisites
|
| 4 |
-
|
| 5 |
-
1. A Hugging Face account (sign up at https://huggingface.co/)
|
| 6 |
-
2. Qdrant Cloud instance with your data uploaded
|
| 7 |
-
3. OpenAI API key
|
| 8 |
-
|
| 9 |
-
## Step-by-Step Deployment
|
| 10 |
-
|
| 11 |
-
### 1. Create a New Space
|
| 12 |
-
|
| 13 |
-
1. Go to https://huggingface.co/spaces
|
| 14 |
-
2. Click **"Create new Space"**
|
| 15 |
-
3. Fill in the details:
|
| 16 |
-
- **Owner**: Your username or organization
|
| 17 |
-
- **Space name**: `80k-rag-qa` (or your preferred name)
|
| 18 |
-
- **License**: Choose appropriate license (e.g., MIT)
|
| 19 |
-
- **Space SDK**: Select **"Gradio"**
|
| 20 |
-
- **Hardware**: Select **"CPU basic"** (free tier) or upgrade if needed
|
| 21 |
-
- **Visibility**: Choose "Public" or "Private"
|
| 22 |
-
4. Click **"Create Space"**
|
| 23 |
-
|
| 24 |
-
### 2. Configure Secrets
|
| 25 |
-
|
| 26 |
-
Before uploading code, set up your API keys:
|
| 27 |
-
|
| 28 |
-
1. Go to your Space's page
|
| 29 |
-
2. Click **"Settings"** → **"Variables and Secrets"**
|
| 30 |
-
3. Click **"New Secret"** for each of the following:
|
| 31 |
-
- **Name**: `QDRANT_URL` | **Value**: Your Qdrant instance URL
|
| 32 |
-
- **Name**: `QDRANT_API_KEY` | **Value**: Your Qdrant API key
|
| 33 |
-
- **Name**: `OPENAI_API_KEY` | **Value**: Your OpenAI API key
|
| 34 |
-
4. Click **"Save"** for each secret
|
| 35 |
-
|
| 36 |
-
### 3. Upload Your Code
|
| 37 |
-
|
| 38 |
-
**Option A: Using Git (Recommended)**
|
| 39 |
-
|
| 40 |
-
```bash
|
| 41 |
-
# Clone your new Space
|
| 42 |
-
git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
|
| 43 |
-
cd YOUR_SPACE_NAME
|
| 44 |
-
|
| 45 |
-
# Copy necessary files from this project
|
| 46 |
-
cp /home/ryan/Documents/80k_rag/app.py .
|
| 47 |
-
cp /home/ryan/Documents/80k_rag/rag_chat.py .
|
| 48 |
-
cp /home/ryan/Documents/80k_rag/citation_validator.py .
|
| 49 |
-
cp /home/ryan/Documents/80k_rag/config.py .
|
| 50 |
-
cp /home/ryan/Documents/80k_rag/requirements.txt .
|
| 51 |
-
cp /home/ryan/Documents/80k_rag/README.md .
|
| 52 |
-
|
| 53 |
-
# Add, commit, and push
|
| 54 |
-
git add .
|
| 55 |
-
git commit -m "Initial deployment"
|
| 56 |
-
git push
|
| 57 |
-
```
|
| 58 |
-
|
| 59 |
-
**Option B: Using the Web Interface**
|
| 60 |
-
|
| 61 |
-
1. Go to your Space → **"Files and versions"** tab
|
| 62 |
-
2. Click **"Add file"** → **"Upload files"**
|
| 63 |
-
3. Upload these files:
|
| 64 |
-
- `app.py`
|
| 65 |
-
- `rag_chat.py`
|
| 66 |
-
- `citation_validator.py`
|
| 67 |
-
- `config.py`
|
| 68 |
-
- `requirements.txt`
|
| 69 |
-
- `README.md`
|
| 70 |
-
4. Click **"Commit changes to main"**
|
| 71 |
-
|
| 72 |
-
### 4. Monitor Deployment
|
| 73 |
-
|
| 74 |
-
1. Go to the **"App"** tab to see your Space building
|
| 75 |
-
2. Check the **"Logs"** section (click "See logs" if build fails)
|
| 76 |
-
3. Wait for the build to complete (usually 2-5 minutes)
|
| 77 |
-
4. Your app will be live at: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
|
| 78 |
-
|
| 79 |
-
## Troubleshooting
|
| 80 |
-
|
| 81 |
-
### Build Fails
|
| 82 |
-
- Check the logs for missing dependencies
|
| 83 |
-
- Ensure all required files are uploaded
|
| 84 |
-
- Verify `requirements.txt` has correct package names
|
| 85 |
-
|
| 86 |
-
### Runtime Errors
|
| 87 |
-
- Verify secrets are set correctly in Settings
|
| 88 |
-
- Check logs for import errors or missing modules
|
| 89 |
-
- Ensure your Qdrant instance is accessible
|
| 90 |
-
|
| 91 |
-
### Out of Memory
|
| 92 |
-
- Consider upgrading to a larger hardware tier
|
| 93 |
-
- Optimize model loading and caching
|
| 94 |
-
- Reduce `SOURCE_COUNT` in `rag_chat.py`
|
| 95 |
-
|
| 96 |
-
## Updating Your Space
|
| 97 |
-
|
| 98 |
-
To update your deployed app:
|
| 99 |
-
|
| 100 |
-
```bash
|
| 101 |
-
# Make changes to your local files
|
| 102 |
-
# Then push updates
|
| 103 |
-
git add .
|
| 104 |
-
git commit -m "Update: describe your changes"
|
| 105 |
-
git push
|
| 106 |
-
```
|
| 107 |
-
|
| 108 |
-
The Space will automatically rebuild with your changes.
|
| 109 |
-
|
| 110 |
-
## Cost Considerations
|
| 111 |
-
|
| 112 |
-
- **Hugging Face Space**: Free for CPU basic tier
|
| 113 |
-
- **OpenAI API**: Pay per token (GPT-4o-mini is cost-effective)
|
| 114 |
-
- **Qdrant Cloud**: Has free tier, pay for larger datasets
|
| 115 |
-
- **Estimated cost**: ~$0.01-0.10 per query depending on usage
|
| 116 |
-
|
| 117 |
-
## Security Notes
|
| 118 |
-
|
| 119 |
-
- Never commit API keys to git (they should only be in Space Secrets)
|
| 120 |
-
- Use `.gitignore` to exclude sensitive files
|
| 121 |
-
- Regularly rotate API keys
|
| 122 |
-
- Monitor API usage to prevent abuse
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TODO.txt
CHANGED
|
@@ -1,7 +1,2 @@
|
|
| 1 |
-
Technical:
|
| 2 |
-
- Test source citation
|
| 3 |
-
- Setup demo website
|
| 4 |
- Post video on Linkedin & send outreach messages
|
| 5 |
-
- have u ever wondered what to do about AI?
|
| 6 |
-
|
| 7 |
-
- Fix the dates that trafilatura scrapes
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
- Post video on Linkedin & send outreach messages
|
| 2 |
+
- have u ever wondered what to do about AI?
|
|
|
|
|
|
chunk_articles_cli.py
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
import json
|
|
|
|
| 2 |
from llama_index.core.node_parser import SemanticSplitterNodeParser
|
| 3 |
from llama_index.core.schema import Document
|
| 4 |
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
|
@@ -7,6 +8,8 @@ from config import MODEL_NAME
|
|
| 7 |
BUFFER_SIZE = 3
|
| 8 |
BREAKPOINT_PERCENTILE_THRESHOLD = 87
|
| 9 |
NUMBER_OF_ARTICLES = 86
|
|
|
|
|
|
|
| 10 |
|
| 11 |
def load_articles(json_path="articles.json", n=None):
|
| 12 |
"""Load articles from JSON file. Optionally load only first N articles."""
|
|
@@ -25,7 +28,7 @@ def chunk_text_semantic(text, embed_model):
|
|
| 25 |
nodes = splitter.get_nodes_from_documents([doc])
|
| 26 |
return [node.text for node in nodes]
|
| 27 |
|
| 28 |
-
def make_jsonl(articles, out_path="
|
| 29 |
"""Create JSONL with semantic chunks from multiple articles."""
|
| 30 |
print("Loading embedding model for semantic chunking...")
|
| 31 |
embed_model = HuggingFaceEmbedding(model_name=MODEL_NAME)
|
|
@@ -44,6 +47,54 @@ def make_jsonl(articles, out_path="article_chunks.jsonl"):
|
|
| 44 |
}
|
| 45 |
f.write(json.dumps(record, ensure_ascii=False) + "\n")
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
def main():
|
| 48 |
articles = load_articles(n=NUMBER_OF_ARTICLES)
|
| 49 |
if not articles:
|
|
@@ -51,7 +102,7 @@ def main():
|
|
| 51 |
return
|
| 52 |
|
| 53 |
make_jsonl(articles)
|
| 54 |
-
print(f"Chunks from {len(articles)} articles written to
|
| 55 |
|
| 56 |
if __name__ == "__main__":
|
| 57 |
main()
|
|
|
|
| 1 |
import json
|
| 2 |
+
import os
|
| 3 |
from llama_index.core.node_parser import SemanticSplitterNodeParser
|
| 4 |
from llama_index.core.schema import Document
|
| 5 |
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
|
|
|
| 8 |
BUFFER_SIZE = 3
|
| 9 |
BREAKPOINT_PERCENTILE_THRESHOLD = 87
|
| 10 |
NUMBER_OF_ARTICLES = 86
|
| 11 |
+
INPUT_FOLDER = "extracted_content"
|
| 12 |
+
OUTPUT_FILE = "chunks.jsonl"
|
| 13 |
|
| 14 |
def load_articles(json_path="articles.json", n=None):
|
| 15 |
"""Load articles from JSON file. Optionally load only first N articles."""
|
|
|
|
| 28 |
nodes = splitter.get_nodes_from_documents([doc])
|
| 29 |
return [node.text for node in nodes]
|
| 30 |
|
| 31 |
+
def make_jsonl(articles, out_path="chunks.jsonl"):
|
| 32 |
"""Create JSONL with semantic chunks from multiple articles."""
|
| 33 |
print("Loading embedding model for semantic chunking...")
|
| 34 |
embed_model = HuggingFaceEmbedding(model_name=MODEL_NAME)
|
|
|
|
| 47 |
}
|
| 48 |
f.write(json.dumps(record, ensure_ascii=False) + "\n")
|
| 49 |
|
| 50 |
+
def chunk_from_json_files(input_folder=INPUT_FOLDER, output_file=OUTPUT_FILE):
|
| 51 |
+
"""Load articles from JSON files in folder and chunk them to JSONL."""
|
| 52 |
+
if not os.path.exists(input_folder):
|
| 53 |
+
print(f"Input folder '{input_folder}' not found")
|
| 54 |
+
return
|
| 55 |
+
|
| 56 |
+
# Load all articles from JSON files
|
| 57 |
+
all_articles = []
|
| 58 |
+
json_files = [f for f in os.listdir(input_folder) if f.endswith('.json')]
|
| 59 |
+
|
| 60 |
+
if not json_files:
|
| 61 |
+
print(f"No JSON files found in {input_folder}")
|
| 62 |
+
return
|
| 63 |
+
|
| 64 |
+
for json_file in json_files:
|
| 65 |
+
json_path = os.path.join(input_folder, json_file)
|
| 66 |
+
with open(json_path, "r", encoding="utf-8") as f:
|
| 67 |
+
articles = json.load(f)
|
| 68 |
+
all_articles.extend(articles)
|
| 69 |
+
print(f"Loaded {len(articles)} articles from {json_file}")
|
| 70 |
+
|
| 71 |
+
if not all_articles:
|
| 72 |
+
print("No articles found to chunk")
|
| 73 |
+
return
|
| 74 |
+
|
| 75 |
+
print(f"\nTotal articles to chunk: {len(all_articles)}")
|
| 76 |
+
print("Loading embedding model for semantic chunking...")
|
| 77 |
+
embed_model = HuggingFaceEmbedding(model_name=MODEL_NAME)
|
| 78 |
+
|
| 79 |
+
chunk_count = 0
|
| 80 |
+
with open(output_file, "w", encoding="utf-8") as f:
|
| 81 |
+
for idx, article in enumerate(all_articles, 1):
|
| 82 |
+
print(f"Chunking ({idx}/{len(all_articles)}): {article['title']}")
|
| 83 |
+
chunks = chunk_text_semantic(article["text"], embed_model)
|
| 84 |
+
for i, chunk in enumerate(chunks, 1):
|
| 85 |
+
record = {
|
| 86 |
+
"url": article["url"],
|
| 87 |
+
"title": article["title"],
|
| 88 |
+
"date": article.get("date"),
|
| 89 |
+
"chunk_id": i,
|
| 90 |
+
"text": chunk,
|
| 91 |
+
}
|
| 92 |
+
chunk_count += 1
|
| 93 |
+
f.write(json.dumps(record, ensure_ascii=False) + "\n")
|
| 94 |
+
|
| 95 |
+
print(f"\n✓ Created {chunk_count} chunks from {len(all_articles)} articles")
|
| 96 |
+
print(f"💾 Saved to {output_file}")
|
| 97 |
+
|
| 98 |
def main():
|
| 99 |
articles = load_articles(n=NUMBER_OF_ARTICLES)
|
| 100 |
if not articles:
|
|
|
|
| 102 |
return
|
| 103 |
|
| 104 |
make_jsonl(articles)
|
| 105 |
+
print(f"Chunks from {len(articles)} articles written to chunks.jsonl")
|
| 106 |
|
| 107 |
if __name__ == "__main__":
|
| 108 |
main()
|
extract_articles_cli.py
DELETED
|
@@ -1,87 +0,0 @@
|
|
| 1 |
-
import requests, re, json
|
| 2 |
-
import trafilatura
|
| 3 |
-
from typing import List, Dict, Optional
|
| 4 |
-
|
| 5 |
-
HEADERS = {"User-Agent": "RAG-80k/0.1 (+your-contact)"}
|
| 6 |
-
NUMBER_OF_ARTICLES = 10
|
| 7 |
-
|
| 8 |
-
def get_all_article_urls() -> List[str]:
|
| 9 |
-
"""Extract all article URLs from the sitemap."""
|
| 10 |
-
sitemap_url = "https://80000hours.org/article-sitemap.xml"
|
| 11 |
-
r = requests.get(sitemap_url, headers=HEADERS, timeout=20)
|
| 12 |
-
r.raise_for_status()
|
| 13 |
-
|
| 14 |
-
# Find all <loc> tags in the sitemap
|
| 15 |
-
urls = re.findall(r"<loc>(.*?)</loc>", r.text)
|
| 16 |
-
return urls
|
| 17 |
-
|
| 18 |
-
def extract_article(url: str) -> Optional[Dict]:
|
| 19 |
-
"""Extract article content and metadata from a URL."""
|
| 20 |
-
r = requests.get(url, headers=HEADERS, timeout=30)
|
| 21 |
-
r.raise_for_status()
|
| 22 |
-
data = trafilatura.extract(
|
| 23 |
-
r.content,
|
| 24 |
-
url=url,
|
| 25 |
-
with_metadata=True,
|
| 26 |
-
include_links=False,
|
| 27 |
-
include_comments=False,
|
| 28 |
-
include_formatting=False,
|
| 29 |
-
output_format="json",
|
| 30 |
-
)
|
| 31 |
-
return json.loads(data) if data else None
|
| 32 |
-
|
| 33 |
-
def extract_all_articles() -> List[Dict]:
|
| 34 |
-
"""Extract all articles from the sitemap."""
|
| 35 |
-
urls = get_all_article_urls()
|
| 36 |
-
print(f"Found {len(urls)} articles in sitemap")
|
| 37 |
-
|
| 38 |
-
articles = []
|
| 39 |
-
for i, url in enumerate(urls, 1):
|
| 40 |
-
print(f"[{i}/{len(urls)}] Extracting: {url}")
|
| 41 |
-
record = extract_article(url)
|
| 42 |
-
if record and record.get("text"):
|
| 43 |
-
articles.append({
|
| 44 |
-
"url": url,
|
| 45 |
-
"title": record.get("title", ""),
|
| 46 |
-
"date": record.get("date"),
|
| 47 |
-
"text": record.get("text", "").strip()
|
| 48 |
-
})
|
| 49 |
-
else:
|
| 50 |
-
print(f" Failed to extract: {url}")
|
| 51 |
-
|
| 52 |
-
print(f"Successfully extracted {len(articles)} articles")
|
| 53 |
-
return articles
|
| 54 |
-
|
| 55 |
-
def extract_first_n_articles(n: int) -> List[Dict]:
|
| 56 |
-
"""Extract the first N articles from the sitemap."""
|
| 57 |
-
urls = get_all_article_urls()[:n]
|
| 58 |
-
print(f"Extracting first {n} articles")
|
| 59 |
-
|
| 60 |
-
articles = []
|
| 61 |
-
for i, url in enumerate(urls, 1):
|
| 62 |
-
print(f"[{i}/{len(urls)}] Extracting: {url}")
|
| 63 |
-
record = extract_article(url)
|
| 64 |
-
if record and record.get("text"):
|
| 65 |
-
articles.append({
|
| 66 |
-
"url": url,
|
| 67 |
-
"title": record.get("title", ""),
|
| 68 |
-
"date": record.get("date"),
|
| 69 |
-
"text": record.get("text", "").strip()
|
| 70 |
-
})
|
| 71 |
-
else:
|
| 72 |
-
print(f" Failed to extract: {url}")
|
| 73 |
-
|
| 74 |
-
print(f"Successfully extracted {len(articles)} articles")
|
| 75 |
-
return articles
|
| 76 |
-
|
| 77 |
-
def main():
|
| 78 |
-
articles = extract_all_articles()
|
| 79 |
-
|
| 80 |
-
if articles:
|
| 81 |
-
# Save to JSON file
|
| 82 |
-
output_file = "articles.json"
|
| 83 |
-
with open(output_file, "w", encoding="utf-8") as f:
|
| 84 |
-
json.dump(articles, f, ensure_ascii=False, indent=2)
|
| 85 |
-
|
| 86 |
-
if __name__ == "__main__":
|
| 87 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
extract_content_cli.py
ADDED
|
@@ -0,0 +1,241 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import requests, re, json
|
| 2 |
+
import trafilatura
|
| 3 |
+
from typing import List, Dict, Optional
|
| 4 |
+
from time import sleep
|
| 5 |
+
from dateutil import parser as date_parser
|
| 6 |
+
from concurrent.futures import ThreadPoolExecutor, as_completed
|
| 7 |
+
import random
|
| 8 |
+
import os
|
| 9 |
+
import threading
|
| 10 |
+
import time
|
| 11 |
+
from requests.adapters import HTTPAdapter
|
| 12 |
+
from urllib3.util.retry import Retry
|
| 13 |
+
|
| 14 |
+
# Parallel processing settings
|
| 15 |
+
USE_PARALLEL = True
|
| 16 |
+
MAX_WORKERS = 3
|
| 17 |
+
|
| 18 |
+
# Rate limiting settings
|
| 19 |
+
MIN_DELAY = 1.0
|
| 20 |
+
MAX_DELAY = 3.0
|
| 21 |
+
RATE_LOCK = threading.Lock()
|
| 22 |
+
_next_request_time = 0.0
|
| 23 |
+
|
| 24 |
+
# Output settings
|
| 25 |
+
OUTPUT_FOLDER = "extracted_content"
|
| 26 |
+
TEST_LIMIT = None
|
| 27 |
+
|
| 28 |
+
# HTTP settings
|
| 29 |
+
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
|
| 30 |
+
|
| 31 |
+
# All content sitemaps (excluding category/author which are just metadata)
|
| 32 |
+
SITEMAPS = {
|
| 33 |
+
"ai_career_guide_pages": "https://80000hours.org/ai_career_guide_page-sitemap.xml",
|
| 34 |
+
# "articles": "https://80000hours.org/article-sitemap.xml",
|
| 35 |
+
# "career_guide_pages": "https://80000hours.org/careerguidepage-sitemap.xml",
|
| 36 |
+
"career_profiles": "https://80000hours.org/career_profile-sitemap.xml",
|
| 37 |
+
# "career_reports": "https://80000hours.org/career_report-sitemap.xml",
|
| 38 |
+
# "case_studies": "https://80000hours.org/case_study-sitemap.xml",
|
| 39 |
+
"posts": "https://80000hours.org/post-sitemap.xml",
|
| 40 |
+
"problem_profiles": "https://80000hours.org/problem_profile-sitemap.xml",
|
| 41 |
+
# "podcasts": "https://80000hours.org/podcast-sitemap.xml",
|
| 42 |
+
# "podcast_after_hours": "https://80000hours.org/podcast_after_hours-sitemap.xml",
|
| 43 |
+
"skill_sets": "https://80000hours.org/skill_set-sitemap.xml",
|
| 44 |
+
# "videos": "https://80000hours.org/video-sitemap.xml",
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
# Thread-local session with retries and backoff
|
| 48 |
+
thread_local = threading.local()
|
| 49 |
+
|
| 50 |
+
def get_session():
|
| 51 |
+
"""Get or create a thread-local requests session with retries and connection pooling."""
|
| 52 |
+
s = getattr(thread_local, "session", None)
|
| 53 |
+
if s is None:
|
| 54 |
+
s = requests.Session()
|
| 55 |
+
s.headers.update(HEADERS)
|
| 56 |
+
retry = Retry(
|
| 57 |
+
total=5, connect=3, read=3, status=3,
|
| 58 |
+
status_forcelist=[429, 500, 502, 503, 504],
|
| 59 |
+
allowed_methods={"GET", "HEAD"},
|
| 60 |
+
backoff_factor=0.8,
|
| 61 |
+
raise_on_status=False,
|
| 62 |
+
respect_retry_after_header=True,
|
| 63 |
+
)
|
| 64 |
+
adapter = HTTPAdapter(
|
| 65 |
+
max_retries=retry,
|
| 66 |
+
pool_connections=MAX_WORKERS * 2,
|
| 67 |
+
pool_maxsize=MAX_WORKERS * 2,
|
| 68 |
+
)
|
| 69 |
+
s.mount("http://", adapter)
|
| 70 |
+
s.mount("https://", adapter)
|
| 71 |
+
thread_local.session = s
|
| 72 |
+
return s
|
| 73 |
+
|
| 74 |
+
def throttle():
|
| 75 |
+
"""Enforce rate limiting across all threads."""
|
| 76 |
+
global _next_request_time
|
| 77 |
+
delay = random.uniform(MIN_DELAY, MAX_DELAY)
|
| 78 |
+
with RATE_LOCK:
|
| 79 |
+
now = time.monotonic()
|
| 80 |
+
wait = max(0.0, _next_request_time - now)
|
| 81 |
+
_next_request_time = max(now, _next_request_time) + delay
|
| 82 |
+
if wait > 0:
|
| 83 |
+
time.sleep(wait)
|
| 84 |
+
|
| 85 |
+
def get_urls_from_sitemap(sitemap_url: str) -> List[str]:
|
| 86 |
+
"""Extract all URLs from a sitemap."""
|
| 87 |
+
throttle()
|
| 88 |
+
r = get_session().get(sitemap_url, timeout=20)
|
| 89 |
+
r.raise_for_status()
|
| 90 |
+
return re.findall(r"<loc>(.*?)</loc>", r.text)
|
| 91 |
+
|
| 92 |
+
def parse_custom_date(html_content: str) -> Optional[str]:
|
| 93 |
+
"""
|
| 94 |
+
Extract and parse publication date from 80,000 Hours HTML content.
|
| 95 |
+
|
| 96 |
+
Priority:
|
| 97 |
+
1. "Updated [date]" if present
|
| 98 |
+
2. "Published [date]" otherwise
|
| 99 |
+
|
| 100 |
+
Returns date in YYYY-MM-DD format, or None if not found.
|
| 101 |
+
"""
|
| 102 |
+
# Date pattern: month + optional day (with ordinal) + year
|
| 103 |
+
date_pattern = r'([A-Za-z]+\s+(?:\d{1,2}(?:st|nd|rd|th)?,?\s+)?\d{4})'
|
| 104 |
+
|
| 105 |
+
# Try "Updated" first, then "Published"
|
| 106 |
+
for keyword in ['Updated', 'Published']:
|
| 107 |
+
match = re.search(f'{keyword}\\s+{date_pattern}', html_content, re.IGNORECASE)
|
| 108 |
+
if match:
|
| 109 |
+
try:
|
| 110 |
+
parsed_date = date_parser.parse(match.group(1), fuzzy=True)
|
| 111 |
+
return parsed_date.strftime('%Y-%m-%d')
|
| 112 |
+
except:
|
| 113 |
+
pass
|
| 114 |
+
|
| 115 |
+
return None
|
| 116 |
+
|
| 117 |
+
def extract_content(url: str) -> Optional[Dict]:
|
| 118 |
+
"""Extract content and metadata from a URL."""
|
| 119 |
+
try:
|
| 120 |
+
throttle()
|
| 121 |
+
r = get_session().get(url, timeout=30)
|
| 122 |
+
r.raise_for_status()
|
| 123 |
+
except Exception as e:
|
| 124 |
+
print(f" ❌ Request failed: {e}")
|
| 125 |
+
return None
|
| 126 |
+
|
| 127 |
+
data = trafilatura.extract(
|
| 128 |
+
r.content, url=url, with_metadata=True,
|
| 129 |
+
include_links=False, include_comments=False,
|
| 130 |
+
include_formatting=False, output_format="json"
|
| 131 |
+
)
|
| 132 |
+
|
| 133 |
+
if not data:
|
| 134 |
+
return None
|
| 135 |
+
|
| 136 |
+
result = json.loads(data)
|
| 137 |
+
if custom_date := parse_custom_date(r.text):
|
| 138 |
+
result['date'] = custom_date
|
| 139 |
+
|
| 140 |
+
return result
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
def process_record(record: Optional[Dict], url: str, sitemap_name: str) -> Optional[Dict]:
|
| 144 |
+
"""Convert extraction record to final output format."""
|
| 145 |
+
if not (record and record.get("text")):
|
| 146 |
+
return None
|
| 147 |
+
return {
|
| 148 |
+
"url": url,
|
| 149 |
+
"title": record.get("title", ""),
|
| 150 |
+
"date": record.get("date"),
|
| 151 |
+
"author": record.get("author"),
|
| 152 |
+
"text": record.get("text", "").strip(),
|
| 153 |
+
"content_type": sitemap_name
|
| 154 |
+
}
|
| 155 |
+
|
| 156 |
+
def handle_extraction_result(record: Optional[Dict], url: str, sitemap_name: str, index: int, total: int, items: List[Dict]) -> None:
|
| 157 |
+
"""Process extraction result and add to items list if successful."""
|
| 158 |
+
try:
|
| 159 |
+
result = process_record(record, url, sitemap_name)
|
| 160 |
+
if result:
|
| 161 |
+
items.append(result)
|
| 162 |
+
status = "✓" if result else "⚠️ Failed:"
|
| 163 |
+
print(f"[{index}/{total}] {status} {url}")
|
| 164 |
+
except Exception as e:
|
| 165 |
+
print(f"[{index}/{total}] ❌ {url}: {e}")
|
| 166 |
+
|
| 167 |
+
def extract_from_sitemap(sitemap_name: str, sitemap_url: str, limit: int = None, parallel: bool = True, max_workers: int = 5) -> List[Dict]:
|
| 168 |
+
"""Extract content from a sitemap using either parallel or sequential processing."""
|
| 169 |
+
print(f"\n{'='*80}")
|
| 170 |
+
print(f"Processing {sitemap_name}...")
|
| 171 |
+
print(f"{'='*80}")
|
| 172 |
+
|
| 173 |
+
urls = get_urls_from_sitemap(sitemap_url)
|
| 174 |
+
print(f"Found {len(urls)} URLs in sitemap")
|
| 175 |
+
|
| 176 |
+
if limit:
|
| 177 |
+
urls = urls[:limit]
|
| 178 |
+
print(f"Limiting to first {limit} URL(s)")
|
| 179 |
+
|
| 180 |
+
items = []
|
| 181 |
+
|
| 182 |
+
if parallel and len(urls) > 1:
|
| 183 |
+
print(f"🚀 Using parallel processing with {max_workers} workers")
|
| 184 |
+
completed = 0
|
| 185 |
+
|
| 186 |
+
with ThreadPoolExecutor(max_workers=max_workers) as executor:
|
| 187 |
+
# Submit all tasks
|
| 188 |
+
future_to_url = {
|
| 189 |
+
executor.submit(extract_content, url): url
|
| 190 |
+
for url in urls
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
# Process completed tasks
|
| 194 |
+
for future in as_completed(future_to_url):
|
| 195 |
+
url = future_to_url[future]
|
| 196 |
+
completed += 1
|
| 197 |
+
handle_extraction_result(future.result(), url, sitemap_name, completed, len(urls), items)
|
| 198 |
+
else:
|
| 199 |
+
print("📝 Using sequential processing")
|
| 200 |
+
for i, url in enumerate(urls, 1):
|
| 201 |
+
handle_extraction_result(extract_content(url), url, sitemap_name, i, len(urls), items)
|
| 202 |
+
|
| 203 |
+
print(f"✓ Successfully extracted {len(items)}/{len(urls)} items")
|
| 204 |
+
return items
|
| 205 |
+
|
| 206 |
+
def extract_all_to_json():
|
| 207 |
+
"""Extract all content from sitemaps and save to individual JSON files."""
|
| 208 |
+
os.makedirs(OUTPUT_FOLDER, exist_ok=True)
|
| 209 |
+
|
| 210 |
+
print("Starting 80,000 Hours content extraction...")
|
| 211 |
+
print(f"Total content types: {len(SITEMAPS)}")
|
| 212 |
+
print(f"Output folder: {OUTPUT_FOLDER}/")
|
| 213 |
+
if TEST_LIMIT:
|
| 214 |
+
print(f"⚠️ TEST MODE: Extracting only {TEST_LIMIT} item(s) per content type\n")
|
| 215 |
+
|
| 216 |
+
all_stats = {}
|
| 217 |
+
for content_type, sitemap_url in SITEMAPS.items():
|
| 218 |
+
items = extract_from_sitemap(
|
| 219 |
+
content_type, sitemap_url,
|
| 220 |
+
limit=TEST_LIMIT, parallel=USE_PARALLEL, max_workers=MAX_WORKERS
|
| 221 |
+
)
|
| 222 |
+
all_stats[content_type] = len(items)
|
| 223 |
+
|
| 224 |
+
if items:
|
| 225 |
+
output_file = os.path.join(OUTPUT_FOLDER, f"{content_type}.json")
|
| 226 |
+
with open(output_file, "w", encoding="utf-8") as f:
|
| 227 |
+
json.dump(items, f, ensure_ascii=False, indent=2)
|
| 228 |
+
print(f"💾 Saved to {output_file}")
|
| 229 |
+
|
| 230 |
+
print(f"\n{'='*80}\nEXTRACTION COMPLETE\n{'='*80}")
|
| 231 |
+
print(f"Total items extracted: {sum(all_stats.values())}")
|
| 232 |
+
print("\nBreakdown by content type:")
|
| 233 |
+
for content_type, count in sorted(all_stats.items(), key=lambda x: x[1], reverse=True):
|
| 234 |
+
print(f" {content_type:25s}: {count:4d} items → {OUTPUT_FOLDER}/{content_type}.json")
|
| 235 |
+
|
| 236 |
+
def main():
|
| 237 |
+
extract_all_to_json()
|
| 238 |
+
|
| 239 |
+
if __name__ == "__main__":
|
| 240 |
+
main()
|
| 241 |
+
|
pipeline.py
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Combined pipeline: Extract → Chunk → Upload to Qdrant (additive)"""
|
| 2 |
+
from extract_content_cli import extract_all_to_json
|
| 3 |
+
from chunk_articles_cli import chunk_from_json_files
|
| 4 |
+
from upload_to_qdrant_cli import upload_chunks_additive
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
def main():
|
| 8 |
+
print("="*80)
|
| 9 |
+
print("80,000 HOURS RAG PIPELINE")
|
| 10 |
+
print("Extract → Chunk → Upload (Additive)")
|
| 11 |
+
print("="*80)
|
| 12 |
+
|
| 13 |
+
# Step 1: Extract to individual JSON files
|
| 14 |
+
print("\n" + "="*80)
|
| 15 |
+
print("STEP 1: EXTRACTING CONTENT")
|
| 16 |
+
print("="*80)
|
| 17 |
+
extract_all_to_json()
|
| 18 |
+
|
| 19 |
+
# Step 2: Chunk from JSON files
|
| 20 |
+
print("\n" + "="*80)
|
| 21 |
+
print("STEP 2: CHUNKING ARTICLES")
|
| 22 |
+
print("="*80)
|
| 23 |
+
chunk_from_json_files()
|
| 24 |
+
|
| 25 |
+
# Step 3: Upload to Qdrant from chunks file (additive)
|
| 26 |
+
print("\n" + "="*80)
|
| 27 |
+
print("STEP 3: UPLOADING TO QDRANT (ADDITIVE)")
|
| 28 |
+
print("="*80)
|
| 29 |
+
upload_chunks_additive()
|
| 30 |
+
|
| 31 |
+
print("\n" + "="*80)
|
| 32 |
+
print("PIPELINE COMPLETE ✓")
|
| 33 |
+
print("="*80)
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
if __name__ == "__main__":
|
| 37 |
+
main()
|
requirements.txt
CHANGED
|
@@ -7,6 +7,8 @@ requests>=2.31.0
|
|
| 7 |
gradio>=4.0.0
|
| 8 |
rapidfuzz>=3.0.0
|
| 9 |
fuzzysearch>=0.7.3
|
|
|
|
|
|
|
| 10 |
--extra-index-url https://download.pytorch.org/whl/cpu
|
| 11 |
torch>=2.0.0
|
| 12 |
|
|
|
|
| 7 |
gradio>=4.0.0
|
| 8 |
rapidfuzz>=3.0.0
|
| 9 |
fuzzysearch>=0.7.3
|
| 10 |
+
trafilatura>=1.6.0
|
| 11 |
+
python-dateutil>=2.8.0
|
| 12 |
--extra-index-url https://download.pytorch.org/whl/cpu
|
| 13 |
torch>=2.0.0
|
| 14 |
|
upload_to_qdrant_cli.py
CHANGED
|
@@ -8,7 +8,7 @@ from config import MODEL_NAME, COLLECTION_NAME, EMBEDDING_DIM
|
|
| 8 |
|
| 9 |
load_dotenv()
|
| 10 |
|
| 11 |
-
def load_chunks(jsonl_path="
|
| 12 |
chunks = []
|
| 13 |
with open(jsonl_path, "r", encoding="utf-8") as f:
|
| 14 |
for line in f:
|
|
@@ -64,9 +64,16 @@ def create_points(chunks, embeddings):
|
|
| 64 |
points.append(point)
|
| 65 |
return points
|
| 66 |
|
| 67 |
-
def upload_points(client, points, collection_name=COLLECTION_NAME):
|
| 68 |
-
print(f"Uploading {len(points)} points...")
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
print(f"✓ Uploaded {len(points)} chunks to collection '{collection_name}'")
|
| 71 |
|
| 72 |
def verify_upload(client, collection_name=COLLECTION_NAME):
|
|
@@ -74,21 +81,63 @@ def verify_upload(client, collection_name=COLLECTION_NAME):
|
|
| 74 |
print(f"Collection now has {collection_info.points_count} points")
|
| 75 |
return collection_info.points_count
|
| 76 |
|
| 77 |
-
def
|
| 78 |
-
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
-
|
| 82 |
-
|
| 83 |
|
| 84 |
-
if not
|
|
|
|
| 85 |
return
|
| 86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
embeddings = generate_embeddings(model, chunks)
|
| 88 |
points = create_points(chunks, embeddings)
|
|
|
|
|
|
|
| 89 |
upload_points(client, points)
|
| 90 |
-
|
|
|
|
|
|
|
| 91 |
|
| 92 |
-
|
| 93 |
-
|
| 94 |
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
load_dotenv()
|
| 10 |
|
| 11 |
+
def load_chunks(jsonl_path="chunks.jsonl"):
|
| 12 |
chunks = []
|
| 13 |
with open(jsonl_path, "r", encoding="utf-8") as f:
|
| 14 |
for line in f:
|
|
|
|
| 64 |
points.append(point)
|
| 65 |
return points
|
| 66 |
|
| 67 |
+
def upload_points(client, points, collection_name=COLLECTION_NAME, batch_size=100):
|
| 68 |
+
print(f"Uploading {len(points)} points in batches of {batch_size}...")
|
| 69 |
+
total_batches = (len(points) + batch_size - 1) // batch_size
|
| 70 |
+
|
| 71 |
+
for i in range(0, len(points), batch_size):
|
| 72 |
+
batch = points[i:i + batch_size]
|
| 73 |
+
batch_num = (i // batch_size) + 1
|
| 74 |
+
print(f" Batch {batch_num}/{total_batches}: Uploading {len(batch)} points...")
|
| 75 |
+
client.upsert(collection_name=collection_name, points=batch)
|
| 76 |
+
|
| 77 |
print(f"✓ Uploaded {len(points)} chunks to collection '{collection_name}'")
|
| 78 |
|
| 79 |
def verify_upload(client, collection_name=COLLECTION_NAME):
|
|
|
|
| 81 |
print(f"Collection now has {collection_info.points_count} points")
|
| 82 |
return collection_info.points_count
|
| 83 |
|
| 84 |
+
def ensure_collection_exists(client, collection_name=COLLECTION_NAME, embedding_dim=EMBEDDING_DIM):
|
| 85 |
+
"""Ensure collection exists, create if it doesn't. Returns starting ID for new points."""
|
| 86 |
+
if not client.collection_exists(collection_name):
|
| 87 |
+
print(f"Collection '{collection_name}' doesn't exist. Creating...")
|
| 88 |
+
client.create_collection(
|
| 89 |
+
collection_name=collection_name,
|
| 90 |
+
vectors_config=VectorParams(size=embedding_dim, distance=Distance.COSINE),
|
| 91 |
+
)
|
| 92 |
+
return 0
|
| 93 |
+
else:
|
| 94 |
+
collection_info = client.get_collection(collection_name)
|
| 95 |
+
point_count = collection_info.points_count
|
| 96 |
+
print(f"Collection '{collection_name}' exists with {point_count} points")
|
| 97 |
+
return point_count
|
| 98 |
+
|
| 99 |
+
def offset_point_ids(points, start_id):
|
| 100 |
+
"""Update point IDs to start from a given offset."""
|
| 101 |
+
print(f"Setting point IDs starting from {start_id}...")
|
| 102 |
+
for i, point in enumerate(points):
|
| 103 |
+
point.id = start_id + i
|
| 104 |
+
return points
|
| 105 |
+
|
| 106 |
+
def print_upload_summary(start_id, added_count, new_count):
|
| 107 |
+
"""Print upload summary statistics."""
|
| 108 |
+
print(f"\n✓ Upload complete!")
|
| 109 |
+
print(f" Previous: {start_id} points")
|
| 110 |
+
print(f" Added: {added_count} points")
|
| 111 |
+
print(f" Total now: {new_count} points")
|
| 112 |
+
|
| 113 |
+
def upload_chunks_additive(chunks_file="chunks.jsonl"):
|
| 114 |
+
"""Upload chunks to Qdrant additively (preserves existing data)."""
|
| 115 |
+
if not os.path.exists(chunks_file):
|
| 116 |
+
print(f"Chunks file '{chunks_file}' not found")
|
| 117 |
+
return
|
| 118 |
|
| 119 |
+
chunks = load_chunks(chunks_file)
|
| 120 |
+
print(f"Found {len(chunks)} chunks")
|
| 121 |
|
| 122 |
+
if not chunks:
|
| 123 |
+
print("No chunks to upload")
|
| 124 |
return
|
| 125 |
|
| 126 |
+
client = create_qdrant_client()
|
| 127 |
+
start_id = ensure_collection_exists(client)
|
| 128 |
+
|
| 129 |
+
model = load_embedding_model()
|
| 130 |
embeddings = generate_embeddings(model, chunks)
|
| 131 |
points = create_points(chunks, embeddings)
|
| 132 |
+
points = offset_point_ids(points, start_id)
|
| 133 |
+
|
| 134 |
upload_points(client, points)
|
| 135 |
+
|
| 136 |
+
new_count = verify_upload(client)
|
| 137 |
+
print_upload_summary(start_id, len(points), new_count)
|
| 138 |
|
| 139 |
+
def main():
|
| 140 |
+
upload_chunks_additive()
|
| 141 |
|
| 142 |
+
if __name__ == "__main__":
|
| 143 |
+
main()
|