Refactor blog data utilities and configuration
Browse files- Removed the Jupyter notebook `utils_data_loading.ipynb` and migrated utility functions to `blog_utils.py`.
- Created a new configuration file `config.py` to manage environment variables and default settings.
- Implemented a script `update_blog_data.py` for updating the blog data vector store with command-line options.
- Added a JSON file `blog_stats_20250510_161540.json` to store blog statistics.
- Enhanced document processing functions for better modularity and error handling.
- BLOG_DATA_UTILS.md +24 -14
- app.py +20 -81
- blog_utils.py +305 -0
- config.py +14 -0
- main.py +44 -1
- stats/blog_stats_20250510_161540.json +8 -0
- update_blog_data.ipynb +101 -148
- update_blog_data.py +101 -0
- utils_data_loading.ipynb +0 -454
BLOG_DATA_UTILS.md
CHANGED
|
@@ -4,24 +4,30 @@ This directory contains utilities for loading, processing, and maintaining blog
|
|
| 4 |
|
| 5 |
## Available Tools
|
| 6 |
|
| 7 |
-
### `
|
| 8 |
|
| 9 |
-
This
|
| 10 |
- Loading blog posts from the data directory
|
| 11 |
- Processing and enriching metadata (adding URLs, titles, etc.)
|
| 12 |
- Getting statistics about the documents
|
| 13 |
- Creating and updating vector embeddings
|
| 14 |
- Loading existing vector stores
|
| 15 |
|
| 16 |
-
### `update_blog_data.
|
| 17 |
|
| 18 |
-
This
|
| 19 |
-
-
|
| 20 |
- Process new blog posts
|
| 21 |
- Update the vector store
|
| 22 |
-
- Test the updated system with sample queries
|
| 23 |
- Track changes over time
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
## How to Use
|
| 26 |
|
| 27 |
### Updating Blog Data
|
|
@@ -29,12 +35,17 @@ This notebook demonstrates how to:
|
|
| 29 |
When new blog posts are published, follow these steps:
|
| 30 |
|
| 31 |
1. Add the markdown files to the `data/` directory
|
| 32 |
-
2. Run the update
|
| 33 |
```bash
|
| 34 |
cd /home/mafzaal/source/lets-talk
|
| 35 |
-
uv run
|
| 36 |
```
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
This will:
|
| 39 |
- Load all blog posts (including new ones)
|
| 40 |
- Update the vector embeddings
|
|
@@ -50,23 +61,22 @@ VECTOR_STORAGE_PATH=./db/vectorstore_v3 # Path to vector store
|
|
| 50 |
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l # Embedding model
|
| 51 |
QDRANT_COLLECTION=thedataguy_documents # Collection name
|
| 52 |
BLOG_BASE_URL=https://thedataguy.pro/blog/ # Base URL for blog
|
| 53 |
-
FORCE_RECREATE_EMBEDDINGS=false # Whether to force recreation
|
| 54 |
```
|
| 55 |
|
| 56 |
### In the Chainlit App
|
| 57 |
|
| 58 |
-
The Chainlit app (`app.py`) has been updated to use these utility functions
|
| 59 |
|
| 60 |
## Adding Custom Processing
|
| 61 |
|
| 62 |
To add custom processing for blog posts:
|
| 63 |
|
| 64 |
-
1. Edit the `update_document_metadata` function in `
|
| 65 |
2. Add any additional enrichment or processing steps
|
| 66 |
-
3. Update the vector store using the `update_blog_data.
|
| 67 |
|
| 68 |
## Future Improvements
|
| 69 |
|
| 70 |
-
- Add
|
| 71 |
-
- Add webhook support to automatically update when new posts are published
|
| 72 |
- Add tracking of embedding models and versions
|
|
|
|
|
|
| 4 |
|
| 5 |
## Available Tools
|
| 6 |
|
| 7 |
+
### `blog_utils.py`
|
| 8 |
|
| 9 |
+
This Python module contains utility functions for:
|
| 10 |
- Loading blog posts from the data directory
|
| 11 |
- Processing and enriching metadata (adding URLs, titles, etc.)
|
| 12 |
- Getting statistics about the documents
|
| 13 |
- Creating and updating vector embeddings
|
| 14 |
- Loading existing vector stores
|
| 15 |
|
| 16 |
+
### `update_blog_data.py`
|
| 17 |
|
| 18 |
+
This script allows you to:
|
| 19 |
+
- Update the blog data when new posts are published
|
| 20 |
- Process new blog posts
|
| 21 |
- Update the vector store
|
|
|
|
| 22 |
- Track changes over time
|
| 23 |
|
| 24 |
+
### Legacy Notebooks (Reference Only)
|
| 25 |
+
|
| 26 |
+
The following notebooks are kept for reference but the functionality has been moved to Python modules:
|
| 27 |
+
|
| 28 |
+
- `utils_data_loading.ipynb` - Contains the original utility functions
|
| 29 |
+
- `update_blog_data.ipynb` - Demonstrates the update workflow
|
| 30 |
+
|
| 31 |
## How to Use
|
| 32 |
|
| 33 |
### Updating Blog Data
|
|
|
|
| 35 |
When new blog posts are published, follow these steps:
|
| 36 |
|
| 37 |
1. Add the markdown files to the `data/` directory
|
| 38 |
+
2. Run the update script:
|
| 39 |
```bash
|
| 40 |
cd /home/mafzaal/source/lets-talk
|
| 41 |
+
uv run python update_blog_data.py
|
| 42 |
```
|
| 43 |
|
| 44 |
+
You can also force recreation of the vector store:
|
| 45 |
+
```bash
|
| 46 |
+
uv run python update_blog_data.py --force-recreate
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
This will:
|
| 50 |
- Load all blog posts (including new ones)
|
| 51 |
- Update the vector embeddings
|
|
|
|
| 61 |
EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l # Embedding model
|
| 62 |
QDRANT_COLLECTION=thedataguy_documents # Collection name
|
| 63 |
BLOG_BASE_URL=https://thedataguy.pro/blog/ # Base URL for blog
|
|
|
|
| 64 |
```
|
| 65 |
|
| 66 |
### In the Chainlit App
|
| 67 |
|
| 68 |
+
The Chainlit app (`app.py`) has been updated to use these utility functions from the `blog_utils.py` module. It falls back to notebook import and direct initialization if there are any issues.
|
| 69 |
|
| 70 |
## Adding Custom Processing
|
| 71 |
|
| 72 |
To add custom processing for blog posts:
|
| 73 |
|
| 74 |
+
1. Edit the `update_document_metadata` function in `blog_utils.py`
|
| 75 |
2. Add any additional enrichment or processing steps
|
| 76 |
+
3. Update the vector store using the `update_blog_data.py` script
|
| 77 |
|
| 78 |
## Future Improvements
|
| 79 |
|
| 80 |
+
- Add scheduled update process for automatically including new blog posts
|
|
|
|
| 81 |
- Add tracking of embedding models and versions
|
| 82 |
+
- Add webhook support to automatically update when new posts are published
|
app.py
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
import os
|
| 2 |
import getpass
|
| 3 |
import sys
|
| 4 |
-
import importlib.util
|
| 5 |
from pathlib import Path
|
| 6 |
from operator import itemgetter
|
|
|
|
| 7 |
from dotenv import load_dotenv
|
| 8 |
|
| 9 |
# Load environment variables from .env file
|
|
@@ -17,78 +17,17 @@ from langchain_huggingface import HuggingFaceEmbeddings
|
|
| 17 |
from langchain_qdrant import QdrantVectorStore
|
| 18 |
from qdrant_client import QdrantClient
|
| 19 |
from qdrant_client.http.models import Distance, VectorParams
|
| 20 |
-
|
| 21 |
-
# Import utility functions from the notebook
|
| 22 |
-
def import_notebook_functions(notebook_path):
|
| 23 |
-
"""Import functions from a Jupyter notebook"""
|
| 24 |
-
import nbformat
|
| 25 |
-
from importlib.util import spec_from_loader, module_from_spec
|
| 26 |
-
from IPython.core.interactiveshell import InteractiveShell
|
| 27 |
-
|
| 28 |
-
# Create a module
|
| 29 |
-
module_name = Path(notebook_path).stem
|
| 30 |
-
spec = spec_from_loader(module_name, loader=None)
|
| 31 |
-
module = module_from_spec(spec)
|
| 32 |
-
sys.modules[module_name] = module
|
| 33 |
-
|
| 34 |
-
# Read the notebook
|
| 35 |
-
with open(notebook_path) as f:
|
| 36 |
-
nb = nbformat.read(f, as_version=4)
|
| 37 |
-
|
| 38 |
-
# Execute code cells
|
| 39 |
-
shell = InteractiveShell.instance()
|
| 40 |
-
for cell in nb.cells:
|
| 41 |
-
if cell.cell_type == 'code':
|
| 42 |
-
# Skip example code
|
| 43 |
-
if 'if __name__ == "__main__":' in cell.source:
|
| 44 |
-
continue
|
| 45 |
-
|
| 46 |
-
code = shell.input_transformer_manager.transform_cell(cell.source)
|
| 47 |
-
exec(code, module.__dict__)
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
# Try to import utility functions if available
|
| 52 |
-
try:
|
| 53 |
-
utils = import_notebook_functions('utils_data_loading.ipynb')
|
| 54 |
-
|
| 55 |
-
# Load vector store using the utility function
|
| 56 |
-
vector_store = utils.load_vector_store(
|
| 57 |
-
storage_path=os.environ.get("VECTOR_STORAGE_PATH", "./db/vectorstore_v3"),
|
| 58 |
-
collection_name=os.environ.get("QDRANT_COLLECTION", "thedataguy_documents"),
|
| 59 |
-
embedding_model=os.environ.get("EMBEDDING_MODEL", "Snowflake/snowflake-arctic-embed-l")
|
| 60 |
-
)
|
| 61 |
-
|
| 62 |
-
print("Successfully loaded vector store using utility functions")
|
| 63 |
-
|
| 64 |
-
except Exception as e:
|
| 65 |
-
print(f"Could not load utility functions: {e}")
|
| 66 |
-
print("Falling back to direct initialization")
|
| 67 |
-
|
| 68 |
-
# Get vector storage path from .env file with fallback
|
| 69 |
-
storage_path = Path(os.environ.get("VECTOR_STORAGE_PATH", "./db/vectorstore_v3"))
|
| 70 |
-
|
| 71 |
-
# Load embedding model from environment variable with fallback
|
| 72 |
-
embedding_model = os.environ.get("EMBEDDING_MODEL", "Snowflake/snowflake-arctic-embed-l")
|
| 73 |
-
huggingface_embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
|
| 74 |
-
|
| 75 |
-
# Set up Qdrant vectorstore from existing collection
|
| 76 |
-
collection_name = os.environ.get("QDRANT_COLLECTION", "thedataguy_documents")
|
| 77 |
-
|
| 78 |
-
vector_store = QdrantVectorStore.from_existing_collection(
|
| 79 |
-
path=storage_path,
|
| 80 |
-
collection_name=collection_name,
|
| 81 |
-
embedding=huggingface_embeddings,
|
| 82 |
-
)
|
| 83 |
-
|
| 84 |
|
| 85 |
# Create a retriever
|
| 86 |
retriever = vector_store.as_retriever()
|
| 87 |
|
| 88 |
# Set up ChatOpenAI with environment variables
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
llm = ChatOpenAI(model=llm_model, temperature=temperature)
|
| 92 |
|
| 93 |
# Create RAG prompt template
|
| 94 |
rag_prompt_template = """\
|
|
@@ -149,24 +88,24 @@ async def on_message(message: cl.Message):
|
|
| 149 |
response = chain.invoke({"question": message.content})
|
| 150 |
|
| 151 |
# Get the sources to display them
|
| 152 |
-
sources = []
|
| 153 |
-
for doc in response["context"]:
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
|
| 167 |
# Send the response with sources
|
| 168 |
await cl.Message(
|
| 169 |
content=response["response"].content,
|
| 170 |
-
sources=sources
|
| 171 |
).send()
|
| 172 |
|
|
|
|
| 1 |
import os
|
| 2 |
import getpass
|
| 3 |
import sys
|
|
|
|
| 4 |
from pathlib import Path
|
| 5 |
from operator import itemgetter
|
| 6 |
+
from config import LLM_MODEL, LLM_TEMPERATURE
|
| 7 |
from dotenv import load_dotenv
|
| 8 |
|
| 9 |
# Load environment variables from .env file
|
|
|
|
| 17 |
from langchain_qdrant import QdrantVectorStore
|
| 18 |
from qdrant_client import QdrantClient
|
| 19 |
from qdrant_client.http.models import Distance, VectorParams
|
| 20 |
+
import blog_utils
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
# Load vector store using the utility function
|
| 23 |
+
vector_store = blog_utils.load_vector_store()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
# Create a retriever
|
| 26 |
retriever = vector_store.as_retriever()
|
| 27 |
|
| 28 |
# Set up ChatOpenAI with environment variables
|
| 29 |
+
|
| 30 |
+
llm = ChatOpenAI(model=LLM_MODEL, temperature=LLM_TEMPERATURE)
|
|
|
|
| 31 |
|
| 32 |
# Create RAG prompt template
|
| 33 |
rag_prompt_template = """\
|
|
|
|
| 88 |
response = chain.invoke({"question": message.content})
|
| 89 |
|
| 90 |
# Get the sources to display them
|
| 91 |
+
# sources = []
|
| 92 |
+
# for doc in response["context"]:
|
| 93 |
+
# if "url" in doc.metadata:
|
| 94 |
+
# # Get title from post_title metadata if available, otherwise derive from URL
|
| 95 |
+
# title = doc.metadata.get("post_title", "")
|
| 96 |
+
# if not title:
|
| 97 |
+
# title = doc.metadata["url"].split("/")[-2].replace("-", " ").title()
|
| 98 |
|
| 99 |
+
# sources.append(
|
| 100 |
+
# cl.Source(
|
| 101 |
+
# url=doc.metadata["url"],
|
| 102 |
+
# title=title
|
| 103 |
+
# )
|
| 104 |
+
# )
|
| 105 |
|
| 106 |
# Send the response with sources
|
| 107 |
await cl.Message(
|
| 108 |
content=response["response"].content,
|
| 109 |
+
#sources=sources
|
| 110 |
).send()
|
| 111 |
|
blog_utils.py
ADDED
|
@@ -0,0 +1,305 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Blog Data Utilities Module
|
| 3 |
+
|
| 4 |
+
This module contains utility functions for loading, processing, and storing blog posts
|
| 5 |
+
for the RAG system. It includes functions for loading blog posts from the data directory,
|
| 6 |
+
processing their metadata, and creating vector embeddings.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import os
|
| 10 |
+
import json
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from typing import List, Dict, Any, Optional
|
| 13 |
+
from datetime import datetime
|
| 14 |
+
|
| 15 |
+
from langchain_community.document_loaders import DirectoryLoader
|
| 16 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
| 17 |
+
from langchain.schema.document import Document
|
| 18 |
+
from langchain_huggingface import HuggingFaceEmbeddings
|
| 19 |
+
from langchain_qdrant import QdrantVectorStore
|
| 20 |
+
from qdrant_client import QdrantClient
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
from config import (
|
| 24 |
+
DATA_DIR,
|
| 25 |
+
VECTOR_STORAGE_PATH,
|
| 26 |
+
EMBEDDING_MODEL,
|
| 27 |
+
QDRANT_COLLECTION,
|
| 28 |
+
BLOG_BASE_URL
|
| 29 |
+
)
|
| 30 |
+
|
| 31 |
+
def load_blog_posts(data_dir: str = DATA_DIR,
|
| 32 |
+
glob_pattern: str = "*.md",
|
| 33 |
+
recursive: bool = True,
|
| 34 |
+
show_progress: bool = True) -> List[Document]:
|
| 35 |
+
"""
|
| 36 |
+
Load blog posts from the specified directory.
|
| 37 |
+
|
| 38 |
+
Args:
|
| 39 |
+
data_dir: Directory containing the blog posts
|
| 40 |
+
glob_pattern: Pattern to match files
|
| 41 |
+
recursive: Whether to search subdirectories
|
| 42 |
+
show_progress: Whether to show a progress bar
|
| 43 |
+
|
| 44 |
+
Returns:
|
| 45 |
+
List of Document objects containing the blog posts
|
| 46 |
+
"""
|
| 47 |
+
text_loader = DirectoryLoader(
|
| 48 |
+
data_dir,
|
| 49 |
+
glob=glob_pattern,
|
| 50 |
+
show_progress=show_progress,
|
| 51 |
+
recursive=recursive
|
| 52 |
+
)
|
| 53 |
+
|
| 54 |
+
documents = text_loader.load()
|
| 55 |
+
print(f"Loaded {len(documents)} documents from {data_dir}")
|
| 56 |
+
return documents
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def update_document_metadata(documents: List[Document],
|
| 60 |
+
data_dir_prefix: str = DATA_DIR,
|
| 61 |
+
blog_base_url: str = BLOG_BASE_URL,
|
| 62 |
+
remove_suffix: str = "index.md") -> List[Document]:
|
| 63 |
+
"""
|
| 64 |
+
Update the metadata of documents to include URL and other information.
|
| 65 |
+
|
| 66 |
+
Args:
|
| 67 |
+
documents: List of Document objects to update
|
| 68 |
+
data_dir_prefix: Prefix to replace in source paths
|
| 69 |
+
blog_base_url: Base URL for the blog posts
|
| 70 |
+
remove_suffix: Suffix to remove from paths (like index.md)
|
| 71 |
+
|
| 72 |
+
Returns:
|
| 73 |
+
Updated list of Document objects
|
| 74 |
+
"""
|
| 75 |
+
for doc in documents:
|
| 76 |
+
# Create URL from source path
|
| 77 |
+
doc.metadata["url"] = doc.metadata["source"].replace(data_dir_prefix, blog_base_url)
|
| 78 |
+
|
| 79 |
+
# Remove index.md or other suffix if present
|
| 80 |
+
if remove_suffix and doc.metadata["url"].endswith(remove_suffix):
|
| 81 |
+
doc.metadata["url"] = doc.metadata["url"][:-len(remove_suffix)]
|
| 82 |
+
|
| 83 |
+
# Extract post title from the directory structure
|
| 84 |
+
path_parts = Path(doc.metadata["source"]).parts
|
| 85 |
+
if len(path_parts) > 1:
|
| 86 |
+
# Use the directory name as post_slug
|
| 87 |
+
doc.metadata["post_slug"] = path_parts[-2]
|
| 88 |
+
doc.metadata["post_title"] = path_parts[-2].replace("-", " ").title()
|
| 89 |
+
|
| 90 |
+
# Add document length as metadata
|
| 91 |
+
doc.metadata["content_length"] = len(doc.page_content)
|
| 92 |
+
|
| 93 |
+
return documents
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def get_document_stats(documents: List[Document]) -> Dict[str, Any]:
|
| 97 |
+
"""
|
| 98 |
+
Get statistics about the documents.
|
| 99 |
+
|
| 100 |
+
Args:
|
| 101 |
+
documents: List of Document objects
|
| 102 |
+
|
| 103 |
+
Returns:
|
| 104 |
+
Dictionary with statistics
|
| 105 |
+
"""
|
| 106 |
+
stats = {
|
| 107 |
+
"total_documents": len(documents),
|
| 108 |
+
"total_characters": sum(len(doc.page_content) for doc in documents),
|
| 109 |
+
"min_length": min(len(doc.page_content) for doc in documents) if documents else 0,
|
| 110 |
+
"max_length": max(len(doc.page_content) for doc in documents) if documents else 0,
|
| 111 |
+
"avg_length": sum(len(doc.page_content) for doc in documents) / len(documents) if documents else 0,
|
| 112 |
+
}
|
| 113 |
+
|
| 114 |
+
# Create a list of document info for analysis
|
| 115 |
+
doc_info = []
|
| 116 |
+
for doc in documents:
|
| 117 |
+
doc_info.append({
|
| 118 |
+
"url": doc.metadata.get("url", ""),
|
| 119 |
+
"source": doc.metadata.get("source", ""),
|
| 120 |
+
"title": doc.metadata.get("post_title", ""),
|
| 121 |
+
"text_length": doc.metadata.get("content_length", 0),
|
| 122 |
+
})
|
| 123 |
+
|
| 124 |
+
stats["documents"] = doc_info
|
| 125 |
+
return stats
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
def display_document_stats(stats: Dict[str, Any]):
|
| 129 |
+
"""
|
| 130 |
+
Display document statistics in a readable format.
|
| 131 |
+
|
| 132 |
+
Args:
|
| 133 |
+
stats: Dictionary with statistics from get_document_stats
|
| 134 |
+
"""
|
| 135 |
+
print(f"Total Documents: {stats['total_documents']}")
|
| 136 |
+
print(f"Total Characters: {stats['total_characters']}")
|
| 137 |
+
print(f"Min Length: {stats['min_length']} characters")
|
| 138 |
+
print(f"Max Length: {stats['max_length']} characters")
|
| 139 |
+
print(f"Average Length: {stats['avg_length']:.2f} characters")
|
| 140 |
+
|
| 141 |
+
# For use in notebooks where pandas and display are available:
|
| 142 |
+
try:
|
| 143 |
+
import pandas as pd
|
| 144 |
+
from IPython.display import display
|
| 145 |
+
if stats["documents"]:
|
| 146 |
+
df = pd.DataFrame(stats["documents"])
|
| 147 |
+
display(df)
|
| 148 |
+
except (ImportError, NameError):
|
| 149 |
+
# Just print the first 5 documents if not in a notebook environment
|
| 150 |
+
if stats["documents"]:
|
| 151 |
+
print("\nFirst 5 documents:")
|
| 152 |
+
for i, doc in enumerate(stats["documents"][:5]):
|
| 153 |
+
print(f"{i+1}. {doc['title']} ({doc['url']})")
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
def split_documents(documents: List[Document],
|
| 157 |
+
chunk_size: int = 1000,
|
| 158 |
+
chunk_overlap: int = 200) -> List[Document]:
|
| 159 |
+
"""
|
| 160 |
+
Split documents into chunks for better embedding and retrieval.
|
| 161 |
+
|
| 162 |
+
Args:
|
| 163 |
+
documents: List of Document objects to split
|
| 164 |
+
chunk_size: Size of each chunk in characters
|
| 165 |
+
chunk_overlap: Overlap between chunks in characters
|
| 166 |
+
|
| 167 |
+
Returns:
|
| 168 |
+
List of split Document objects
|
| 169 |
+
"""
|
| 170 |
+
text_splitter = RecursiveCharacterTextSplitter(
|
| 171 |
+
chunk_size=chunk_size,
|
| 172 |
+
chunk_overlap=chunk_overlap,
|
| 173 |
+
length_function=len,
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
split_docs = text_splitter.split_documents(documents)
|
| 177 |
+
print(f"Split {len(documents)} documents into {len(split_docs)} chunks")
|
| 178 |
+
return split_docs
|
| 179 |
+
|
| 180 |
+
|
| 181 |
+
def create_vector_store(documents: List[Document],
|
| 182 |
+
storage_path: str = VECTOR_STORAGE_PATH,
|
| 183 |
+
collection_name: str = QDRANT_COLLECTION,
|
| 184 |
+
embedding_model: str = EMBEDDING_MODEL,
|
| 185 |
+
force_recreate: bool = False) -> Optional[QdrantVectorStore]:
|
| 186 |
+
|
| 187 |
+
"""
|
| 188 |
+
Create a vector store from the documents using Qdrant.
|
| 189 |
+
Args:
|
| 190 |
+
documents: List of Document objects to embed
|
| 191 |
+
storage_path: Path to the vector store
|
| 192 |
+
collection_name: Name of the collection
|
| 193 |
+
embedding_model: Name of the embedding model
|
| 194 |
+
force_recreate: Whether to force recreation of the vector store
|
| 195 |
+
Returns:
|
| 196 |
+
QdrantVectorStore vector store or None if creation fails
|
| 197 |
+
"""
|
| 198 |
+
|
| 199 |
+
vector_store = QdrantVectorStore.from_documents(
|
| 200 |
+
documents,
|
| 201 |
+
embedding=HuggingFaceEmbeddings(model_name=embedding_model),
|
| 202 |
+
collection_name=collection_name,
|
| 203 |
+
path=storage_path,
|
| 204 |
+
force_recreate=force_recreate,
|
| 205 |
+
)
|
| 206 |
+
|
| 207 |
+
return vector_store
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
def load_vector_store(storage_path: str = VECTOR_STORAGE_PATH,
|
| 211 |
+
collection_name: str = QDRANT_COLLECTION,
|
| 212 |
+
embedding_model: str = EMBEDDING_MODEL) -> Optional[QdrantVectorStore]:
|
| 213 |
+
"""
|
| 214 |
+
Load an existing vector store.
|
| 215 |
+
|
| 216 |
+
Args:
|
| 217 |
+
storage_path: Path to the vector store
|
| 218 |
+
collection_name: Name of the collection
|
| 219 |
+
embedding_model: Name of the embedding model
|
| 220 |
+
|
| 221 |
+
Returns:
|
| 222 |
+
QdrantVectorStore vector store or None if it doesn't exist
|
| 223 |
+
"""
|
| 224 |
+
# Initialize the embedding model
|
| 225 |
+
embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
|
| 226 |
+
|
| 227 |
+
# Check if vector store exists
|
| 228 |
+
if not Path(storage_path).exists():
|
| 229 |
+
print(f"Vector store not found at {storage_path}")
|
| 230 |
+
return None
|
| 231 |
+
|
| 232 |
+
try:
|
| 233 |
+
# Initialize Qdrant client
|
| 234 |
+
client = QdrantClient(path=storage_path)
|
| 235 |
+
|
| 236 |
+
# Create vector store with the client
|
| 237 |
+
vector_store = QdrantVectorStore(
|
| 238 |
+
client=client,
|
| 239 |
+
collection_name=collection_name,
|
| 240 |
+
embedding=embeddings,
|
| 241 |
+
)
|
| 242 |
+
print(f"Loaded vector store from {storage_path}")
|
| 243 |
+
return vector_store
|
| 244 |
+
except Exception as e:
|
| 245 |
+
print(f"Error loading vector store: {e}")
|
| 246 |
+
return None
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
def process_blog_posts(data_dir: str = DATA_DIR,
|
| 250 |
+
create_embeddings: bool = True,
|
| 251 |
+
force_recreate_embeddings: bool = False,
|
| 252 |
+
storage_path: str = VECTOR_STORAGE_PATH):
|
| 253 |
+
"""
|
| 254 |
+
Complete pipeline to process blog posts and optionally create vector embeddings.
|
| 255 |
+
|
| 256 |
+
Args:
|
| 257 |
+
data_dir: Directory containing the blog posts
|
| 258 |
+
create_embeddings: Whether to create vector embeddings
|
| 259 |
+
force_recreate_embeddings: Whether to force recreation of embeddings
|
| 260 |
+
storage_path: Path to the vector store (not used with in-memory approach)
|
| 261 |
+
|
| 262 |
+
Returns:
|
| 263 |
+
Dictionary with data and vector store (if created)
|
| 264 |
+
"""
|
| 265 |
+
# Load documents
|
| 266 |
+
documents = load_blog_posts(data_dir)
|
| 267 |
+
|
| 268 |
+
# Update metadata
|
| 269 |
+
documents = update_document_metadata(documents)
|
| 270 |
+
|
| 271 |
+
|
| 272 |
+
# Get and display stats
|
| 273 |
+
stats = get_document_stats(documents)
|
| 274 |
+
display_document_stats(stats)
|
| 275 |
+
|
| 276 |
+
result = {
|
| 277 |
+
"documents": documents,
|
| 278 |
+
"stats": stats,
|
| 279 |
+
"vector_store": None
|
| 280 |
+
}
|
| 281 |
+
|
| 282 |
+
# Create vector store if requested
|
| 283 |
+
if create_embeddings:
|
| 284 |
+
# Using in-memory vector store to avoid pickling issues
|
| 285 |
+
vector_store = create_vector_store(
|
| 286 |
+
documents,
|
| 287 |
+
force_recreate=force_recreate_embeddings
|
| 288 |
+
)
|
| 289 |
+
result["vector_store"] = vector_store
|
| 290 |
+
|
| 291 |
+
return result
|
| 292 |
+
|
| 293 |
+
|
| 294 |
+
# Allow script to be run directly if needed
|
| 295 |
+
if __name__ == "__main__":
|
| 296 |
+
print("Blog Data Utilities Module")
|
| 297 |
+
print("Available functions:")
|
| 298 |
+
print("- load_blog_posts()")
|
| 299 |
+
print("- update_document_metadata()")
|
| 300 |
+
print("- get_document_stats()")
|
| 301 |
+
print("- display_document_stats()")
|
| 302 |
+
print("- split_documents()")
|
| 303 |
+
print("- create_vector_store()")
|
| 304 |
+
print("- load_vector_store()")
|
| 305 |
+
print("- process_blog_posts()")
|
config.py
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
from dotenv import load_dotenv
|
| 3 |
+
|
| 4 |
+
# Load environment variables from .env file
|
| 5 |
+
load_dotenv()
|
| 6 |
+
|
| 7 |
+
# Configuration with defaults
|
| 8 |
+
DATA_DIR = os.environ.get("DATA_DIR", "data/")
|
| 9 |
+
VECTOR_STORAGE_PATH = os.environ.get("VECTOR_STORAGE_PATH", "./db/vectorstore_v3")
|
| 10 |
+
EMBEDDING_MODEL = os.environ.get("EMBEDDING_MODEL", "Snowflake/snowflake-arctic-embed-l")
|
| 11 |
+
QDRANT_COLLECTION = os.environ.get("QDRANT_COLLECTION", "thedataguy_documents")
|
| 12 |
+
BLOG_BASE_URL = os.environ.get("BLOG_BASE_URL", "https://thedataguy.pro/blog/")
|
| 13 |
+
LLM_MODEL = os.environ.get("LLM_MODEL", "gpt-4o-mini")
|
| 14 |
+
LLM_TEMPERATURE = float(os.environ.get("TEMPERATURE", "0"))
|
main.py
CHANGED
|
@@ -1,5 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
def main():
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
|
| 4 |
|
| 5 |
if __name__ == "__main__":
|
|
|
|
| 1 |
+
import blog_utils
|
| 2 |
+
from update_blog_data import parse_args, save_stats
|
| 3 |
+
|
| 4 |
+
|
| 5 |
def main():
|
| 6 |
+
|
| 7 |
+
"""Main function to update blog data"""
|
| 8 |
+
args = parse_args()
|
| 9 |
+
|
| 10 |
+
print("=== Blog Data Update ===")
|
| 11 |
+
print(f"Data directory: {args.data_dir}")
|
| 12 |
+
print(f"Force recreate: {args.force_recreate}")
|
| 13 |
+
print("========================")
|
| 14 |
+
|
| 15 |
+
# Process blog posts without creating embeddings
|
| 16 |
+
try:
|
| 17 |
+
# Load and process documents
|
| 18 |
+
documents = blog_utils.load_blog_posts(args.data_dir)
|
| 19 |
+
documents = blog_utils.update_document_metadata(documents)
|
| 20 |
+
|
| 21 |
+
# Get stats
|
| 22 |
+
stats = blog_utils.get_document_stats(documents)
|
| 23 |
+
blog_utils.display_document_stats(stats)
|
| 24 |
+
|
| 25 |
+
# Save stats for tracking
|
| 26 |
+
stats_file = save_stats(stats)
|
| 27 |
+
|
| 28 |
+
# Create a reference file for the vector store
|
| 29 |
+
if args.force_recreate:
|
| 30 |
+
print("\nAttempting to save vector store reference file...")
|
| 31 |
+
blog_utils.create_vector_store(documents, force_recreate=args.force_recreate)
|
| 32 |
+
|
| 33 |
+
print("\n=== Update Summary ===")
|
| 34 |
+
print(f"Processed {stats['total_documents']} documents")
|
| 35 |
+
print(f"Stats saved to: {stats_file}")
|
| 36 |
+
print("Note: Vector store creation is currently disabled due to pickling issues.")
|
| 37 |
+
print(" See VECTOR_STORE_ISSUES.md for more information and possible solutions.")
|
| 38 |
+
print("=====================")
|
| 39 |
+
|
| 40 |
+
return 0
|
| 41 |
+
except Exception as e:
|
| 42 |
+
print(f"Error: {e}")
|
| 43 |
+
import traceback
|
| 44 |
+
traceback.print_exc()
|
| 45 |
+
return 1
|
| 46 |
|
| 47 |
|
| 48 |
if __name__ == "__main__":
|
stats/blog_stats_20250510_161540.json
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"timestamp": "20250510_161540",
|
| 3 |
+
"total_documents": 14,
|
| 4 |
+
"total_characters": 106275,
|
| 5 |
+
"min_length": 1900,
|
| 6 |
+
"max_length": 13468,
|
| 7 |
+
"avg_length": 7591.071428571428
|
| 8 |
+
}
|
update_blog_data.ipynb
CHANGED
|
@@ -12,7 +12,7 @@
|
|
| 12 |
},
|
| 13 |
{
|
| 14 |
"cell_type": "code",
|
| 15 |
-
"execution_count":
|
| 16 |
"id": "6ec048b4",
|
| 17 |
"metadata": {},
|
| 18 |
"outputs": [],
|
|
@@ -21,139 +21,88 @@
|
|
| 21 |
"import os\n",
|
| 22 |
"from pathlib import Path\n",
|
| 23 |
"from dotenv import load_dotenv\n",
|
| 24 |
-
"import importlib.util\n"
|
| 25 |
-
"\n",
|
| 26 |
-
"# Load environment variables\n",
|
| 27 |
-
"load_dotenv()\n",
|
| 28 |
-
"\n",
|
| 29 |
-
"# Import utility functions from utils_data_loading.ipynb\n",
|
| 30 |
-
"# We'll do this by first converting the notebook to a Python module"
|
| 31 |
]
|
| 32 |
},
|
| 33 |
{
|
| 34 |
-
"cell_type": "
|
| 35 |
-
"
|
| 36 |
-
"id": "7f01d61f",
|
| 37 |
"metadata": {},
|
| 38 |
-
"outputs": [],
|
| 39 |
"source": [
|
| 40 |
-
"#
|
| 41 |
-
"
|
| 42 |
-
"
|
| 43 |
-
"
|
| 44 |
-
"
|
| 45 |
-
"
|
| 46 |
-
" notebook_path: Path to the notebook\n",
|
| 47 |
-
" module_name: Name to give the module\n",
|
| 48 |
-
" \n",
|
| 49 |
-
" Returns:\n",
|
| 50 |
-
" The imported module\n",
|
| 51 |
-
" \"\"\"\n",
|
| 52 |
-
" import nbformat\n",
|
| 53 |
-
" from importlib.util import spec_from_loader, module_from_spec\n",
|
| 54 |
-
" from IPython.core.interactiveshell import InteractiveShell\n",
|
| 55 |
-
" \n",
|
| 56 |
-
" shell = InteractiveShell.instance()\n",
|
| 57 |
-
" \n",
|
| 58 |
-
" with open(notebook_path) as f:\n",
|
| 59 |
-
" nb = nbformat.read(f, as_version=4)\n",
|
| 60 |
-
" \n",
|
| 61 |
-
" # Create a module\n",
|
| 62 |
-
" spec = spec_from_loader(module_name, loader=None)\n",
|
| 63 |
-
" module = module_from_spec(spec)\n",
|
| 64 |
-
" sys.modules[module_name] = module\n",
|
| 65 |
-
" \n",
|
| 66 |
-
" # Execute only the code cells in the notebook\n",
|
| 67 |
-
" for cell in nb.cells:\n",
|
| 68 |
-
" if cell.cell_type == 'code':\n",
|
| 69 |
-
" # Skip cells that start with certain keywords like \"if __name__ == \"__main__\":\"\n",
|
| 70 |
-
" if 'if __name__ == \"__main__\":' in cell.source:\n",
|
| 71 |
-
" continue\n",
|
| 72 |
-
" \n",
|
| 73 |
-
" # Execute the cell and store its content in the module\n",
|
| 74 |
-
" code = shell.input_transformer_manager.transform_cell(cell.source)\n",
|
| 75 |
-
" exec(code, module.__dict__)\n",
|
| 76 |
-
" \n",
|
| 77 |
-
" return module"
|
| 78 |
]
|
| 79 |
},
|
| 80 |
{
|
| 81 |
"cell_type": "code",
|
| 82 |
-
"execution_count":
|
| 83 |
-
"id": "
|
| 84 |
"metadata": {},
|
| 85 |
-
"outputs": [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
"source": [
|
| 87 |
-
"
|
| 88 |
-
"utils = import_notebook_as_module('utils_data_loading.ipynb')\n",
|
| 89 |
"\n",
|
| 90 |
-
"
|
| 91 |
-
"
|
| 92 |
-
]
|
| 93 |
-
},
|
| 94 |
-
{
|
| 95 |
-
"cell_type": "markdown",
|
| 96 |
-
"id": "85ae6617",
|
| 97 |
-
"metadata": {},
|
| 98 |
-
"source": [
|
| 99 |
-
"## Configuration\n",
|
| 100 |
"\n",
|
| 101 |
-
"
|
| 102 |
]
|
| 103 |
},
|
| 104 |
{
|
| 105 |
"cell_type": "code",
|
| 106 |
"execution_count": null,
|
| 107 |
-
"id": "
|
| 108 |
-
"metadata": {},
|
| 109 |
-
"outputs": [],
|
| 110 |
-
"source": [
|
| 111 |
-
"# Configuration (can be overridden from .env file)\n",
|
| 112 |
-
"DATA_DIR = os.environ.get(\"DATA_DIR\", \"data/\")\n",
|
| 113 |
-
"VECTOR_STORAGE_PATH = os.environ.get(\"VECTOR_STORAGE_PATH\", \"./db/vectorstore_v3\")\n",
|
| 114 |
-
"BLOG_BASE_URL = os.environ.get(\"BLOG_BASE_URL\", \"https://thedataguy.pro/blog/\")\n",
|
| 115 |
-
"FORCE_RECREATE_EMBEDDINGS = os.environ.get(\"FORCE_RECREATE_EMBEDDINGS\", \"false\").lower() == \"true\"\n",
|
| 116 |
-
"\n",
|
| 117 |
-
"print(f\"Data Directory: {DATA_DIR}\")\n",
|
| 118 |
-
"print(f\"Vector Storage Path: {VECTOR_STORAGE_PATH}\")\n",
|
| 119 |
-
"print(f\"Blog Base URL: {BLOG_BASE_URL}\")\n",
|
| 120 |
-
"print(f\"Force Recreate Embeddings: {FORCE_RECREATE_EMBEDDINGS}\")"
|
| 121 |
-
]
|
| 122 |
-
},
|
| 123 |
-
{
|
| 124 |
-
"cell_type": "markdown",
|
| 125 |
-
"id": "cc19ab4c",
|
| 126 |
"metadata": {},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
"source": [
|
| 128 |
-
"
|
| 129 |
-
"\n",
|
| 130 |
-
"This process will:\n",
|
| 131 |
-
"1. Load existing blog posts\n",
|
| 132 |
-
"2. Process and update metadata\n",
|
| 133 |
-
"3. Create or update vector embeddings"
|
| 134 |
]
|
| 135 |
},
|
| 136 |
{
|
| 137 |
"cell_type": "code",
|
| 138 |
-
"execution_count":
|
| 139 |
-
"id": "
|
| 140 |
"metadata": {},
|
| 141 |
"outputs": [],
|
| 142 |
"source": [
|
| 143 |
-
"
|
| 144 |
-
"result = utils.process_blog_posts(\n",
|
| 145 |
-
" data_dir=DATA_DIR,\n",
|
| 146 |
-
" create_embeddings=True,\n",
|
| 147 |
-
" force_recreate_embeddings=FORCE_RECREATE_EMBEDDINGS\n",
|
| 148 |
-
")\n",
|
| 149 |
-
"\n",
|
| 150 |
-
"# Access the documents and vector store\n",
|
| 151 |
-
"documents = result[\"documents\"]\n",
|
| 152 |
-
"stats = result[\"stats\"]\n",
|
| 153 |
-
"vector_store = result[\"vector_store\"]\n",
|
| 154 |
-
"\n",
|
| 155 |
-
"print(f\"\\nProcessed {len(documents)} blog posts\")\n",
|
| 156 |
-
"print(f\"Vector store created/updated at: {VECTOR_STORAGE_PATH}\")"
|
| 157 |
]
|
| 158 |
},
|
| 159 |
{
|
|
@@ -168,13 +117,44 @@
|
|
| 168 |
},
|
| 169 |
{
|
| 170 |
"cell_type": "code",
|
| 171 |
-
"execution_count":
|
| 172 |
"id": "8b552e6b",
|
| 173 |
"metadata": {},
|
| 174 |
-
"outputs": [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
"source": [
|
| 176 |
"# Create a retriever from the vector store\n",
|
| 177 |
-
"retriever = vector_store.as_retriever(search_kwargs={\"k\":
|
| 178 |
"\n",
|
| 179 |
"# Test queries\n",
|
| 180 |
"test_queries = [\n",
|
|
@@ -194,49 +174,14 @@
|
|
| 194 |
" print(f\"{i+1}. {title} ({url})\")"
|
| 195 |
]
|
| 196 |
},
|
| 197 |
-
{
|
| 198 |
-
"cell_type": "markdown",
|
| 199 |
-
"id": "ddbe9282",
|
| 200 |
-
"metadata": {},
|
| 201 |
-
"source": [
|
| 202 |
-
"## Schedule This Notebook\n",
|
| 203 |
-
"\n",
|
| 204 |
-
"To keep the blog data up-to-date, you can schedule this notebook to run periodically. \n",
|
| 205 |
-
"Here are some options:\n",
|
| 206 |
-
"\n",
|
| 207 |
-
"1. Use a cron job to run this notebook with papermill\n",
|
| 208 |
-
"2. Set up a GitHub Action to run this notebook on a schedule\n",
|
| 209 |
-
"3. Use Airflow or another workflow management system\n",
|
| 210 |
-
"\n",
|
| 211 |
-
"Example of running with papermill:\n",
|
| 212 |
-
"```bash\n",
|
| 213 |
-
"papermill update_blog_data.ipynb output_$(date +%Y%m%d).ipynb\n",
|
| 214 |
-
"```"
|
| 215 |
-
]
|
| 216 |
-
},
|
| 217 |
{
|
| 218 |
"cell_type": "code",
|
| 219 |
-
"execution_count":
|
| 220 |
-
"id": "
|
| 221 |
"metadata": {},
|
| 222 |
"outputs": [],
|
| 223 |
"source": [
|
| 224 |
-
"
|
| 225 |
-
"import json\n",
|
| 226 |
-
"from datetime import datetime\n",
|
| 227 |
-
"\n",
|
| 228 |
-
"stats_dir = Path(\"stats\")\n",
|
| 229 |
-
"stats_dir.mkdir(exist_ok=True)\n",
|
| 230 |
-
"\n",
|
| 231 |
-
"# Add timestamp to stats\n",
|
| 232 |
-
"stats[\"timestamp\"] = datetime.now().isoformat()\n",
|
| 233 |
-
"\n",
|
| 234 |
-
"# Save stats\n",
|
| 235 |
-
"stats_path = stats_dir / f\"blog_stats_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json\"\n",
|
| 236 |
-
"with open(stats_path, \"w\") as f:\n",
|
| 237 |
-
" json.dump(stats, f, indent=2)\n",
|
| 238 |
-
"\n",
|
| 239 |
-
"print(f\"Saved stats to {stats_path}\")"
|
| 240 |
]
|
| 241 |
}
|
| 242 |
],
|
|
@@ -247,7 +192,15 @@
|
|
| 247 |
"name": "python3"
|
| 248 |
},
|
| 249 |
"language_info": {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
"name": "python",
|
|
|
|
|
|
|
| 251 |
"version": "3.13.2"
|
| 252 |
}
|
| 253 |
},
|
|
|
|
| 12 |
},
|
| 13 |
{
|
| 14 |
"cell_type": "code",
|
| 15 |
+
"execution_count": 1,
|
| 16 |
"id": "6ec048b4",
|
| 17 |
"metadata": {},
|
| 18 |
"outputs": [],
|
|
|
|
| 21 |
"import os\n",
|
| 22 |
"from pathlib import Path\n",
|
| 23 |
"from dotenv import load_dotenv\n",
|
| 24 |
+
"import importlib.util\n"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
]
|
| 26 |
},
|
| 27 |
{
|
| 28 |
+
"cell_type": "markdown",
|
| 29 |
+
"id": "cc19ab4c",
|
|
|
|
| 30 |
"metadata": {},
|
|
|
|
| 31 |
"source": [
|
| 32 |
+
"## Update Blog Data Process\n",
|
| 33 |
+
"\n",
|
| 34 |
+
"This process will:\n",
|
| 35 |
+
"1. Load existing blog posts\n",
|
| 36 |
+
"2. Process and update metadata\n",
|
| 37 |
+
"3. Create or update vector embeddings"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
]
|
| 39 |
},
|
| 40 |
{
|
| 41 |
"cell_type": "code",
|
| 42 |
+
"execution_count": 7,
|
| 43 |
+
"id": "3d56f688",
|
| 44 |
"metadata": {},
|
| 45 |
+
"outputs": [
|
| 46 |
+
{
|
| 47 |
+
"name": "stderr",
|
| 48 |
+
"output_type": "stream",
|
| 49 |
+
"text": [
|
| 50 |
+
"100%|██████████| 14/14 [00:00<00:00, 42.05it/s]"
|
| 51 |
+
]
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"name": "stdout",
|
| 55 |
+
"output_type": "stream",
|
| 56 |
+
"text": [
|
| 57 |
+
"Loaded 14 documents from data/\n"
|
| 58 |
+
]
|
| 59 |
+
},
|
| 60 |
+
{
|
| 61 |
+
"name": "stderr",
|
| 62 |
+
"output_type": "stream",
|
| 63 |
+
"text": [
|
| 64 |
+
"\n"
|
| 65 |
+
]
|
| 66 |
+
}
|
| 67 |
+
],
|
| 68 |
"source": [
|
| 69 |
+
"import blog_utils\n",
|
|
|
|
| 70 |
"\n",
|
| 71 |
+
"docs = blog_utils.load_blog_posts()\n",
|
| 72 |
+
"docs = blog_utils.update_document_metadata(docs)\n",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
"\n",
|
| 74 |
+
"\n"
|
| 75 |
]
|
| 76 |
},
|
| 77 |
{
|
| 78 |
"cell_type": "code",
|
| 79 |
"execution_count": null,
|
| 80 |
+
"id": "a14c70dc",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
"metadata": {},
|
| 82 |
+
"outputs": [
|
| 83 |
+
{
|
| 84 |
+
"data": {
|
| 85 |
+
"text/plain": [
|
| 86 |
+
"Document(metadata={'source': 'data/introduction-to-ragas/index.md', 'url': 'https://thedataguy.pro/blog/introduction-to-ragas/', 'post_slug': 'introduction-to-ragas', 'post_title': 'Introduction To Ragas', 'content_length': 6071}, page_content='title: \"Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications\" date: 2025-04-26T18:00:00-06:00 layout: blog description: \"Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems.\" categories: [\"AI\", \"RAG\", \"Evaluation\",\"Ragas\"] coverImage: \"https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3\" readingTime: 7 published: true\\n\\nAs Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you\\'re building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in.\\n\\nWhat is Ragas?\\n\\nRagas is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems.\\n\\nAt its core, Ragas helps answer crucial questions: - Is my application retrieving the right information? - Are the responses factually accurate and consistent with the retrieved context? - Does the system appropriately address the user\\'s query? - How well does my application handle multi-turn conversations?\\n\\nWhy Evaluate LLM Applications?\\n\\nLLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable.\\n\\nEvaluation serves several key purposes: - Quality assurance: Identify and fix issues before they reach users - Performance tracking: Monitor how changes impact system performance - Benchmarking: Compare different approaches objectively - Continuous improvement: Build feedback loops to enhance your application\\n\\nKey Features of Ragas\\n\\n🎯 Specialized Metrics\\n\\nRagas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications:\\n\\nFaithfulness: Measures if the response is factually consistent with the retrieved context\\n\\nContext Relevancy: Evaluates if the retrieved information is relevant to the query\\n\\nAnswer Relevancy: Assesses if the response addresses the user\\'s question\\n\\nTopic Adherence: Gauges how well multi-turn conversations stay on topic\\n\\n🧪 Test Data Generation\\n\\nCreating high-quality test data is often a bottleneck in evaluation. Ragas helps you generate comprehensive test datasets automatically, saving time and ensuring thorough coverage.\\n\\n🔗 Seamless Integrations\\n\\nRagas works with popular LLM frameworks and tools: - LangChain - LlamaIndex - Haystack - OpenAI\\n\\nObservability platforms - Phoenix - LangSmith - Langfuse\\n\\n📊 Comprehensive Analysis\\n\\nBeyond simple scores, Ragas provides detailed insights into your application\\'s strengths and weaknesses, enabling targeted improvements.\\n\\nGetting Started with Ragas\\n\\nInstalling Ragas is straightforward:\\n\\nbash uv init && uv add ragas\\n\\nHere\\'s a simple example of evaluating a response using Ragas:\\n\\n```python from ragas.metrics import Faithfulness from ragas.evaluation import EvaluationDataset from ragas.dataset_schema import SingleTurnSample from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI\\n\\nInitialize the LLM, you are going to new OPENAI API key\\n\\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\\n\\nYour evaluation data\\n\\ntest_data = { \"user_input\": \"What is the capital of France?\", \"retrieved_contexts\": [\"Paris is the capital and most populous city of France.\"], \"response\": \"The capital of France is Paris.\" }\\n\\nCreate a sample\\n\\nsample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor\\n\\nCreate metric\\n\\nfaithfulness = Faithfulness(llm=evaluator_llm)\\n\\nCalculate the score\\n\\nresult = await faithfulness.single_turn_ascore(sample) print(f\"Faithfulness score: {result}\") ```\\n\\n💡 Try it yourself: Explore the hands-on notebook for this workflow: 01_Introduction_to_Ragas\\n\\nWhat\\'s Coming in This Blog Series\\n\\nThis introduction is just the beginning. In the upcoming posts, we\\'ll dive deeper into all aspects of evaluating LLM applications with Ragas:\\n\\nPart 2: Basic Evaluation Workflow We\\'ll explore each metric in detail, explaining when and how to use them effectively.\\n\\nPart 3: Evaluating RAG Systems Learn specialized techniques for evaluating retrieval-augmented generation systems, including context precision, recall, and relevance.\\n\\nPart 4: Test Data Generation Discover how to create high-quality test datasets that thoroughly exercise your application\\'s capabilities.\\n\\nPart 5: Advanced Evaluation Techniques Go beyond basic metrics with custom evaluations, multi-aspect analysis, and domain-specific assessments.\\n\\nPart 6: Evaluating AI Agents Learn how to evaluate complex AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.\\n\\nPart 7: Integrations and Observability Connect Ragas with your existing tools and platforms for streamlined evaluation workflows.\\n\\nPart 8: Building Feedback Loops Learn how to implement feedback loops that drive continuous improvement in your LLM applications. Transform evaluation insights into concrete improvements for your LLM applications.\\n\\nConclusion\\n\\nIn a world increasingly powered by LLMs, robust evaluation is the difference between reliable applications and unpredictable ones. Ragas provides the tools you need to confidently assess and improve your LLM applications.\\n\\nReady to Elevate Your LLM Applications?\\n\\nStart exploring Ragas today by visiting the official documentation. Share your thoughts, challenges, or success stories. If you\\'re facing specific evaluation hurdles, don\\'t hesitate to reach out—we\\'d love to help!')"
|
| 87 |
+
]
|
| 88 |
+
},
|
| 89 |
+
"execution_count": 8,
|
| 90 |
+
"metadata": {},
|
| 91 |
+
"output_type": "execute_result"
|
| 92 |
+
}
|
| 93 |
+
],
|
| 94 |
"source": [
|
| 95 |
+
"docs[0]\n"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
]
|
| 97 |
},
|
| 98 |
{
|
| 99 |
"cell_type": "code",
|
| 100 |
+
"execution_count": 11,
|
| 101 |
+
"id": "72dd14b5",
|
| 102 |
"metadata": {},
|
| 103 |
"outputs": [],
|
| 104 |
"source": [
|
| 105 |
+
"vector_store = blog_utils = blog_utils.create_vector_store(docs,'./db/vector_store_4')"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
]
|
| 107 |
},
|
| 108 |
{
|
|
|
|
| 117 |
},
|
| 118 |
{
|
| 119 |
"cell_type": "code",
|
| 120 |
+
"execution_count": 12,
|
| 121 |
"id": "8b552e6b",
|
| 122 |
"metadata": {},
|
| 123 |
+
"outputs": [
|
| 124 |
+
{
|
| 125 |
+
"name": "stdout",
|
| 126 |
+
"output_type": "stream",
|
| 127 |
+
"text": [
|
| 128 |
+
"\n",
|
| 129 |
+
"Query: What is RAGAS?\n",
|
| 130 |
+
"Retrieved 3 documents:\n",
|
| 131 |
+
"1. Introduction To Ragas (https://thedataguy.pro/blog/introduction-to-ragas/)\n",
|
| 132 |
+
"2. Evaluating Rag Systems With Ragas (https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/)\n",
|
| 133 |
+
"3. Advanced Metrics And Customization With Ragas (https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)\n",
|
| 134 |
+
"\n",
|
| 135 |
+
"Query: How to build research agents?\n",
|
| 136 |
+
"Retrieved 3 documents:\n",
|
| 137 |
+
"1. Building Research Agent (https://thedataguy.pro/blog/building-research-agent/)\n",
|
| 138 |
+
"2. Advanced Metrics And Customization With Ragas (https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)\n",
|
| 139 |
+
"3. Evaluating Rag Systems With Ragas (https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/)\n",
|
| 140 |
+
"\n",
|
| 141 |
+
"Query: What is metric driven development?\n",
|
| 142 |
+
"Retrieved 3 documents:\n",
|
| 143 |
+
"1. Metric Driven Development (https://thedataguy.pro/blog/metric-driven-development/)\n",
|
| 144 |
+
"2. Advanced Metrics And Customization With Ragas (https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)\n",
|
| 145 |
+
"3. Coming Back To Ai Roots (https://thedataguy.pro/blog/coming-back-to-ai-roots/)\n",
|
| 146 |
+
"\n",
|
| 147 |
+
"Query: Who is TheDataGuy?\n",
|
| 148 |
+
"Retrieved 3 documents:\n",
|
| 149 |
+
"1. Advanced Metrics And Customization With Ragas (https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)\n",
|
| 150 |
+
"2. Langchain Experience Csharp Perspective (https://thedataguy.pro/blog/langchain-experience-csharp-perspective/)\n",
|
| 151 |
+
"3. Evaluating Rag Systems With Ragas (https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/)\n"
|
| 152 |
+
]
|
| 153 |
+
}
|
| 154 |
+
],
|
| 155 |
"source": [
|
| 156 |
"# Create a retriever from the vector store\n",
|
| 157 |
+
"retriever = vector_store.as_retriever(search_kwargs={\"k\": 3})\n",
|
| 158 |
"\n",
|
| 159 |
"# Test queries\n",
|
| 160 |
"test_queries = [\n",
|
|
|
|
| 174 |
" print(f\"{i+1}. {title} ({url})\")"
|
| 175 |
]
|
| 176 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
{
|
| 178 |
"cell_type": "code",
|
| 179 |
+
"execution_count": 13,
|
| 180 |
+
"id": "4cdd6899",
|
| 181 |
"metadata": {},
|
| 182 |
"outputs": [],
|
| 183 |
"source": [
|
| 184 |
+
"vector_store.client.close()"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
]
|
| 186 |
}
|
| 187 |
],
|
|
|
|
| 192 |
"name": "python3"
|
| 193 |
},
|
| 194 |
"language_info": {
|
| 195 |
+
"codemirror_mode": {
|
| 196 |
+
"name": "ipython",
|
| 197 |
+
"version": 3
|
| 198 |
+
},
|
| 199 |
+
"file_extension": ".py",
|
| 200 |
+
"mimetype": "text/x-python",
|
| 201 |
"name": "python",
|
| 202 |
+
"nbconvert_exporter": "python",
|
| 203 |
+
"pygments_lexer": "ipython3",
|
| 204 |
"version": "3.13.2"
|
| 205 |
}
|
| 206 |
},
|
update_blog_data.py
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Blog Data Update Script
|
| 3 |
+
|
| 4 |
+
This script updates the blog data vector store when new posts are added.
|
| 5 |
+
It can be scheduled to run periodically or manually executed.
|
| 6 |
+
|
| 7 |
+
Usage:
|
| 8 |
+
python update_blog_data.py [--force-recreate] [--data-dir DATA_DIR]
|
| 9 |
+
|
| 10 |
+
Options:
|
| 11 |
+
--force-recreate Force recreation of the vector store even if it exists
|
| 12 |
+
--data-dir DIR Directory containing the blog posts (default: data/)
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import os
|
| 16 |
+
import sys
|
| 17 |
+
import argparse
|
| 18 |
+
from datetime import datetime
|
| 19 |
+
import json
|
| 20 |
+
from pathlib import Path
|
| 21 |
+
|
| 22 |
+
# Import the blog utilities module
|
| 23 |
+
import blog_utils
|
| 24 |
+
|
| 25 |
+
def parse_args():
|
| 26 |
+
"""Parse command-line arguments"""
|
| 27 |
+
parser = argparse.ArgumentParser(description="Update blog data vector store")
|
| 28 |
+
parser.add_argument("--force-recreate", action="store_true",
|
| 29 |
+
help="Force recreation of the vector store")
|
| 30 |
+
parser.add_argument("--data-dir", default=blog_utils.DATA_DIR,
|
| 31 |
+
help=f"Directory containing blog posts (default: {blog_utils.DATA_DIR})")
|
| 32 |
+
return parser.parse_args()
|
| 33 |
+
|
| 34 |
+
def save_stats(stats, output_dir="./stats"):
|
| 35 |
+
"""Save stats to a JSON file for tracking changes over time"""
|
| 36 |
+
# Create directory if it doesn't exist
|
| 37 |
+
Path(output_dir).mkdir(exist_ok=True, parents=True)
|
| 38 |
+
|
| 39 |
+
# Create filename with timestamp
|
| 40 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 41 |
+
filename = f"{output_dir}/blog_stats_{timestamp}.json"
|
| 42 |
+
|
| 43 |
+
# Save only the basic stats, not the full document list
|
| 44 |
+
basic_stats = {
|
| 45 |
+
"timestamp": timestamp,
|
| 46 |
+
"total_documents": stats["total_documents"],
|
| 47 |
+
"total_characters": stats["total_characters"],
|
| 48 |
+
"min_length": stats["min_length"],
|
| 49 |
+
"max_length": stats["max_length"],
|
| 50 |
+
"avg_length": stats["avg_length"],
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
with open(filename, "w") as f:
|
| 54 |
+
json.dump(basic_stats, f, indent=2)
|
| 55 |
+
|
| 56 |
+
print(f"Saved stats to {filename}")
|
| 57 |
+
return filename
|
| 58 |
+
|
| 59 |
+
def main():
|
| 60 |
+
"""Main function to update blog data"""
|
| 61 |
+
args = parse_args()
|
| 62 |
+
|
| 63 |
+
print("=== Blog Data Update ===")
|
| 64 |
+
print(f"Data directory: {args.data_dir}")
|
| 65 |
+
print(f"Force recreate: {args.force_recreate}")
|
| 66 |
+
print("========================")
|
| 67 |
+
|
| 68 |
+
# Process blog posts without creating embeddings
|
| 69 |
+
try:
|
| 70 |
+
# Load and process documents
|
| 71 |
+
documents = blog_utils.load_blog_posts(args.data_dir)
|
| 72 |
+
documents = blog_utils.update_document_metadata(documents)
|
| 73 |
+
|
| 74 |
+
# Get stats
|
| 75 |
+
stats = blog_utils.get_document_stats(documents)
|
| 76 |
+
blog_utils.display_document_stats(stats)
|
| 77 |
+
|
| 78 |
+
# Save stats for tracking
|
| 79 |
+
stats_file = save_stats(stats)
|
| 80 |
+
|
| 81 |
+
# Create a reference file for the vector store
|
| 82 |
+
if args.force_recreate:
|
| 83 |
+
print("\nAttempting to save vector store reference file...")
|
| 84 |
+
blog_utils.create_vector_store(documents, force_recreate=args.force_recreate)
|
| 85 |
+
|
| 86 |
+
print("\n=== Update Summary ===")
|
| 87 |
+
print(f"Processed {stats['total_documents']} documents")
|
| 88 |
+
print(f"Stats saved to: {stats_file}")
|
| 89 |
+
print("Note: Vector store creation is currently disabled due to pickling issues.")
|
| 90 |
+
print(" See VECTOR_STORE_ISSUES.md for more information and possible solutions.")
|
| 91 |
+
print("=====================")
|
| 92 |
+
|
| 93 |
+
return 0
|
| 94 |
+
except Exception as e:
|
| 95 |
+
print(f"Error: {e}")
|
| 96 |
+
import traceback
|
| 97 |
+
traceback.print_exc()
|
| 98 |
+
return 1
|
| 99 |
+
|
| 100 |
+
if __name__ == "__main__":
|
| 101 |
+
sys.exit(main())
|
utils_data_loading.ipynb
DELETED
|
@@ -1,454 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"cells": [
|
| 3 |
-
{
|
| 4 |
-
"cell_type": "markdown",
|
| 5 |
-
"id": "b31c2849",
|
| 6 |
-
"metadata": {},
|
| 7 |
-
"source": [
|
| 8 |
-
"# Utility Functions for Blog Post Loading and Processing\n",
|
| 9 |
-
"\n",
|
| 10 |
-
"This notebook contains utility functions for loading blog posts from the data directory, processing their metadata, and creating vector embeddings for use in the RAG system."
|
| 11 |
-
]
|
| 12 |
-
},
|
| 13 |
-
{
|
| 14 |
-
"cell_type": "code",
|
| 15 |
-
"execution_count": null,
|
| 16 |
-
"id": "848b0a86",
|
| 17 |
-
"metadata": {},
|
| 18 |
-
"outputs": [],
|
| 19 |
-
"source": [
|
| 20 |
-
"import os\n",
|
| 21 |
-
"import json\n",
|
| 22 |
-
"from pathlib import Path\n",
|
| 23 |
-
"from typing import List, Dict, Any, Optional\n",
|
| 24 |
-
"\n",
|
| 25 |
-
"from langchain_community.document_loaders import DirectoryLoader\n",
|
| 26 |
-
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
|
| 27 |
-
"from langchain.schema.document import Document\n",
|
| 28 |
-
"from langchain_huggingface import HuggingFaceEmbeddings\n",
|
| 29 |
-
"from langchain_community.vectorstores import Qdrant\n",
|
| 30 |
-
"\n",
|
| 31 |
-
"from IPython.display import Markdown, display\n",
|
| 32 |
-
"from dotenv import load_dotenv\n",
|
| 33 |
-
"\n",
|
| 34 |
-
"# Load environment variables from .env file\n",
|
| 35 |
-
"load_dotenv()"
|
| 36 |
-
]
|
| 37 |
-
},
|
| 38 |
-
{
|
| 39 |
-
"cell_type": "markdown",
|
| 40 |
-
"id": "39e32435",
|
| 41 |
-
"metadata": {},
|
| 42 |
-
"source": [
|
| 43 |
-
"## Configuration\n",
|
| 44 |
-
"\n",
|
| 45 |
-
"Load configuration from environment variables or use defaults."
|
| 46 |
-
]
|
| 47 |
-
},
|
| 48 |
-
{
|
| 49 |
-
"cell_type": "code",
|
| 50 |
-
"execution_count": null,
|
| 51 |
-
"id": "5a6a5d6d",
|
| 52 |
-
"metadata": {},
|
| 53 |
-
"outputs": [],
|
| 54 |
-
"source": [
|
| 55 |
-
"# Configuration with defaults\n",
|
| 56 |
-
"DATA_DIR = os.environ.get(\"DATA_DIR\", \"data/\")\n",
|
| 57 |
-
"VECTOR_STORAGE_PATH = os.environ.get(\"VECTOR_STORAGE_PATH\", \"./db/vectorstore_v3\")\n",
|
| 58 |
-
"EMBEDDING_MODEL = os.environ.get(\"EMBEDDING_MODEL\", \"Snowflake/snowflake-arctic-embed-l\")\n",
|
| 59 |
-
"QDRANT_COLLECTION = os.environ.get(\"QDRANT_COLLECTION\", \"thedataguy_documents\")\n",
|
| 60 |
-
"BLOG_BASE_URL = os.environ.get(\"BLOG_BASE_URL\", \"https://thedataguy.pro/blog/\")"
|
| 61 |
-
]
|
| 62 |
-
},
|
| 63 |
-
{
|
| 64 |
-
"cell_type": "markdown",
|
| 65 |
-
"id": "01454147",
|
| 66 |
-
"metadata": {},
|
| 67 |
-
"source": [
|
| 68 |
-
"## Utility Functions\n",
|
| 69 |
-
"\n",
|
| 70 |
-
"These functions handle the loading, processing, and storing of blog posts."
|
| 71 |
-
]
|
| 72 |
-
},
|
| 73 |
-
{
|
| 74 |
-
"cell_type": "code",
|
| 75 |
-
"execution_count": null,
|
| 76 |
-
"id": "25792cd5",
|
| 77 |
-
"metadata": {},
|
| 78 |
-
"outputs": [],
|
| 79 |
-
"source": [
|
| 80 |
-
"def load_blog_posts(data_dir: str = DATA_DIR, \n",
|
| 81 |
-
" glob_pattern: str = \"*.md\", \n",
|
| 82 |
-
" recursive: bool = True, \n",
|
| 83 |
-
" show_progress: bool = True) -> List[Document]:\n",
|
| 84 |
-
" \"\"\"\n",
|
| 85 |
-
" Load blog posts from the specified directory.\n",
|
| 86 |
-
" \n",
|
| 87 |
-
" Args:\n",
|
| 88 |
-
" data_dir: Directory containing the blog posts\n",
|
| 89 |
-
" glob_pattern: Pattern to match files\n",
|
| 90 |
-
" recursive: Whether to search subdirectories\n",
|
| 91 |
-
" show_progress: Whether to show a progress bar\n",
|
| 92 |
-
" \n",
|
| 93 |
-
" Returns:\n",
|
| 94 |
-
" List of Document objects containing the blog posts\n",
|
| 95 |
-
" \"\"\"\n",
|
| 96 |
-
" text_loader = DirectoryLoader(\n",
|
| 97 |
-
" data_dir, \n",
|
| 98 |
-
" glob=glob_pattern, \n",
|
| 99 |
-
" show_progress=show_progress,\n",
|
| 100 |
-
" recursive=recursive\n",
|
| 101 |
-
" )\n",
|
| 102 |
-
" \n",
|
| 103 |
-
" documents = text_loader.load()\n",
|
| 104 |
-
" print(f\"Loaded {len(documents)} documents from {data_dir}\")\n",
|
| 105 |
-
" return documents"
|
| 106 |
-
]
|
| 107 |
-
},
|
| 108 |
-
{
|
| 109 |
-
"cell_type": "code",
|
| 110 |
-
"execution_count": null,
|
| 111 |
-
"id": "e7ddba72",
|
| 112 |
-
"metadata": {},
|
| 113 |
-
"outputs": [],
|
| 114 |
-
"source": [
|
| 115 |
-
"def update_document_metadata(documents: List[Document], \n",
|
| 116 |
-
" data_dir_prefix: str = DATA_DIR,\n",
|
| 117 |
-
" blog_base_url: str = BLOG_BASE_URL,\n",
|
| 118 |
-
" remove_suffix: str = \"index.md\") -> List[Document]:\n",
|
| 119 |
-
" \"\"\"\n",
|
| 120 |
-
" Update the metadata of documents to include URL and other information.\n",
|
| 121 |
-
" \n",
|
| 122 |
-
" Args:\n",
|
| 123 |
-
" documents: List of Document objects to update\n",
|
| 124 |
-
" data_dir_prefix: Prefix to replace in source paths\n",
|
| 125 |
-
" blog_base_url: Base URL for the blog posts\n",
|
| 126 |
-
" remove_suffix: Suffix to remove from paths (like index.md)\n",
|
| 127 |
-
" \n",
|
| 128 |
-
" Returns:\n",
|
| 129 |
-
" Updated list of Document objects\n",
|
| 130 |
-
" \"\"\"\n",
|
| 131 |
-
" for doc in documents:\n",
|
| 132 |
-
" # Create URL from source path\n",
|
| 133 |
-
" doc.metadata[\"url\"] = doc.metadata[\"source\"].replace(data_dir_prefix, blog_base_url)\n",
|
| 134 |
-
" \n",
|
| 135 |
-
" # Remove index.md or other suffix if present\n",
|
| 136 |
-
" if remove_suffix and doc.metadata[\"url\"].endswith(remove_suffix):\n",
|
| 137 |
-
" doc.metadata[\"url\"] = doc.metadata[\"url\"][:-len(remove_suffix)]\n",
|
| 138 |
-
" \n",
|
| 139 |
-
" # Extract post title from the directory structure\n",
|
| 140 |
-
" path_parts = Path(doc.metadata[\"source\"]).parts\n",
|
| 141 |
-
" if len(path_parts) > 1:\n",
|
| 142 |
-
" # Use the directory name as post_slug\n",
|
| 143 |
-
" doc.metadata[\"post_slug\"] = path_parts[-2]\n",
|
| 144 |
-
" doc.metadata[\"post_title\"] = path_parts[-2].replace(\"-\", \" \").title()\n",
|
| 145 |
-
" \n",
|
| 146 |
-
" # Add document length as metadata\n",
|
| 147 |
-
" doc.metadata[\"content_length\"] = len(doc.page_content)\n",
|
| 148 |
-
" \n",
|
| 149 |
-
" return documents"
|
| 150 |
-
]
|
| 151 |
-
},
|
| 152 |
-
{
|
| 153 |
-
"cell_type": "code",
|
| 154 |
-
"execution_count": null,
|
| 155 |
-
"id": "e0dfe498",
|
| 156 |
-
"metadata": {},
|
| 157 |
-
"outputs": [],
|
| 158 |
-
"source": [
|
| 159 |
-
"def get_document_stats(documents: List[Document]) -> Dict[str, Any]:\n",
|
| 160 |
-
" \"\"\"\n",
|
| 161 |
-
" Get statistics about the documents.\n",
|
| 162 |
-
" \n",
|
| 163 |
-
" Args:\n",
|
| 164 |
-
" documents: List of Document objects\n",
|
| 165 |
-
" \n",
|
| 166 |
-
" Returns:\n",
|
| 167 |
-
" Dictionary with statistics\n",
|
| 168 |
-
" \"\"\"\n",
|
| 169 |
-
" stats = {\n",
|
| 170 |
-
" \"total_documents\": len(documents),\n",
|
| 171 |
-
" \"total_characters\": sum(len(doc.page_content) for doc in documents),\n",
|
| 172 |
-
" \"min_length\": min(len(doc.page_content) for doc in documents),\n",
|
| 173 |
-
" \"max_length\": max(len(doc.page_content) for doc in documents),\n",
|
| 174 |
-
" \"avg_length\": sum(len(doc.page_content) for doc in documents) / len(documents) if documents else 0,\n",
|
| 175 |
-
" }\n",
|
| 176 |
-
" \n",
|
| 177 |
-
" # Create a list of document info for analysis\n",
|
| 178 |
-
" doc_info = []\n",
|
| 179 |
-
" for doc in documents:\n",
|
| 180 |
-
" doc_info.append({\n",
|
| 181 |
-
" \"url\": doc.metadata.get(\"url\", \"\"),\n",
|
| 182 |
-
" \"source\": doc.metadata.get(\"source\", \"\"),\n",
|
| 183 |
-
" \"title\": doc.metadata.get(\"post_title\", \"\"),\n",
|
| 184 |
-
" \"text_length\": doc.metadata.get(\"content_length\", 0),\n",
|
| 185 |
-
" })\n",
|
| 186 |
-
" \n",
|
| 187 |
-
" stats[\"documents\"] = doc_info\n",
|
| 188 |
-
" return stats"
|
| 189 |
-
]
|
| 190 |
-
},
|
| 191 |
-
{
|
| 192 |
-
"cell_type": "code",
|
| 193 |
-
"execution_count": null,
|
| 194 |
-
"id": "0ae139c0",
|
| 195 |
-
"metadata": {},
|
| 196 |
-
"outputs": [],
|
| 197 |
-
"source": [
|
| 198 |
-
"def display_document_stats(stats: Dict[str, Any]):\n",
|
| 199 |
-
" \"\"\"\n",
|
| 200 |
-
" Display document statistics in a readable format.\n",
|
| 201 |
-
" \n",
|
| 202 |
-
" Args:\n",
|
| 203 |
-
" stats: Dictionary with statistics from get_document_stats\n",
|
| 204 |
-
" \"\"\"\n",
|
| 205 |
-
" print(f\"Total Documents: {stats['total_documents']}\")\n",
|
| 206 |
-
" print(f\"Total Characters: {stats['total_characters']}\")\n",
|
| 207 |
-
" print(f\"Min Length: {stats['min_length']} characters\")\n",
|
| 208 |
-
" print(f\"Max Length: {stats['max_length']} characters\")\n",
|
| 209 |
-
" print(f\"Average Length: {stats['avg_length']:.2f} characters\")\n",
|
| 210 |
-
" \n",
|
| 211 |
-
" # Display documents as a table\n",
|
| 212 |
-
" import pandas as pd\n",
|
| 213 |
-
" if stats[\"documents\"]:\n",
|
| 214 |
-
" df = pd.DataFrame(stats[\"documents\"])\n",
|
| 215 |
-
" display(df)"
|
| 216 |
-
]
|
| 217 |
-
},
|
| 218 |
-
{
|
| 219 |
-
"cell_type": "code",
|
| 220 |
-
"execution_count": null,
|
| 221 |
-
"id": "2dcf66b4",
|
| 222 |
-
"metadata": {},
|
| 223 |
-
"outputs": [],
|
| 224 |
-
"source": [
|
| 225 |
-
"def split_documents(documents: List[Document], \n",
|
| 226 |
-
" chunk_size: int = 1000, \n",
|
| 227 |
-
" chunk_overlap: int = 200) -> List[Document]:\n",
|
| 228 |
-
" \"\"\"\n",
|
| 229 |
-
" Split documents into chunks for better embedding and retrieval.\n",
|
| 230 |
-
" \n",
|
| 231 |
-
" Args:\n",
|
| 232 |
-
" documents: List of Document objects to split\n",
|
| 233 |
-
" chunk_size: Size of each chunk in characters\n",
|
| 234 |
-
" chunk_overlap: Overlap between chunks in characters\n",
|
| 235 |
-
" \n",
|
| 236 |
-
" Returns:\n",
|
| 237 |
-
" List of split Document objects\n",
|
| 238 |
-
" \"\"\"\n",
|
| 239 |
-
" text_splitter = RecursiveCharacterTextSplitter(\n",
|
| 240 |
-
" chunk_size=chunk_size,\n",
|
| 241 |
-
" chunk_overlap=chunk_overlap,\n",
|
| 242 |
-
" length_function=len,\n",
|
| 243 |
-
" )\n",
|
| 244 |
-
" \n",
|
| 245 |
-
" split_docs = text_splitter.split_documents(documents)\n",
|
| 246 |
-
" print(f\"Split {len(documents)} documents into {len(split_docs)} chunks\")\n",
|
| 247 |
-
" return split_docs"
|
| 248 |
-
]
|
| 249 |
-
},
|
| 250 |
-
{
|
| 251 |
-
"cell_type": "code",
|
| 252 |
-
"execution_count": null,
|
| 253 |
-
"id": "527ad848",
|
| 254 |
-
"metadata": {},
|
| 255 |
-
"outputs": [],
|
| 256 |
-
"source": [
|
| 257 |
-
"def create_vector_store(documents: List[Document], \n",
|
| 258 |
-
" storage_path: str = VECTOR_STORAGE_PATH,\n",
|
| 259 |
-
" collection_name: str = QDRANT_COLLECTION,\n",
|
| 260 |
-
" embedding_model: str = EMBEDDING_MODEL,\n",
|
| 261 |
-
" force_recreate: bool = False) -> Qdrant:\n",
|
| 262 |
-
" \"\"\"\n",
|
| 263 |
-
" Create a vector store from documents.\n",
|
| 264 |
-
" \n",
|
| 265 |
-
" Args:\n",
|
| 266 |
-
" documents: List of Document objects to store\n",
|
| 267 |
-
" storage_path: Path to the vector store\n",
|
| 268 |
-
" collection_name: Name of the collection\n",
|
| 269 |
-
" embedding_model: Name of the embedding model\n",
|
| 270 |
-
" force_recreate: Whether to force recreation of the vector store\n",
|
| 271 |
-
" \n",
|
| 272 |
-
" Returns:\n",
|
| 273 |
-
" Qdrant vector store\n",
|
| 274 |
-
" \"\"\"\n",
|
| 275 |
-
" # Initialize the embedding model\n",
|
| 276 |
-
" embeddings = HuggingFaceEmbeddings(model_name=embedding_model)\n",
|
| 277 |
-
" \n",
|
| 278 |
-
" # Create the directory if it doesn't exist\n",
|
| 279 |
-
" storage_dir = Path(storage_path).parent\n",
|
| 280 |
-
" os.makedirs(storage_dir, exist_ok=True)\n",
|
| 281 |
-
" \n",
|
| 282 |
-
" # Check if vector store exists\n",
|
| 283 |
-
" vector_store_exists = Path(storage_path).exists() and not force_recreate\n",
|
| 284 |
-
" \n",
|
| 285 |
-
" if vector_store_exists:\n",
|
| 286 |
-
" print(f\"Loading existing vector store from {storage_path}\")\n",
|
| 287 |
-
" try:\n",
|
| 288 |
-
" vector_store = Qdrant(\n",
|
| 289 |
-
" path=storage_path,\n",
|
| 290 |
-
" embedding_function=embeddings,\n",
|
| 291 |
-
" collection_name=collection_name\n",
|
| 292 |
-
" )\n",
|
| 293 |
-
" return vector_store\n",
|
| 294 |
-
" except Exception as e:\n",
|
| 295 |
-
" print(f\"Error loading existing vector store: {e}\")\n",
|
| 296 |
-
" print(\"Creating new vector store...\")\n",
|
| 297 |
-
" force_recreate = True\n",
|
| 298 |
-
" \n",
|
| 299 |
-
" # Create new vector store\n",
|
| 300 |
-
" print(f\"Creating new vector store at {storage_path}\")\n",
|
| 301 |
-
" vector_store = Qdrant.from_documents(\n",
|
| 302 |
-
" documents=documents,\n",
|
| 303 |
-
" embedding=embeddings,\n",
|
| 304 |
-
" path=storage_path,\n",
|
| 305 |
-
" collection_name=collection_name,\n",
|
| 306 |
-
" )\n",
|
| 307 |
-
" \n",
|
| 308 |
-
" return vector_store"
|
| 309 |
-
]
|
| 310 |
-
},
|
| 311 |
-
{
|
| 312 |
-
"cell_type": "markdown",
|
| 313 |
-
"id": "c78f99fc",
|
| 314 |
-
"metadata": {},
|
| 315 |
-
"source": [
|
| 316 |
-
"## Example Usage\n",
|
| 317 |
-
"\n",
|
| 318 |
-
"Here's how to use these utility functions for processing blog posts."
|
| 319 |
-
]
|
| 320 |
-
},
|
| 321 |
-
{
|
| 322 |
-
"cell_type": "code",
|
| 323 |
-
"execution_count": null,
|
| 324 |
-
"id": "132d32c6",
|
| 325 |
-
"metadata": {},
|
| 326 |
-
"outputs": [],
|
| 327 |
-
"source": [
|
| 328 |
-
"def process_blog_posts(data_dir: str = DATA_DIR,\n",
|
| 329 |
-
" create_embeddings: bool = True,\n",
|
| 330 |
-
" force_recreate_embeddings: bool = False):\n",
|
| 331 |
-
" \"\"\"\n",
|
| 332 |
-
" Complete pipeline to process blog posts and optionally create vector embeddings.\n",
|
| 333 |
-
" \n",
|
| 334 |
-
" Args:\n",
|
| 335 |
-
" data_dir: Directory containing the blog posts\n",
|
| 336 |
-
" create_embeddings: Whether to create vector embeddings\n",
|
| 337 |
-
" force_recreate_embeddings: Whether to force recreation of embeddings\n",
|
| 338 |
-
" \n",
|
| 339 |
-
" Returns:\n",
|
| 340 |
-
" Dictionary with data and vector store (if created)\n",
|
| 341 |
-
" \"\"\"\n",
|
| 342 |
-
" # Load documents\n",
|
| 343 |
-
" documents = load_blog_posts(data_dir)\n",
|
| 344 |
-
" \n",
|
| 345 |
-
" # Update metadata\n",
|
| 346 |
-
" documents = update_document_metadata(documents)\n",
|
| 347 |
-
" \n",
|
| 348 |
-
" # Get and display stats\n",
|
| 349 |
-
" stats = get_document_stats(documents)\n",
|
| 350 |
-
" display_document_stats(stats)\n",
|
| 351 |
-
" \n",
|
| 352 |
-
" result = {\n",
|
| 353 |
-
" \"documents\": documents,\n",
|
| 354 |
-
" \"stats\": stats,\n",
|
| 355 |
-
" \"vector_store\": None\n",
|
| 356 |
-
" }\n",
|
| 357 |
-
" \n",
|
| 358 |
-
" # Create vector store if requested\n",
|
| 359 |
-
" if create_embeddings:\n",
|
| 360 |
-
" vector_store = create_vector_store(\n",
|
| 361 |
-
" documents, \n",
|
| 362 |
-
" force_recreate=force_recreate_embeddings\n",
|
| 363 |
-
" )\n",
|
| 364 |
-
" result[\"vector_store\"] = vector_store\n",
|
| 365 |
-
" \n",
|
| 366 |
-
" return result"
|
| 367 |
-
]
|
| 368 |
-
},
|
| 369 |
-
{
|
| 370 |
-
"cell_type": "code",
|
| 371 |
-
"execution_count": null,
|
| 372 |
-
"id": "266d4fb3",
|
| 373 |
-
"metadata": {},
|
| 374 |
-
"outputs": [],
|
| 375 |
-
"source": [
|
| 376 |
-
"# Example usage\n",
|
| 377 |
-
"if __name__ == \"__main__\":\n",
|
| 378 |
-
" # Process blog posts without creating embeddings\n",
|
| 379 |
-
" result = process_blog_posts(create_embeddings=False)\n",
|
| 380 |
-
" \n",
|
| 381 |
-
" # Example: Access the documents\n",
|
| 382 |
-
" print(f\"\\nDocument example: {result['documents'][0].metadata}\")\n",
|
| 383 |
-
" \n",
|
| 384 |
-
" # Create embeddings if needed\n",
|
| 385 |
-
" # result = process_blog_posts(create_embeddings=True)\n",
|
| 386 |
-
" \n",
|
| 387 |
-
" # Retriever example\n",
|
| 388 |
-
" # retriever = result[\"vector_store\"].as_retriever()\n",
|
| 389 |
-
" # query = \"What is RAGAS?\"\n",
|
| 390 |
-
" # docs = retriever.invoke(query, k=2)\n",
|
| 391 |
-
" # print(f\"\\nRetrieved {len(docs)} documents for query: {query}\")"
|
| 392 |
-
]
|
| 393 |
-
},
|
| 394 |
-
{
|
| 395 |
-
"cell_type": "markdown",
|
| 396 |
-
"id": "22132649",
|
| 397 |
-
"metadata": {},
|
| 398 |
-
"source": [
|
| 399 |
-
"## Function for Loading Existing Vector Store\n",
|
| 400 |
-
"\n",
|
| 401 |
-
"This function can be used to load an existing vector store without reprocessing all blog posts."
|
| 402 |
-
]
|
| 403 |
-
},
|
| 404 |
-
{
|
| 405 |
-
"cell_type": "code",
|
| 406 |
-
"execution_count": null,
|
| 407 |
-
"id": "c24e0c02",
|
| 408 |
-
"metadata": {},
|
| 409 |
-
"outputs": [],
|
| 410 |
-
"source": [
|
| 411 |
-
"def load_vector_store(storage_path: str = VECTOR_STORAGE_PATH,\n",
|
| 412 |
-
" collection_name: str = QDRANT_COLLECTION,\n",
|
| 413 |
-
" embedding_model: str = EMBEDDING_MODEL) -> Optional[Qdrant]:\n",
|
| 414 |
-
" \"\"\"\n",
|
| 415 |
-
" Load an existing vector store.\n",
|
| 416 |
-
" \n",
|
| 417 |
-
" Args:\n",
|
| 418 |
-
" storage_path: Path to the vector store\n",
|
| 419 |
-
" collection_name: Name of the collection\n",
|
| 420 |
-
" embedding_model: Name of the embedding model\n",
|
| 421 |
-
" \n",
|
| 422 |
-
" Returns:\n",
|
| 423 |
-
" Qdrant vector store or None if it doesn't exist\n",
|
| 424 |
-
" \"\"\"\n",
|
| 425 |
-
" # Initialize the embedding model\n",
|
| 426 |
-
" embeddings = HuggingFaceEmbeddings(model_name=embedding_model)\n",
|
| 427 |
-
" \n",
|
| 428 |
-
" # Check if vector store exists\n",
|
| 429 |
-
" if not Path(storage_path).exists():\n",
|
| 430 |
-
" print(f\"Vector store not found at {storage_path}\")\n",
|
| 431 |
-
" return None\n",
|
| 432 |
-
" \n",
|
| 433 |
-
" try:\n",
|
| 434 |
-
" vector_store = Qdrant(\n",
|
| 435 |
-
" path=storage_path,\n",
|
| 436 |
-
" embedding_function=embeddings,\n",
|
| 437 |
-
" collection_name=collection_name\n",
|
| 438 |
-
" )\n",
|
| 439 |
-
" print(f\"Loaded vector store from {storage_path}\")\n",
|
| 440 |
-
" return vector_store\n",
|
| 441 |
-
" except Exception as e:\n",
|
| 442 |
-
" print(f\"Error loading vector store: {e}\")\n",
|
| 443 |
-
" return None"
|
| 444 |
-
]
|
| 445 |
-
}
|
| 446 |
-
],
|
| 447 |
-
"metadata": {
|
| 448 |
-
"language_info": {
|
| 449 |
-
"name": "python"
|
| 450 |
-
}
|
| 451 |
-
},
|
| 452 |
-
"nbformat": 4,
|
| 453 |
-
"nbformat_minor": 5
|
| 454 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|