mafzaal commited on
Commit
9681c5d
·
1 Parent(s): af85e91

Refactor blog data utilities and configuration

Browse files

- Removed the Jupyter notebook `utils_data_loading.ipynb` and migrated utility functions to `blog_utils.py`.
- Created a new configuration file `config.py` to manage environment variables and default settings.
- Implemented a script `update_blog_data.py` for updating the blog data vector store with command-line options.
- Added a JSON file `blog_stats_20250510_161540.json` to store blog statistics.
- Enhanced document processing functions for better modularity and error handling.

BLOG_DATA_UTILS.md CHANGED
@@ -4,24 +4,30 @@ This directory contains utilities for loading, processing, and maintaining blog
4
 
5
  ## Available Tools
6
 
7
- ### `utils_data_loading.ipynb`
8
 
9
- This notebook contains utility functions for:
10
  - Loading blog posts from the data directory
11
  - Processing and enriching metadata (adding URLs, titles, etc.)
12
  - Getting statistics about the documents
13
  - Creating and updating vector embeddings
14
  - Loading existing vector stores
15
 
16
- ### `update_blog_data.ipynb`
17
 
18
- This notebook demonstrates how to:
19
- - Use the utility functions to update the blog data
20
  - Process new blog posts
21
  - Update the vector store
22
- - Test the updated system with sample queries
23
  - Track changes over time
24
 
 
 
 
 
 
 
 
25
  ## How to Use
26
 
27
  ### Updating Blog Data
@@ -29,12 +35,17 @@ This notebook demonstrates how to:
29
  When new blog posts are published, follow these steps:
30
 
31
  1. Add the markdown files to the `data/` directory
32
- 2. Run the update notebook:
33
  ```bash
34
  cd /home/mafzaal/source/lets-talk
35
- uv run jupyter nbconvert --to notebook --execute update_blog_data.ipynb --output executed_update_$(date +%Y%m%d).ipynb
36
  ```
37
 
 
 
 
 
 
38
  This will:
39
  - Load all blog posts (including new ones)
40
  - Update the vector embeddings
@@ -50,23 +61,22 @@ VECTOR_STORAGE_PATH=./db/vectorstore_v3 # Path to vector store
50
  EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l # Embedding model
51
  QDRANT_COLLECTION=thedataguy_documents # Collection name
52
  BLOG_BASE_URL=https://thedataguy.pro/blog/ # Base URL for blog
53
- FORCE_RECREATE_EMBEDDINGS=false # Whether to force recreation
54
  ```
55
 
56
  ### In the Chainlit App
57
 
58
- The Chainlit app (`app.py`) has been updated to use these utility functions if available. It falls back to direct initialization if they can't be loaded.
59
 
60
  ## Adding Custom Processing
61
 
62
  To add custom processing for blog posts:
63
 
64
- 1. Edit the `update_document_metadata` function in `utils_data_loading.ipynb`
65
  2. Add any additional enrichment or processing steps
66
- 3. Update the vector store using the `update_blog_data.ipynb` notebook
67
 
68
  ## Future Improvements
69
 
70
- - Add support for incremental updates (only process new posts)
71
- - Add webhook support to automatically update when new posts are published
72
  - Add tracking of embedding models and versions
 
 
4
 
5
  ## Available Tools
6
 
7
+ ### `blog_utils.py`
8
 
9
+ This Python module contains utility functions for:
10
  - Loading blog posts from the data directory
11
  - Processing and enriching metadata (adding URLs, titles, etc.)
12
  - Getting statistics about the documents
13
  - Creating and updating vector embeddings
14
  - Loading existing vector stores
15
 
16
+ ### `update_blog_data.py`
17
 
18
+ This script allows you to:
19
+ - Update the blog data when new posts are published
20
  - Process new blog posts
21
  - Update the vector store
 
22
  - Track changes over time
23
 
24
+ ### Legacy Notebooks (Reference Only)
25
+
26
+ The following notebooks are kept for reference but the functionality has been moved to Python modules:
27
+
28
+ - `utils_data_loading.ipynb` - Contains the original utility functions
29
+ - `update_blog_data.ipynb` - Demonstrates the update workflow
30
+
31
  ## How to Use
32
 
33
  ### Updating Blog Data
 
35
  When new blog posts are published, follow these steps:
36
 
37
  1. Add the markdown files to the `data/` directory
38
+ 2. Run the update script:
39
  ```bash
40
  cd /home/mafzaal/source/lets-talk
41
+ uv run python update_blog_data.py
42
  ```
43
 
44
+ You can also force recreation of the vector store:
45
+ ```bash
46
+ uv run python update_blog_data.py --force-recreate
47
+ ```
48
+
49
  This will:
50
  - Load all blog posts (including new ones)
51
  - Update the vector embeddings
 
61
  EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l # Embedding model
62
  QDRANT_COLLECTION=thedataguy_documents # Collection name
63
  BLOG_BASE_URL=https://thedataguy.pro/blog/ # Base URL for blog
 
64
  ```
65
 
66
  ### In the Chainlit App
67
 
68
+ The Chainlit app (`app.py`) has been updated to use these utility functions from the `blog_utils.py` module. It falls back to notebook import and direct initialization if there are any issues.
69
 
70
  ## Adding Custom Processing
71
 
72
  To add custom processing for blog posts:
73
 
74
+ 1. Edit the `update_document_metadata` function in `blog_utils.py`
75
  2. Add any additional enrichment or processing steps
76
+ 3. Update the vector store using the `update_blog_data.py` script
77
 
78
  ## Future Improvements
79
 
80
+ - Add scheduled update process for automatically including new blog posts
 
81
  - Add tracking of embedding models and versions
82
+ - Add webhook support to automatically update when new posts are published
app.py CHANGED
@@ -1,9 +1,9 @@
1
  import os
2
  import getpass
3
  import sys
4
- import importlib.util
5
  from pathlib import Path
6
  from operator import itemgetter
 
7
  from dotenv import load_dotenv
8
 
9
  # Load environment variables from .env file
@@ -17,78 +17,17 @@ from langchain_huggingface import HuggingFaceEmbeddings
17
  from langchain_qdrant import QdrantVectorStore
18
  from qdrant_client import QdrantClient
19
  from qdrant_client.http.models import Distance, VectorParams
20
-
21
- # Import utility functions from the notebook
22
- def import_notebook_functions(notebook_path):
23
- """Import functions from a Jupyter notebook"""
24
- import nbformat
25
- from importlib.util import spec_from_loader, module_from_spec
26
- from IPython.core.interactiveshell import InteractiveShell
27
-
28
- # Create a module
29
- module_name = Path(notebook_path).stem
30
- spec = spec_from_loader(module_name, loader=None)
31
- module = module_from_spec(spec)
32
- sys.modules[module_name] = module
33
-
34
- # Read the notebook
35
- with open(notebook_path) as f:
36
- nb = nbformat.read(f, as_version=4)
37
-
38
- # Execute code cells
39
- shell = InteractiveShell.instance()
40
- for cell in nb.cells:
41
- if cell.cell_type == 'code':
42
- # Skip example code
43
- if 'if __name__ == "__main__":' in cell.source:
44
- continue
45
-
46
- code = shell.input_transformer_manager.transform_cell(cell.source)
47
- exec(code, module.__dict__)
48
 
49
- return module
50
-
51
- # Try to import utility functions if available
52
- try:
53
- utils = import_notebook_functions('utils_data_loading.ipynb')
54
-
55
- # Load vector store using the utility function
56
- vector_store = utils.load_vector_store(
57
- storage_path=os.environ.get("VECTOR_STORAGE_PATH", "./db/vectorstore_v3"),
58
- collection_name=os.environ.get("QDRANT_COLLECTION", "thedataguy_documents"),
59
- embedding_model=os.environ.get("EMBEDDING_MODEL", "Snowflake/snowflake-arctic-embed-l")
60
- )
61
-
62
- print("Successfully loaded vector store using utility functions")
63
-
64
- except Exception as e:
65
- print(f"Could not load utility functions: {e}")
66
- print("Falling back to direct initialization")
67
-
68
- # Get vector storage path from .env file with fallback
69
- storage_path = Path(os.environ.get("VECTOR_STORAGE_PATH", "./db/vectorstore_v3"))
70
-
71
- # Load embedding model from environment variable with fallback
72
- embedding_model = os.environ.get("EMBEDDING_MODEL", "Snowflake/snowflake-arctic-embed-l")
73
- huggingface_embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
74
-
75
- # Set up Qdrant vectorstore from existing collection
76
- collection_name = os.environ.get("QDRANT_COLLECTION", "thedataguy_documents")
77
-
78
- vector_store = QdrantVectorStore.from_existing_collection(
79
- path=storage_path,
80
- collection_name=collection_name,
81
- embedding=huggingface_embeddings,
82
- )
83
-
84
 
85
  # Create a retriever
86
  retriever = vector_store.as_retriever()
87
 
88
  # Set up ChatOpenAI with environment variables
89
- llm_model = os.environ.get("LLM_MODEL", "gpt-4o-mini")
90
- temperature = float(os.environ.get("TEMPERATURE", "0"))
91
- llm = ChatOpenAI(model=llm_model, temperature=temperature)
92
 
93
  # Create RAG prompt template
94
  rag_prompt_template = """\
@@ -149,24 +88,24 @@ async def on_message(message: cl.Message):
149
  response = chain.invoke({"question": message.content})
150
 
151
  # Get the sources to display them
152
- sources = []
153
- for doc in response["context"]:
154
- if "url" in doc.metadata:
155
- # Get title from post_title metadata if available, otherwise derive from URL
156
- title = doc.metadata.get("post_title", "")
157
- if not title:
158
- title = doc.metadata["url"].split("/")[-2].replace("-", " ").title()
159
 
160
- sources.append(
161
- cl.Source(
162
- url=doc.metadata["url"],
163
- title=title
164
- )
165
- )
166
 
167
  # Send the response with sources
168
  await cl.Message(
169
  content=response["response"].content,
170
- sources=sources
171
  ).send()
172
 
 
1
  import os
2
  import getpass
3
  import sys
 
4
  from pathlib import Path
5
  from operator import itemgetter
6
+ from config import LLM_MODEL, LLM_TEMPERATURE
7
  from dotenv import load_dotenv
8
 
9
  # Load environment variables from .env file
 
17
  from langchain_qdrant import QdrantVectorStore
18
  from qdrant_client import QdrantClient
19
  from qdrant_client.http.models import Distance, VectorParams
20
+ import blog_utils
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
+ # Load vector store using the utility function
23
+ vector_store = blog_utils.load_vector_store()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  # Create a retriever
26
  retriever = vector_store.as_retriever()
27
 
28
  # Set up ChatOpenAI with environment variables
29
+
30
+ llm = ChatOpenAI(model=LLM_MODEL, temperature=LLM_TEMPERATURE)
 
31
 
32
  # Create RAG prompt template
33
  rag_prompt_template = """\
 
88
  response = chain.invoke({"question": message.content})
89
 
90
  # Get the sources to display them
91
+ # sources = []
92
+ # for doc in response["context"]:
93
+ # if "url" in doc.metadata:
94
+ # # Get title from post_title metadata if available, otherwise derive from URL
95
+ # title = doc.metadata.get("post_title", "")
96
+ # if not title:
97
+ # title = doc.metadata["url"].split("/")[-2].replace("-", " ").title()
98
 
99
+ # sources.append(
100
+ # cl.Source(
101
+ # url=doc.metadata["url"],
102
+ # title=title
103
+ # )
104
+ # )
105
 
106
  # Send the response with sources
107
  await cl.Message(
108
  content=response["response"].content,
109
+ #sources=sources
110
  ).send()
111
 
blog_utils.py ADDED
@@ -0,0 +1,305 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Blog Data Utilities Module
3
+
4
+ This module contains utility functions for loading, processing, and storing blog posts
5
+ for the RAG system. It includes functions for loading blog posts from the data directory,
6
+ processing their metadata, and creating vector embeddings.
7
+ """
8
+
9
+ import os
10
+ import json
11
+ from pathlib import Path
12
+ from typing import List, Dict, Any, Optional
13
+ from datetime import datetime
14
+
15
+ from langchain_community.document_loaders import DirectoryLoader
16
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
17
+ from langchain.schema.document import Document
18
+ from langchain_huggingface import HuggingFaceEmbeddings
19
+ from langchain_qdrant import QdrantVectorStore
20
+ from qdrant_client import QdrantClient
21
+
22
+
23
+ from config import (
24
+ DATA_DIR,
25
+ VECTOR_STORAGE_PATH,
26
+ EMBEDDING_MODEL,
27
+ QDRANT_COLLECTION,
28
+ BLOG_BASE_URL
29
+ )
30
+
31
+ def load_blog_posts(data_dir: str = DATA_DIR,
32
+ glob_pattern: str = "*.md",
33
+ recursive: bool = True,
34
+ show_progress: bool = True) -> List[Document]:
35
+ """
36
+ Load blog posts from the specified directory.
37
+
38
+ Args:
39
+ data_dir: Directory containing the blog posts
40
+ glob_pattern: Pattern to match files
41
+ recursive: Whether to search subdirectories
42
+ show_progress: Whether to show a progress bar
43
+
44
+ Returns:
45
+ List of Document objects containing the blog posts
46
+ """
47
+ text_loader = DirectoryLoader(
48
+ data_dir,
49
+ glob=glob_pattern,
50
+ show_progress=show_progress,
51
+ recursive=recursive
52
+ )
53
+
54
+ documents = text_loader.load()
55
+ print(f"Loaded {len(documents)} documents from {data_dir}")
56
+ return documents
57
+
58
+
59
+ def update_document_metadata(documents: List[Document],
60
+ data_dir_prefix: str = DATA_DIR,
61
+ blog_base_url: str = BLOG_BASE_URL,
62
+ remove_suffix: str = "index.md") -> List[Document]:
63
+ """
64
+ Update the metadata of documents to include URL and other information.
65
+
66
+ Args:
67
+ documents: List of Document objects to update
68
+ data_dir_prefix: Prefix to replace in source paths
69
+ blog_base_url: Base URL for the blog posts
70
+ remove_suffix: Suffix to remove from paths (like index.md)
71
+
72
+ Returns:
73
+ Updated list of Document objects
74
+ """
75
+ for doc in documents:
76
+ # Create URL from source path
77
+ doc.metadata["url"] = doc.metadata["source"].replace(data_dir_prefix, blog_base_url)
78
+
79
+ # Remove index.md or other suffix if present
80
+ if remove_suffix and doc.metadata["url"].endswith(remove_suffix):
81
+ doc.metadata["url"] = doc.metadata["url"][:-len(remove_suffix)]
82
+
83
+ # Extract post title from the directory structure
84
+ path_parts = Path(doc.metadata["source"]).parts
85
+ if len(path_parts) > 1:
86
+ # Use the directory name as post_slug
87
+ doc.metadata["post_slug"] = path_parts[-2]
88
+ doc.metadata["post_title"] = path_parts[-2].replace("-", " ").title()
89
+
90
+ # Add document length as metadata
91
+ doc.metadata["content_length"] = len(doc.page_content)
92
+
93
+ return documents
94
+
95
+
96
+ def get_document_stats(documents: List[Document]) -> Dict[str, Any]:
97
+ """
98
+ Get statistics about the documents.
99
+
100
+ Args:
101
+ documents: List of Document objects
102
+
103
+ Returns:
104
+ Dictionary with statistics
105
+ """
106
+ stats = {
107
+ "total_documents": len(documents),
108
+ "total_characters": sum(len(doc.page_content) for doc in documents),
109
+ "min_length": min(len(doc.page_content) for doc in documents) if documents else 0,
110
+ "max_length": max(len(doc.page_content) for doc in documents) if documents else 0,
111
+ "avg_length": sum(len(doc.page_content) for doc in documents) / len(documents) if documents else 0,
112
+ }
113
+
114
+ # Create a list of document info for analysis
115
+ doc_info = []
116
+ for doc in documents:
117
+ doc_info.append({
118
+ "url": doc.metadata.get("url", ""),
119
+ "source": doc.metadata.get("source", ""),
120
+ "title": doc.metadata.get("post_title", ""),
121
+ "text_length": doc.metadata.get("content_length", 0),
122
+ })
123
+
124
+ stats["documents"] = doc_info
125
+ return stats
126
+
127
+
128
+ def display_document_stats(stats: Dict[str, Any]):
129
+ """
130
+ Display document statistics in a readable format.
131
+
132
+ Args:
133
+ stats: Dictionary with statistics from get_document_stats
134
+ """
135
+ print(f"Total Documents: {stats['total_documents']}")
136
+ print(f"Total Characters: {stats['total_characters']}")
137
+ print(f"Min Length: {stats['min_length']} characters")
138
+ print(f"Max Length: {stats['max_length']} characters")
139
+ print(f"Average Length: {stats['avg_length']:.2f} characters")
140
+
141
+ # For use in notebooks where pandas and display are available:
142
+ try:
143
+ import pandas as pd
144
+ from IPython.display import display
145
+ if stats["documents"]:
146
+ df = pd.DataFrame(stats["documents"])
147
+ display(df)
148
+ except (ImportError, NameError):
149
+ # Just print the first 5 documents if not in a notebook environment
150
+ if stats["documents"]:
151
+ print("\nFirst 5 documents:")
152
+ for i, doc in enumerate(stats["documents"][:5]):
153
+ print(f"{i+1}. {doc['title']} ({doc['url']})")
154
+
155
+
156
+ def split_documents(documents: List[Document],
157
+ chunk_size: int = 1000,
158
+ chunk_overlap: int = 200) -> List[Document]:
159
+ """
160
+ Split documents into chunks for better embedding and retrieval.
161
+
162
+ Args:
163
+ documents: List of Document objects to split
164
+ chunk_size: Size of each chunk in characters
165
+ chunk_overlap: Overlap between chunks in characters
166
+
167
+ Returns:
168
+ List of split Document objects
169
+ """
170
+ text_splitter = RecursiveCharacterTextSplitter(
171
+ chunk_size=chunk_size,
172
+ chunk_overlap=chunk_overlap,
173
+ length_function=len,
174
+ )
175
+
176
+ split_docs = text_splitter.split_documents(documents)
177
+ print(f"Split {len(documents)} documents into {len(split_docs)} chunks")
178
+ return split_docs
179
+
180
+
181
+ def create_vector_store(documents: List[Document],
182
+ storage_path: str = VECTOR_STORAGE_PATH,
183
+ collection_name: str = QDRANT_COLLECTION,
184
+ embedding_model: str = EMBEDDING_MODEL,
185
+ force_recreate: bool = False) -> Optional[QdrantVectorStore]:
186
+
187
+ """
188
+ Create a vector store from the documents using Qdrant.
189
+ Args:
190
+ documents: List of Document objects to embed
191
+ storage_path: Path to the vector store
192
+ collection_name: Name of the collection
193
+ embedding_model: Name of the embedding model
194
+ force_recreate: Whether to force recreation of the vector store
195
+ Returns:
196
+ QdrantVectorStore vector store or None if creation fails
197
+ """
198
+
199
+ vector_store = QdrantVectorStore.from_documents(
200
+ documents,
201
+ embedding=HuggingFaceEmbeddings(model_name=embedding_model),
202
+ collection_name=collection_name,
203
+ path=storage_path,
204
+ force_recreate=force_recreate,
205
+ )
206
+
207
+ return vector_store
208
+
209
+
210
+ def load_vector_store(storage_path: str = VECTOR_STORAGE_PATH,
211
+ collection_name: str = QDRANT_COLLECTION,
212
+ embedding_model: str = EMBEDDING_MODEL) -> Optional[QdrantVectorStore]:
213
+ """
214
+ Load an existing vector store.
215
+
216
+ Args:
217
+ storage_path: Path to the vector store
218
+ collection_name: Name of the collection
219
+ embedding_model: Name of the embedding model
220
+
221
+ Returns:
222
+ QdrantVectorStore vector store or None if it doesn't exist
223
+ """
224
+ # Initialize the embedding model
225
+ embeddings = HuggingFaceEmbeddings(model_name=embedding_model)
226
+
227
+ # Check if vector store exists
228
+ if not Path(storage_path).exists():
229
+ print(f"Vector store not found at {storage_path}")
230
+ return None
231
+
232
+ try:
233
+ # Initialize Qdrant client
234
+ client = QdrantClient(path=storage_path)
235
+
236
+ # Create vector store with the client
237
+ vector_store = QdrantVectorStore(
238
+ client=client,
239
+ collection_name=collection_name,
240
+ embedding=embeddings,
241
+ )
242
+ print(f"Loaded vector store from {storage_path}")
243
+ return vector_store
244
+ except Exception as e:
245
+ print(f"Error loading vector store: {e}")
246
+ return None
247
+
248
+
249
+ def process_blog_posts(data_dir: str = DATA_DIR,
250
+ create_embeddings: bool = True,
251
+ force_recreate_embeddings: bool = False,
252
+ storage_path: str = VECTOR_STORAGE_PATH):
253
+ """
254
+ Complete pipeline to process blog posts and optionally create vector embeddings.
255
+
256
+ Args:
257
+ data_dir: Directory containing the blog posts
258
+ create_embeddings: Whether to create vector embeddings
259
+ force_recreate_embeddings: Whether to force recreation of embeddings
260
+ storage_path: Path to the vector store (not used with in-memory approach)
261
+
262
+ Returns:
263
+ Dictionary with data and vector store (if created)
264
+ """
265
+ # Load documents
266
+ documents = load_blog_posts(data_dir)
267
+
268
+ # Update metadata
269
+ documents = update_document_metadata(documents)
270
+
271
+
272
+ # Get and display stats
273
+ stats = get_document_stats(documents)
274
+ display_document_stats(stats)
275
+
276
+ result = {
277
+ "documents": documents,
278
+ "stats": stats,
279
+ "vector_store": None
280
+ }
281
+
282
+ # Create vector store if requested
283
+ if create_embeddings:
284
+ # Using in-memory vector store to avoid pickling issues
285
+ vector_store = create_vector_store(
286
+ documents,
287
+ force_recreate=force_recreate_embeddings
288
+ )
289
+ result["vector_store"] = vector_store
290
+
291
+ return result
292
+
293
+
294
+ # Allow script to be run directly if needed
295
+ if __name__ == "__main__":
296
+ print("Blog Data Utilities Module")
297
+ print("Available functions:")
298
+ print("- load_blog_posts()")
299
+ print("- update_document_metadata()")
300
+ print("- get_document_stats()")
301
+ print("- display_document_stats()")
302
+ print("- split_documents()")
303
+ print("- create_vector_store()")
304
+ print("- load_vector_store()")
305
+ print("- process_blog_posts()")
config.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from dotenv import load_dotenv
3
+
4
+ # Load environment variables from .env file
5
+ load_dotenv()
6
+
7
+ # Configuration with defaults
8
+ DATA_DIR = os.environ.get("DATA_DIR", "data/")
9
+ VECTOR_STORAGE_PATH = os.environ.get("VECTOR_STORAGE_PATH", "./db/vectorstore_v3")
10
+ EMBEDDING_MODEL = os.environ.get("EMBEDDING_MODEL", "Snowflake/snowflake-arctic-embed-l")
11
+ QDRANT_COLLECTION = os.environ.get("QDRANT_COLLECTION", "thedataguy_documents")
12
+ BLOG_BASE_URL = os.environ.get("BLOG_BASE_URL", "https://thedataguy.pro/blog/")
13
+ LLM_MODEL = os.environ.get("LLM_MODEL", "gpt-4o-mini")
14
+ LLM_TEMPERATURE = float(os.environ.get("TEMPERATURE", "0"))
main.py CHANGED
@@ -1,5 +1,48 @@
 
 
 
 
1
  def main():
2
- print("Hello from lets-talk!")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
 
5
  if __name__ == "__main__":
 
1
+ import blog_utils
2
+ from update_blog_data import parse_args, save_stats
3
+
4
+
5
  def main():
6
+
7
+ """Main function to update blog data"""
8
+ args = parse_args()
9
+
10
+ print("=== Blog Data Update ===")
11
+ print(f"Data directory: {args.data_dir}")
12
+ print(f"Force recreate: {args.force_recreate}")
13
+ print("========================")
14
+
15
+ # Process blog posts without creating embeddings
16
+ try:
17
+ # Load and process documents
18
+ documents = blog_utils.load_blog_posts(args.data_dir)
19
+ documents = blog_utils.update_document_metadata(documents)
20
+
21
+ # Get stats
22
+ stats = blog_utils.get_document_stats(documents)
23
+ blog_utils.display_document_stats(stats)
24
+
25
+ # Save stats for tracking
26
+ stats_file = save_stats(stats)
27
+
28
+ # Create a reference file for the vector store
29
+ if args.force_recreate:
30
+ print("\nAttempting to save vector store reference file...")
31
+ blog_utils.create_vector_store(documents, force_recreate=args.force_recreate)
32
+
33
+ print("\n=== Update Summary ===")
34
+ print(f"Processed {stats['total_documents']} documents")
35
+ print(f"Stats saved to: {stats_file}")
36
+ print("Note: Vector store creation is currently disabled due to pickling issues.")
37
+ print(" See VECTOR_STORE_ISSUES.md for more information and possible solutions.")
38
+ print("=====================")
39
+
40
+ return 0
41
+ except Exception as e:
42
+ print(f"Error: {e}")
43
+ import traceback
44
+ traceback.print_exc()
45
+ return 1
46
 
47
 
48
  if __name__ == "__main__":
stats/blog_stats_20250510_161540.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "timestamp": "20250510_161540",
3
+ "total_documents": 14,
4
+ "total_characters": 106275,
5
+ "min_length": 1900,
6
+ "max_length": 13468,
7
+ "avg_length": 7591.071428571428
8
+ }
update_blog_data.ipynb CHANGED
@@ -12,7 +12,7 @@
12
  },
13
  {
14
  "cell_type": "code",
15
- "execution_count": null,
16
  "id": "6ec048b4",
17
  "metadata": {},
18
  "outputs": [],
@@ -21,139 +21,88 @@
21
  "import os\n",
22
  "from pathlib import Path\n",
23
  "from dotenv import load_dotenv\n",
24
- "import importlib.util\n",
25
- "\n",
26
- "# Load environment variables\n",
27
- "load_dotenv()\n",
28
- "\n",
29
- "# Import utility functions from utils_data_loading.ipynb\n",
30
- "# We'll do this by first converting the notebook to a Python module"
31
  ]
32
  },
33
  {
34
- "cell_type": "code",
35
- "execution_count": null,
36
- "id": "7f01d61f",
37
  "metadata": {},
38
- "outputs": [],
39
  "source": [
40
- "# Function to import the utility module\n",
41
- "def import_notebook_as_module(notebook_path, module_name=\"utils_module\"):\n",
42
- " \"\"\"\n",
43
- " Import a Jupyter notebook as a Python module.\n",
44
- " \n",
45
- " Args:\n",
46
- " notebook_path: Path to the notebook\n",
47
- " module_name: Name to give the module\n",
48
- " \n",
49
- " Returns:\n",
50
- " The imported module\n",
51
- " \"\"\"\n",
52
- " import nbformat\n",
53
- " from importlib.util import spec_from_loader, module_from_spec\n",
54
- " from IPython.core.interactiveshell import InteractiveShell\n",
55
- " \n",
56
- " shell = InteractiveShell.instance()\n",
57
- " \n",
58
- " with open(notebook_path) as f:\n",
59
- " nb = nbformat.read(f, as_version=4)\n",
60
- " \n",
61
- " # Create a module\n",
62
- " spec = spec_from_loader(module_name, loader=None)\n",
63
- " module = module_from_spec(spec)\n",
64
- " sys.modules[module_name] = module\n",
65
- " \n",
66
- " # Execute only the code cells in the notebook\n",
67
- " for cell in nb.cells:\n",
68
- " if cell.cell_type == 'code':\n",
69
- " # Skip cells that start with certain keywords like \"if __name__ == \"__main__\":\"\n",
70
- " if 'if __name__ == \"__main__\":' in cell.source:\n",
71
- " continue\n",
72
- " \n",
73
- " # Execute the cell and store its content in the module\n",
74
- " code = shell.input_transformer_manager.transform_cell(cell.source)\n",
75
- " exec(code, module.__dict__)\n",
76
- " \n",
77
- " return module"
78
  ]
79
  },
80
  {
81
  "cell_type": "code",
82
- "execution_count": null,
83
- "id": "774c1373",
84
  "metadata": {},
85
- "outputs": [],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  "source": [
87
- "# Import the utility functions\n",
88
- "utils = import_notebook_as_module('utils_data_loading.ipynb')\n",
89
  "\n",
90
- "# Now you can access all the functions from the utils module\n",
91
- "print(\"Successfully imported utility functions.\")"
92
- ]
93
- },
94
- {
95
- "cell_type": "markdown",
96
- "id": "85ae6617",
97
- "metadata": {},
98
- "source": [
99
- "## Configuration\n",
100
  "\n",
101
- "Set up the configuration for data processing."
102
  ]
103
  },
104
  {
105
  "cell_type": "code",
106
  "execution_count": null,
107
- "id": "54e9ca48",
108
- "metadata": {},
109
- "outputs": [],
110
- "source": [
111
- "# Configuration (can be overridden from .env file)\n",
112
- "DATA_DIR = os.environ.get(\"DATA_DIR\", \"data/\")\n",
113
- "VECTOR_STORAGE_PATH = os.environ.get(\"VECTOR_STORAGE_PATH\", \"./db/vectorstore_v3\")\n",
114
- "BLOG_BASE_URL = os.environ.get(\"BLOG_BASE_URL\", \"https://thedataguy.pro/blog/\")\n",
115
- "FORCE_RECREATE_EMBEDDINGS = os.environ.get(\"FORCE_RECREATE_EMBEDDINGS\", \"false\").lower() == \"true\"\n",
116
- "\n",
117
- "print(f\"Data Directory: {DATA_DIR}\")\n",
118
- "print(f\"Vector Storage Path: {VECTOR_STORAGE_PATH}\")\n",
119
- "print(f\"Blog Base URL: {BLOG_BASE_URL}\")\n",
120
- "print(f\"Force Recreate Embeddings: {FORCE_RECREATE_EMBEDDINGS}\")"
121
- ]
122
- },
123
- {
124
- "cell_type": "markdown",
125
- "id": "cc19ab4c",
126
  "metadata": {},
 
 
 
 
 
 
 
 
 
 
 
 
127
  "source": [
128
- "## Update Blog Data Process\n",
129
- "\n",
130
- "This process will:\n",
131
- "1. Load existing blog posts\n",
132
- "2. Process and update metadata\n",
133
- "3. Create or update vector embeddings"
134
  ]
135
  },
136
  {
137
  "cell_type": "code",
138
- "execution_count": null,
139
- "id": "3d56f688",
140
  "metadata": {},
141
  "outputs": [],
142
  "source": [
143
- "# Process blog posts and create/update embeddings\n",
144
- "result = utils.process_blog_posts(\n",
145
- " data_dir=DATA_DIR,\n",
146
- " create_embeddings=True,\n",
147
- " force_recreate_embeddings=FORCE_RECREATE_EMBEDDINGS\n",
148
- ")\n",
149
- "\n",
150
- "# Access the documents and vector store\n",
151
- "documents = result[\"documents\"]\n",
152
- "stats = result[\"stats\"]\n",
153
- "vector_store = result[\"vector_store\"]\n",
154
- "\n",
155
- "print(f\"\\nProcessed {len(documents)} blog posts\")\n",
156
- "print(f\"Vector store created/updated at: {VECTOR_STORAGE_PATH}\")"
157
  ]
158
  },
159
  {
@@ -168,13 +117,44 @@
168
  },
169
  {
170
  "cell_type": "code",
171
- "execution_count": null,
172
  "id": "8b552e6b",
173
  "metadata": {},
174
- "outputs": [],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  "source": [
176
  "# Create a retriever from the vector store\n",
177
- "retriever = vector_store.as_retriever(search_kwargs={\"k\": 2})\n",
178
  "\n",
179
  "# Test queries\n",
180
  "test_queries = [\n",
@@ -194,49 +174,14 @@
194
  " print(f\"{i+1}. {title} ({url})\")"
195
  ]
196
  },
197
- {
198
- "cell_type": "markdown",
199
- "id": "ddbe9282",
200
- "metadata": {},
201
- "source": [
202
- "## Schedule This Notebook\n",
203
- "\n",
204
- "To keep the blog data up-to-date, you can schedule this notebook to run periodically. \n",
205
- "Here are some options:\n",
206
- "\n",
207
- "1. Use a cron job to run this notebook with papermill\n",
208
- "2. Set up a GitHub Action to run this notebook on a schedule\n",
209
- "3. Use Airflow or another workflow management system\n",
210
- "\n",
211
- "Example of running with papermill:\n",
212
- "```bash\n",
213
- "papermill update_blog_data.ipynb output_$(date +%Y%m%d).ipynb\n",
214
- "```"
215
- ]
216
- },
217
  {
218
  "cell_type": "code",
219
- "execution_count": null,
220
- "id": "3634e064",
221
  "metadata": {},
222
  "outputs": [],
223
  "source": [
224
- "# Save stats to a file for tracking changes over time\n",
225
- "import json\n",
226
- "from datetime import datetime\n",
227
- "\n",
228
- "stats_dir = Path(\"stats\")\n",
229
- "stats_dir.mkdir(exist_ok=True)\n",
230
- "\n",
231
- "# Add timestamp to stats\n",
232
- "stats[\"timestamp\"] = datetime.now().isoformat()\n",
233
- "\n",
234
- "# Save stats\n",
235
- "stats_path = stats_dir / f\"blog_stats_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json\"\n",
236
- "with open(stats_path, \"w\") as f:\n",
237
- " json.dump(stats, f, indent=2)\n",
238
- "\n",
239
- "print(f\"Saved stats to {stats_path}\")"
240
  ]
241
  }
242
  ],
@@ -247,7 +192,15 @@
247
  "name": "python3"
248
  },
249
  "language_info": {
 
 
 
 
 
 
250
  "name": "python",
 
 
251
  "version": "3.13.2"
252
  }
253
  },
 
12
  },
13
  {
14
  "cell_type": "code",
15
+ "execution_count": 1,
16
  "id": "6ec048b4",
17
  "metadata": {},
18
  "outputs": [],
 
21
  "import os\n",
22
  "from pathlib import Path\n",
23
  "from dotenv import load_dotenv\n",
24
+ "import importlib.util\n"
 
 
 
 
 
 
25
  ]
26
  },
27
  {
28
+ "cell_type": "markdown",
29
+ "id": "cc19ab4c",
 
30
  "metadata": {},
 
31
  "source": [
32
+ "## Update Blog Data Process\n",
33
+ "\n",
34
+ "This process will:\n",
35
+ "1. Load existing blog posts\n",
36
+ "2. Process and update metadata\n",
37
+ "3. Create or update vector embeddings"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ]
39
  },
40
  {
41
  "cell_type": "code",
42
+ "execution_count": 7,
43
+ "id": "3d56f688",
44
  "metadata": {},
45
+ "outputs": [
46
+ {
47
+ "name": "stderr",
48
+ "output_type": "stream",
49
+ "text": [
50
+ "100%|██████████| 14/14 [00:00<00:00, 42.05it/s]"
51
+ ]
52
+ },
53
+ {
54
+ "name": "stdout",
55
+ "output_type": "stream",
56
+ "text": [
57
+ "Loaded 14 documents from data/\n"
58
+ ]
59
+ },
60
+ {
61
+ "name": "stderr",
62
+ "output_type": "stream",
63
+ "text": [
64
+ "\n"
65
+ ]
66
+ }
67
+ ],
68
  "source": [
69
+ "import blog_utils\n",
 
70
  "\n",
71
+ "docs = blog_utils.load_blog_posts()\n",
72
+ "docs = blog_utils.update_document_metadata(docs)\n",
 
 
 
 
 
 
 
 
73
  "\n",
74
+ "\n"
75
  ]
76
  },
77
  {
78
  "cell_type": "code",
79
  "execution_count": null,
80
+ "id": "a14c70dc",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  "metadata": {},
82
+ "outputs": [
83
+ {
84
+ "data": {
85
+ "text/plain": [
86
+ "Document(metadata={'source': 'data/introduction-to-ragas/index.md', 'url': 'https://thedataguy.pro/blog/introduction-to-ragas/', 'post_slug': 'introduction-to-ragas', 'post_title': 'Introduction To Ragas', 'content_length': 6071}, page_content='title: \"Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications\" date: 2025-04-26T18:00:00-06:00 layout: blog description: \"Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems.\" categories: [\"AI\", \"RAG\", \"Evaluation\",\"Ragas\"] coverImage: \"https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3\" readingTime: 7 published: true\\n\\nAs Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you\\'re building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how well your application performs. This is where Ragas steps in.\\n\\nWhat is Ragas?\\n\\nRagas is an open-source evaluation framework specifically designed for LLM applications, with particular strengths in Retrieval-Augmented Generation (RAG) systems. Unlike traditional NLP evaluation methods, Ragas provides specialized metrics that address the unique challenges of LLM-powered systems.\\n\\nAt its core, Ragas helps answer crucial questions: - Is my application retrieving the right information? - Are the responses factually accurate and consistent with the retrieved context? - Does the system appropriately address the user\\'s query? - How well does my application handle multi-turn conversations?\\n\\nWhy Evaluate LLM Applications?\\n\\nLLMs are powerful but imperfect. They can hallucinate facts, misinterpret queries, or generate convincing but incorrect responses. For applications where accuracy and reliability matter—like healthcare, finance, or education—proper evaluation is non-negotiable.\\n\\nEvaluation serves several key purposes: - Quality assurance: Identify and fix issues before they reach users - Performance tracking: Monitor how changes impact system performance - Benchmarking: Compare different approaches objectively - Continuous improvement: Build feedback loops to enhance your application\\n\\nKey Features of Ragas\\n\\n🎯 Specialized Metrics\\n\\nRagas offers both LLM-based and computational metrics tailored to evaluate different aspects of LLM applications:\\n\\nFaithfulness: Measures if the response is factually consistent with the retrieved context\\n\\nContext Relevancy: Evaluates if the retrieved information is relevant to the query\\n\\nAnswer Relevancy: Assesses if the response addresses the user\\'s question\\n\\nTopic Adherence: Gauges how well multi-turn conversations stay on topic\\n\\n🧪 Test Data Generation\\n\\nCreating high-quality test data is often a bottleneck in evaluation. Ragas helps you generate comprehensive test datasets automatically, saving time and ensuring thorough coverage.\\n\\n🔗 Seamless Integrations\\n\\nRagas works with popular LLM frameworks and tools: - LangChain - LlamaIndex - Haystack - OpenAI\\n\\nObservability platforms - Phoenix - LangSmith - Langfuse\\n\\n📊 Comprehensive Analysis\\n\\nBeyond simple scores, Ragas provides detailed insights into your application\\'s strengths and weaknesses, enabling targeted improvements.\\n\\nGetting Started with Ragas\\n\\nInstalling Ragas is straightforward:\\n\\nbash uv init && uv add ragas\\n\\nHere\\'s a simple example of evaluating a response using Ragas:\\n\\n```python from ragas.metrics import Faithfulness from ragas.evaluation import EvaluationDataset from ragas.dataset_schema import SingleTurnSample from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI\\n\\nInitialize the LLM, you are going to new OPENAI API key\\n\\nevaluator_llm = LangchainLLMWrapper(ChatOpenAI(model=\"gpt-4o\"))\\n\\nYour evaluation data\\n\\ntest_data = { \"user_input\": \"What is the capital of France?\", \"retrieved_contexts\": [\"Paris is the capital and most populous city of France.\"], \"response\": \"The capital of France is Paris.\" }\\n\\nCreate a sample\\n\\nsample = SingleTurnSample(**test_data) # Unpack the dictionary into the constructor\\n\\nCreate metric\\n\\nfaithfulness = Faithfulness(llm=evaluator_llm)\\n\\nCalculate the score\\n\\nresult = await faithfulness.single_turn_ascore(sample) print(f\"Faithfulness score: {result}\") ```\\n\\n💡 Try it yourself: Explore the hands-on notebook for this workflow: 01_Introduction_to_Ragas\\n\\nWhat\\'s Coming in This Blog Series\\n\\nThis introduction is just the beginning. In the upcoming posts, we\\'ll dive deeper into all aspects of evaluating LLM applications with Ragas:\\n\\nPart 2: Basic Evaluation Workflow We\\'ll explore each metric in detail, explaining when and how to use them effectively.\\n\\nPart 3: Evaluating RAG Systems Learn specialized techniques for evaluating retrieval-augmented generation systems, including context precision, recall, and relevance.\\n\\nPart 4: Test Data Generation Discover how to create high-quality test datasets that thoroughly exercise your application\\'s capabilities.\\n\\nPart 5: Advanced Evaluation Techniques Go beyond basic metrics with custom evaluations, multi-aspect analysis, and domain-specific assessments.\\n\\nPart 6: Evaluating AI Agents Learn how to evaluate complex AI agents that engage in multi-turn interactions, use tools, and work toward specific goals.\\n\\nPart 7: Integrations and Observability Connect Ragas with your existing tools and platforms for streamlined evaluation workflows.\\n\\nPart 8: Building Feedback Loops Learn how to implement feedback loops that drive continuous improvement in your LLM applications. Transform evaluation insights into concrete improvements for your LLM applications.\\n\\nConclusion\\n\\nIn a world increasingly powered by LLMs, robust evaluation is the difference between reliable applications and unpredictable ones. Ragas provides the tools you need to confidently assess and improve your LLM applications.\\n\\nReady to Elevate Your LLM Applications?\\n\\nStart exploring Ragas today by visiting the official documentation. Share your thoughts, challenges, or success stories. If you\\'re facing specific evaluation hurdles, don\\'t hesitate to reach out—we\\'d love to help!')"
87
+ ]
88
+ },
89
+ "execution_count": 8,
90
+ "metadata": {},
91
+ "output_type": "execute_result"
92
+ }
93
+ ],
94
  "source": [
95
+ "docs[0]\n"
 
 
 
 
 
96
  ]
97
  },
98
  {
99
  "cell_type": "code",
100
+ "execution_count": 11,
101
+ "id": "72dd14b5",
102
  "metadata": {},
103
  "outputs": [],
104
  "source": [
105
+ "vector_store = blog_utils = blog_utils.create_vector_store(docs,'./db/vector_store_4')"
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  ]
107
  },
108
  {
 
117
  },
118
  {
119
  "cell_type": "code",
120
+ "execution_count": 12,
121
  "id": "8b552e6b",
122
  "metadata": {},
123
+ "outputs": [
124
+ {
125
+ "name": "stdout",
126
+ "output_type": "stream",
127
+ "text": [
128
+ "\n",
129
+ "Query: What is RAGAS?\n",
130
+ "Retrieved 3 documents:\n",
131
+ "1. Introduction To Ragas (https://thedataguy.pro/blog/introduction-to-ragas/)\n",
132
+ "2. Evaluating Rag Systems With Ragas (https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/)\n",
133
+ "3. Advanced Metrics And Customization With Ragas (https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)\n",
134
+ "\n",
135
+ "Query: How to build research agents?\n",
136
+ "Retrieved 3 documents:\n",
137
+ "1. Building Research Agent (https://thedataguy.pro/blog/building-research-agent/)\n",
138
+ "2. Advanced Metrics And Customization With Ragas (https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)\n",
139
+ "3. Evaluating Rag Systems With Ragas (https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/)\n",
140
+ "\n",
141
+ "Query: What is metric driven development?\n",
142
+ "Retrieved 3 documents:\n",
143
+ "1. Metric Driven Development (https://thedataguy.pro/blog/metric-driven-development/)\n",
144
+ "2. Advanced Metrics And Customization With Ragas (https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)\n",
145
+ "3. Coming Back To Ai Roots (https://thedataguy.pro/blog/coming-back-to-ai-roots/)\n",
146
+ "\n",
147
+ "Query: Who is TheDataGuy?\n",
148
+ "Retrieved 3 documents:\n",
149
+ "1. Advanced Metrics And Customization With Ragas (https://thedataguy.pro/blog/advanced-metrics-and-customization-with-ragas/)\n",
150
+ "2. Langchain Experience Csharp Perspective (https://thedataguy.pro/blog/langchain-experience-csharp-perspective/)\n",
151
+ "3. Evaluating Rag Systems With Ragas (https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/)\n"
152
+ ]
153
+ }
154
+ ],
155
  "source": [
156
  "# Create a retriever from the vector store\n",
157
+ "retriever = vector_store.as_retriever(search_kwargs={\"k\": 3})\n",
158
  "\n",
159
  "# Test queries\n",
160
  "test_queries = [\n",
 
174
  " print(f\"{i+1}. {title} ({url})\")"
175
  ]
176
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  {
178
  "cell_type": "code",
179
+ "execution_count": 13,
180
+ "id": "4cdd6899",
181
  "metadata": {},
182
  "outputs": [],
183
  "source": [
184
+ "vector_store.client.close()"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
  ]
186
  }
187
  ],
 
192
  "name": "python3"
193
  },
194
  "language_info": {
195
+ "codemirror_mode": {
196
+ "name": "ipython",
197
+ "version": 3
198
+ },
199
+ "file_extension": ".py",
200
+ "mimetype": "text/x-python",
201
  "name": "python",
202
+ "nbconvert_exporter": "python",
203
+ "pygments_lexer": "ipython3",
204
  "version": "3.13.2"
205
  }
206
  },
update_blog_data.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Blog Data Update Script
3
+
4
+ This script updates the blog data vector store when new posts are added.
5
+ It can be scheduled to run periodically or manually executed.
6
+
7
+ Usage:
8
+ python update_blog_data.py [--force-recreate] [--data-dir DATA_DIR]
9
+
10
+ Options:
11
+ --force-recreate Force recreation of the vector store even if it exists
12
+ --data-dir DIR Directory containing the blog posts (default: data/)
13
+ """
14
+
15
+ import os
16
+ import sys
17
+ import argparse
18
+ from datetime import datetime
19
+ import json
20
+ from pathlib import Path
21
+
22
+ # Import the blog utilities module
23
+ import blog_utils
24
+
25
+ def parse_args():
26
+ """Parse command-line arguments"""
27
+ parser = argparse.ArgumentParser(description="Update blog data vector store")
28
+ parser.add_argument("--force-recreate", action="store_true",
29
+ help="Force recreation of the vector store")
30
+ parser.add_argument("--data-dir", default=blog_utils.DATA_DIR,
31
+ help=f"Directory containing blog posts (default: {blog_utils.DATA_DIR})")
32
+ return parser.parse_args()
33
+
34
+ def save_stats(stats, output_dir="./stats"):
35
+ """Save stats to a JSON file for tracking changes over time"""
36
+ # Create directory if it doesn't exist
37
+ Path(output_dir).mkdir(exist_ok=True, parents=True)
38
+
39
+ # Create filename with timestamp
40
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
41
+ filename = f"{output_dir}/blog_stats_{timestamp}.json"
42
+
43
+ # Save only the basic stats, not the full document list
44
+ basic_stats = {
45
+ "timestamp": timestamp,
46
+ "total_documents": stats["total_documents"],
47
+ "total_characters": stats["total_characters"],
48
+ "min_length": stats["min_length"],
49
+ "max_length": stats["max_length"],
50
+ "avg_length": stats["avg_length"],
51
+ }
52
+
53
+ with open(filename, "w") as f:
54
+ json.dump(basic_stats, f, indent=2)
55
+
56
+ print(f"Saved stats to {filename}")
57
+ return filename
58
+
59
+ def main():
60
+ """Main function to update blog data"""
61
+ args = parse_args()
62
+
63
+ print("=== Blog Data Update ===")
64
+ print(f"Data directory: {args.data_dir}")
65
+ print(f"Force recreate: {args.force_recreate}")
66
+ print("========================")
67
+
68
+ # Process blog posts without creating embeddings
69
+ try:
70
+ # Load and process documents
71
+ documents = blog_utils.load_blog_posts(args.data_dir)
72
+ documents = blog_utils.update_document_metadata(documents)
73
+
74
+ # Get stats
75
+ stats = blog_utils.get_document_stats(documents)
76
+ blog_utils.display_document_stats(stats)
77
+
78
+ # Save stats for tracking
79
+ stats_file = save_stats(stats)
80
+
81
+ # Create a reference file for the vector store
82
+ if args.force_recreate:
83
+ print("\nAttempting to save vector store reference file...")
84
+ blog_utils.create_vector_store(documents, force_recreate=args.force_recreate)
85
+
86
+ print("\n=== Update Summary ===")
87
+ print(f"Processed {stats['total_documents']} documents")
88
+ print(f"Stats saved to: {stats_file}")
89
+ print("Note: Vector store creation is currently disabled due to pickling issues.")
90
+ print(" See VECTOR_STORE_ISSUES.md for more information and possible solutions.")
91
+ print("=====================")
92
+
93
+ return 0
94
+ except Exception as e:
95
+ print(f"Error: {e}")
96
+ import traceback
97
+ traceback.print_exc()
98
+ return 1
99
+
100
+ if __name__ == "__main__":
101
+ sys.exit(main())
utils_data_loading.ipynb DELETED
@@ -1,454 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "id": "b31c2849",
6
- "metadata": {},
7
- "source": [
8
- "# Utility Functions for Blog Post Loading and Processing\n",
9
- "\n",
10
- "This notebook contains utility functions for loading blog posts from the data directory, processing their metadata, and creating vector embeddings for use in the RAG system."
11
- ]
12
- },
13
- {
14
- "cell_type": "code",
15
- "execution_count": null,
16
- "id": "848b0a86",
17
- "metadata": {},
18
- "outputs": [],
19
- "source": [
20
- "import os\n",
21
- "import json\n",
22
- "from pathlib import Path\n",
23
- "from typing import List, Dict, Any, Optional\n",
24
- "\n",
25
- "from langchain_community.document_loaders import DirectoryLoader\n",
26
- "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
27
- "from langchain.schema.document import Document\n",
28
- "from langchain_huggingface import HuggingFaceEmbeddings\n",
29
- "from langchain_community.vectorstores import Qdrant\n",
30
- "\n",
31
- "from IPython.display import Markdown, display\n",
32
- "from dotenv import load_dotenv\n",
33
- "\n",
34
- "# Load environment variables from .env file\n",
35
- "load_dotenv()"
36
- ]
37
- },
38
- {
39
- "cell_type": "markdown",
40
- "id": "39e32435",
41
- "metadata": {},
42
- "source": [
43
- "## Configuration\n",
44
- "\n",
45
- "Load configuration from environment variables or use defaults."
46
- ]
47
- },
48
- {
49
- "cell_type": "code",
50
- "execution_count": null,
51
- "id": "5a6a5d6d",
52
- "metadata": {},
53
- "outputs": [],
54
- "source": [
55
- "# Configuration with defaults\n",
56
- "DATA_DIR = os.environ.get(\"DATA_DIR\", \"data/\")\n",
57
- "VECTOR_STORAGE_PATH = os.environ.get(\"VECTOR_STORAGE_PATH\", \"./db/vectorstore_v3\")\n",
58
- "EMBEDDING_MODEL = os.environ.get(\"EMBEDDING_MODEL\", \"Snowflake/snowflake-arctic-embed-l\")\n",
59
- "QDRANT_COLLECTION = os.environ.get(\"QDRANT_COLLECTION\", \"thedataguy_documents\")\n",
60
- "BLOG_BASE_URL = os.environ.get(\"BLOG_BASE_URL\", \"https://thedataguy.pro/blog/\")"
61
- ]
62
- },
63
- {
64
- "cell_type": "markdown",
65
- "id": "01454147",
66
- "metadata": {},
67
- "source": [
68
- "## Utility Functions\n",
69
- "\n",
70
- "These functions handle the loading, processing, and storing of blog posts."
71
- ]
72
- },
73
- {
74
- "cell_type": "code",
75
- "execution_count": null,
76
- "id": "25792cd5",
77
- "metadata": {},
78
- "outputs": [],
79
- "source": [
80
- "def load_blog_posts(data_dir: str = DATA_DIR, \n",
81
- " glob_pattern: str = \"*.md\", \n",
82
- " recursive: bool = True, \n",
83
- " show_progress: bool = True) -> List[Document]:\n",
84
- " \"\"\"\n",
85
- " Load blog posts from the specified directory.\n",
86
- " \n",
87
- " Args:\n",
88
- " data_dir: Directory containing the blog posts\n",
89
- " glob_pattern: Pattern to match files\n",
90
- " recursive: Whether to search subdirectories\n",
91
- " show_progress: Whether to show a progress bar\n",
92
- " \n",
93
- " Returns:\n",
94
- " List of Document objects containing the blog posts\n",
95
- " \"\"\"\n",
96
- " text_loader = DirectoryLoader(\n",
97
- " data_dir, \n",
98
- " glob=glob_pattern, \n",
99
- " show_progress=show_progress,\n",
100
- " recursive=recursive\n",
101
- " )\n",
102
- " \n",
103
- " documents = text_loader.load()\n",
104
- " print(f\"Loaded {len(documents)} documents from {data_dir}\")\n",
105
- " return documents"
106
- ]
107
- },
108
- {
109
- "cell_type": "code",
110
- "execution_count": null,
111
- "id": "e7ddba72",
112
- "metadata": {},
113
- "outputs": [],
114
- "source": [
115
- "def update_document_metadata(documents: List[Document], \n",
116
- " data_dir_prefix: str = DATA_DIR,\n",
117
- " blog_base_url: str = BLOG_BASE_URL,\n",
118
- " remove_suffix: str = \"index.md\") -> List[Document]:\n",
119
- " \"\"\"\n",
120
- " Update the metadata of documents to include URL and other information.\n",
121
- " \n",
122
- " Args:\n",
123
- " documents: List of Document objects to update\n",
124
- " data_dir_prefix: Prefix to replace in source paths\n",
125
- " blog_base_url: Base URL for the blog posts\n",
126
- " remove_suffix: Suffix to remove from paths (like index.md)\n",
127
- " \n",
128
- " Returns:\n",
129
- " Updated list of Document objects\n",
130
- " \"\"\"\n",
131
- " for doc in documents:\n",
132
- " # Create URL from source path\n",
133
- " doc.metadata[\"url\"] = doc.metadata[\"source\"].replace(data_dir_prefix, blog_base_url)\n",
134
- " \n",
135
- " # Remove index.md or other suffix if present\n",
136
- " if remove_suffix and doc.metadata[\"url\"].endswith(remove_suffix):\n",
137
- " doc.metadata[\"url\"] = doc.metadata[\"url\"][:-len(remove_suffix)]\n",
138
- " \n",
139
- " # Extract post title from the directory structure\n",
140
- " path_parts = Path(doc.metadata[\"source\"]).parts\n",
141
- " if len(path_parts) > 1:\n",
142
- " # Use the directory name as post_slug\n",
143
- " doc.metadata[\"post_slug\"] = path_parts[-2]\n",
144
- " doc.metadata[\"post_title\"] = path_parts[-2].replace(\"-\", \" \").title()\n",
145
- " \n",
146
- " # Add document length as metadata\n",
147
- " doc.metadata[\"content_length\"] = len(doc.page_content)\n",
148
- " \n",
149
- " return documents"
150
- ]
151
- },
152
- {
153
- "cell_type": "code",
154
- "execution_count": null,
155
- "id": "e0dfe498",
156
- "metadata": {},
157
- "outputs": [],
158
- "source": [
159
- "def get_document_stats(documents: List[Document]) -> Dict[str, Any]:\n",
160
- " \"\"\"\n",
161
- " Get statistics about the documents.\n",
162
- " \n",
163
- " Args:\n",
164
- " documents: List of Document objects\n",
165
- " \n",
166
- " Returns:\n",
167
- " Dictionary with statistics\n",
168
- " \"\"\"\n",
169
- " stats = {\n",
170
- " \"total_documents\": len(documents),\n",
171
- " \"total_characters\": sum(len(doc.page_content) for doc in documents),\n",
172
- " \"min_length\": min(len(doc.page_content) for doc in documents),\n",
173
- " \"max_length\": max(len(doc.page_content) for doc in documents),\n",
174
- " \"avg_length\": sum(len(doc.page_content) for doc in documents) / len(documents) if documents else 0,\n",
175
- " }\n",
176
- " \n",
177
- " # Create a list of document info for analysis\n",
178
- " doc_info = []\n",
179
- " for doc in documents:\n",
180
- " doc_info.append({\n",
181
- " \"url\": doc.metadata.get(\"url\", \"\"),\n",
182
- " \"source\": doc.metadata.get(\"source\", \"\"),\n",
183
- " \"title\": doc.metadata.get(\"post_title\", \"\"),\n",
184
- " \"text_length\": doc.metadata.get(\"content_length\", 0),\n",
185
- " })\n",
186
- " \n",
187
- " stats[\"documents\"] = doc_info\n",
188
- " return stats"
189
- ]
190
- },
191
- {
192
- "cell_type": "code",
193
- "execution_count": null,
194
- "id": "0ae139c0",
195
- "metadata": {},
196
- "outputs": [],
197
- "source": [
198
- "def display_document_stats(stats: Dict[str, Any]):\n",
199
- " \"\"\"\n",
200
- " Display document statistics in a readable format.\n",
201
- " \n",
202
- " Args:\n",
203
- " stats: Dictionary with statistics from get_document_stats\n",
204
- " \"\"\"\n",
205
- " print(f\"Total Documents: {stats['total_documents']}\")\n",
206
- " print(f\"Total Characters: {stats['total_characters']}\")\n",
207
- " print(f\"Min Length: {stats['min_length']} characters\")\n",
208
- " print(f\"Max Length: {stats['max_length']} characters\")\n",
209
- " print(f\"Average Length: {stats['avg_length']:.2f} characters\")\n",
210
- " \n",
211
- " # Display documents as a table\n",
212
- " import pandas as pd\n",
213
- " if stats[\"documents\"]:\n",
214
- " df = pd.DataFrame(stats[\"documents\"])\n",
215
- " display(df)"
216
- ]
217
- },
218
- {
219
- "cell_type": "code",
220
- "execution_count": null,
221
- "id": "2dcf66b4",
222
- "metadata": {},
223
- "outputs": [],
224
- "source": [
225
- "def split_documents(documents: List[Document], \n",
226
- " chunk_size: int = 1000, \n",
227
- " chunk_overlap: int = 200) -> List[Document]:\n",
228
- " \"\"\"\n",
229
- " Split documents into chunks for better embedding and retrieval.\n",
230
- " \n",
231
- " Args:\n",
232
- " documents: List of Document objects to split\n",
233
- " chunk_size: Size of each chunk in characters\n",
234
- " chunk_overlap: Overlap between chunks in characters\n",
235
- " \n",
236
- " Returns:\n",
237
- " List of split Document objects\n",
238
- " \"\"\"\n",
239
- " text_splitter = RecursiveCharacterTextSplitter(\n",
240
- " chunk_size=chunk_size,\n",
241
- " chunk_overlap=chunk_overlap,\n",
242
- " length_function=len,\n",
243
- " )\n",
244
- " \n",
245
- " split_docs = text_splitter.split_documents(documents)\n",
246
- " print(f\"Split {len(documents)} documents into {len(split_docs)} chunks\")\n",
247
- " return split_docs"
248
- ]
249
- },
250
- {
251
- "cell_type": "code",
252
- "execution_count": null,
253
- "id": "527ad848",
254
- "metadata": {},
255
- "outputs": [],
256
- "source": [
257
- "def create_vector_store(documents: List[Document], \n",
258
- " storage_path: str = VECTOR_STORAGE_PATH,\n",
259
- " collection_name: str = QDRANT_COLLECTION,\n",
260
- " embedding_model: str = EMBEDDING_MODEL,\n",
261
- " force_recreate: bool = False) -> Qdrant:\n",
262
- " \"\"\"\n",
263
- " Create a vector store from documents.\n",
264
- " \n",
265
- " Args:\n",
266
- " documents: List of Document objects to store\n",
267
- " storage_path: Path to the vector store\n",
268
- " collection_name: Name of the collection\n",
269
- " embedding_model: Name of the embedding model\n",
270
- " force_recreate: Whether to force recreation of the vector store\n",
271
- " \n",
272
- " Returns:\n",
273
- " Qdrant vector store\n",
274
- " \"\"\"\n",
275
- " # Initialize the embedding model\n",
276
- " embeddings = HuggingFaceEmbeddings(model_name=embedding_model)\n",
277
- " \n",
278
- " # Create the directory if it doesn't exist\n",
279
- " storage_dir = Path(storage_path).parent\n",
280
- " os.makedirs(storage_dir, exist_ok=True)\n",
281
- " \n",
282
- " # Check if vector store exists\n",
283
- " vector_store_exists = Path(storage_path).exists() and not force_recreate\n",
284
- " \n",
285
- " if vector_store_exists:\n",
286
- " print(f\"Loading existing vector store from {storage_path}\")\n",
287
- " try:\n",
288
- " vector_store = Qdrant(\n",
289
- " path=storage_path,\n",
290
- " embedding_function=embeddings,\n",
291
- " collection_name=collection_name\n",
292
- " )\n",
293
- " return vector_store\n",
294
- " except Exception as e:\n",
295
- " print(f\"Error loading existing vector store: {e}\")\n",
296
- " print(\"Creating new vector store...\")\n",
297
- " force_recreate = True\n",
298
- " \n",
299
- " # Create new vector store\n",
300
- " print(f\"Creating new vector store at {storage_path}\")\n",
301
- " vector_store = Qdrant.from_documents(\n",
302
- " documents=documents,\n",
303
- " embedding=embeddings,\n",
304
- " path=storage_path,\n",
305
- " collection_name=collection_name,\n",
306
- " )\n",
307
- " \n",
308
- " return vector_store"
309
- ]
310
- },
311
- {
312
- "cell_type": "markdown",
313
- "id": "c78f99fc",
314
- "metadata": {},
315
- "source": [
316
- "## Example Usage\n",
317
- "\n",
318
- "Here's how to use these utility functions for processing blog posts."
319
- ]
320
- },
321
- {
322
- "cell_type": "code",
323
- "execution_count": null,
324
- "id": "132d32c6",
325
- "metadata": {},
326
- "outputs": [],
327
- "source": [
328
- "def process_blog_posts(data_dir: str = DATA_DIR,\n",
329
- " create_embeddings: bool = True,\n",
330
- " force_recreate_embeddings: bool = False):\n",
331
- " \"\"\"\n",
332
- " Complete pipeline to process blog posts and optionally create vector embeddings.\n",
333
- " \n",
334
- " Args:\n",
335
- " data_dir: Directory containing the blog posts\n",
336
- " create_embeddings: Whether to create vector embeddings\n",
337
- " force_recreate_embeddings: Whether to force recreation of embeddings\n",
338
- " \n",
339
- " Returns:\n",
340
- " Dictionary with data and vector store (if created)\n",
341
- " \"\"\"\n",
342
- " # Load documents\n",
343
- " documents = load_blog_posts(data_dir)\n",
344
- " \n",
345
- " # Update metadata\n",
346
- " documents = update_document_metadata(documents)\n",
347
- " \n",
348
- " # Get and display stats\n",
349
- " stats = get_document_stats(documents)\n",
350
- " display_document_stats(stats)\n",
351
- " \n",
352
- " result = {\n",
353
- " \"documents\": documents,\n",
354
- " \"stats\": stats,\n",
355
- " \"vector_store\": None\n",
356
- " }\n",
357
- " \n",
358
- " # Create vector store if requested\n",
359
- " if create_embeddings:\n",
360
- " vector_store = create_vector_store(\n",
361
- " documents, \n",
362
- " force_recreate=force_recreate_embeddings\n",
363
- " )\n",
364
- " result[\"vector_store\"] = vector_store\n",
365
- " \n",
366
- " return result"
367
- ]
368
- },
369
- {
370
- "cell_type": "code",
371
- "execution_count": null,
372
- "id": "266d4fb3",
373
- "metadata": {},
374
- "outputs": [],
375
- "source": [
376
- "# Example usage\n",
377
- "if __name__ == \"__main__\":\n",
378
- " # Process blog posts without creating embeddings\n",
379
- " result = process_blog_posts(create_embeddings=False)\n",
380
- " \n",
381
- " # Example: Access the documents\n",
382
- " print(f\"\\nDocument example: {result['documents'][0].metadata}\")\n",
383
- " \n",
384
- " # Create embeddings if needed\n",
385
- " # result = process_blog_posts(create_embeddings=True)\n",
386
- " \n",
387
- " # Retriever example\n",
388
- " # retriever = result[\"vector_store\"].as_retriever()\n",
389
- " # query = \"What is RAGAS?\"\n",
390
- " # docs = retriever.invoke(query, k=2)\n",
391
- " # print(f\"\\nRetrieved {len(docs)} documents for query: {query}\")"
392
- ]
393
- },
394
- {
395
- "cell_type": "markdown",
396
- "id": "22132649",
397
- "metadata": {},
398
- "source": [
399
- "## Function for Loading Existing Vector Store\n",
400
- "\n",
401
- "This function can be used to load an existing vector store without reprocessing all blog posts."
402
- ]
403
- },
404
- {
405
- "cell_type": "code",
406
- "execution_count": null,
407
- "id": "c24e0c02",
408
- "metadata": {},
409
- "outputs": [],
410
- "source": [
411
- "def load_vector_store(storage_path: str = VECTOR_STORAGE_PATH,\n",
412
- " collection_name: str = QDRANT_COLLECTION,\n",
413
- " embedding_model: str = EMBEDDING_MODEL) -> Optional[Qdrant]:\n",
414
- " \"\"\"\n",
415
- " Load an existing vector store.\n",
416
- " \n",
417
- " Args:\n",
418
- " storage_path: Path to the vector store\n",
419
- " collection_name: Name of the collection\n",
420
- " embedding_model: Name of the embedding model\n",
421
- " \n",
422
- " Returns:\n",
423
- " Qdrant vector store or None if it doesn't exist\n",
424
- " \"\"\"\n",
425
- " # Initialize the embedding model\n",
426
- " embeddings = HuggingFaceEmbeddings(model_name=embedding_model)\n",
427
- " \n",
428
- " # Check if vector store exists\n",
429
- " if not Path(storage_path).exists():\n",
430
- " print(f\"Vector store not found at {storage_path}\")\n",
431
- " return None\n",
432
- " \n",
433
- " try:\n",
434
- " vector_store = Qdrant(\n",
435
- " path=storage_path,\n",
436
- " embedding_function=embeddings,\n",
437
- " collection_name=collection_name\n",
438
- " )\n",
439
- " print(f\"Loaded vector store from {storage_path}\")\n",
440
- " return vector_store\n",
441
- " except Exception as e:\n",
442
- " print(f\"Error loading vector store: {e}\")\n",
443
- " return None"
444
- ]
445
- }
446
- ],
447
- "metadata": {
448
- "language_info": {
449
- "name": "python"
450
- }
451
- },
452
- "nbformat": 4,
453
- "nbformat_minor": 5
454
- }