Spaces:

dsimeone
/

organic-chatbot

Sleeping

App Files Files Community

daniel-simeone commited on Jan 28

Commit

66c4741

1 Parent(s): b2be33e

improve quality

Browse files

Files changed (7) hide show

DEPLOYMENT.md +63 -20
README.md +75 -20
app.py +84 -64
example_usage.py +1 -1
ingest_documents.py +5 -4
ingestion.py +5 -5
requirements.txt +1 -0

DEPLOYMENT.md CHANGED Viewed

@@ -6,6 +6,8 @@ This guide will walk you through deploying your RAG chatbot to Hugging Face Spac
 1. **Hugging Face Account**: Sign up at https://huggingface.co/join
 2. **Access Token**: Get your token from https://huggingface.co/settings/tokens
 ## Step-by-Step Deployment
@@ -34,13 +36,21 @@ This guide will walk you through deploying your RAG chatbot to Hugging Face Spac
      - `pdfs/` folder (if you want to include sample PDFs)
      - `ingest_documents.py` (optional, for manual ingestion)
-3. **Important Notes**:
-   - **Vector Store**: The `vector_store/` folder is in `.gitignore` and won't be uploaded. You have two options:
      - **Option A**: Run `ingest_documents.py` on the Space after deployment (via the Space's terminal)
      - **Option B**: Upload the vector store files manually if they're not too large
    - **PDFs**: If your PDFs are large (>50MB), consider hosting them elsewhere or using Hugging Face Datasets
-4. **Wait for Build**: Hugging Face will automatically:
    - Install dependencies from `requirements.txt`
    - Start your Gradio app
    - Your Space will be live at: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
@@ -90,6 +100,22 @@ This guide will walk you through deploying your RAG chatbot to Hugging Face Spac
 ## Post-Deployment Setup
 ### Setting Up the Vector Store on Hugging Face Spaces
 Since the vector store isn't included in the repository, you need to create it on the Space:
@@ -101,9 +127,9 @@ Since the vector store isn't included in the repository, you need to create it o
 2. **Option B: Upload Vector Store Files**:
    - If your vector store files are small enough:
-     - Upload `vector_store/index.faiss`
-     - Upload `vector_store/documents.pkl`
-     - Upload `vector_store/embeddings.pkl`
    - The app will automatically load them on startup
 3. **Option C: Pre-build in a Script**:
@@ -118,25 +144,38 @@ Your Space needs these files:
 - ✅ `ingestion.py` - Document ingestion module
 - ✅ `requirements.txt` - Python dependencies
 - ✅ `README.md` - Documentation (optional but recommended)
-- ⚠️ `vector_store/` - Will be created on the Space
 - ⚠️ `pdfs/` - Optional, include if you want sample PDFs
 ## Configuration for Hugging Face Spaces
-### Update app.py for Spaces
-The current `app.py` should work, but you might want to adjust:
-1. **Port**: Hugging Face Spaces uses port 7860 automatically
-2. **Server name**: Use `0.0.0.0` (already set)
-3. **Share**: Set to `False` (already set)
-Your current launch code is fine:
 ```python
 app.launch(
     share=False,
     server_name="0.0.0.0",
-    server_port=7861,  # Note: Spaces uses 7860, but this should auto-adjust
     theme=MinimalistTheme()
 )
 ```
@@ -170,21 +209,25 @@ pinned: false
 - Verify the vector store files are in the correct location
 - Check file permissions
-- Ensure the path in `app.py` is correct (should be `./vector_store`)
-### Model Loading Issues
-- Large models may take time to download on first run
-- Consider using smaller models for faster startup
-- Check available disk space in your Space
 ### Memory Issues
 - If you get out-of-memory errors, consider:
-  - Using a smaller embedding model
   - Reducing chunk size in `ingestion.py`
   - Upgrading to a Space with more memory
 ## Updating Your Space
 After making changes locally:

 1. **Hugging Face Account**: Sign up at https://huggingface.co/join
 2. **Access Token**: Get your token from https://huggingface.co/settings/tokens
+   - Create a token with "Read" permissions
+   - This token is required for the Inference API to work
 ## Step-by-Step Deployment
      - `pdfs/` folder (if you want to include sample PDFs)
      - `ingest_documents.py` (optional, for manual ingestion)
+3. **Set Up HF_TOKEN Secret (REQUIRED)**:
+   - Go to your Space → **Settings** → **Secrets**
+   - Click **New secret**
+   - Name: `HF_TOKEN`
+   - Value: Paste your Hugging Face access token
+   - Click **Add secret**
+   - **Important**: Without this token, the chatbot will not work as it needs the Inference API
+4. **Important Notes**:
+   - **Vector Store**: The `data/vector_store/` folder is in `.gitignore` and won't be uploaded. You have two options:
      - **Option A**: Run `ingest_documents.py` on the Space after deployment (via the Space's terminal)
      - **Option B**: Upload the vector store files manually if they're not too large
    - **PDFs**: If your PDFs are large (>50MB), consider hosting them elsewhere or using Hugging Face Datasets
+5. **Wait for Build**: Hugging Face will automatically:
    - Install dependencies from `requirements.txt`
    - Start your Gradio app
    - Your Space will be live at: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME`
 ## Post-Deployment Setup
+### Setting Up HF_TOKEN Secret
+**This is REQUIRED for the chatbot to work!**
+1. Go to your Space on Hugging Face
+2. Click on **Settings** (gear icon)
+3. Scroll down to **Secrets** section
+4. Click **New secret**
+5. Enter:
+   - **Name**: `HF_TOKEN`
+   - **Value**: Your Hugging Face access token (from https://huggingface.co/settings/tokens)
+6. Click **Add secret**
+7. The token will be automatically available to your app via `os.environ.get("HF_TOKEN")`
+**Note**: The token must have "Read" permissions. The app uses it to access the Inference API for Mistral-7B-Instruct.
 ### Setting Up the Vector Store on Hugging Face Spaces
 Since the vector store isn't included in the repository, you need to create it on the Space:
 2. **Option B: Upload Vector Store Files**:
    - If your vector store files are small enough:
+- Upload `data/vector_store/index.faiss`
+   - Upload `data/vector_store/documents.pkl`
+   - Upload `data/vector_store/embeddings.pkl`
    - The app will automatically load them on startup
 3. **Option C: Pre-build in a Script**:
 - ✅ `ingestion.py` - Document ingestion module
 - ✅ `requirements.txt` - Python dependencies
 - ✅ `README.md` - Documentation (optional but recommended)
+- ⚠️ `data/vector_store/` - Will be created on the Space
 - ⚠️ `pdfs/` - Optional, include if you want sample PDFs
 ## Configuration for Hugging Face Spaces
+### Model Configuration
+The chatbot now uses **Mistral-7B-Instruct-v0.2** via Hugging Face Inference API. This means:
+- **No local model loading**: Faster startup, no need for GPU
+- **Hosted inference**: The model runs on Hugging Face's infrastructure
+- **Requires HF_TOKEN**: Must be set in Space secrets (see above)
+The model is configured in `app.py` and can be changed if needed:
+```python
+chatbot = RAGChatbot(model_name="mistralai/Mistral-7B-Instruct-v0.2")
+```
+### App Configuration
+The current `app.py` is configured for Spaces:
+1. **Port**: Uses `os.environ.get("PORT", 7860)` - Spaces automatically sets this
+2. **Server name**: Uses `0.0.0.0` (required for Spaces)
+3. **Share**: Set to `False` (Spaces provides its own sharing)
+The launch code is already configured correctly:
 ```python
+port = int(os.environ.get("PORT", 7860))
 app.launch(
     share=False,
     server_name="0.0.0.0",
+    server_port=port,
     theme=MinimalistTheme()
 )
 ```
 - Verify the vector store files are in the correct location
 - Check file permissions
+- Ensure the path in `app.py` is correct (should be `data/vector_store`)
+### Inference API Issues
+- **"HF_TOKEN not set" error**: Make sure you've added the `HF_TOKEN` secret in Space settings
+- **API rate limits**: Free tier has rate limits; upgrade if you need more requests
+- **Model access errors**: Verify your token has "Read" permissions
+- **Connection errors**: Check that the Inference API is accessible from your Space
 ### Memory Issues
 - If you get out-of-memory errors, consider:
+  - Using a smaller embedding model (e.g., `all-MiniLM-L6-v2` instead of `all-mpnet-base-v2`)
   - Reducing chunk size in `ingestion.py`
+  - Processing fewer documents at once
   - Upgrading to a Space with more memory
+**Note**: Since the model runs via Inference API, memory issues are less likely than with local model loading.
 ## Updating Your Space
 After making changes locally:

README.md CHANGED Viewed

@@ -129,28 +129,65 @@ pinned: false
 ## Configuration
-### Changing the Chatbot Model
-Edit `app.py` and change the `model_name` parameter in `RAGChatbot`:
 ```python
-chatbot = RAGChatbot(model_name="your-preferred-model")
 ```
-Popular options:
-- `microsoft/DialoGPT-medium` (default)
-- `gpt2`
-- `facebook/blenderbot-400M-distill`
-- `microsoft/DialoGPT-large`
-### Changing the Embedding Model
-Edit the `embedding_model` parameter:
 ```python
-chatbot = RAGChatbot(embedding_model="sentence-transformers/all-mpnet-base-v2")
 ```
 ## Project Structure
 ```
@@ -162,18 +199,30 @@ chatbot = RAGChatbot(embedding_model="sentence-transformers/all-mpnet-base-v2")
 ├── README.md              # This file
 ├── pdfs/                  # Folder for PDF files (add your PDFs here)
 │   └── README.md
-└── vector_store/          # Saved vector store (created after ingestion)
-    ├── index.faiss
-    ├── documents.pkl
-    └── embeddings.pkl
 ```
 ## Limitations
 - Vector store is stored locally (not persistent on Hugging Face Spaces by default)
 - Large documents may take time to process
 - Some URLs may be blocked or require authentication
-- GPU recommended for better performance with larger models
 ## Troubleshooting
@@ -186,10 +235,16 @@ chatbot = RAGChatbot(embedding_model="sentence-transformers/all-mpnet-base-v2")
 - Some websites block automated requests
 - Try different URLs or use PDF uploads instead
-### Model Loading Issues
-- Ensure you have sufficient disk space
-- Check your internet connection for model downloads
-- Some models require GPU - check model requirements
 ## License

 ## Configuration
+### Setting Up Hugging Face Token (Required)
+The chatbot uses Hugging Face Inference API to access high-quality models. You need to set up an API token:
+1. **Get your token:**
+   - Go to https://huggingface.co/settings/tokens
+   - Create a new token with "Read" permissions
+   - Copy the token
+2. **For local development:**
+   - Set environment variable: `export HF_TOKEN=your_token_here` (Linux/Mac)
+   - Or: `set HF_TOKEN=your_token_here` (Windows)
+   - Or create a `.env` file with `HF_TOKEN=your_token_here`
+3. **For Hugging Face Spaces:**
+   - Go to your Space → Settings → Secrets
+   - Add a new secret: Name = `HF_TOKEN`, Value = your token
+   - The app will automatically use this token
+### Chatbot Model
+The chatbot uses **Mistral-7B-Instruct-v0.2** via Hugging Face Inference API. This is an instruction-tuned model that provides high-quality, coherent answers.
+To change the model, edit `app.py`:
 ```python
+chatbot = RAGChatbot(model_name="mistralai/Mistral-7B-Instruct-v0.2")
 ```
+Other recommended instruction-tuned models:
+- `HuggingFaceH4/zephyr-7b-beta` - Excellent for chat
+- `meta-llama/Meta-Llama-3-8B-Instruct` - High quality (requires access)
+- `microsoft/phi-2` - Smaller, faster option
+### Embedding Model
+The default embedding model is **all-mpnet-base-v2**, which provides high-quality embeddings for better retrieval.
+To change the embedding model, edit both `app.py` and `ingest_documents.py`:
 ```python
+# In app.py
+chatbot = RAGChatbot(embedding_model="all-mpnet-base-v2")
+# In ingest_documents.py
+ingestion = DocumentIngestion(embedding_model="all-mpnet-base-v2")
 ```
+**Note:** If you change the embedding model, you must re-run `ingest_documents.py` to rebuild the vector store.
+### Ingestion Parameters
+The ingestion system uses optimized parameters:
+- **Chunk size**: 600 characters (for precise retrieval)
+- **Chunk overlap**: 150 characters (to avoid cutting sentences)
+- **Retrieval count**: 5 chunks (for comprehensive context)
+These parameters are set in `ingestion.py` and can be adjusted if needed.
 ## Project Structure
 ```
 ├── README.md              # This file
 ├── pdfs/                  # Folder for PDF files (add your PDFs here)
 │   └── README.md
+└── data/
+    └── vector_store/     # Saved vector store (created after ingestion)
+        ├── index.faiss
+        ├── documents.pkl
+        └── embeddings.pkl
 ```
+## How It Works
+The chatbot uses **Retrieval-Augmented Generation (RAG)**:
+1. **Document Ingestion**: PDFs and URLs are processed into chunks and embedded using sentence transformers
+2. **Vector Search**: When you ask a question, the system searches for the most relevant document chunks
+3. **Answer Generation**: The retrieved context is sent to Mistral-7B-Instruct via Inference API, which synthesizes a coherent answer based on the context
+This approach combines the accuracy of document retrieval with the natural language capabilities of a large language model.
 ## Limitations
 - Vector store is stored locally (not persistent on Hugging Face Spaces by default)
 - Large documents may take time to process
 - Some URLs may be blocked or require authentication
+- Requires HF_TOKEN for Inference API access (free tier available)
+- If you change embedding model or chunk parameters, you must re-run ingestion
 ## Troubleshooting
 - Some websites block automated requests
 - Try different URLs or use PDF uploads instead
+### Inference API Issues
+- Verify your `HF_TOKEN` is set correctly
+- Check that the token has "Read" permissions
+- Ensure you have API access (free tier available)
+- If you get rate limit errors, you may need to upgrade your Hugging Face account
+### Ingestion Issues
+- If you changed embedding model or chunk parameters, re-run `ingest_documents.py`
+- Ensure you have enough disk space for the vector store
+- Large documents may take time to process
 ## License

app.py CHANGED Viewed

@@ -6,9 +6,8 @@ from gradio.themes.base import Base
 from gradio.themes.utils import colors, fonts, sizes
 import os
 from typing import List, Tuple
-from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
 from ingestion import DocumentIngestion
-import torch
 # Create a clean minimalist theme
@@ -80,36 +79,38 @@ class RAGChatbot:
     def __init__(
         self,
-        model_name: str = "microsoft/DialoGPT-medium",
-        embedding_model: str = "all-MiniLM-L6-v2",
         vector_store_path: str = "data/vector_store"
     ):
         """
         Initialize the RAG chatbot.
         Args:
-            model_name: Hugging Face model name for the chatbot
             embedding_model: Model for document embeddings
             vector_store_path: Path to saved vector store
         """
         self.model_name = model_name
-        self.device = "cuda" if torch.cuda.is_available() else "cpu"
-        # Load chatbot model
-        print(f"Loading chatbot model: {model_name}")
         try:
-            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
-            self.model = AutoModelForCausalLM.from_pretrained(model_name)
-            self.tokenizer.pad_token = self.tokenizer.eos_token
-        except Exception as e:
-            print(f"Warning: Could not load {model_name}. Using a simpler pipeline.")
-            self.model = None
-            self.tokenizer = None
-            self.chatbot_pipeline = pipeline(
-                "text-generation",
-                model="gpt2",
-                device=0 if self.device == "cuda" else -1
             )
         # Initialize document ingestion
         self.ingestion = DocumentIngestion(embedding_model=embedding_model)
@@ -126,9 +127,9 @@ class RAGChatbot:
         self.chat_history = []
-    def generate_response(self, query: str, use_rag: bool = True, num_results: int = 3) -> str:
         """
-        Generate a response to the user query.
         Args:
             query: User's question
@@ -138,61 +139,80 @@ class RAGChatbot:
         Returns:
             Generated response
         """
-        # If RAG is enabled and we have a vector store, return relevant context
         if use_rag and self.ingestion.index is not None:
             try:
                 results = self.ingestion.search(query, k=num_results)
                 if results:
-                    # Format the response with relevant context
-                    response_parts = []
-                    response_parts.append(f"Based on the documents, here's what I found regarding your question: '{query}'\n\n")
                     for i, result in enumerate(results, 1):
-                        source = result['metadata']['source']
-                        text = result['text']
-                        # Clean up the text
-                        text = text.strip()
                         if text:
-                            response_parts.append(f"**Relevant information {i}** (from {source}):\n{text}\n")
-                    response = "\n".join(response_parts)
-                    return response
             except Exception as e:
                 print(f"Error in RAG retrieval: {e}")
                 return f"I encountered an error while searching the documents: {str(e)}"
-        # If no RAG or no results, try to generate a response using the model
-        # But DialoGPT isn't great for this, so we'll keep it simple
-        if self.model and self.tokenizer:
-            # Simple generation without complex prompts
-            inputs = self.tokenizer.encode(query, return_tensors="pt")
-            inputs = inputs.to(self.device)
-            with torch.no_grad():
-                outputs = self.model.generate(
-                    inputs,
-                    max_new_tokens=100,
-                    num_return_sequences=1,
-                    temperature=0.7,
-                    do_sample=True,
-                    pad_token_id=self.tokenizer.eos_token_id,
-                    eos_token_id=self.tokenizer.eos_token_id,
-                )
-            # Decode only new tokens
-            input_length = inputs.shape[1]
-            generated_tokens = outputs[0][input_length:]
-            response = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
-            # Clean up
-            response = response.replace("<|endoftext|>", "").strip()
-            if not response or len(response.strip()) < 3:
-                return "I understand your question, but I don't have relevant information in my knowledge base. Please enable RAG to search the documents."
-            return response
-        else:
-            return "I understand your question, but I don't have relevant information in my knowledge base. Please enable RAG to search the documents."
     def chat(self, message: str, history, use_rag: bool):
         """

 from gradio.themes.utils import colors, fonts, sizes
 import os
 from typing import List, Tuple
+from huggingface_hub import InferenceClient
 from ingestion import DocumentIngestion
 # Create a clean minimalist theme
     def __init__(
         self,
+        model_name: str = "mistralai/Mistral-7B-Instruct-v0.2",
+        embedding_model: str = "all-mpnet-base-v2",
         vector_store_path: str = "data/vector_store"
     ):
         """
         Initialize the RAG chatbot.
         Args:
+            model_name: Hugging Face model name for the chatbot (via Inference API)
             embedding_model: Model for document embeddings
             vector_store_path: Path to saved vector store
         """
         self.model_name = model_name
+        # Initialize Inference API client
+        hf_token = os.environ.get("HF_TOKEN")
+        if not hf_token:
+            print("Warning: HF_TOKEN not set. Inference API calls may fail.")
+            print("Set HF_TOKEN environment variable or add it to Space secrets.")
+        else:
+            print("HF_TOKEN found. Inference API ready.")
+        print(f"Initializing Inference API client for model: {model_name}")
         try:
+            self.inference_client = InferenceClient(
+                model=model_name,
+                token=hf_token
             )
+            print("Inference API client initialized successfully")
+        except Exception as e:
+            print(f"Error initializing Inference API client: {e}")
+            self.inference_client = None
         # Initialize document ingestion
         self.ingestion = DocumentIngestion(embedding_model=embedding_model)
         self.chat_history = []
+    def generate_response(self, query: str, use_rag: bool = True, num_results: int = 5) -> str:
         """
+        Generate a response to the user query using RAG and Inference API.
         Args:
             query: User's question
         Returns:
             Generated response
         """
+        if self.inference_client is None:
+            return "Error: Inference API client not initialized. Please check HF_TOKEN configuration."
+        # If RAG is enabled and we have a vector store, retrieve context and generate answer
         if use_rag and self.ingestion.index is not None:
             try:
                 results = self.ingestion.search(query, k=num_results)
                 if results:
+                    # Build context from retrieved chunks
+                    context_parts = []
                     for i, result in enumerate(results, 1):
+                        text = result['text'].strip()
                         if text:
+                            context_parts.append(f"[Context {i}]\n{text}")
+                    context = "\n\n".join(context_parts)
+                    # Build instruction-tuned prompt
+                    prompt = f"""You are a helpful assistant. Answer the user's question based ONLY on the provided context. If the context doesn't contain enough information to answer the question, say so clearly.
+Context:
+{context}
+Question: {query}
+Answer:"""
+                    # Generate response using Inference API
+                    try:
+                        response = self.inference_client.text_generation(
+                            prompt,
+                            max_new_tokens=512,
+                            temperature=0.7,
+                            top_p=0.9,
+                            return_full_text=False
+                        )
+                        return response.strip()
+                    except Exception as api_error:
+                        print(f"Error calling Inference API: {api_error}")
+                        # Fallback: return formatted chunks with note
+                        response_parts = []
+                        response_parts.append("I retrieved relevant information, but couldn't generate a synthesized answer. Here are the relevant chunks:\n\n")
+                        for i, result in enumerate(results, 1):
+                            source = result['metadata']['source']
+                            text = result['text'].strip()
+                            if text:
+                                response_parts.append(f"**Relevant information {i}** (from {source}):\n{text}\n")
+                        return "\n".join(response_parts)
+                else:
+                    # No results found
+                    return "I couldn't find any relevant information in the documents to answer your question. Please try rephrasing or check if the documents contain information about this topic."
             except Exception as e:
                 print(f"Error in RAG retrieval: {e}")
                 return f"I encountered an error while searching the documents: {str(e)}"
+        # If no RAG or no vector store, generate response without context
+        try:
+            prompt = f"""You are a helpful assistant. Answer the following question concisely.
+Question: {query}
+Answer:"""
+            response = self.inference_client.text_generation(
+                prompt,
+                max_new_tokens=256,
+                temperature=0.7,
+                top_p=0.9,
+                return_full_text=False
+            )
+            return response.strip()
+        except Exception as e:
+            print(f"Error generating response: {e}")
+            return f"I encountered an error while generating a response: {str(e)}. Please check your HF_TOKEN configuration."
     def chat(self, message: str, history, use_rag: bool):
         """

example_usage.py CHANGED Viewed

@@ -32,7 +32,7 @@ def main():
         ingestion.build_vector_store()
         # Save vector store
-        ingestion.save("./vector_store")
         # Example search
         query = "What is artificial intelligence?"

         ingestion.build_vector_store()
         # Save vector store
+        ingestion.save("data/vector_store")
         # Example search
         query = "What is artificial intelligence?"

ingest_documents.py CHANGED Viewed

@@ -11,7 +11,8 @@ from ingestion import DocumentIngestion
 PDF_FOLDER = "data/pdfs"  # Folder containing PDF files
 URLS = [
     # Add your URLs here, one per line
-    "https://www.ontario.ca/page/organic-crop-and-livestock-production-ontario"
 ]
@@ -23,7 +24,7 @@ def main():
     # Initialize ingestion system
     print("\nInitializing document ingestion system...")
-    ingestion = DocumentIngestion(embedding_model="all-MiniLM-L6-v2")
     # Collect PDF files
     pdf_paths = []
@@ -67,13 +68,13 @@ def main():
         # Save vector store
         print("\nSaving vector store...")
-        ingestion.save("./vector_store")
         print("\n" + "=" * 60)
         print("[SUCCESS] Ingestion complete!")
         print("=" * 60)
         print(f"\nTotal document chunks: {len(documents)}")
-        print(f"Vector store saved to: ./vector_store")
         print("\nYou can now run 'py app.py' to start the chatbot.")
     except Exception as e:

 PDF_FOLDER = "data/pdfs"  # Folder containing PDF files
 URLS = [
     # Add your URLs here, one per line
+    "https://inspection.canada.ca/en/food-labels/organic-products/operating-manual"
 ]
     # Initialize ingestion system
     print("\nInitializing document ingestion system...")
+    ingestion = DocumentIngestion(embedding_model="all-mpnet-base-v2")
     # Collect PDF files
     pdf_paths = []
         # Save vector store
         print("\nSaving vector store...")
+        ingestion.save("data/vector_store")
         print("\n" + "=" * 60)
         print("[SUCCESS] Ingestion complete!")
         print("=" * 60)
         print(f"\nTotal document chunks: {len(documents)}")
+        print(f"Vector store saved to: data/vector_store")
         print("\nYou can now run 'py app.py' to start the chatbot.")
     except Exception as e:

ingestion.py CHANGED Viewed

@@ -17,7 +17,7 @@ import pickle
 class DocumentIngestion:
     """Handles ingestion of PDFs and URLs into a searchable vector store."""
-    def __init__(self, embedding_model: str = "all-MiniLM-L6-v2"):
         """
         Initialize the document ingestion system.
@@ -26,8 +26,8 @@ class DocumentIngestion:
         """
         self.embedding_model = SentenceTransformer(embedding_model)
         self.text_splitter = RecursiveCharacterTextSplitter(
-            chunk_size=1000,
-            chunk_overlap=200,
             length_function=len,
         )
         self.documents = []
@@ -200,7 +200,7 @@ class DocumentIngestion:
         return results
-    def save(self, directory: str = "./vector_store"):
         """Save the vector store to disk."""
         os.makedirs(directory, exist_ok=True)
@@ -216,7 +216,7 @@ class DocumentIngestion:
         print(f"Vector store saved to {directory}")
-    def load(self, directory: str = "./vector_store"):
         """Load the vector store from disk."""
         # Load index
         self.index = faiss.read_index(os.path.join(directory, "index.faiss"))

 class DocumentIngestion:
     """Handles ingestion of PDFs and URLs into a searchable vector store."""
+    def __init__(self, embedding_model: str = "all-mpnet-base-v2"):
         """
         Initialize the document ingestion system.
         """
         self.embedding_model = SentenceTransformer(embedding_model)
         self.text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=600,
+            chunk_overlap=150,
             length_function=len,
         )
         self.documents = []
         return results
+    def save(self, directory: str = "data/vector_store"):
         """Save the vector store to disk."""
         os.makedirs(directory, exist_ok=True)
         print(f"Vector store saved to {directory}")
+    def load(self, directory: str = "data/vector_store"):
         """Load the vector store from disk."""
         # Load index
         self.index = faiss.read_index(os.path.join(directory, "index.faiss"))

requirements.txt CHANGED Viewed

@@ -10,3 +10,4 @@ requests>=2.31.0
 faiss-cpu>=1.7.4
 numpy>=1.24.0
 accelerate>=0.25.0

 faiss-cpu>=1.7.4
 numpy>=1.24.0
 accelerate>=0.25.0
+huggingface_hub>=0.20.0