Spaces:

aankitdas
/

doc-intelligence-rag

Sleeping

App Files Files Community

aankitdas commited on Dec 14, 2025

Commit

939a9f4

0 Parent(s):

initial clean commit

Browse files

Files changed (21) hide show

.dockerignore +35 -0
.gitignore +26 -0
.python-version +1 -0
Dockerfile +37 -0
README.md +93 -0
frontend/index.html +592 -0
main.py +6 -0
notebooks/01_rag_notebook.ipynb +993 -0
notebooks/02_test_modules.ipynb +596 -0
pyproject-local.toml +23 -0
pyproject.toml +21 -0
requirements-railway.txt +14 -0
src/main.py +504 -0
src/rag/__init__.py +37 -0
src/rag/chunker.py +97 -0
src/rag/embeddings.py +302 -0
src/rag/llm.py +220 -0
src/rag/pdf_processor.py +251 -0
src/rag/pipeline.py +362 -0
src/rag/vector_store.py +297 -0
uv.lock +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,35 @@

+# Python cache
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+*.pyd
+# Virtual environments
+.venv/
+env/
+venv/
+# Local data, logs, temp
+*.log
+*.sqlite3
+*.db
+tmp/
+temp/
+.cache/
+.huggingface/
+models/
+papers/
+# Jupyter notebooks and outputs (if not needed in prod)
+.ipynb_checkpoints/
+*.ipynb
+notebooks/
+# Git and version control
+.git/
+.gitignore
+# OS generated files
+.DS_Store
+Thumbs.db

.gitignore ADDED Viewed

	@@ -0,0 +1,26 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv/
+.env
+notebooks/.chromadb_test/
+.chromadb/
+__pycache__/
+*.pyc
+.DS_Store
+.pytest_cache/
+*.egg-info/
+dist/
+build/
+.pytest_cache
+.coverage
+htmlcov/
+node_modules/
+.streamlit/
+papers/

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.12

Dockerfile ADDED Viewed

	@@ -0,0 +1,37 @@

+FROM python:3.12-slim
+WORKDIR /app
+# Install minimal system deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements
+COPY requirements-railway.txt ./
+# Install CPU-only torch first to avoid downloading CUDA deps
+RUN pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
+# Install dependencies with aggressive caching cleanup
+RUN pip install --no-cache-dir -r requirements-railway.txt && \
+    pip cache purge && \
+    find /usr/local -type d -name '__pycache__' -exec rm -rf {} + 2>/dev/null || true && \
+    find /usr/local -type f -name '*.pyc' -delete && \
+    rm -rf /tmp/* /var/tmp/* /root/.cache
+# Copy application code
+COPY src ./src
+COPY frontend ./frontend
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV HF_HOME=/tmp/huggingface
+ENV TRANSFORMERS_CACHE=/tmp/huggingface
+ENV TORCH_HOME=/tmp/torch
+ENV EMBEDDING_BACKEND=sentence-transformers
+EXPOSE 7860
+CMD ["python", "-m", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,93 @@

+# Document Intelligence RAG System
+Production-grade Retrieval-Augmented Generation (RAG) system for analyzing research papers and documents with AI.
+Ask questions about your PDFs. Get answers grounded in your documents with source attribution.
+## Features
+- PDF Ingestion: Extract text from PDFs using PDFProcessor
+- Document Chunking: Split documents into smaller chunks for better context
+- Embedding: Convert text chunks into vector embeddings using Ollama
+- Vector Storage: Store embeddings in ChromaDB for efficient retrieval
+- LLM Integration: Use Groq LLM for generating answers
+- Source Attribution: Track document origins for citation
+- FastAPI Integration: Build a REST API for easy access
+- Docker Support: Containerize the system for easy deployment
+- PDF Processing: Extract text from PDFs using PDFProcessor
+- Document Chunking: Split documents into smaller chunks for better context
+- Embedding: Convert text chunks into vector embeddings using Ollama
+- Vector Storage: Store embeddings in ChromaDB for efficient retrieval
+- LLM Integration: Use Groq LLM for generating answers
+- Source Attribution: Track document origins for citation
+- FastAPI Integration: Build a REST API for easy access
+- Docker Support: Containerize the system for easy deployment
+## Quickstart
+### Prerequisites
+- Python 3.12
+- Ollama
+- Groq API Key
+- ChromaDB
+- FastAPI
+- Uvicorn
+- PDFProcessor
+- Embeddings
+- LLM
+- Vector Store
+1. Setup environment variables
+```bash
+# Clone repository
+git clone https://github.com/aankitdas/document-intelligence-rag.git
+cd document-intelligence-rag
+# Install Ollama (one-time setup)
+# Download from https://ollama.ai
+ollama pull nomic-embed-text
+# Start Ollama server (in background)
+ollama serve
+# Create Python environment
+uv venv
+source .venv/bin/activate  # Windows: .venv\Scripts\activate
+# Install dependencies
+uv sync
+# Set API keys
+export GROQ_API_KEY="gsk_..."  # Get from https://console.groq.com
+```
+2. Prepare Documents
+```bash
+# Create a folder for documents
+# Create papers folder
+mkdir papers
+# Add your PDFs to papers/
+# Example: papers/research_paper.pdf
+```
+3. Run API
+```bash
+# Run API
+uvicorn src.api.main:app --reload
+```
+4. Query API
+```bash
+# Query API
+curl http://localhost:8000/ask -X POST -H "Content-Type: application/json" -d '{"question": "What is the main contribution of this paper?"}'
+```
+## Tech Stack
+| Component        | Technology                    | Why                                                                 |
+|------------------|-------------------------------|---------------------------------------------------------------------|
+| Embeddings       | Ollama (`nomic-embed-text`)   | Local, free, 384-dimensional embeddings                             |
+| Vector Database  | Chroma                        | Persistent storage, fast similarity search, completely free         |
+| LLM              | Groq (Llama 3.1)              | Free API tier, very fast inference                                  |
+| Backend          | FastAPI                       | Production-grade, async, automatic API docs                         |
+| Frontend         | HTML / CSS / JavaScript       | Simple setup, no build tooling required                             |
+| Package Manager  | UV                            | Fast dependency resolution, deterministic environments              |

frontend/index.html ADDED Viewed

	@@ -0,0 +1,592 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Document Intelligence RAG</title>
+    <style>
+        * {
+            margin: 0;
+            padding: 0;
+            box-sizing: border-box;
+        }
+        body {
+            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            min-height: 100vh;
+            padding: 20px;
+        }
+        .container {
+            max-width: 1000px;
+            margin: 0 auto;
+        }
+        header {
+            text-align: center;
+            color: white;
+            margin-bottom: 40px;
+        }
+        header h1 {
+            font-size: 2.5em;
+            margin-bottom: 10px;
+            text-shadow: 0 2px 10px rgba(0, 0, 0, 0.2);
+        }
+        header p {
+            font-size: 1.1em;
+            opacity: 0.9;
+        }
+        .main-grid {
+            display: grid;
+            grid-template-columns: 1fr 1fr;
+            gap: 20px;
+            margin-bottom: 20px;
+        }
+        .card {
+            background: white;
+            border-radius: 12px;
+            padding: 25px;
+            box-shadow: 0 10px 30px rgba(0, 0, 0, 0.2);
+        }
+        .card h2 {
+            color: #333;
+            margin-bottom: 15px;
+            font-size: 1.3em;
+        }
+        .upload-area {
+            border: 2px dashed #667eea;
+            border-radius: 8px;
+            padding: 30px;
+            text-align: center;
+            cursor: pointer;
+            transition: all 0.3s;
+        }
+        .upload-area:hover {
+            border-color: #764ba2;
+            background: #f8f9ff;
+        }
+        .upload-area.dragover {
+            border-color: #764ba2;
+            background: #f0f2ff;
+        }
+        .upload-area input {
+            display: none;
+        }
+        .upload-area p {
+            color: #666;
+            margin-bottom: 10px;
+        }
+        .btn {
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            color: white;
+            border: none;
+            padding: 12px 24px;
+            border-radius: 8px;
+            cursor: pointer;
+            font-size: 1em;
+            font-weight: 600;
+            transition: transform 0.2s, box-shadow 0.2s;
+        }
+        .btn:hover {
+            transform: translateY(-2px);
+            box-shadow: 0 5px 15px rgba(102, 126, 234, 0.4);
+        }
+        .btn:active {
+            transform: translateY(0);
+        }
+        .btn-secondary {
+            background: #f0f0f0;
+            color: #333;
+        }
+        .btn-secondary:hover {
+            background: #e0e0e0;
+            box-shadow: 0 5px 15px rgba(0, 0, 0, 0.1);
+        }
+        .query-input {
+            display: flex;
+            gap: 10px;
+            margin-bottom: 20px;
+        }
+        .query-input input {
+            flex: 1;
+            padding: 12px;
+            border: 2px solid #e0e0e0;
+            border-radius: 8px;
+            font-size: 1em;
+            transition: border-color 0.3s;
+        }
+        .query-input input:focus {
+            outline: none;
+            border-color: #667eea;
+        }
+        .status {
+            padding: 15px;
+            border-radius: 8px;
+            margin-bottom: 15px;
+            font-size: 0.95em;
+        }
+        .status.success {
+            background: #d4edda;
+            color: #155724;
+            border-left: 4px solid #28a745;
+        }
+        .status.error {
+            background: #f8d7da;
+            color: #721c24;
+            border-left: 4px solid #f5c6cb;
+        }
+        .status.loading {
+            background: #e7f3ff;
+            color: #004085;
+            border-left: 4px solid #0c5ff4;
+        }
+        .answer-box {
+            background: #f8f9fa;
+            border-left: 4px solid #667eea;
+            padding: 15px;
+            border-radius: 8px;
+            margin-bottom: 20px;
+        }
+        .answer-box h3 {
+            color: #333;
+            margin-bottom: 10px;
+        }
+        .answer-box p {
+            color: #555;
+            line-height: 1.6;
+            margin-bottom: 15px;
+        }
+        .sources {
+            background: white;
+            border-radius: 8px;
+            padding: 15px;
+            margin-bottom: 15px;
+        }
+        .sources h4 {
+            color: #333;
+            margin-bottom: 12px;
+            font-size: 0.95em;
+        }
+        .source-item {
+            padding: 10px;
+            background: #f8f9fa;
+            border-radius: 6px;
+            margin-bottom: 8px;
+            border-left: 3px solid #667eea;
+            font-size: 0.9em;
+        }
+        .source-item .relevance {
+            color: #667eea;
+            font-weight: 600;
+            margin-bottom: 5px;
+        }
+        .source-item .text {
+            color: #555;
+            font-style: italic;
+        }
+        .stats {
+            display: grid;
+            grid-template-columns: repeat(2, 1fr);
+            gap: 10px;
+            margin-bottom: 20px;
+        }
+        .stat-box {
+            background: #f8f9fa;
+            padding: 12px;
+            border-radius: 6px;
+            text-align: center;
+        }
+        .stat-box .number {
+            font-size: 1.5em;
+            font-weight: bold;
+            color: #667eea;
+        }
+        .stat-box .label {
+            font-size: 0.85em;
+            color: #666;
+            margin-top: 5px;
+        }
+        .status-grid {
+            display: grid;
+            grid-template-columns: repeat(4, 1fr);
+            gap: 10px;
+        }
+        .loading-spinner {
+            display: inline-block;
+            width: 20px;
+            height: 20px;
+            border: 3px solid #f3f3f3;
+            border-top: 3px solid #667eea;
+            border-radius: 50%;
+            animation: spin 1s linear infinite;
+            margin-right: 10px;
+            vertical-align: middle;
+        }
+        @keyframes spin {
+            0% {
+                transform: rotate(0deg);
+            }
+            100% {
+                transform: rotate(360deg);
+            }
+        }
+        .full-width {
+            grid-column: 1 / -1;
+        }
+        @media (max-width: 768px) {
+            .main-grid {
+                grid-template-columns: 1fr;
+            }
+            header h1 {
+                font-size: 1.8em;
+            }
+            .stats {
+                grid-template-columns: 1fr;
+            }
+            .status-grid {
+                grid-template-columns: repeat(2, 1fr);
+            }
+        }
+        .hidden {
+            display: none;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <header>
+            <h1>📚 Document Intelligence RAG</h1>
+            <p>Ask questions about your research papers</p>
+        </header>
+        <div class="main-grid">
+            <!-- Upload Section -->
+            <div class="card">
+                <h2>📤 Upload Documents</h2>
+                <div class="upload-area" id="uploadArea">
+                    <p>📁 Drag & drop PDFs here or click to browse</p>
+                    <input type="file" id="fileInput" multiple accept=".pdf">
+                    <button class="btn" onclick="document.getElementById('fileInput').click()">
+                        Choose Files
+                    </button>
+                </div>
+                <div id="uploadStatus" class="status hidden"></div>
+                <div id="stats" class="stats">
+                    <div class="stat-box">
+                        <div class="number" id="totalChunks">0</div>
+                        <div class="label">Total Chunks</div>
+                    </div>
+                    <div class="stat-box">
+                        <div class="number" id="docCount">0</div>
+                        <div class="label">Documents</div>
+                    </div>
+                </div>
+                <button class="btn btn-secondary" onclick="loadStats()">
+                    🔄 Refresh Stats
+                </button>
+                <button class="btn btn-secondary" style="background: #ff6b6b; color: white; margin-top: 10px;"
+                    onclick="resetSystem()">
+                    🗑️ Delete All Documents
+                </button>
+                <p style="font-size: 0.85em; color: #999; margin-top: 10px;">
+                    💾 Documents are stored persistently. They remain after restart.
+                </p>
+            </div>
+            <!-- Query Section -->
+            <div class="card">
+                <h2>❓ Ask Questions</h2>
+                <div class="query-input">
+                    <input type="text" id="queryInput" placeholder="What would you like to know about your documents?"
+                        onkeypress="if(event.key==='Enter') submitQuery()">
+                    <button class="btn" onclick="submitQuery()">Search</button>
+                </div>
+                <div id="queryStatus" class="status hidden"></div>
+                <div id="answerContainer" class="hidden">
+                    <div class="answer-box">
+                        <h3>Answer</h3>
+                        <p id="answerText"></p>
+                    </div>
+                    <div class="sources" id="sourcesBox">
+                        <h4>📖 Sources Used</h4>
+                        <div id="sourcesList"></div>
+                    </div>
+                </div>
+            </div>
+        </div>
+        <!-- Status Indicators -->
+        <div class="card full-width">
+            <h2>🔧 System Status</h2>
+            <div id="healthStatus" class="status-grid">Loading...</div>
+        </div>
+    </div>
+    <script>
+        const API_URL = 'http://localhost:8000';
+        // Upload handlers
+        const uploadArea = document.getElementById('uploadArea');
+        const fileInput = document.getElementById('fileInput');
+        uploadArea.addEventListener('click', () => fileInput.click());
+        uploadArea.addEventListener('dragover', (e) => {
+            e.preventDefault();
+            uploadArea.classList.add('dragover');
+        });
+        uploadArea.addEventListener('dragleave', () => {
+            uploadArea.classList.remove('dragover');
+        });
+        uploadArea.addEventListener('drop', (e) => {
+            e.preventDefault();
+            uploadArea.classList.remove('dragover');
+            handleFiles(e.dataTransfer.files);
+        });
+        fileInput.addEventListener('change', (e) => {
+            handleFiles(e.target.files);
+        });
+        async function handleFiles(files) {
+            const statusDiv = document.getElementById('uploadStatus');
+            for (const file of files) {
+                if (!file.name.endsWith('.pdf')) {
+                    showStatus(statusDiv, `Skipping ${file.name} - only PDFs supported`, 'error');
+                    continue;
+                }
+                showStatus(statusDiv, `Uploading ${file.name}...`, 'loading');
+                const formData = new FormData();
+                formData.append('file', file);
+                try {
+                    const response = await fetch(`${API_URL}/ingest`, {
+                        method: 'POST',
+                        body: formData
+                    });
+                    if (response.ok) {
+                        const data = await response.json();
+                        showStatus(
+                            statusDiv,
+                            `✓ ${file.name}: ${data.chunks_embedded} chunks ingested`,
+                            'success'
+                        );
+                        loadStats();
+                    } else {
+                        const error = await response.json();
+                        showStatus(statusDiv, `✗ ${file.name}: ${error.detail}`, 'error');
+                    }
+                } catch (error) {
+                    showStatus(statusDiv, `✗ Upload failed: ${error.message}`, 'error');
+                }
+            }
+            fileInput.value = '';
+        }
+        async function submitQuery() {
+            const query = document.getElementById('queryInput').value.trim();
+            if (!query) {
+                showStatus(
+                    document.getElementById('queryStatus'),
+                    'Please enter a question',
+                    'error'
+                );
+                return;
+            }
+            const statusDiv = document.getElementById('queryStatus');
+            showStatus(statusDiv, 'Searching your documents...', 'loading');
+            try {
+                const response = await fetch(`${API_URL}/query`, {
+                    method: 'POST',
+                    headers: { 'Content-Type': 'application/json' },
+                    body: JSON.stringify({ query, top_k: 3 })
+                });
+                if (response.ok) {
+                    const data = await response.json();
+                    displayAnswer(data);
+                    statusDiv.classList.add('hidden');
+                } else {
+                    const error = await response.json();
+                    showStatus(statusDiv, error.error || 'Query failed', 'error');
+                }
+            } catch (error) {
+                showStatus(statusDiv, `Error: ${error.message}`, 'error');
+            }
+        }
+        function displayAnswer(data) {
+            document.getElementById('answerText').textContent = data.answer;
+            const sourcesList = document.getElementById('sourcesList');
+            sourcesList.innerHTML = data.sources.map(source => `
+                <div class="source-item">
+                    <div class="relevance">📌 Relevance: ${(source.similarity * 100).toFixed(0)}%</div>
+                    <div class="text">${source.preview}</div>
+                </div>
+            `).join('');
+            document.getElementById('answerContainer').classList.remove('hidden');
+        }
+        async function loadStats() {
+            try {
+                const response = await fetch(`${API_URL}/stats`);
+                if (response.ok) {
+                    const data = await response.json();
+                    document.getElementById('totalChunks').textContent = data.total_chunks;
+                }
+            } catch (error) {
+                console.error('Failed to load stats:', error);
+            }
+        }
+        async function loadHealth() {
+            try {
+                const response = await fetch(`${API_URL}/health`);
+                if (response.ok) {
+                    const data = await response.json();
+                    // Get embedding backend name
+                    let embeddingName = data.embedding_backend || 'Unknown';
+                    // Format nicely
+                    if (embeddingName === 'sentence-transformers') {
+                        embeddingName = 'Sentence-Transformers';
+                    } else if (embeddingName === 'ollama') {
+                        embeddingName = 'Ollama';
+                    }
+                    const healthHtml = `
+                        <div class="stat-box">
+                            <div class="number">${data.embedding_backend ? '✓' : '✗'}</div>
+                            <div class="label">${embeddingName} (Embeddings)</div>
+                        </div>
+                        <div class="stat-box">
+                            <div class="number">${data.groq === '✓' ? '✓' : '✗'}</div>
+                            <div class="label">Groq (LLM)</div>
+                        </div>
+                        <div class="stat-box">
+                            <div class="number">${data.chroma.status === '✓' ? '✓' : '✗'}</div>
+                            <div class="label">Chroma (Vector DB)</div>
+                        </div>
+                        <div class="stat-box">
+                            <div class="number">${data.status === 'healthy' ? '✓' : '⚠'}</div>
+                            <div class="label">Overall Status</div>
+                        </div>
+                    `;
+                    document.getElementById('healthStatus').innerHTML = healthHtml;
+                }
+            } catch (error) {
+                document.getElementById('healthStatus').innerHTML =
+                    `<div style="grid-column: 1/-1; padding: 15px; background: #f8d7da; color: #721c24; border-radius: 8px;">Cannot connect to API at ${API_URL}</div>`;
+            }
+        }
+        async function resetSystem() {
+            if (!confirm('⚠️ Delete ALL documents and embeddings? This cannot be undone!')) {
+                return;
+            }
+            const statusDiv = document.getElementById('uploadStatus');
+            showStatus(statusDiv, 'Resetting system...', 'loading');
+            try {
+                const response = await fetch(`${API_URL}/reset`, {
+                    method: 'POST',
+                    headers: {
+                        'Content-Type': 'application/json'
+                    }
+                });
+                if (response.ok) {
+                    const data = await response.json();
+                    showStatus(statusDiv, '✓ All documents deleted!', 'success');
+                    loadStats();
+                } else {
+                    const error = await response.json();
+                    showStatus(statusDiv, `Reset failed: ${error.detail || 'Unknown error'}`, 'error');
+                }
+            } catch (error) {
+                showStatus(statusDiv, `Error: ${error.message}`, 'error');
+            }
+        }
+        function showStatus(element, message, type) {
+            element.textContent = message;
+            element.className = `status ${type}`;
+            element.classList.remove('hidden');
+        }
+        // Load stats and health on page load
+        window.addEventListener('load', () => {
+            loadStats();
+            loadHealth();
+            setInterval(loadHealth, 30000); // Refresh every 30s
+        });
+    </script>
+</body>
+</html>

main.py ADDED Viewed

	@@ -0,0 +1,6 @@

+def main():
+    print("Hello from doc-intelligence-rag!")
+if __name__ == "__main__":
+    main()

notebooks/01_rag_notebook.ipynb ADDED Viewed

	@@ -0,0 +1,993 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "b0f26b66",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "d:\\projects\\doc-intelligence-rag\\.venv\\Scripts\\python.exe\n"
+     ]
+    }
+   ],
+   "source": [
+    "import sys\n",
+    "print(sys.executable)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "d39fb45d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "============================================================\n",
+      "PART 0: RAG Pipeline Overview\n",
+      "============================================================\n",
+      "\n",
+      "Input: User asks \"What is machine learning?\"\n",
+      "\n",
+      "       ↓\n",
+      "\n",
+      "1. RETRIEVE: Search similar docs\n",
+      "   Question embedding → Find top 3 most similar chunks\n",
+      "\n",
+      "2. AUGMENT: Build context\n",
+      "   Combine retrieved chunks into a context string\n",
+      "\n",
+      "3. GENERATE: LLM answers\n",
+      "   Pass (context + question) → LLM → Answer\n",
+      "\n",
+      "Output: \"Machine learning is...\"\n",
+      "\n",
+      "\n",
+      "============================================================\n",
+      "PART 1: Text Chunking (Why?)\n",
+      "============================================================\n",
+      "\n",
+      "WHY chunk text?\n",
+      "  • Embeddings work better on ~500 token chunks (not 50k token documents)\n",
+      "  • Allows granular retrieval (retrieve only relevant section)\n",
+      "  • Reduces embedding costs\n",
+      "\n",
+      "CHALLENGE: Chunking strategy matters!\n",
+      "  • Too small: Lose context (100 tokens)\n",
+      "  • Too large: Lose precision (5000 tokens)\n",
+      "  • SOLUTION: Use overlap (e.g., 500 token chunk with 50 token overlap)\n",
+      "\n",
+      "WHY overlap?\n",
+      "  • Important info might be at chunk boundary\n",
+      "  • Overlap ensures semantic continuity\n",
+      "  • Example: \"The study shows A. B supports A.\" might split badly without overlap\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# RAG System from First Principles\n",
+    "# =====================================\n",
+    "# This notebook builds a RAG system step-by-step\n",
+    "# We'll understand WHY before we code\n",
+    "#\n",
+    "# Prerequisites with UV:\n",
+    "#   uv init rag-learning\n",
+    "#   uv add jupyter numpy pandas requests pydantic groq\n",
+    "#   uv venv && source .venv/bin/activate\n",
+    "#\n",
+    "# Free APIs Used:\n",
+    "#   - Groq (LLM): https://console.groq.com (free API key)\n",
+    "#   - Ollama (Embeddings): https://ollama.ai (local, completely free)\n",
+    "\n",
+    "# ========== PART 0: THE PROBLEM ==========\n",
+    "# \n",
+    "# Problem: How do we make an LLM answer questions about OUR documents?\n",
+    "# \n",
+    "# Naive approach: Put entire document into prompt\n",
+    "#   ❌ Problem: Context window is limited (4k, 8k, 128k tokens)\n",
+    "#   ❌ Problem: Costs scale with document size\n",
+    "#   ❌ Problem: LLM gets confused with irrelevant information\n",
+    "#\n",
+    "# Better approach: RAG (Retrieval-Augmented Generation)\n",
+    "#   ✓ Only pass relevant chunks to LLM\n",
+    "#   ✓ Reduces costs\n",
+    "#   ✓ Improves accuracy\n",
+    "#\n",
+    "# RAG Pipeline:\n",
+    "#   1. Split document into chunks\n",
+    "#   2. Convert chunks to embeddings (vectors)\n",
+    "#   3. Store embeddings in vector database\n",
+    "#   4. When user asks question:\n",
+    "#      a) Convert question to embedding\n",
+    "#      b) Find most similar chunks (similarity search)\n",
+    "#      c) Pass those chunks + question to LLM\n",
+    "#      d) LLM answers based on those chunks\n",
+    "\n",
+    "print(\"=\" * 60)\n",
+    "print(\"PART 0: RAG Pipeline Overview\")\n",
+    "print(\"=\" * 60)\n",
+    "print(\"\"\"\n",
+    "Input: User asks \"What is machine learning?\"\n",
+    "       \n",
+    "       ↓\n",
+    "       \n",
+    "1. RETRIEVE: Search similar docs\n",
+    "   Question embedding → Find top 3 most similar chunks\n",
+    "   \n",
+    "2. AUGMENT: Build context\n",
+    "   Combine retrieved chunks into a context string\n",
+    "   \n",
+    "3. GENERATE: LLM answers\n",
+    "   Pass (context + question) → LLM → Answer\n",
+    "\n",
+    "Output: \"Machine learning is...\"\n",
+    "\"\"\")\n",
+    "\n",
+    "# ========== PART 1: TEXT CHUNKING ==========\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"PART 1: Text Chunking (Why?)\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "print(\"\"\"\n",
+    "WHY chunk text?\n",
+    "  • Embeddings work better on ~500 token chunks (not 50k token documents)\n",
+    "  • Allows granular retrieval (retrieve only relevant section)\n",
+    "  • Reduces embedding costs\n",
+    "\n",
+    "CHALLENGE: Chunking strategy matters!\n",
+    "  • Too small: Lose context (100 tokens)\n",
+    "  • Too large: Lose precision (5000 tokens)\n",
+    "  • SOLUTION: Use overlap (e.g., 500 token chunk with 50 token overlap)\n",
+    "\n",
+    "WHY overlap?\n",
+    "  • Important info might be at chunk boundary\n",
+    "  • Overlap ensures semantic continuity\n",
+    "  • Example: \"The study shows A. B supports A.\" might split badly without overlap\n",
+    "\"\"\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "0523d02c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "✓ Split into 5 chunks:\n",
+      "  Chunk 0: 20 words | Machine learning is a subset of artificial intelligence. It ...\n",
+      "  Chunk 1: 20 words | on algorithms that learn from data. Deep learning uses neura...\n",
+      "  Chunk 2: 20 words | networks with multiple layers. Transformers are the backbone...\n",
+      "  Chunk 3: 12 words | NLP systems. The attention mechanism allows models to focus ...\n",
+      "  Chunk 4: 2 words | relevant parts....\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "# Implement chunking\n",
+    "def chunk_text(text, chunk_size=500, overlap=50):\n",
+    "    \"\"\"\n",
+    "    Split text into overlapping chunks.\n",
+    "    \n",
+    "    Args:\n",
+    "        text: Raw text to chunk\n",
+    "        chunk_size: Tokens per chunk (roughly words * 0.75)\n",
+    "        overlap: Tokens overlap between chunks\n",
+    "    \n",
+    "    Returns:\n",
+    "        List of chunk dicts with text and metadata\n",
+    "    \"\"\"\n",
+    "    words = text.split()\n",
+    "    chunks = []\n",
+    "    \n",
+    "    # Calculate stride (how many words to move forward each iteration)\n",
+    "    # If overlap=50 and chunk_size=500, stride=450\n",
+    "    stride = chunk_size - overlap\n",
+    "    \n",
+    "    for i in range(0, len(words), stride):\n",
+    "        chunk_words = words[i:i + chunk_size]\n",
+    "        chunk_text = \" \".join(chunk_words)\n",
+    "        \n",
+    "        if chunk_text.strip():  # Skip empty chunks\n",
+    "            chunks.append({\n",
+    "                \"text\": chunk_text,\n",
+    "                \"start_idx\": i,\n",
+    "                \"word_count\": len(chunk_words),\n",
+    "                \"chunk_id\": len(chunks)  # Simple ID\n",
+    "            })\n",
+    "    \n",
+    "    return chunks\n",
+    "\n",
+    "# Test chunking\n",
+    "sample_text = \"\"\"\n",
+    "Machine learning is a subset of artificial intelligence. \n",
+    "It focuses on algorithms that learn from data. \n",
+    "Deep learning uses neural networks with multiple layers. \n",
+    "Transformers are the backbone of modern NLP systems. \n",
+    "The attention mechanism allows models to focus on relevant parts.\n",
+    "\"\"\"\n",
+    "\n",
+    "chunks = chunk_text(sample_text, chunk_size=20, overlap=10)\n",
+    "print(f\"\\n✓ Split into {len(chunks)} chunks:\")\n",
+    "for i, chunk in enumerate(chunks):\n",
+    "    print(f\"  Chunk {i}: {chunk['word_count']} words | {chunk['text'][:60]}...\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "3464f9c9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "PART 2: Embeddings (Why?)\n",
+      "============================================================\n",
+      "\n",
+      "PROBLEM: How do we compare text for similarity?\n",
+      "  • Can't just use string matching (\"cat\" ≠ \"feline\")\n",
+      "  • Need semantic understanding\n",
+      "\n",
+      "SOLUTION: Embeddings (vectors that capture meaning)\n",
+      "  • Convert text → vector (list of numbers)\n",
+      "  • Similar texts have similar vectors\n",
+      "  • We can use math to compare them!\n",
+      "\n",
+      "EXAMPLE:\n",
+      "  \"The cat sat on the mat\" → [0.2, -0.5, 0.8, 0.1, ...] (384 dims)\n",
+      "  \"A feline sat on rug\"    → [0.21, -0.48, 0.79, 0.12, ...] (384 dims)\n",
+      "\n",
+      "  Notice: Very similar vectors = similar meaning!\n",
+      "\n",
+      "HOW DO WE GET EMBEDDINGS?\n",
+      "  Option 1: Use OpenAI API (costs money)\n",
+      "  Option 2: Use local Ollama (free, slower)\n",
+      "\n",
+      "  We'll use Ollama because:\n",
+      "    ✓ Free (no API costs)\n",
+      "    ✓ Privacy (stays on your machine)\n",
+      "    ✓ Fast for this use case\n",
+      "    ✓ Good enough for RAG\n",
+      "\n",
+      "EMBEDDING MODELS:\n",
+      "  • nomic-embed-text: 384 dimensions (good for RAG)\n",
+      "  • all-minilm-l6-v2: 384 dimensions (lighter)\n",
+      "  • openai embedding: 1536 dimensions (more expensive, higher quality)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "# ========== PART 2: EMBEDDINGS ==========\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"PART 2: Embeddings (Why?)\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "print(\"\"\"\n",
+    "PROBLEM: How do we compare text for similarity?\n",
+    "  • Can't just use string matching (\"cat\" ≠ \"feline\")\n",
+    "  • Need semantic understanding\n",
+    "\n",
+    "SOLUTION: Embeddings (vectors that capture meaning)\n",
+    "  • Convert text → vector (list of numbers)\n",
+    "  • Similar texts have similar vectors\n",
+    "  • We can use math to compare them!\n",
+    "\n",
+    "EXAMPLE:\n",
+    "  \"The cat sat on the mat\" → [0.2, -0.5, 0.8, 0.1, ...] (384 dims)\n",
+    "  \"A feline sat on rug\"    → [0.21, -0.48, 0.79, 0.12, ...] (384 dims)\n",
+    "  \n",
+    "  Notice: Very similar vectors = similar meaning!\n",
+    "\n",
+    "HOW DO WE GET EMBEDDINGS?\n",
+    "  Option 1: Use OpenAI API (costs money)\n",
+    "  Option 2: Use local Ollama (free, slower)\n",
+    "  \n",
+    "  We'll use Ollama because:\n",
+    "    ✓ Free (no API costs)\n",
+    "    ✓ Privacy (stays on your machine)\n",
+    "    ✓ Fast for this use case\n",
+    "    ✓ Good enough for RAG\n",
+    "\n",
+    "EMBEDDING MODELS:\n",
+    "  • nomic-embed-text: 384 dimensions (good for RAG)\n",
+    "  • all-minilm-l6-v2: 384 dimensions (lighter)\n",
+    "  • openai embedding: 1536 dimensions (more expensive, higher quality)\n",
+    "\"\"\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "5eb96a3e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "✓ Embedding shape: 348 dimensions\n",
+      "  Sample values: [-0.019382589036581188, -0.005570373852637493, -0.05710695701024572, -0.04601154504073093, 0.08037614551187885] ...\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "import numpy as np\n",
+    "\n",
+    "def simulate_embedding(text, dims=348):\n",
+    "    \"\"\"\n",
+    "    Simulate what an embedding looks like.\n",
+    "    \n",
+    "    In reality, Ollama would use a neural network to create this.\n",
+    "    For learning, we'll just use a hash-based deterministic vector.\n",
+    "    \"\"\"\n",
+    "    # Hash the text to get consistent numbers\n",
+    "    hash_val = hash(text) % 10000\n",
+    "    np.random.seed(hash_val)\n",
+    "    \n",
+    "    # Create random but consistent vector\n",
+    "    embedding = np.random.randn(dims)\n",
+    "    \n",
+    "    # Normalize to unit length (important for cosine similarity)\n",
+    "    embedding = embedding / np.linalg.norm(embedding)\n",
+    "    \n",
+    "    return embedding.tolist()\n",
+    "\n",
+    "# Demonstrate\n",
+    "text1 = \"Machine learning is AI\"\n",
+    "text2 = \"Deep learning uses neural networks\"\n",
+    "text3 = \"Cooking pasta is delicious\"\n",
+    "\n",
+    "emb1 = simulate_embedding(text1)\n",
+    "emb2 = simulate_embedding(text2)\n",
+    "emb3 = simulate_embedding(text3)\n",
+    "\n",
+    "print(f\"\\n✓ Embedding shape: {len(emb1)} dimensions\")\n",
+    "print(f\"  Sample values: {emb1[:5]} ...\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "e15b8066",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "PART 3: Similarity Search (Why?)\n",
+      "============================================================\n",
+      "\n",
+      "GOAL: Find which chunks are most relevant to a query\n",
+      "\n",
+      "METHOD: Cosine Similarity\n",
+      "  • Measure angle between vectors\n",
+      "  • Values from -1 to 1 (1 = identical direction = same meaning)\n",
+      "  • Formula: similarity = (A · B) / (|A| * |B|)\n",
+      "\n",
+      "EXAMPLE:\n",
+      "  Query: \"What is deep learning?\"\n",
+      "  Chunk 1: \"Deep learning uses neural networks\" → similarity = 0.92 ✓ Relevant!\n",
+      "  Chunk 2: \"Cooking pasta...\" → similarity = 0.15 ✗ Not relevant\n",
+      "  Chunk 3: \"Neural networks have many layers\" → similarity = 0.85 ✓ Relevant!\n",
+      "\n",
+      "  Return top 2: Chunks 1 and 3\n",
+      "\n",
+      "\n",
+      "✓ Query: 'neural networks'\n",
+      "  Results (sorted by relevance):\n",
+      "     -0.077 | Deep learning uses neural networks\n",
+      "     -0.021 | Cooking pasta is delicious\n",
+      "     -0.005 | Machine learning is AI\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "# ========== PART 3: SIMILARITY SEARCH ==========\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"PART 3: Similarity Search (Why?)\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "print(\"\"\"\n",
+    "GOAL: Find which chunks are most relevant to a query\n",
+    "\n",
+    "METHOD: Cosine Similarity\n",
+    "  • Measure angle between vectors\n",
+    "  • Values from -1 to 1 (1 = identical direction = same meaning)\n",
+    "  • Formula: similarity = (A · B) / (|A| * |B|)\n",
+    "  \n",
+    "EXAMPLE:\n",
+    "  Query: \"What is deep learning?\"\n",
+    "  Chunk 1: \"Deep learning uses neural networks\" → similarity = 0.92 ✓ Relevant!\n",
+    "  Chunk 2: \"Cooking pasta...\" → similarity = 0.15 ✗ Not relevant\n",
+    "  Chunk 3: \"Neural networks have many layers\" → similarity = 0.85 ✓ Relevant!\n",
+    "  \n",
+    "  Return top 2: Chunks 1 and 3\n",
+    "\"\"\")\n",
+    "\n",
+    "def cosine_similarity(vec_a, vec_b):\n",
+    "    \"\"\"\n",
+    "    Calculate cosine similarity between two vectors.\n",
+    "    \n",
+    "    Returns value between -1 and 1 (higher = more similar)\n",
+    "    \"\"\"\n",
+    "    a = np.array(vec_a)\n",
+    "    b = np.array(vec_b)\n",
+    "    \n",
+    "    dot_product = np.dot(a, b)\n",
+    "    norm_a = np.linalg.norm(a)\n",
+    "    norm_b = np.linalg.norm(b)\n",
+    "    \n",
+    "    if norm_a == 0 or norm_b == 0:\n",
+    "        return 0.0\n",
+    "    \n",
+    "    return float(dot_product / (norm_a * norm_b))\n",
+    "\n",
+    "# Test similarity\n",
+    "query = \"neural networks\"\n",
+    "query_emb = simulate_embedding(query)\n",
+    "\n",
+    "similarities = [\n",
+    "    (\"Deep learning uses neural networks\", cosine_similarity(query_emb, emb2)),\n",
+    "    (\"Machine learning is AI\", cosine_similarity(query_emb, emb1)),\n",
+    "    (\"Cooking pasta is delicious\", cosine_similarity(query_emb, emb3)),\n",
+    "]\n",
+    "\n",
+    "# Sort by similarity\n",
+    "similarities.sort(key=lambda x: x[1], reverse=False)\n",
+    "\n",
+    "print(f\"\\n✓ Query: '{query}'\")\n",
+    "print(f\"  Results (sorted by relevance):\")\n",
+    "for text, score in similarities:\n",
+    "    bar = \"█\" * int(score * 20)\n",
+    "    print(f\"    {bar} {score:.3f} | {text}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "33b76510",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "PART 4: Vector Store (Why?)\n",
+      "============================================================\n",
+      "\n",
+      "PROBLEM: How do we store and retrieve embeddings efficiently?\n",
+      "\n",
+      "SIMPLE APPROACH: In-memory dictionary\n",
+      "  ✓ Easy to understand and implement\n",
+      "  ✓ Fast for small documents (< 10k chunks)\n",
+      "  ✗ Loses data when program stops\n",
+      "  ✗ Doesn't scale to millions of vectors\n",
+      "\n",
+      "PRODUCTION APPROACH: Vector databases\n",
+      "  • Pinecone, Weaviate, Milvus, Chroma, Qdrant\n",
+      "  • Optimized for similarity search\n",
+      "  • Persistent storage\n",
+      "  • Scales to billions of vectors\n",
+      "\n",
+      "FOR NOW: We'll use in-memory (simple) but structure it to be replaceable\n",
+      "\n",
+      "\n",
+      "✓ Query: 'transformers attention mechanism'\n",
+      "  Retrieved 2 chunks:\n",
+      "    [0.005] on algorithms that learn from data. Deep learning uses neural networks...\n",
+      "    [-0.012] NLP systems. The attention mechanism allows models to focus on relevan...\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "# ========== PART 4: VECTOR STORE ==========\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"PART 4: Vector Store (Why?)\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "print(\"\"\"\n",
+    "PROBLEM: How do we store and retrieve embeddings efficiently?\n",
+    "\n",
+    "SIMPLE APPROACH: In-memory dictionary\n",
+    "  ✓ Easy to understand and implement\n",
+    "  ✓ Fast for small documents (< 10k chunks)\n",
+    "  ✗ Loses data when program stops\n",
+    "  ✗ Doesn't scale to millions of vectors\n",
+    "\n",
+    "PRODUCTION APPROACH: Vector databases\n",
+    "  • Pinecone, Weaviate, Milvus, Chroma, Qdrant\n",
+    "  • Optimized for similarity search\n",
+    "  • Persistent storage\n",
+    "  • Scales to billions of vectors\n",
+    "\n",
+    "FOR NOW: We'll use in-memory (simple) but structure it to be replaceable\n",
+    "\"\"\")\n",
+    "\n",
+    "class SimpleVectorStore:\n",
+    "    \"\"\"\n",
+    "    In-memory vector store.\n",
+    "    Structure: Can easily swap for Pinecone/Weaviate later.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    def __init__(self):\n",
+    "        self.vectors = {}      # chunk_id -> embedding vector\n",
+    "        self.metadata = {}     # chunk_id -> {text, source, etc}\n",
+    "    \n",
+    "    def add(self, chunk_id, text, embedding):\n",
+    "        \"\"\"Store a chunk with its embedding.\"\"\"\n",
+    "        self.vectors[chunk_id] = embedding\n",
+    "        self.metadata[chunk_id] = {\"text\": text, \"length\": len(text)}\n",
+    "    \n",
+    "    def search(self, query_embedding, top_k=3):\n",
+    "        \"\"\"Find most similar chunks.\"\"\"\n",
+    "        if not self.vectors:\n",
+    "            return []\n",
+    "        \n",
+    "        results = []\n",
+    "        for chunk_id, vector in self.vectors.items():\n",
+    "            similarity = cosine_similarity(query_embedding, vector)\n",
+    "            results.append({\n",
+    "                \"chunk_id\": chunk_id,\n",
+    "                \"similarity\": similarity,\n",
+    "                \"text\": self.metadata[chunk_id][\"text\"]\n",
+    "            })\n",
+    "        \n",
+    "        # Sort by similarity descending\n",
+    "        results.sort(key=lambda x: x[\"similarity\"], reverse=True)\n",
+    "        return results[:top_k]\n",
+    "\n",
+    "# Test vector store\n",
+    "store = SimpleVectorStore()\n",
+    "\n",
+    "# Add chunks with embeddings\n",
+    "for chunk in chunks:\n",
+    "    emb = simulate_embedding(chunk[\"text\"])\n",
+    "    store.add(chunk[\"chunk_id\"], chunk[\"text\"], emb)\n",
+    "\n",
+    "# Search\n",
+    "query = \"transformers attention mechanism\"\n",
+    "query_emb = simulate_embedding(query)\n",
+    "results = store.search(query_emb, top_k=2)\n",
+    "\n",
+    "print(f\"\\n✓ Query: '{query}'\")\n",
+    "print(f\"  Retrieved {len(results)} chunks:\")\n",
+    "for r in results:\n",
+    "    print(f\"    [{r['similarity']:.3f}] {r['text'][:70]}...\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "94ca95e9",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "PART 5: LLM Integration (Why?)\n",
+      "============================================================\n",
+      "\n",
+      "PROBLEM: How do we actually answer the user's question?\n",
+      "\n",
+      "SOLUTION: Pass context + question to LLM\n",
+      "  1. Retrieve relevant chunks (from vector store)\n",
+      "  2. Combine into \"context\" string\n",
+      "  3. Create prompt: \"Context: [chunks] Question: [query]\"\n",
+      "  4. Send to LLM (Groq, OpenAI, etc)\n",
+      "  5. LLM answers based on context\n",
+      "\n",
+      "WHY THIS WORKS:\n",
+      "  • LLM has knowledge from training\n",
+      "  • Context grounds answer in YOUR documents\n",
+      "  • LLM won't hallucinate (ideally) because answer must match context\n",
+      "\n",
+      "EXAMPLE PROMPT:\n",
+      "  ---\n",
+      "  Context:\n",
+      "  Deep learning uses neural networks with many layers.\n",
+      "  The attention mechanism focuses on relevant parts.\n",
+      "\n",
+      "  Question: How does deep learning work?\n",
+      "\n",
+      "  Answer: (LLM fills this in)\n",
+      "  ---\n",
+      "\n",
+      "\n",
+      "✓ Built context from 2 chunks:\n",
+      "  Length: 271 characters\n",
+      "  Preview:\n",
+      "  [Chunk 1 - Relevance: 0.5%]\n",
+      "on algorithms that learn from data. Deep learning uses neural networks with multiple layers. Transformers are the backbone of modern\n",
+      "\n",
+      "[Chunk 2 - Relevance: -1.2%]\n",
+      "NLP syste...\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "# ========== PART 5: LLM INTEGRATION ==========\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"PART 5: LLM Integration (Why?)\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "print(\"\"\"\n",
+    "PROBLEM: How do we actually answer the user's question?\n",
+    "\n",
+    "SOLUTION: Pass context + question to LLM\n",
+    "  1. Retrieve relevant chunks (from vector store)\n",
+    "  2. Combine into \"context\" string\n",
+    "  3. Create prompt: \"Context: [chunks] Question: [query]\"\n",
+    "  4. Send to LLM (Groq, OpenAI, etc)\n",
+    "  5. LLM answers based on context\n",
+    "\n",
+    "WHY THIS WORKS:\n",
+    "  • LLM has knowledge from training\n",
+    "  • Context grounds answer in YOUR documents\n",
+    "  • LLM won't hallucinate (ideally) because answer must match context\n",
+    "\n",
+    "EXAMPLE PROMPT:\n",
+    "  ---\n",
+    "  Context:\n",
+    "  Deep learning uses neural networks with many layers.\n",
+    "  The attention mechanism focuses on relevant parts.\n",
+    "  \n",
+    "  Question: How does deep learning work?\n",
+    "  \n",
+    "  Answer: (LLM fills this in)\n",
+    "  ---\n",
+    "\"\"\")\n",
+    "\n",
+    "def build_context(retrieved_chunks):\n",
+    "    \"\"\"\n",
+    "    Combine retrieved chunks into a context string for the LLM.\n",
+    "    \n",
+    "    This is where you control what the LLM sees.\n",
+    "    \"\"\"\n",
+    "    context = \"\"\n",
+    "    for i, chunk in enumerate(retrieved_chunks):\n",
+    "        score = chunk.get(\"similarity\", 0)\n",
+    "        context += f\"[Chunk {i+1} - Relevance: {score:.1%}]\\n\"\n",
+    "        context += chunk[\"text\"] + \"\\n\\n\"\n",
+    "    \n",
+    "    return context\n",
+    "\n",
+    "# Demo\n",
+    "context = build_context(results)\n",
+    "print(f\"\\n✓ Built context from {len(results)} chunks:\")\n",
+    "print(f\"  Length: {len(context)} characters\")\n",
+    "print(f\"  Preview:\\n  {context[:200]}...\")\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "2b841f01",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "PART 6: Complete RAG Pipeline\n",
+      "============================================================\n",
+      "\n",
+      "[STEP 1] Chunking document...\n",
+      "  → Split into 1 chunks\n",
+      "\n",
+      "[STEP 2] Creating embeddings...\n",
+      "  → Created 1 embeddings\n",
+      "\n",
+      "[STEP 3] Embedding query...\n",
+      "  → Query embedding ready\n",
+      "\n",
+      "[STEP 4] Searching similar chunks...\n",
+      "  → Retrieved 1 chunks\n",
+      "\n",
+      "[STEP 5] Building prompt...\n",
+      "  → Prompt ready (518 chars)\n",
+      "\n",
+      "============================================================\n",
+      "RESULT SUMMARY\n",
+      "============================================================\n",
+      "Query: How do transformers work?\n",
+      "Chunks created: 1\n",
+      "Chunks retrieved: 1\n",
+      "\n",
+      "Top retrieved chunks:\n",
+      "  [6.46%] Machine learning is a subset of AI that learns from data. De...\n",
+      "\n",
+      "Final prompt (to be sent to LLM):\n",
+      "You are a helpful assistant. Answer the question based ONLY on the provided context.\n",
+      "If the context doesn't contain the answer, say so explicitly.\n",
+      "\n",
+      "Context:\n",
+      "[Chunk 1 - Relevance: 6.5%]\n",
+      "Machine learning is a subset of AI that learns from data. Deep learning uses neural networks with many layers. Transformers use attention mechanisms for NLP tasks. The attention mechanism allows models to focus on important parts. Large language models are trained on massive datasets.\n",
+      "\n",
+      "\n",
+      "\n",
+      "Question: How do transformers work?\n",
+      "\n",
+      "Answer:\n",
+      "\n",
+      "============================================================\n",
+      "✓ You now understand RAG!\n",
+      "============================================================\n",
+      "\n",
+      "Next steps:\n",
+      "  1. Replace simulate_embedding() with real Ollama calls\n",
+      "  2. Replace our Vector Store with Pinecone/Chroma\n",
+      "  3. Replace the final prompt with real Groq LLM calls\n",
+      "  4. Wrap in FastAPI for production\n",
+      "\n",
+      "But the LOGIC stays the same!\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "def build_prompt(context, query):\n",
+    "    \"\"\"Build final prompt for LLM.\"\"\"\n",
+    "    prompt = f\"\"\"You are a helpful assistant. Answer the question based ONLY on the provided context.\n",
+    "If the context doesn't contain the answer, say so explicitly.\n",
+    "\n",
+    "Context:\n",
+    "{context}\n",
+    "\n",
+    "Question: {query}\n",
+    "\n",
+    "Answer:\"\"\"\n",
+    "    return prompt\n",
+    "    \n",
+    "# ========== PART 6: PUTTING IT TOGETHER ==========\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"PART 6: Complete RAG Pipeline\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "def rag_pipeline(query, document_text, chunk_size=500, overlap=50, top_k=3):\n",
+    "    \"\"\"\n",
+    "    Complete RAG pipeline:\n",
+    "    1. Chunk document\n",
+    "    2. Create embeddings\n",
+    "    3. Build vector store\n",
+    "    4. Retrieve relevant chunks\n",
+    "    5. Build prompt\n",
+    "    6. Return to user (in real system, send to LLM)\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    print(f\"\\n[STEP 1] Chunking document...\")\n",
+    "    chunks = chunk_text(document_text, chunk_size, overlap)\n",
+    "    print(f\"  → Split into {len(chunks)} chunks\")\n",
+    "    \n",
+    "    print(f\"\\n[STEP 2] Creating embeddings...\")\n",
+    "    store = SimpleVectorStore()\n",
+    "    for chunk in chunks:\n",
+    "        emb = simulate_embedding(chunk[\"text\"])\n",
+    "        store.add(chunk[\"chunk_id\"], chunk[\"text\"], emb)\n",
+    "    print(f\"  → Created {len(chunks)} embeddings\")\n",
+    "    \n",
+    "    print(f\"\\n[STEP 3] Embedding query...\")\n",
+    "    query_emb = simulate_embedding(query)\n",
+    "    print(f\"  → Query embedding ready\")\n",
+    "    \n",
+    "    print(f\"\\n[STEP 4] Searching similar chunks...\")\n",
+    "    retrieved = store.search(query_emb, top_k)\n",
+    "    print(f\"  → Retrieved {len(retrieved)} chunks\")\n",
+    "    \n",
+    "    print(f\"\\n[STEP 5] Building prompt...\")\n",
+    "    context = build_context(retrieved)\n",
+    "    prompt = build_prompt(context, query)\n",
+    "    print(f\"  → Prompt ready ({len(prompt)} chars)\")\n",
+    "    \n",
+    "    return {\n",
+    "        \"query\": query,\n",
+    "        \"chunks_created\": len(chunks),\n",
+    "        \"chunks_retrieved\": len(retrieved),\n",
+    "        \"context\": context,\n",
+    "        \"prompt\": prompt,\n",
+    "        \"retrieved_chunks\": retrieved\n",
+    "    }\n",
+    "\n",
+    "# Test full pipeline\n",
+    "doc = \"\"\"\n",
+    "Machine learning is a subset of AI that learns from data.\n",
+    "Deep learning uses neural networks with many layers.\n",
+    "Transformers use attention mechanisms for NLP tasks.\n",
+    "The attention mechanism allows models to focus on important parts.\n",
+    "Large language models are trained on massive datasets.\n",
+    "\"\"\"\n",
+    "\n",
+    "result = rag_pipeline(\"How do transformers work?\", doc, chunk_size=500, top_k=2)\n",
+    "\n",
+    "print(f\"\\n\" + \"=\" * 60)\n",
+    "print(\"RESULT SUMMARY\")\n",
+    "print(\"=\" * 60)\n",
+    "print(f\"Query: {result['query']}\")\n",
+    "print(f\"Chunks created: {result['chunks_created']}\")\n",
+    "print(f\"Chunks retrieved: {result['chunks_retrieved']}\")\n",
+    "print(f\"\\nTop retrieved chunks:\")\n",
+    "for chunk in result['retrieved_chunks']:\n",
+    "    print(f\"  [{chunk['similarity']:.2%}] {chunk['text'][:60]}...\")\n",
+    "print(f\"\\nFinal prompt (to be sent to LLM):\")\n",
+    "print(result['prompt'])\n",
+    "\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"✓ You now understand RAG!\")\n",
+    "print(\"=\" * 60)\n",
+    "print(\"\"\"\n",
+    "Next steps:\n",
+    "  1. Replace simulate_embedding() with real Ollama calls\n",
+    "  2. Replace our Vector Store with Pinecone/Chroma\n",
+    "  3. Replace the final prompt with real Groq LLM calls\n",
+    "  4. Wrap in FastAPI for production\n",
+    "  \n",
+    "But the LOGIC stays the same!\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "id": "33af125c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "llama-3.1-8b-instant\n",
+      "groq/compound-mini\n",
+      "whisper-large-v3\n",
+      "moonshotai/kimi-k2-instruct-0905\n",
+      "openai/gpt-oss-20b\n",
+      "playai-tts-arabic\n",
+      "groq/compound\n",
+      "whisper-large-v3-turbo\n",
+      "openai/gpt-oss-120b\n",
+      "meta-llama/llama-prompt-guard-2-86m\n",
+      "meta-llama/llama-prompt-guard-2-22m\n",
+      "openai/gpt-oss-safeguard-20b\n",
+      "moonshotai/kimi-k2-instruct\n",
+      "qwen/qwen3-32b\n",
+      "meta-llama/llama-4-maverick-17b-128e-instruct\n",
+      "allam-2-7b\n",
+      "meta-llama/llama-guard-4-12b\n",
+      "playai-tts\n",
+      "meta-llama/llama-4-scout-17b-16e-instruct\n",
+      "llama-3.3-70b-versatile\n"
+     ]
+    }
+   ],
+   "source": [
+    "from groq import Groq\n",
+    "import os\n",
+    "\n",
+    "client = Groq(api_key=os.getenv(\"GROQ_API_KEY\"))\n",
+    "\n",
+    "models = client.models.list()\n",
+    "for m in models.data:\n",
+    "    print(m.id)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "id": "704478dc",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "🤖 LLM ANSWER:\n",
+      "According to the provided context, transformers use attention mechanisms for NLP tasks.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from dotenv import load_dotenv\n",
+    "import os\n",
+    "from groq import Groq\n",
+    "\n",
+    "# Load from .env file\n",
+    "load_dotenv()\n",
+    "\n",
+    "# Get API key safely\n",
+    "api_key = os.getenv(\"GROQ_API_KEY\")\n",
+    "\n",
+    "if not api_key:\n",
+    "    raise ValueError(\"GROQ_API_KEY not found in environment or .env file\")\n",
+    "\n",
+    "def query_groq(context, query):\n",
+    "    \"\"\"Get real answer from Groq LLM.\"\"\"\n",
+    "    client = Groq(api_key=api_key)\n",
+    "    \n",
+    "    message = client.chat.completions.create(\n",
+    "        model=\"llama-3.1-8b-instant\",\n",
+    "        max_tokens=1024,\n",
+    "        messages=[{\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": f\"\"\"Based on this context, answer the question.\n",
+    "Only use the context provided.\n",
+    "\n",
+    "Context:\n",
+    "{context}\n",
+    "\n",
+    "Question: {query}\n",
+    "\n",
+    "Answer:\"\"\"\n",
+    "        }]\n",
+    "    )\n",
+    "    return message.choices[0].message.content\n",
+    "\n",
+    "# Test it\n",
+    "answer = query_groq(result[\"context\"], result[\"query\"])\n",
+    "print(f\"\\n🤖 LLM ANSWER:\\n{answer}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8866fa43",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

notebooks/02_test_modules.ipynb ADDED Viewed

	@@ -0,0 +1,596 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6cade155",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "d:\\projects\\doc-intelligence-rag\\.venv\\Scripts\\python.exe\n"
+     ]
+    }
+   ],
+   "source": [
+    "import sys\n",
+    "sys.path.append(\"../\")\n",
+    "print(sys.executable)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "8b183487",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loaded .env from: d:\\projects\\doc-intelligence-rag\\notebooks\\..\\src\\rag\\../..\\.env\n",
+      "✓ Chunker works: 6 chunks\n",
+      "Split into 6 chunks:\n",
+      "  Chunk 0: 12 words | Machine Learning is a subset of artificial intelligence that\n",
+      "  Chunk 1: 12 words | training models to make predictions or decisions based on da\n",
+      "  Chunk 2: 12 words | It is a powerful tool for solving a wide range of problems,\n",
+      "  Chunk 3: 12 words | of problems, from image recognition to natural language proc\n",
+      "  Chunk 4: 12 words | this article, we will explore the basics of machine learning\n",
+      "  Chunk 5: 10 words | and how it can be used to solve real-world problems.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Test chunker\n",
+    "from src.rag import chunk_text\n",
+    "\n",
+    "text = \"Machine Learning is a subset of artificial intelligence that involves training models to make predictions or decisions based on data. It is a powerful tool for solving a wide range of problems, from image recognition to natural language processing. In this article, we will explore the basics of machine learning and how it can be used to solve real-world problems.\"\n",
+    "chunks = chunk_text(text, chunk_size=12, overlap=2)\n",
+    "print(f\"✓ Chunker works: {len(chunks)} chunks\")\n",
+    "print(f\"Split into {len(chunks)} chunks:\")\n",
+    "for chunk in chunks:\n",
+    "    print(f\"  Chunk {chunk.chunk_id}: {chunk.word_count} words | {chunk.text[:60]}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "f4c201be",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:src.rag.embeddings:✓ Connected to Ollama at http://localhost:11434\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✓ Embeddings work: 768 dimensions\n"
+     ]
+    }
+   ],
+   "source": [
+    "# check embeddings\n",
+    "\n",
+    "from src.rag import OllamaEmbeddingClient\n",
+    "\n",
+    "client = OllamaEmbeddingClient()\n",
+    "embedding = client.embed(text)\n",
+    "print(f\"✓ Embeddings work: {len(embedding)} dimensions\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "04144e58",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.\n",
+      "INFO:src.rag.vector_store:✓ Initialized Chroma vector store at .chromadb_test (collection: rag)\n",
+      "INFO:src.rag.vector_store:Cleared vector store\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Chunk ID: chunk1, Similarity: 1.00, Text: ml\n"
+     ]
+    }
+   ],
+   "source": [
+    "# check vector store\n",
+    "\n",
+    "from src.rag import ChromaVectorStore\n",
+    "\n",
+    "store = ChromaVectorStore(persist_directory=\".chromadb_test\")\n",
+    "store.clear()\n",
+    "# Add chunks\n",
+    "store.add(\"chunk1\", \"ml\", embedding, metadata={\"source\": \"test\"})\n",
+    "\n",
+    "# Retrieve\n",
+    "results = store.retrieve(embedding, top_k=1)\n",
+    "for r in results:\n",
+    "    print(f\"Chunk ID: {r.chunk_id}, Similarity: {r.similarity:.2f}, Text: {r.text}\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "80ccd988",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:src.rag.llm:Groq LLM client initialized with model: llama-3.1-8b-instant\n",
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✓ LLM works: I am a helpful assistant....\n"
+     ]
+    }
+   ],
+   "source": [
+    "# test groq llm\n",
+    "from src.rag import GroqLLMClient\n",
+    "import os\n",
+    "from dotenv import load_dotenv\n",
+    "load_dotenv()\n",
+    "api_key = os.getenv(\"GROQ_API_KEY\")\n",
+    "\n",
+    "llm = GroqLLMClient(api_key=api_key)\n",
+    "answer = llm.query(\"Context: Hello\", \"Query: Who are you?\")\n",
+    "print(f\"✓ LLM works: {answer[:50]}...\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "6c95cec2",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:src.rag.pipeline:Initializing RAG Pipeline...\n",
+      "INFO:src.rag.embeddings:✓ Connected to Ollama at http://localhost:11434\n",
+      "INFO:src.rag.pipeline:✓ Embeddings client ready\n",
+      "INFO:src.rag.llm:Groq LLM client initialized with model: llama-3.1-8b-instant\n",
+      "INFO:src.rag.pipeline:✓ LLM client ready\n",
+      "INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.\n",
+      "INFO:src.rag.vector_store:✓ Initialized Chroma vector store at .chromadb (collection: rag)\n",
+      "INFO:src.rag.pipeline:✓ Vector store ready\n",
+      "INFO:src.rag.pipeline:✓ RAG Pipeline initialized\n",
+      "INFO:src.rag.pipeline:Ingesting document: doc1\n",
+      "INFO:src.rag.pipeline:✓ Chunks created: 1\n",
+      "INFO:src.rag.pipeline:✓ Embedded 1/1 chunks\n",
+      "INFO:src.rag.pipeline:Querying: What will you explore?\n",
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "INFO:src.rag.pipeline:Query complete: success\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✓ Pipeline works!\n",
+      "Answer: Based on the provided context, it's not explicitly stated what will be explored.\n"
+     ]
+    }
+   ],
+   "source": [
+    "#test full pipeline\n",
+    "from src.rag import RAGPipeline\n",
+    "\n",
+    "pipeline = RAGPipeline()\n",
+    "pipeline.ingest(\"doc1\", text)\n",
+    "result = pipeline.query(\"What will you explore?\")\n",
+    "print(f\"✓ Pipeline works!\")\n",
+    "print(f\"Answer: {result['answer']}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "b6aa9c72",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:src.rag.pipeline:Initializing RAG Pipeline...\n",
+      "INFO:src.rag.embeddings:✓ Connected to Ollama at http://localhost:11434\n",
+      "INFO:src.rag.pipeline:✓ Embeddings client ready\n",
+      "INFO:src.rag.llm:Groq LLM client initialized with model: llama-3.1-8b-instant\n",
+      "INFO:src.rag.pipeline:✓ LLM client ready\n",
+      "INFO:src.rag.vector_store:✓ Initialized Chroma vector store at .chromadb (collection: rag)\n",
+      "INFO:src.rag.pipeline:✓ Vector store ready\n",
+      "INFO:src.rag.pipeline:✓ RAG Pipeline initialized\n",
+      "INFO:src.rag.pdf_processor:Processing folder: d:\\projects\\doc-intelligence-rag\\papers\n",
+      "INFO:src.rag.pdf_processor:Found 3 PDF files\n",
+      "INFO:src.rag.pdf_processor:Processing PDF: CMBFSCNN.pdf\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[1] Ingesting all PDFs from 'papers' folder...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:src.rag.pdf_processor:✓ Extracted 26 pages, 90013 chars\n",
+      "INFO:src.rag.pdf_processor:Processing PDF: CVPR Version 16723_CMB_ML_A_Cosmic_Microwav.pdf\n",
+      "INFO:src.rag.pdf_processor:✓ Extracted 11 pages, 53068 chars\n",
+      "INFO:src.rag.pdf_processor:Processing PDF: Petroff20 - Cleaning CMB with ML.pdf\n",
+      "INFO:src.rag.pdf_processor:✓ Extracted 11 pages, 43264 chars\n",
+      "INFO:src.rag.pdf_processor:✓ Processed 3 PDFs successfully\n",
+      "INFO:src.rag.pipeline:Ingesting document: CMBFSCNN\n",
+      "INFO:src.rag.pipeline:✓ Chunks created: 33\n",
+      "INFO:src.rag.pipeline:✓ Embedded 33/33 chunks\n",
+      "INFO:src.rag.pipeline:Ingesting document: CVPR Version 16723_CMB_ML_A_Cosmic_Microwav\n",
+      "INFO:src.rag.pipeline:✓ Chunks created: 19\n",
+      "INFO:src.rag.pipeline:✓ Embedded 19/19 chunks\n",
+      "INFO:src.rag.pipeline:Ingesting document: Petroff20 - Cleaning CMB with ML\n",
+      "INFO:src.rag.pipeline:✓ Chunks created: 16\n",
+      "INFO:src.rag.pipeline:✓ Embedded 16/16 chunks\n",
+      "INFO:src.rag.pipeline:Querying: What are the main findings?\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "[2] Ingestion Summary:\n",
+      "  CMBFSCNN: 33 chunks\n",
+      "  CVPR Version 16723_CMB_ML_A_Cosmic_Microwav: 19 chunks\n",
+      "  Petroff20 - Cleaning CMB with ML: 16 chunks\n",
+      "  Total: 68 chunks\n",
+      "\n",
+      "[3] Querying...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "INFO:src.rag.pipeline:Query complete: success\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "📝 Question: What are the main findings?\n",
+      "\n",
+      "🤖 Answer:\n",
+      "The provided context does not explicitly contain information about the main findings. It appears to be a collection of scientific texts discussing the Cosmic Microwave Background (CMB) and its analysis. The main topics covered include the CMB's power spectrum, contamination by foreground radiations, and the process of component separation.\n",
+      "\n",
+      "📚 Sources (3 chunks):\n",
+      "  - [57.3%] in- frared background (CIB) is a different diffuse extragalactic source. 0 200 4...\n",
+      "  - [57.1%] sion [1] and LiteBird [26], are expected to probe some of the 084 deepest myster...\n",
+      "  - [57.0%] primordial B mode originates from the primordial gravitational waves predicted b...\n"
+     ]
+    }
+   ],
+   "source": [
+    "# test pdf processor\n",
+    "\n",
+    "from src.rag import RAGPipeline\n",
+    "import logging\n",
+    "import os\n",
+    "\n",
+    "project_root = os.path.abspath(os.path.join(os.getcwd(), \"..\"))\n",
+    "papers_path = os.path.join(project_root, \"papers\")\n",
+    "\n",
+    "logging.basicConfig(level=logging.INFO)\n",
+    "\n",
+    "# Initialize pipeline\n",
+    "pipeline = RAGPipeline()\n",
+    "\n",
+    "# Option A: Ingest a single PDF\n",
+    "# result = pipeline.ingest_pdf(\"papers/your_paper.pdf\")\n",
+    "# print(f\"✓ Ingested {result['chunks_embedded']} chunks\")\n",
+    "\n",
+    "# Option B: Ingest all PDFs from folder (RECOMMENDED)\n",
+    "if os.path.exists(papers_path):\n",
+    "    print(\"[1] Ingesting all PDFs from 'papers' folder...\")\n",
+    "    results = pipeline.ingest_folder(papers_path)\n",
+    "    \n",
+    "    print(f\"\\n[2] Ingestion Summary:\")\n",
+    "    total_chunks = 0\n",
+    "    for doc_id, result in results.items():\n",
+    "        print(f\"  {doc_id}: {result['chunks_embedded']} chunks\")\n",
+    "        total_chunks += result['chunks_embedded']\n",
+    "    print(f\"  Total: {total_chunks} chunks\")\n",
+    "    \n",
+    "    # Now query!\n",
+    "    print(f\"\\n[3] Querying...\")\n",
+    "    query = \"What are the main findings?\"  # Change this to your question\n",
+    "    result = pipeline.query(query)\n",
+    "    \n",
+    "    print(f\"\\n📝 Question: {result['query']}\")\n",
+    "    print(f\"\\n🤖 Answer:\\n{result['answer']}\")\n",
+    "    print(f\"\\n📚 Sources ({result['chunks_used']} chunks):\")\n",
+    "    for source in result['sources']:\n",
+    "        print(f\"  - [{source['similarity']:.1%}] {source['preview'][:80]}...\")\n",
+    "else:\n",
+    "    print(\"❌ No 'papers' folder found. Create one and add PDFs!\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "441624d7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:src.rag.pipeline:Querying: What is the Cosmic Microwave Background?\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "============================================================\n",
+      "Q: What is the Cosmic Microwave Background?\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "INFO:src.rag.pipeline:Query complete: success\n",
+      "INFO:src.rag.pipeline:Querying: How is machine learning used to analyze CMB data?\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "A: The Cosmic Microwave Background (CMB) has characteristic modes which are visible as consistently sized lumps and has a first peak at ℓ≈200, which corresponds roughly to 1 deg, about the size of the largest lumps visible in the CMB.\n",
+      "\n",
+      "Sources (3 chunks):\n",
+      "  [60.1%] Machine learning is AI. Deep learning uses networks....\n",
+      "  [46.4%] in- frared background (CIB) is a different diffuse extragala...\n",
+      "\n",
+      "============================================================\n",
+      "Q: How is machine learning used to analyze CMB data?\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "INFO:src.rag.pipeline:Query complete: success\n",
+      "INFO:src.rag.pipeline:Querying: What are foreground contaminations in CMB?\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "A: Based on the provided context, it appears that machine learning is mentioned as relevant to the analysis of CMB data in Chunk 1, but no specific information on how it is used is provided. However, in Chunk 2, it is mentioned that CMB-ML stands out distinctly because it includes Monte-Carlo simulations, which are not included in other software such as PySM3 and PSM.\n",
+      "\n",
+      "In Chunk 3, there are several references to machine learning and its applications in cosmology, including the use of deep learning for deblurring and the use of convolutional neural networks for component separation in CMB analysis. Specifically, reference [57] discusses the use of machine learning for foreground cleaning in CMB data, and reference [58] presents a method for CMB component separation using convolutional neural networks.\n",
+      "\n",
+      "Therefore, while the provided context does not provide a comprehensive answer, it suggests that machine learning is used in various ways to analyze CMB data, including simulation, deblurring, and component separation.\n",
+      "\n",
+      "Sources (3 chunks):\n",
+      "  [62.6%] Machine learning is AI. Deep learning uses networks....\n",
+      "  [59.9%] experiment [17, 60]. 131 Many other examples exist, but all ...\n",
+      "\n",
+      "============================================================\n",
+      "Q: What are foreground contaminations in CMB?\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "INFO:src.rag.pipeline:Query complete: success\n",
+      "INFO:src.rag.pipeline:Querying: What component separation methods are discussed?\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "A: Foreground contaminations in CMB refer to the various sources of radiation that contaminate the Cosmic Microwave Background (CMB) signal, making it difficult to accurately separate the primordial CMB signal from the foregrounds. These contaminants include Galactic polarized radiation, which tends to be brighter than the primordial B-mode signal, as well as other astrophysical sources such as point sources, diffuse sources, and extragalactic signals.\n",
+      "\n",
+      "Sources (3 chunks):\n",
+      "  [63.4%] in- frared background (CIB) is a different diffuse extragala...\n",
+      "  [59.7%] primordial B mode originates from the primordial gravitation...\n",
+      "\n",
+      "============================================================\n",
+      "Q: What component separation methods are discussed?\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 429 Too Many Requests\"\n",
+      "INFO:groq._base_client:Retrying request to /openai/v1/chat/completions in 25.000000 seconds\n",
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "INFO:src.rag.pipeline:Query complete: success\n",
+      "INFO:src.rag.pipeline:Querying: What deep learning architectures are used?\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "A: The following component separation methods are discussed:\n",
+      "\n",
+      "1. Internal Linear Combination (ILC)\n",
+      "2. Needlet (NILC)\n",
+      "3. Scale Discretized (SILC)\n",
+      "4. Hierarchical Morphological Component Analysis (HGMCA)\n",
+      "5. Convolutional Neural Network (CNN)-based methods (CMBFSCNN is specifically mentioned)\n",
+      "\n",
+      "Sources (3 chunks):\n",
+      "  [55.4%] 497 [4] Yashar Akrami, M Ashdown, Jonathan Aumont, Carlo Bac...\n",
+      "  [55.0%] in several next generation CMB experiments, such as the CMB-...\n",
+      "\n",
+      "============================================================\n",
+      "Q: What deep learning architectures are used?\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 429 Too Many Requests\"\n",
+      "INFO:groq._base_client:Retrying request to /openai/v1/chat/completions in 23.000000 seconds\n",
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "INFO:src.rag.pipeline:Query complete: success\n",
+      "INFO:src.rag.pipeline:Querying: What datasets are mentioned?\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "A: Based on the provided context, the following deep learning architectures are mentioned:\n",
+      "\n",
+      "1. Convolutional Neural Network (CNN) - mentioned in several references, including [51], [56], and [61].\n",
+      "2. Multi-scale Convolutional Neural Network - mentioned in [53].\n",
+      "3. Spherical Convolutional Neural Network - mentioned in [56] (DeepSphere).\n",
+      "4. U-Net - mentioned in [61].\n",
+      "\n",
+      "Note that PyTorch is also mentioned, but it is a deep learning library rather than a specific architecture.\n",
+      "\n",
+      "Sources (3 chunks):\n",
+      "  [63.9%] Machine learning is AI. Deep learning uses networks....\n",
+      "  [63.3%] Advances in Neural Information Processing Systems 30, ed. I....\n",
+      "\n",
+      "============================================================\n",
+      "Q: What datasets are mentioned?\n",
+      "============================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 429 Too Many Requests\"\n",
+      "INFO:groq._base_client:Retrying request to /openai/v1/chat/completions in 14.000000 seconds\n",
+      "INFO:httpx:HTTP Request: POST https://api.groq.com/openai/v1/chat/completions \"HTTP/1.1 200 OK\"\n",
+      "INFO:src.rag.pipeline:Query complete: success\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "A: Based on the provided context, the following datasets are mentioned:\n",
+      "\n",
+      "1. The CMB-ML dataset, which is a unified framework for dataset generation and model evaluation.\n",
+      "2. The CMB-S4 science mission dataset.\n",
+      "3. The LiteBird dataset.\n",
+      "4. The COBE (Cosmic Background Explorer) dataset.\n",
+      "5. The Planck Mission dataset.\n",
+      "6. The Cosmic Infrared Background (CIB) dataset.\n",
+      "\n",
+      "Sources (3 chunks):\n",
+      "  [62.0%] REVIEW COPY. DO NOT DISTRIBUTE. physics methods, each of the...\n",
+      "  [58.2%] in- frared background (CIB) is a different diffuse extragala...\n"
+     ]
+    }
+   ],
+   "source": [
+    "queries = [\n",
+    "    \"What is the Cosmic Microwave Background?\",\n",
+    "    \"How is machine learning used to analyze CMB data?\",\n",
+    "    \"What are foreground contaminations in CMB?\",\n",
+    "    \"What component separation methods are discussed?\",\n",
+    "    \"What deep learning architectures are used?\",\n",
+    "    \"What datasets are mentioned?\",\n",
+    "]\n",
+    "\n",
+    "for query in queries:\n",
+    "    print(f\"\\n{'='*60}\")\n",
+    "    print(f\"Q: {query}\")\n",
+    "    print('='*60)\n",
+    "    \n",
+    "    result = pipeline.query(query)\n",
+    "    print(f\"\\nA: {result['answer']}\")\n",
+    "    print(f\"\\nSources ({result['chunks_used']} chunks):\")\n",
+    "    for source in result['sources'][:2]:  # Show top 2\n",
+    "        print(f\"  [{source['similarity']:.1%}] {source['preview'][:60]}...\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "76a5da4b",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

pyproject-local.toml ADDED Viewed

	@@ -0,0 +1,23 @@

+[project]
+name = "doc-intelligence-rag"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "chromadb>=1.3.7",
+    "fastapi>=0.124.4",
+    "groq>=0.37.1",
+    "jupyter>=1.1.1",
+    "numpy>=2.3.5",
+    "ollama>=0.6.1",
+    "pandas>=2.3.3",
+    "pdfplumber>=0.11.8",
+    "pydantic>=2.12.5",
+    "pypdf2>=3.0.1",
+    "python-dotenv>=1.2.1",
+    "python-multipart>=0.0.20",
+    "requests>=2.32.5",
+    "sentence-transformers>=5.2.0",
+    "uvicorn[standard]>=0.38.0",
+]

pyproject.toml ADDED Viewed

	@@ -0,0 +1,21 @@

+[project]
+name = "doc-intelligence-rag"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "chromadb[cpu]>=1.3.7",
+    "fastapi>=0.124.4",
+    "groq>=0.37.1",
+    "numpy>=2.3.5",
+    "pandas>=2.3.3",
+    "pdfplumber>=0.11.8",
+    "pydantic>=2.12.5",
+    "pypdf2>=3.0.1",
+    "python-dotenv>=1.2.1",
+    "python-multipart>=0.0.20",
+    "requests>=2.32.5",
+    "sentence-transformers>=5.2.0",
+    "uvicorn[standard]>=0.38.0"
+]

requirements-railway.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+fastapi>=0.124.4
+uvicorn[standard]>=0.38.0
+pydantic>=2.12.5
+python-multipart>=0.0.20
+python-dotenv>=1.2.1
+requests>=2.32.5
+chromadb[cpu]>=1.3.7
+groq>=0.37.1
+pdfplumber>=0.11.8
+pypdf2>=3.0.1
+numpy>=2.3.5
+pandas>=2.3.3
+# sentence-transformers will be installed at runtime if needed
+sentence-transformers>=5.2.0

src/main.py ADDED Viewed

	@@ -0,0 +1,504 @@

+from fastapi import FastAPI, File, UploadFile, HTTPException
+from fastapi.responses import JSONResponse, FileResponse
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.staticfiles import StaticFiles
+from pydantic import BaseModel
+import logging
+import os
+from typing import List, Optional
+from datetime import datetime
+import tempfile
+from pathlib import Path
+from src.rag import RAGPipeline, RAGConfig
+# ==================== Setup ====================
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+# Initialize FastAPI app
+app = FastAPI(
+    title="Document Intelligence RAG",
+    description="RAG system for analyzing documents with LLM",
+    version="1.0.0",
+    docs_url="/docs",
+    redoc_url="/redoc"
+)
+# Add CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Serve frontend static files
+if os.path.exists("frontend"):
+    app.mount("/static", StaticFiles(directory="frontend"), name="static")
+# Global pipeline instance
+pipeline: Optional[RAGPipeline] = None
+# ==================== Pydantic Models ====================
+class QueryRequest(BaseModel):
+    """Request body for query endpoint."""
+    query: str
+    top_k: int = 3
+class QueryResponse(BaseModel):
+    """Response for query."""
+    query: str
+    answer: str
+    sources: List[dict]
+    chunks_used: int
+    response_time: float
+    status: str
+class IngestResponse(BaseModel):
+    """Response for ingestion."""
+    doc_id: str
+    filename: str
+    chunks_created: int
+    chunks_embedded: int
+    status: str
+    timestamp: str
+class IngestFolderResponse(BaseModel):
+    """Response for folder ingestion."""
+    total_documents: int
+    total_chunks: int
+    documents: List[dict]
+    timestamp: str
+class HealthResponse(BaseModel):
+    """Response for health check."""
+    status: str
+    embedding_backend: str
+    groq: str
+    chroma: dict
+    timestamp: str
+class StatsResponse(BaseModel):
+    """Response for stats."""
+    total_chunks: int
+    config: dict
+    timestamp: str
+# ==================== Startup/Shutdown ====================
+@app.on_event("startup")
+async def startup_event():
+    """Initialize pipeline on startup."""
+    global pipeline
+    logger.info("=" * 60)
+    logger.info("Starting Document Intelligence RAG API")
+    logger.info("=" * 60)
+    try:
+        # Create RAG config (reads EMBEDDING_BACKEND from env)
+        config = RAGConfig(
+            chunk_size=500,
+            chunk_overlap=50,
+            top_k=3
+        )
+        # Initialize pipeline (automatically uses get_embeddings_client())
+        pipeline = RAGPipeline(config=config)
+        logger.info("✓ Pipeline initialized successfully")
+        logger.info(f"✓ Embedding backend: {config.embedding_backend}")
+        logger.info(f"✓ API ready at http://localhost:8000")
+        logger.info(f"✓ Interactive docs at http://localhost:8000/docs")
+    except Exception as e:
+        logger.error(f"Failed to initialize pipeline: {e}")
+        raise
+@app.on_event("shutdown")
+async def shutdown_event():
+    """Cleanup on shutdown."""
+    logger.info("Shutting down Document Intelligence RAG API")
+# ==================== Health & Status ====================
+@app.get("/health", response_model=HealthResponse)
+async def health_check():
+    """
+    Check system health.
+    Returns:
+        Health status of all components
+    """
+    if not pipeline:
+        raise HTTPException(status_code=503, detail="Pipeline not initialized")
+    try:
+        # Check components
+        embeddings_ok = "✓" if pipeline.embeddings else "✗"
+        groq_ok = "✓" if pipeline.llm else "✗"
+        chroma_ok = pipeline.vector_store.size() >= 0
+        return HealthResponse(
+            status="healthy" if all([embeddings_ok == "✓", groq_ok == "✓", chroma_ok]) else "degraded",
+            embedding_backend=pipeline.config.embedding_backend,
+            groq=groq_ok,
+            chroma={
+                "status": "✓" if chroma_ok else "✗",
+                "chunks": pipeline.vector_store.size()
+            },
+            timestamp=datetime.now().isoformat()
+        )
+    except Exception as e:
+        logger.error(f"Health check failed: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/stats", response_model=StatsResponse)
+async def get_stats():
+    """
+    Get pipeline statistics.
+    Returns:
+        Current stats: total chunks, config, etc.
+    """
+    if not pipeline:
+        raise HTTPException(status_code=503, detail="Pipeline not initialized")
+    try:
+        stats = pipeline.get_stats()
+        return StatsResponse(
+            total_chunks=stats['total_chunks'],
+            config=stats['config'],
+            timestamp=datetime.now().isoformat()
+        )
+    except Exception as e:
+        logger.error(f"Stats retrieval failed: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+# ==================== Ingestion Endpoints ====================
+@app.post("/ingest", response_model=IngestResponse)
+async def ingest_pdf(file: UploadFile = File(...)):
+    """
+    Upload and ingest a single PDF file.
+    Args:
+        file: PDF file to upload
+    Returns:
+        Ingestion result with doc_id and chunk count
+    Example:
+        curl -X POST "http://localhost:8000/ingest" \
+          -F "file=@research_paper.pdf"
+    """
+    if not pipeline:
+        raise HTTPException(status_code=503, detail="Pipeline not initialized")
+    if not file.filename.endswith('.pdf'):
+        raise HTTPException(status_code=400, detail="Only PDF files are supported")
+    try:
+        # Save uploaded file to temp location
+        with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
+            contents = await file.read()
+            tmp_file.write(contents)
+            tmp_path = tmp_file.name
+        logger.info(f"Processing uploaded PDF: {file.filename}")
+        # Ingest PDF
+        result = pipeline.ingest_pdf(tmp_path)
+        # Clean up temp file
+        os.remove(tmp_path)
+        return IngestResponse(
+            doc_id=result['doc_id'],
+            filename=file.filename,
+            chunks_created=result['chunks_created'],
+            chunks_embedded=result['chunks_embedded'],
+            status=result['status'],
+            timestamp=datetime.now().isoformat()
+        )
+    except Exception as e:
+        logger.error(f"PDF ingestion failed: {e}")
+        raise HTTPException(status_code=500, detail=f"Ingestion failed: {str(e)}")
+@app.post("/ingest-folder", response_model=IngestFolderResponse)
+async def ingest_folder(folder_path: str):
+    """
+    Ingest all PDFs from a folder.
+    Args:
+        folder_path: Path to folder containing PDFs
+    Returns:
+        Summary of all ingested documents
+    Example:
+        curl -X POST "http://localhost:8000/ingest-folder" \
+          -H "Content-Type: application/json" \
+          -d '{"folder_path": "./papers"}'
+    """
+    if not pipeline:
+        raise HTTPException(status_code=503, detail="Pipeline not initialized")
+    try:
+        # Check folder exists
+        if not os.path.exists(folder_path):
+            raise HTTPException(status_code=400, detail=f"Folder not found: {folder_path}")
+        logger.info(f"Ingesting folder: {folder_path}")
+        # Ingest all PDFs
+        results = pipeline.ingest_folder(folder_path)
+        if not results:
+            raise HTTPException(status_code=400, detail="No PDFs found in folder")
+        # Build response
+        total_chunks = sum(r['chunks_embedded'] for r in results.values())
+        documents = [
+            {
+                "doc_id": doc_id,
+                "chunks": r['chunks_embedded']
+            }
+            for doc_id, r in results.items()
+        ]
+        return IngestFolderResponse(
+            total_documents=len(results),
+            total_chunks=total_chunks,
+            documents=documents,
+            timestamp=datetime.now().isoformat()
+        )
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.error(f"Folder ingestion failed: {e}")
+        raise HTTPException(status_code=500, detail=f"Ingestion failed: {str(e)}")
+# ==================== Query Endpoint ====================
+@app.post("/query", response_model=QueryResponse)
+async def query(request: QueryRequest):
+    """
+    Query the RAG system with a question.
+    Args:
+        request: QueryRequest with 'query' and optional 'top_k'
+    Returns:
+        Answer with sources and metadata
+    Example:
+        curl -X POST "http://localhost:8000/query" \
+          -H "Content-Type: application/json" \
+          -d '{"query": "What is machine learning?", "top_k": 3}'
+    """
+    if not pipeline:
+        raise HTTPException(status_code=503, detail="Pipeline not initialized")
+    if pipeline.vector_store.size() == 0:
+        raise HTTPException(
+            status_code=400,
+            detail="No documents ingested yet. Use /ingest endpoint first."
+        )
+    try:
+        import time
+        start_time = time.time()
+        logger.info(f"Query: {request.query}")
+        # Query pipeline
+        result = pipeline.query(request.query, return_sources=True)
+        response_time = time.time() - start_time
+        return QueryResponse(
+            query=result['query'],
+            answer=result['answer'],
+            sources=result['sources'],
+            chunks_used=result['chunks_used'],
+            response_time=round(response_time, 3),
+            status=result['status']
+        )
+    except Exception as e:
+        logger.error(f"Query failed: {e}")
+        raise HTTPException(status_code=500, detail=f"Query failed: {str(e)}")
+# ==================== Document Management ====================
+@app.get("/documents")
+async def list_documents():
+    """
+    List all ingested documents.
+    Returns:
+        List of document IDs and chunk counts
+    """
+    if not pipeline:
+        raise HTTPException(status_code=503, detail="Pipeline not initialized")
+    try:
+        total_chunks = pipeline.vector_store.size()
+        return {
+            "total_chunks": total_chunks,
+            "status": "ready" if total_chunks > 0 else "empty",
+            "timestamp": datetime.now().isoformat()
+        }
+    except Exception as e:
+        logger.error(f"Failed to list documents: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.delete("/documents/{doc_id}")
+async def delete_document(doc_id: str):
+    """
+    Delete a document and all its chunks.
+    Args:
+        doc_id: Document ID to delete
+    Returns:
+        Deletion result
+    """
+    if not pipeline:
+        raise HTTPException(status_code=503, detail="Pipeline not initialized")
+    try:
+        # Note: This is a simple implementation
+        # For production, you'd want to track document chunks and delete them
+        logger.info(f"Deleting document: {doc_id}")
+        return {
+            "status": "success",
+            "doc_id": doc_id,
+            "message": "Document deletion queued",
+            "timestamp": datetime.now().isoformat()
+        }
+    except Exception as e:
+        logger.error(f"Failed to delete document: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/reset")
+async def reset_system():
+    """
+    Reset the entire system - clear all documents and embeddings.
+    WARNING: This deletes all stored embeddings!
+    Returns:
+        Reset confirmation
+    """
+    global pipeline
+    if not pipeline:
+        raise HTTPException(status_code=503, detail="Pipeline not initialized")
+    try:
+        logger.warning("RESET: Clearing all documents and embeddings")
+        # Clear vector store
+        pipeline.vector_store.clear()
+        logger.info("✓ System reset complete")
+        return {
+            "status": "success",
+            "message": "All documents and embeddings cleared",
+            "chunks_remaining": 0,
+            "timestamp": datetime.now().isoformat()
+        }
+    except Exception as e:
+        logger.error(f"Reset failed: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+# ==================== Error Handlers ====================
+@app.exception_handler(HTTPException)
+async def http_exception_handler(request, exc):
+    """Handle HTTP exceptions."""
+    return JSONResponse(
+        status_code=exc.status_code,
+        content={
+            "error": exc.detail,
+            "status": "error",
+            "timestamp": datetime.now().isoformat()
+        }
+    )
+@app.exception_handler(Exception)
+async def general_exception_handler(request, exc):
+    """Handle general exceptions."""
+    logger.error(f"Unhandled exception: {exc}")
+    return JSONResponse(
+        status_code=500,
+        content={
+            "error": "Internal server error",
+            "status": "error",
+            "timestamp": datetime.now().isoformat()
+        }
+    )
+# ==================== Root Endpoint ====================
+@app.get("/", response_class=FileResponse)
+async def root():
+    """Root endpoint - serve web UI."""
+    frontend_path = "frontend/index.html"
+    if os.path.exists(frontend_path):
+        return FileResponse(frontend_path)
+    # If no frontend, return API info
+    return {
+        "name": "Document Intelligence RAG",
+        "version": "1.0.0",
+        "description": "RAG system for analyzing documents with LLM",
+        "docs": "http://localhost:8000/docs",
+        "health": "http://localhost:8000/health",
+        "embedding_backend": pipeline.config.embedding_backend if pipeline else "initializing",
+        "timestamp": datetime.now().isoformat()
+    }
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)

src/rag/__init__.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""
+RAG Package
+===========
+Modular Retrieval-Augmented Generation system with Ollama + Groq
+"""
+from .chunker import chunk_text, Chunk, chunk_documents
+from .embeddings import OllamaEmbeddingClient, cosine_similarity
+from .vector_store import ChromaVectorStore, RetrievalResult
+from .llm import GroqLLMClient, build_context_string
+from .pdf_processor import PDFProcessor
+from .pipeline import RAGPipeline, RAGConfig
+__all__ = [
+    # Chunking
+    "chunk_text",
+    "Chunk",
+    "chunk_documents",
+    # Embeddings
+    "OllamaEmbeddingClient",
+    "cosine_similarity",
+    # Vector Store
+    "ChromaVectorStore",
+    "SimpleVectorStore",
+    "RetrievalResult",
+    # LLM
+    "GroqLLMClient",
+    "build_context_string",
+    # PDF Processing
+    "PDFProcessor",
+    # Pipeline
+    "RAGPipeline",
+    "RAGConfig",
+]
+__version__ = "0.1.0"

src/rag/chunker.py ADDED Viewed

	@@ -0,0 +1,97 @@

+"""
+Chunker module
+--------------
+Purpose: Split text into smaller chunks.
+"""
+from typing import List, Dict
+from dataclasses import dataclass
+@dataclass
+class Chunk:
+    text: str
+    chunk_id: int
+    start_idx: int
+    word_count: int
+def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[Chunk]:
+    """
+    Split text into smaller chunks.
+    Args:
+        text (str): The text to split into chunks.
+        chunk_size (int): The size of each chunk.
+        overlap (int): The overlap between chunks.
+    Returns:
+        List[Chunk]: A list of chunks.
+    """
+    words = text.split()
+    if not words:
+        return []
+    stride = chunk_size - overlap
+    chunks = []
+    chunk_id = 0
+    for i in range(0, len(words), stride):
+        chunk = words[i:i + chunk_size]
+        chunk_text = ' '.join(chunk)
+        if not chunk_text.strip():
+            continue
+        chunk = Chunk(
+            text=chunk_text,
+            chunk_id=chunk_id,
+            start_idx=i,
+            word_count=len(chunk)
+        )
+        chunks.append(chunk)
+        chunk_id += 1
+    return chunks
+def chunk_documents(
+    documents: Dict[str, str],
+    chunk_size: int = 500,
+    overlap: int = 50
+) -> Dict[str, List[Chunk]]:
+    """
+    Chunk multiple documents.
+    Args:
+        documents: Dict of {doc_id: text}
+        chunk_size: Tokens per chunk
+        overlap: Token overlap
+    Returns:
+        Dict of {doc_id: [chunks]}
+    Example:
+        >>> docs = {"doc1": "Text 1", "doc2": "Text 2"}
+        >>> chunked = chunk_documents(docs)
+        >>> "doc1" in chunked
+        True
+    """
+    chunked_docs = {}
+    for doc_id, text in documents.items():
+        chunks = chunk_text(text, chunk_size, overlap)
+        chunked_docs[doc_id] = chunks
+    return chunked_docs
+if __name__ == "__main__":
+    text = """
+    Machine Learning is a subset of artificial intelligence that involves training models to make predictions or decisions based on data. It is a powerful tool for solving a wide range of problems, from image recognition to natural language processing. In this article, we will explore the basics of machine learning and how it can be used to solve real-world problems.
+    """
+    chunks = chunk_text(text, chunk_size=50, overlap=10)
+    print(f"Split into {len(chunks)} chunks:")
+    for chunk in chunks:
+        print(f"  Chunk {chunk.chunk_id}: {chunk.word_count} words | {chunk.text[:60]}...")

src/rag/embeddings.py ADDED Viewed

	@@ -0,0 +1,302 @@

+"""
+Embeddings module
+----------------
+Purpose: Convert text to vector embeddings using local Ollama or Sentence-Transformers
+"""
+import requests
+import numpy as np
+from typing import List
+import logging
+logger = logging.getLogger(__name__)
+class OllamaEmbeddingClient:
+    """
+    Client for Ollama embedding service
+    Requires: ollama serve running on localhost:11434
+    Model: nomic-embed-text (384 dimensions)
+    """
+    def __init__(
+        self,
+        base_url: str = "http://localhost:11434",
+        model: str = "nomic-embed-text",
+        timeout: int = 30
+    ):
+        """
+        Initialize the Ollama embedding client
+        Args:
+            base_url: Ollama server URL
+            model: Embedding model name
+            timeout: Request timeout in seconds
+        """
+        self.base_url = base_url
+        self.model = model
+        self.timeout = timeout
+        self._test_connection()
+    def _test_connection(self) -> None:
+        """Test if Ollama is running."""
+        try:
+            response = requests.get(
+                f"{self.base_url}/api/tags",
+                timeout=5
+            )
+            if response.status_code != 200:
+                raise ConnectionError(f"Ollama returned {response.status_code}")
+            logger.info(f"✓ Connected to Ollama at {self.base_url}")
+        except requests.exceptions.ConnectionError:
+            raise ConnectionError(
+                f"Cannot connect to Ollama at {self.base_url}. "
+                "Start it with: ollama serve"
+            )
+    def embed(self, text: str) -> List[float]:
+        """
+        Get embedding for a single text.
+        Args:
+            text: Text to embed
+        Returns:
+            List of floats (384 dimensions for nomic-embed-text)
+        Raises:
+            requests.RequestException: If Ollama API fails
+        Example:
+            >>> client = OllamaEmbeddingClient()
+            >>> embedding = client.embed("Hello world")
+            >>> len(embedding)
+            384
+        """
+        try:
+            response = requests.post(
+                f"{self.base_url}/api/embed",
+                json={
+                    "model": self.model,
+                    "input": text
+                },
+                timeout=self.timeout
+            )
+            if response.status_code != 200:
+                raise RuntimeError(
+                    f"Ollama error {response.status_code}: {response.text}"
+                )
+            # Extract embedding from response
+            embedding = response.json()["embeddings"][0]
+            return embedding
+        except requests.exceptions.Timeout:
+            raise TimeoutError(
+                f"Ollama request timed out after {self.timeout}s"
+            )
+        except requests.exceptions.ConnectionError:
+            raise ConnectionError(
+                f"Lost connection to Ollama at {self.base_url}"
+            )
+        except KeyError as e:
+            raise ValueError(f"Unexpected Ollama response format: {e}")
+    def embed_batch(self, texts: List[str]) -> List[List[float]]:
+        """
+        Get embeddings for multiple texts.
+        Args:
+            texts: List of texts to embed
+        Returns:
+            List of embeddings (one per text)
+        Note: This calls Ollama for each text. For production,
+              consider batching at the Ollama level.
+        """
+        embeddings = []
+        for text in texts:
+            try:
+                emb = self.embed(text)
+                embeddings.append(emb)
+            except Exception as e:
+                logger.error(f"Failed to embed text: {e}")
+                raise
+        return embeddings
+class SentenceTransformerEmbeddingClient:
+    """
+    Client for Sentence-Transformers embeddings (local, free).
+    No external service required - runs locally.
+    Model: all-MiniLM-L6-v2 (384 dimensions)
+    Install with: pip install sentence-transformers
+    """
+    def __init__(self, model_name: str = "all-mpnet-base-v2"):
+        """
+        Initialize Sentence-Transformers embedding client.
+        Args:
+            model_name: HuggingFace model name
+                Default: all-MiniLM-L6-v2 (fast, lightweight, 384 dims)
+        Note: First initialization downloads the model (~500MB)
+        """
+        logger.info(f"Initializing Sentence-Transformers (model: {model_name})")
+        try:
+            from sentence_transformers import SentenceTransformer
+            self.model = SentenceTransformer(model_name)
+            logger.info(f"✓ Loaded Sentence-Transformer model: {model_name}")
+        except ImportError:
+            raise ImportError(
+                "sentence-transformers not installed. "
+                "Install with: pip install sentence-transformers"
+            )
+        except Exception as e:
+            logger.error(f"Failed to load Sentence-Transformer model: {e}")
+            raise
+    def embed(self, text: str) -> List[float]:
+        """
+        Get embedding for a single text.
+        Args:
+            text: Text to embed
+        Returns:
+            List of floats (384 dimensions for all-MiniLM-L6-v2)
+        Example:
+            >>> client = SentenceTransformerEmbeddingClient()
+            >>> embedding = client.embed("Hello world")
+            >>> len(embedding)
+            384
+        """
+        try:
+            embedding = self.model.encode(text, convert_to_numpy=True)
+            return embedding.tolist()
+        except Exception as e:
+            logger.error(f"Failed to embed text: {e}")
+            raise
+    def embed_batch(self, texts: List[str]) -> List[List[float]]:
+        """
+        Get embeddings for multiple texts (more efficient than calling embed() for each).
+        Args:
+            texts: List of texts to embed
+        Returns:
+            List of embeddings (one per text)
+        """
+        try:
+            embeddings = self.model.encode(texts, convert_to_numpy=True)
+            return [emb.tolist() for emb in embeddings]
+        except Exception as e:
+            logger.error(f"Failed to embed batch: {e}")
+            raise
+def cosine_similarity(vec_a: List[float], vec_b: List[float]) -> float:
+    """
+    Calculate cosine similarity between two vectors.
+    Args:
+        vec_a: First vector
+        vec_b: Second vector
+    Returns:
+        Similarity score from -1 to 1 (1 = identical)
+    Note: Works best on normalized vectors (which both Ollama and Sentence-Transformers provide)
+    Example:
+        >>> vec1 = [1.0, 0.0, 0.0]
+        >>> vec2 = [1.0, 0.0, 0.0]
+        >>> cosine_similarity(vec1, vec2)
+        1.0
+    """
+    a = np.array(vec_a)
+    b = np.array(vec_b)
+    dot_product = np.dot(a, b)
+    norm_a = np.linalg.norm(a)
+    norm_b = np.linalg.norm(b)
+    if norm_a == 0 or norm_b == 0:
+        return 0.0
+    return float(dot_product / (norm_a * norm_b))
+# ============ TESTS ============
+def test_cosine_similarity():
+    """Test cosine similarity calculation."""
+    # Identical vectors
+    vec1 = [1.0, 0.0, 0.0]
+    vec2 = [1.0, 0.0, 0.0]
+    assert abs(cosine_similarity(vec1, vec2) - 1.0) < 0.01
+    # Orthogonal vectors
+    vec3 = [1.0, 0.0, 0.0]
+    vec4 = [0.0, 1.0, 0.0]
+    assert abs(cosine_similarity(vec3, vec4) - 0.0) < 0.01
+def test_cosine_similarity_normalized():
+    """Test with normalized vectors."""
+    # Normalized vectors
+    vec1 = np.array([1.0, 0.0, 0.0])
+    vec1 = vec1 / np.linalg.norm(vec1)
+    vec2 = np.array([1.0, 0.0, 0.0])
+    vec2 = vec2 / np.linalg.norm(vec2)
+    sim = cosine_similarity(vec1.tolist(), vec2.tolist())
+    assert abs(sim - 1.0) < 0.01
+if __name__ == "__main__":
+    import os
+    # Test based on EMBEDDING_BACKEND env var
+    backend = os.getenv("EMBEDDING_BACKEND", "sentence-transformers").lower()
+    try:
+        if backend == "ollama":
+            print("Testing Ollama embeddings...")
+            client = OllamaEmbeddingClient()
+        else:
+            print("Testing Sentence-Transformers embeddings...")
+            client = SentenceTransformerEmbeddingClient()
+        # Test single embedding
+        text = "Machine learning is AI"
+        embedding = client.embed(text)
+        print(f"✓ Embedding created: {len(embedding)} dimensions")
+        print(f"  Sample values: {embedding[:5]}")
+        # Test similarity
+        text2 = "Deep learning uses networks"
+        embedding2 = client.embed(text2)
+        sim = cosine_similarity(embedding, embedding2)
+        print(f"  Similarity between texts: {sim:.3f}")
+    except Exception as e:
+        print(f"✗ Error: {e}")
+        if backend == "ollama":
+            print("  Start Ollama with: ollama serve")
+        else:
+            print("  Install sentence-transformers with: pip install sentence-transformers")

src/rag/llm.py ADDED Viewed

	@@ -0,0 +1,220 @@

+"""
+LLM Module
+----------
+Purpose: Query Groq LLM with context for RAG answers
+"""
+from groq import Groq
+from typing import List
+import os
+import logging
+logging.basicConfig(level=logging.INFO)
+from dotenv import load_dotenv
+env_paths = [
+    os.path.join(os.path.dirname(__file__), '../..', '.env'),  # Project root
+    os.path.join(os.path.dirname(__file__), '.env'),  # Script directory
+]
+for env_path in env_paths:
+    if os.path.exists(env_path):
+        load_dotenv(env_path)
+        print(f"Loaded .env from: {env_path}")
+        break
+logger = logging.getLogger(__name__)
+class GroqLLMClient:
+    """
+    Client for querying Groq LLM with context for RAG answers
+    Requires: Groq API key
+    Model: llama-3.1-8b-instant -> check available models using client.models.list()
+    """
+    def __init__(
+        self,
+        api_key: str,
+        model_name: str = "llama-3.1-8b-instant",
+        max_tokens: int = 1024,
+        temperature: float = 0.7,
+    ):
+        """
+        Initialize Groq LLM client
+        Args:
+            api_key (str): Groq API key
+            model_name (str): Groq model name
+            max_tokens (int): Maximum number of tokens to generate
+            temperature (float): 0-1, higher for more creative shit
+        """
+        self.api_key = api_key or os.getenv("GROQ_API_KEY")
+        if not self.api_key:
+            raise ValueError("GROQ_API_KEY not found in environment variables")
+        self.client = Groq(api_key=self.api_key)
+        self.model_name = model_name
+        self.max_tokens = max_tokens
+        self.temperature = temperature
+        logger.info(f"Groq LLM client initialized with model: {self.model_name}")
+    def _build_prompt(
+        self,
+        context: str,
+        question: str,
+    ) -> str:
+        """
+        Build the final prompt for LLM
+        Args:
+            context (str): Retrieved chunks
+            question (str): Question to ask
+        Returns:
+            str: Prompt for LLM
+        """
+        prompt = f"""You are a helpful assistant. Answer the question based ONLY on the provided context.
+                    If the context doesn't contain enough information to answer, say so explicitly.
+                    Do not make up information.
+                    Context: {context}
+                    Question: {question}
+                    Answer:"""
+        return prompt
+    def query(
+        self,
+        context: str,
+        query: str,
+    ) -> str:
+        """
+        Query the Groq LLM with context
+        Args:
+            context (str): Retrieved context from vector store
+            query: User's question
+        Returns:
+            LLM's answer as string
+        Raises:
+            RuntimeError: If Groq API fails
+        """
+        try:
+            prompt = self._build_prompt(context, query)
+            logger.debug(f"Querying Groq with {len(context)} chars context")
+            response = self.client.chat.completions.create(
+                model=self.model_name,
+                messages=[
+                    {"role": "user", "content": prompt}
+                ],
+                max_tokens=self.max_tokens,
+                temperature=self.temperature,
+            )
+            answer = response.choices[0].message.content
+            logger.debug(f"Groq API response: {answer}")
+            return answer
+        except Exception as e:
+            logger.error(f"Groq query failed: {e}")
+            raise RuntimeError(f"LLM query failed: {e}")
+    def query_with_sources(
+        self,
+        context: str,
+        query: str,
+        sources: List[str] = None
+    ) -> dict:
+        """
+        Query LLM and return answer with source attribution.
+        Args:
+            context: Retrieved context
+            query: User's question
+            sources: Optional list of source identifiers (chunk IDs, URLs, etc.)
+        Returns:
+            Dict with 'answer' and 'sources' keys
+        Example:
+            >>> result = client.query_with_sources(
+            ...     context="...",
+            ...     query="What is ML?",
+            ...     sources=["doc1_chunk_0", "doc1_chunk_2"]
+            ... )
+            >>> print(result["answer"])
+            >>> print(result["sources"])
+        """
+        answer = self.query(context, query)
+        return {
+            "answer": answer,
+            "sources": sources or []
+        }
+def build_context_string(
+    retrieved_results: List,
+    include_scores: bool = True
+) -> str:
+    """
+    Build a context string from retrieved results
+    Args:
+        retrieved_results: List of retrieved results
+        include_scores: Whether to include scores in the context string
+    Returns:
+        Context string
+    """
+    context_parts = []
+    for i, result in enumerate(retrieved_results, 1):
+        if include_scores:
+            part = f"[Chunk {i} - Relevance: {result.similarity:.1%}]\n{result.text}"
+        else:
+            part = f"[Chunk {i}]\n{result.text}"
+        context_parts.append(part)
+    return "\n\n".join(context_parts)
+# ============ TESTS ============
+def test_build_context_string():
+    """Test context string building."""
+    from .vector_store import RetrievalResult
+    results = [
+        RetrievalResult("chunk1", "Text 1", 0.95),
+        RetrievalResult("chunk2", "Text 2", 0.87)
+    ]
+    context = build_context_string(results)
+    assert "Text 1" in context
+    assert "Text 2" in context
+    assert "95.0%" in context
+if __name__ == "__main__":
+    try:
+        # Test Groq client
+        client = GroqLLMClient(api_key=os.getenv("GROQ_API_KEY"))
+        # Test context string
+        from .vector_store import RetrievalResult
+        results = [
+            RetrievalResult("chunk1", "Machine learning is AI", 0.95),
+            RetrievalResult("chunk2", "Deep learning uses neural networks", 0.87)
+        ]
+        context = build_context_string(results)
+        # Query
+        answer = client.query(
+            context=context,
+            query="What is machine learning?"
+        )
+        print("✓ Groq query successful!")
+        print(f"Answer: {answer[:200]}...")
+    except Exception as e:
+        print(f"✗ Error: {e}")

src/rag/pdf_processor.py ADDED Viewed

	@@ -0,0 +1,251 @@

+"""
+PDF Processor
+-------------
+Purpose: Process PDF files and extract text.
+"""
+import os
+from pathlib import Path
+from typing import List, Dict, Tuple
+import logging
+import PyPDF2
+logger = logging.getLogger(__name__)
+@staticmethod
+def extract_text_pypdf2(pdf_path: str) -> Tuple[str, Dict]:
+    """
+    Extract text from PDF using PyPDF2.
+    Args:
+        pdf_path: Path to PDF file
+    Returns:
+        Tuple of (text, metadata)
+        metadata includes: num_pages, title, author (if available)
+    Note: PyPDF2 works okay for text-based PDFs.
+          For scanned PDFs, consider using OCR tools.
+    """
+    try:
+        with open(pdf_path, 'rb') as pdf_file:
+            pdf_reader = PyPDF2.PdfReader(pdf_file)
+            # Extract metadata
+            metadata = pdf_reader.metadata or {}
+            num_pages = len(pdf_reader.pages)
+            # Extract text from all pages
+            text = ""
+            page_texts = {}
+            for page_num, page in enumerate(pdf_reader.pages):
+                page_text = page.extract_text()
+                text += f"\n--- Page {page_num + 1} ---\n{page_text}\n"
+                page_texts[page_num + 1] = page_text
+            result_metadata = {
+                "num_pages": num_pages,
+                "title": metadata.get('/Title', 'Unknown'),
+                "author": metadata.get('/Author', 'Unknown'),
+                "page_texts": page_texts,
+                "source_file": os.path.basename(pdf_path)
+            }
+            return text, result_metadata
+    except Exception as e:
+        logger.error(f"Failed to extract text from {pdf_path}: {e}")
+        raise
+def extract_text_pdfplumber(pdf_path: str) -> Tuple[str, Dict]:
+    """
+    Extract text from PDF using pdfplumber (better quality).
+    Args:
+        pdf_path: Path to PDF file
+    Returns:
+        Tuple of (text, metadata)
+    Note: Requires: pip install pdfplumber
+          Better text extraction than PyPDF2, especially for complex layouts
+    """
+    try:
+        import pdfplumber
+    except ImportError:
+        logger.warning("pdfplumber not installed, falling back to PyPDF2")
+        return extract_text_pypdf2(pdf_path)
+    try:
+        with pdfplumber.open(pdf_path) as pdf:
+            text = ""
+            page_texts = {}
+            for page_num, page in enumerate(pdf.pages):
+                page_text = page.extract_text()
+                text += f"\n--- Page {page_num + 1} ---\n{page_text}\n"
+                page_texts[page_num + 1] = page_text
+            result_metadata = {
+                "num_pages": len(pdf.pages),
+                "title": pdf.metadata.get('Title', 'Unknown') if pdf.metadata else 'Unknown',
+                "author": pdf.metadata.get('Author', 'Unknown') if pdf.metadata else 'Unknown',
+                "page_texts": page_texts,
+                "source_file": os.path.basename(pdf_path)
+            }
+            return text, result_metadata
+    except Exception as e:
+        logger.error(f"Failed to extract text from {pdf_path}: {e}")
+        raise
+class PDFProcessor:
+    """
+    Process PDF files and extract text for RAG ingestion.
+    """
+    def __init__(self, use_pdfplumber: bool = False):
+        """
+        Initialize PDF processor.
+        Args:
+            use_pdfplumber: Use pdfplumber (better) or PyPDF2 (built-in)
+        """
+        self.use_pdfplumber = use_pdfplumber
+        if use_pdfplumber:
+            try:
+                import pdfplumber
+                logger.info("Using pdfplumber for PDF extraction")
+            except ImportError:
+                logger.warning("pdfplumber not installed, using PyPDF2")
+                self.use_pdfplumber = False
+    def process_pdf(self, pdf_path: str) -> Tuple[str, Dict]:
+        """
+        Extract text from a single PDF.
+        Args:
+            pdf_path: Path to PDF file
+        Returns:
+            Tuple of (extracted_text, metadata)
+        Example:
+            >>> processor = PDFProcessor()
+            >>> text, meta = processor.process_pdf("paper.pdf")
+            >>> print(f"Extracted {meta['num_pages']} pages")
+        """
+        pdf_path = str(pdf_path)
+        if not os.path.exists(pdf_path):
+            raise FileNotFoundError(f"PDF not found: {pdf_path}")
+        if not pdf_path.lower().endswith('.pdf'):
+            raise ValueError(f"Not a PDF file: {pdf_path}")
+        logger.info(f"Processing PDF: {os.path.basename(pdf_path)}")
+        if self.use_pdfplumber:
+            text, metadata = extract_text_pdfplumber(pdf_path)
+        else:
+            text, metadata = extract_text_pypdf2(pdf_path)
+        logger.info(
+            f"✓ Extracted {metadata['num_pages']} pages, "
+            f"{len(text)} chars"
+        )
+        return text, metadata
+    def process_folder(
+        self,
+        folder_path: str,
+        pattern: str = "*.pdf"
+    ) -> Dict[str, Tuple[str, Dict]]:
+        """
+        Process all PDFs in a folder.
+        Args:
+            folder_path: Path to folder containing PDFs
+            pattern: File pattern to match (default: "*.pdf")
+        Returns:
+            Dict of {filename: (text, metadata)}
+        Example:
+            >>> processor = PDFProcessor()
+            >>> docs = processor.process_folder("./papers")
+            >>> for filename, (text, meta) in docs.items():
+            ...     print(f"{filename}: {meta['num_pages']} pages")
+        """
+        folder_path = Path(folder_path)
+        if not folder_path.exists():
+            raise FileNotFoundError(f"Folder not found: {folder_path}")
+        logger.info(f"Processing folder: {folder_path}")
+        pdf_files = list(folder_path.glob(pattern))
+        logger.info(f"Found {len(pdf_files)} PDF files")
+        documents = {}
+        failed = []
+        for pdf_path in pdf_files:
+            try:
+                text, metadata = self.process_pdf(str(pdf_path))
+                documents[pdf_path.stem] = (text, metadata)  # Use filename without extension as key
+            except Exception as e:
+                logger.error(f"Failed to process {pdf_path.name}: {e}")
+                failed.append((pdf_path.name, str(e)))
+        if failed:
+            logger.warning(f"Failed to process {len(failed)} files:")
+            for filename, error in failed:
+                logger.warning(f"  - {filename}: {error}")
+        logger.info(f"✓ Processed {len(documents)} PDFs successfully")
+        return documents
+    def clean_text(self, text: str) -> str:
+        """
+        Clean extracted text (remove extra whitespace, control characters, etc.)
+        Args:
+            text: Raw extracted text
+        Returns:
+            Cleaned text
+        """
+        # Remove multiple newlines
+        text = '\n'.join([line.strip() for line in text.split('\n') if line.strip()])
+        # Remove control characters (but keep newlines and tabs)
+        text = ''.join(char for char in text if char.isprintable() or char in '\n\t')
+        return text
+# ============ TESTS ============
+def test_pdf_processor_missing_file():
+    """Test handling of missing file."""
+    processor = PDFProcessor()
+    try:
+        processor.process_pdf("nonexistent.pdf")
+        assert False, "Should raise FileNotFoundError"
+    except FileNotFoundError:
+        print("✓ Correctly raises FileNotFoundError for missing file")
+if __name__ == "__main__":
+    logging.basicConfig(level=logging.INFO)
+    # Example usage
+    processor = PDFProcessor(use_pdfplumber=False)

src/rag/pipeline.py ADDED Viewed

	@@ -0,0 +1,362 @@

+"""
+RAG Pipeline
+------------
+Purpose: DO all RAG stuff in to a unified pipeline
+"""
+from typing import List, Dict, Any
+from dataclasses import dataclass
+import logging
+import os
+from dotenv import load_dotenv
+from pathlib import Path
+from .chunker import chunk_text
+from .vector_store import ChromaVectorStore
+from .llm import GroqLLMClient, build_context_string
+from .pdf_processor import PDFProcessor
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def load_env():
+    """Load environment variables from project root .env file."""
+    env_paths = [
+        os.path.join(os.path.dirname(__file__), '../..', '.env'),
+        os.path.join(os.path.dirname(__file__), '.env'),
+    ]
+    for env_path in env_paths:
+        if os.path.exists(env_path):
+            load_dotenv(env_path)
+            logger.debug(f"Loaded .env from: {env_path}")
+            return env_path
+    logger.warning("No .env file found")
+    return None
+def get_embeddings_client():
+    """
+    Get embeddings client based on EMBEDDING_BACKEND env var.
+    Environment Variables:
+        EMBEDDING_BACKEND: "ollama" or "sentence-transformers" (default)
+        OLLAMA_BASE_URL: URL for Ollama (default: http://localhost:11434)
+    Returns:
+        Embeddings client instance
+    """
+    backend = os.getenv("EMBEDDING_BACKEND", "sentence-transformers").lower()
+    if backend == "ollama":
+        logger.info("Using Ollama embeddings")
+        from .embeddings import OllamaEmbeddingClient
+        base_url = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
+        return OllamaEmbeddingClient(
+            base_url=base_url,
+            model="nomic-embed-text"
+        )
+    else:
+        # sentence-transformers (default, free, works everywhere)
+        logger.info("Using Sentence-Transformers embeddings (local)")
+        from .embeddings import SentenceTransformerEmbeddingClient
+        return SentenceTransformerEmbeddingClient()
+@dataclass
+class RAGConfig:
+    """Configuration for RAG pipeline."""
+    chunk_size: int = 500
+    chunk_overlap: int = 50
+    top_k: int = 3
+    embedding_backend: str = None  # Will use env var if None
+    groq_api_key: str = None
+    def __post_init__(self):
+        """Set embedding_backend from env if not provided."""
+        if self.embedding_backend is None:
+            self.embedding_backend = os.getenv("EMBEDDING_BACKEND", "sentence-transformers")
+class RAGPipeline:
+    """
+    End-to-end RAG pipeline.
+    Workflow:
+        1. Initialize: Create components
+        2. Ingest: Chunk and embed documents
+        3. Query: Retrieve and answer
+    """
+    def __init__(
+        self,
+        config: RAGConfig = None,
+        embeddings=None,
+        llm=None
+    ):
+        """
+        Initialize RAG pipeline with all components.
+        Args:
+            config: RAGConfig object with settings
+            embeddings: Optional embeddings client (for dependency injection)
+            llm: Optional LLM client (for dependency injection)
+        """
+        load_env()
+        self.config = config or RAGConfig()
+        logger.info("Initializing RAG Pipeline...")
+        # Use provided embeddings or create from config
+        if embeddings:
+            self.embeddings = embeddings
+            logger.info("✓ Using provided embeddings client")
+        else:
+            try:
+                self.embeddings = get_embeddings_client()
+                logger.info("✓ Embeddings client ready")
+            except Exception as e:
+                logger.error(f"Failed to initialize embeddings: {e}")
+                raise
+        # Use provided LLM or create from config
+        if llm:
+            self.llm = llm
+            logger.info("✓ Using provided LLM client")
+        else:
+            try:
+                api_key = self.config.groq_api_key or os.getenv("GROQ_API_KEY")
+                if not api_key:
+                    raise ValueError(
+                        "GROQ_API_KEY not provided. Pass it in RAGConfig or set GROQ_API_KEY environment variable."
+                    )
+                self.llm = GroqLLMClient(api_key=api_key)
+                logger.info("✓ LLM client ready")
+            except Exception as e:
+                logger.error(f"Failed to initialize LLM: {e}")
+                raise
+        self.vector_store = ChromaVectorStore()
+        logger.info("✓ Vector store ready")
+        logger.info("✓ RAG Pipeline initialized")
+    def ingest_pdf(
+        self,
+        pdf_path: str
+    ) -> Dict[str, Any]:
+        """
+        Ingest a PDF file: extract text, chunk, and embed.
+        Args:
+            pdf_path: Path to PDF file
+        Returns:
+            Ingestion stats
+        Example:
+            >>> pipeline = RAGPipeline()
+            >>> result = pipeline.ingest_pdf("research_paper.pdf")
+            >>> print(f"Ingested {result['chunks_embedded']} chunks")
+        """
+        # Extract PDF
+        processor = PDFProcessor(use_pdfplumber=False)
+        text, metadata = processor.process_pdf(pdf_path)
+        # Use filename (without extension) as doc_id
+        doc_id = Path(pdf_path).stem
+        # Add PDF metadata to chunks
+        ingestion_result = self.ingest(doc_id, text)
+        ingestion_result["pdf_metadata"] = metadata
+        return ingestion_result
+    def ingest_folder(
+        self,
+        folder_path: str
+    ) -> Dict[str, Dict[str, Any]]:
+        """
+        Ingest all PDFs from a folder.
+        Args:
+            folder_path: Path to folder containing PDFs
+        Returns:
+            Dict of {doc_id: ingestion_result}
+        Example:
+            >>> pipeline = RAGPipeline()
+            >>> results = pipeline.ingest_folder("./papers")
+            >>> for doc_id, result in results.items():
+            ...     print(f"{doc_id}: {result['chunks_embedded']} chunks")
+        """
+        processor = PDFProcessor(use_pdfplumber=False)
+        documents = processor.process_folder(folder_path)
+        results = {}
+        for doc_id, (text, metadata) in documents.items():
+            result = self.ingest(doc_id, text)
+            result["pdf_metadata"] = metadata
+            results[doc_id] = result
+        return results
+    def ingest(
+        self,
+        doc_id: str,
+        text: str
+    ) -> Dict[str, Any]:
+        """
+        Ingest a document: chunk it and embed each chunk.
+        Args:
+            doc_id: Unique document identifier
+            text: Document text
+        Returns:
+            Ingestion stats (chunks created, time taken, etc.)
+        Example:
+            >>> pipeline = RAGPipeline()
+            >>> result = pipeline.ingest(
+            ...     "doc1",
+            ...     "Machine learning is AI. Deep learning uses networks."
+            ... )
+            >>> print(f"Ingested {result['chunks_created']} chunks")
+        """
+        logger.info(f"Ingesting document: {doc_id}")
+        #Step 1: chunk it
+        chunks = chunk_text(text, self.config.chunk_size, self.config.chunk_overlap)
+        logger.info(f"✓ Chunks created: {len(chunks)}")
+        if not chunks:
+            logger.warning("No chunks created. Document may be too short.")
+            return {
+                "doc_id": doc_id,
+                "chunks_created": 0,
+                "time_taken": 0,
+                "error": "Document too short"
+            }
+        #Step 2: embed each chunk
+        chunks_embedded = 0
+        for chunk in chunks:
+            try:
+                chunk_id = f"{doc_id}_chunk_{chunk.chunk_id}"
+                embedding = self.embeddings.embed(chunk.text)
+                self.vector_store.add(
+                    chunk_id=chunk_id,
+                    text=chunk.text,
+                    embedding=embedding,
+                    metadata={
+                        "doc_id": doc_id,
+                        "chunk_num": chunk.chunk_id,
+                        "word_count": chunk.word_count
+                    }
+                )
+                chunks_embedded += 1
+            except Exception as e:
+                logger.error(f"Failed to embed chunk {chunk_id}: {e}")
+                continue
+        logger.info(f"✓ Embedded {chunks_embedded}/{len(chunks)} chunks")
+        return {
+            "doc_id": doc_id,
+            "chunks_created": len(chunks),
+            "chunks_embedded": chunks_embedded,
+            "status": "success" if chunks_embedded > 0 else "partial"
+        }
+    def query(
+        self,
+        query: str,
+        return_sources: bool = True
+    ) -> Dict[str, Any]:
+        """
+        Query the RAG system: retrieve relevant chunks and generate answer.
+        Args:
+            query: User's question
+            return_sources: Include source chunks in response
+        Returns:
+            Dictionary with 'query', 'answer', 'sources', etc.
+        Raises:
+            ValueError: If vector store is empty
+        Example:
+            >>> pipeline = RAGPipeline()
+            >>> pipeline.ingest("doc1", "Machine learning is...")
+            >>> result = pipeline.query("What is ML?")
+            >>> print(result["answer"])
+        """
+        logger.info(f"Querying: {query}")
+        #Check if we have docs
+        if self.vector_store.size() == 0:
+            raise ValueError("No documents in vector store")
+        #Step 1: Embed the query
+        query_embedding = self.embeddings.embed(query)
+        logger.debug("  → Query embedded")
+        #Step 2: Retrieve relevant chunks
+        retrieved_chunks = self.vector_store.retrieve(
+            query_embedding,
+            top_k=self.config.top_k
+        )
+        logger.debug(f"  → Retrieved {len(retrieved_chunks)} chunks")
+        if not retrieved_chunks:
+            return {
+                "query": query,
+                "answer": "No relevant documents found.",
+                "sources": [],
+                "status": "no_results"
+            }
+        #Step 3: Build context from retrieved chunks
+        context = build_context_string(retrieved_chunks)
+        logger.debug(f"  → Built context ({len(context)} chars)")
+        #Step 4: Query LLM with context
+        try:
+            answer = self.llm.query(context=context, query=query)
+            logger.debug(f"  → LLM responded ({len(answer)} chars)")
+        except Exception as e:
+            logger.error(f"LLM query failed: {e}")
+            raise
+        #Step 5: Format response
+        sources = [
+            {
+                "chunk_id": r.chunk_id,
+                "similarity": round(r.similarity, 3),
+                "preview": r.text[:100] + "..." if len(r.text) > 100 else r.text
+            }
+            for r in retrieved_chunks
+        ] if return_sources else []
+        result = {
+            "query": query,
+            "answer": answer,
+            "sources": sources,
+            "chunks_used": len(retrieved_chunks),
+            "status": "success"
+        }
+        logger.info(f"Query complete: {result['status']}")
+        return result
+    def get_stats(self) -> Dict[str, Any]:
+        """Get pipeline statistics."""
+        return {
+            "total_chunks": self.vector_store.size(),
+            "config": {
+                "chunk_size": self.config.chunk_size,
+                "chunk_overlap": self.config.chunk_overlap,
+                "top_k": self.config.top_k
+            }
+        }

src/rag/vector_store.py ADDED Viewed

	@@ -0,0 +1,297 @@

+"""
+Vector Store Module
+===================
+Purpose: Store embeddings and retrieve similar ones
+This module uses Chroma for persistent, efficient vector storage.
+Chroma is free, local, and production-ready.
+Key Concepts:
+  • Vector storage: Persistent storage mapping chunk_id → embedding
+  • Metadata: Source info, text preview, etc.
+  • Retrieval: Find top-k most similar vectors using cosine similarity
+  • Persistence: Data survives application restarts
+"""
+from typing import List, Dict, Any
+from dataclasses import dataclass, field
+import logging
+import chromadb
+import os
+logger = logging.getLogger(__name__)
+@dataclass
+class RetrievalResult:
+    """A single retrieved chunk with metadata."""
+    chunk_id: str
+    text: str
+    similarity: float
+    metadata: Dict[str, Any] = field(default_factory=dict)
+class ChromaVectorStore:
+    """
+    Vector store using Chroma (persistent, free, production-ready).
+    Chroma is a modern vector database that:
+    • Stores embeddings persistently on disk
+    • Provides similarity search
+    • Is completely free and open source
+    • Works locally (no API calls)
+    This is the recommended implementation for production RAG systems.
+    """
+    def __init__(self, persist_directory: str = ".chromadb", collection_name: str = "rag"):
+        """
+        Initialize Chroma vector store.
+        Args:
+            persist_directory: Where to store vectors on disk
+            collection_name: Name of the collection (namespace)
+        Example:
+            >>> store = ChromaVectorStore(persist_directory="./data/vectors")
+        """
+        self.persist_directory = persist_directory
+        self.collection_name = collection_name
+        # Ensure persist directory exists
+        os.makedirs(persist_directory, exist_ok=True)
+        try:
+            # Create persistent client
+            self.client = chromadb.PersistentClient(path=persist_directory)
+            # Get or create collection
+            self.collection = self.client.get_or_create_collection(
+                name=collection_name,
+                metadata={"hnsw:space": "cosine"}  # Use cosine similarity
+            )
+            logger.info(
+                f"✓ Initialized Chroma vector store at {persist_directory} "
+                f"(collection: {collection_name})"
+            )
+        except Exception as e:
+            logger.error(f"Failed to initialize Chroma: {e}")
+            raise
+    def __enter__(self):
+        return self
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        try:
+            self.client.persist()
+            self.client.shutdown()
+        except Exception:
+            pass
+    def add(
+        self,
+        chunk_id: str,
+        text: str,
+        embedding: List[float],
+        metadata: Dict[str, Any] = None
+    ) -> None:
+        """
+        Add a chunk with its embedding to the store.
+        Args:
+            chunk_id: Unique identifier for chunk
+            text: Original text content
+            embedding: Vector representation (list of floats)
+            metadata: Optional metadata (source, page number, etc.)
+        Example:
+            >>> store.add(
+            ...     "doc1_chunk_0",
+            ...     "Machine learning is AI",
+            ...     [0.1, 0.2, ..., 0.384],
+            ...     metadata={"doc_id": "doc1", "page": 1}
+            ... )
+        """
+        try:
+            self.collection.add(
+                ids=[chunk_id],
+                documents=[text],
+                embeddings=[embedding],
+                metadatas=[metadata or {}]
+            )
+            logger.debug(f"Added chunk {chunk_id} ({len(text)} chars)")
+        except Exception as e:
+            logger.error(f"Failed to add chunk {chunk_id}: {e}")
+            raise
+    def retrieve(
+        self,
+        query_embedding: List[float],
+        top_k: int = 5
+    ) -> List[RetrievalResult]:
+        """
+        Find most similar chunks to query.
+        Args:
+            query_embedding: Query vector
+            top_k: Number of results to return
+        Returns:
+            List of RetrievalResult objects, sorted by similarity (highest first)
+        Example:
+            >>> results = store.retrieve(query_embedding, top_k=3)
+            >>> for r in results:
+            ...     print(f"{r.similarity:.3f} | {r.text[:60]}")
+        """
+        try:
+            if self.collection.count() == 0:
+                logger.warning("Vector store is empty")
+                return []
+            # Query Chroma
+            results = self.collection.query(
+                query_embeddings=[query_embedding],
+                n_results=top_k
+            )
+            if not results["ids"] or not results["ids"][0]:
+                logger.debug("No results found for query")
+                return []
+            # Convert to RetrievalResult objects
+            retrieval_results = []
+            for i, chunk_id in enumerate(results["ids"][0]):
+                # Chroma returns distances, convert to similarity (1 - distance for cosine)
+                # Note: Chroma with cosine metric returns distances
+                distance = results["distances"][0][i]
+                similarity = 1 - distance  # Convert distance to similarity
+                result = RetrievalResult(
+                    chunk_id=chunk_id,
+                    text=results["documents"][0][i],
+                    similarity=similarity,
+                    metadata=results["metadatas"][0][i]
+                )
+                retrieval_results.append(result)
+            logger.debug(f"Retrieved {len(retrieval_results)} chunks")
+            return retrieval_results
+        except Exception as e:
+            logger.error(f"Retrieval failed: {e}")
+            raise
+    def size(self) -> int:
+        """Return number of chunks in store."""
+        try:
+            count = self.collection.count()
+            return count
+        except Exception as e:
+            logger.error(f"Failed to get store size: {e}")
+            return 0
+    def delete(self, chunk_id: str) -> bool:
+        """
+        Delete a chunk from the store.
+        Args:
+            chunk_id: ID of chunk to delete
+        Returns:
+            True if deleted, False if not found
+        """
+        try:
+            self.collection.delete(ids=[chunk_id])
+            logger.debug(f"Deleted chunk {chunk_id}")
+            return True
+        except Exception as e:
+            logger.error(f"Failed to delete chunk {chunk_id}: {e}")
+            return False
+    def clear(self) -> None:
+        """Clear all vectors from store."""
+        try:
+            # Get all IDs and delete them
+            all_data = self.collection.get()
+            if all_data["ids"]:
+                self.collection.delete(ids=all_data["ids"])
+            logger.info("Cleared vector store")
+        except Exception as e:
+            logger.error(f"Failed to clear store: {e}")
+            raise
+# ============ TESTS ============
+import tempfile
+import shutil
+import time
+def test_chroma_vector_store():
+    temp_dir = tempfile.mkdtemp()
+    store = ChromaVectorStore(persist_directory=temp_dir)
+    try:
+        # Add chunks
+        vec1 = [1.0, 0.0, 0.0]
+        vec2 = [0.9, 0.1, 0.0]
+        vec3 = [0.0, 1.0, 0.0]
+        store.add("chunk1", "Machine learning", vec1, metadata={"source": "test"})
+        store.add("chunk2", "Deep learning networks", vec2, metadata={"source": "test"})
+        store.add("chunk3", "Cooking recipes", vec3, metadata={"source": "test"})
+        # Retrieve
+        results = store.retrieve(vec1, top_k=2)
+        assert len(results) == 2
+        assert results[0].chunk_id == "chunk1"
+        print("✓ Chroma test passed!")
+    finally:
+        # Cleanup Chroma resources
+        try:
+            if hasattr(store, "client"):
+                store.client.close()
+                del store.client
+                del store.collection
+        except Exception as e:
+            logger.warning(f"Error closing Chroma client: {e}")
+        # Give Windows time to release file handles
+        time.sleep(1.0)
+        # Retry logic for Windows file deletion
+        retry_count = 0
+        max_retries = 5
+        while retry_count < max_retries:
+            try:
+                shutil.rmtree(temp_dir)
+                break
+            except PermissionError:
+                retry_count += 1
+                if retry_count < max_retries:
+                    time.sleep(0.5)
+                else:
+                    logger.warning(f"Could not delete temp directory {temp_dir}, skipping")
+                    break
+if __name__ == "__main__":
+    logging.basicConfig(level=logging.INFO)
+    # Test Chroma
+    try:
+        test_chroma_vector_store()
+    except ImportError:
+        print("Chroma not installed, skipping test")
+    # Test SimpleVectorStore
+    test_simple_vector_store()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff