Spaces:
Sleeping
Sleeping
| title: RagCore | |
| emoji: 🔍 | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # RagCore | |
| **A production-ready Retrieval-Augmented Generation system with hybrid search, metadata filtering, and a conversational UI.** | |
| RagCore solves the problem of querying unstructured documents (PDFs, text files, HTML pages) using natural language. It ingests documents, splits them into semantically meaningful chunks, indexes them in both a vector database and a BM25 keyword index, then retrieves and reranks the most relevant passages to generate grounded, citation-backed answers using Google Gemini. | |
| Unlike naive RAG implementations that rely solely on vector similarity, RagCore combines dense (semantic) and sparse (keyword) retrieval using Reciprocal Rank Fusion, applies a cross-encoder reranker to promote the most relevant passages, and uses an intelligent query analyzer that automatically extracts filters (date ranges, document types, sources) from natural language queries. | |
| --- | |
| ## Table of Contents | |
| 1. [Architecture Overview](#architecture-overview) | |
| 2. [Tech Stack](#tech-stack) | |
| 3. [Project Structure](#project-structure) | |
| 4. [Core Components Deep Dive](#core-components-deep-dive) | |
| 5. [Data Models](#data-models) | |
| 6. [API Reference](#api-reference) | |
| 7. [UI Guide](#ui-guide) | |
| 8. [Setup and Installation](#setup-and-installation) | |
| 9. [Deployment](#deployment) | |
| 10. [Configuration Reference](#configuration-reference) | |
| 11. [How It Works End-to-End](#how-it-works-end-to-end) | |
| 12. [Testing](#testing) | |
| 13. [CI/CD](#cicd) | |
| 14. [Performance and Limits](#performance-and-limits) | |
| 15. [Troubleshooting](#troubleshooting) | |
| --- | |
| ## Architecture Overview | |
| RagCore is built as a FastAPI application with two main pipelines: **Ingestion** and **Query**. A Gradio-based UI is mounted directly onto the FastAPI app at `/ui`. | |
| ### Ingestion Pipeline | |
| ``` | |
| +------------------+ +----------------+ +-------------------+ | |
| | File Upload | --> | Parser | --> | Text Cleaner | | |
| | (PDF/TXT/HTML) | | (pypdf/bs4) | | (regex cleanup) | | |
| +------------------+ +----------------+ +-------------------+ | |
| | | |
| v | |
| +------------------+ +----------------+ +-------------------+ | |
| | Qdrant Cloud | <-- | Embedder | <-- | Chunker | | |
| | (vector store) | | (MiniLM-L6-v2) | | (sentence-aware) | | |
| +------------------+ +----------------+ +-------------------+ | |
| | | | |
| | v | |
| | +-------------------+ | |
| +------------------------------------> | BM25 Index | | |
| | (in-memory) | | |
| +-------------------+ | |
| ^ | |
| | | |
| +-------------------+ | |
| | Metadata Extractor| | |
| | (title/dates/tags)| | |
| +-------------------+ | |
| ``` | |
| **Step-by-step flow:** | |
| 1. User uploads a file via the `/api/ingest` endpoint or the Gradio UI. | |
| 2. The **Parser** detects file type by extension and extracts raw text (pypdf for PDFs, BeautifulSoup for HTML, direct decoding for TXT). | |
| 3. The **Text Cleaner** normalizes whitespace, collapses blank lines, and trims each line. | |
| 4. The **Metadata Extractor** pulls out the document title (first non-empty line), dates (via regex patterns), and tags (frequent capitalized phrases). | |
| 5. The **Chunker** splits text into overlapping chunks at sentence boundaries, respecting a configurable word-count limit. | |
| 6. The **Embedder** encodes each chunk into a 384-dimensional vector using the `all-MiniLM-L6-v2` sentence transformer. | |
| 7. Chunks with their vectors and payload metadata are upserted into **Qdrant Cloud** in batches of 100. | |
| 8. The same chunks are added to the in-memory **BM25 index** for keyword search. | |
| ### Query Pipeline | |
| ``` | |
| +------------------+ +-------------------+ +------------------+ | |
| | User Query | --> | Query Analyzer | --> | Hybrid Retriever| | |
| | "What is RAG | | (intent, filters, | | | | |
| | from PDFs?" | | cleaned query) | | +----------+ | | |
| +------------------+ +-------------------+ | |Dense | | | |
| | |(Qdrant) | | | |
| | +----------+ | | |
| | | | | |
| | +----------+ | | |
| | |Sparse | | | |
| | |(BM25) | | | |
| | +----------+ | | |
| | | | | |
| | +----------+ | | |
| | |RRF Fusion| | | |
| | +----------+ | | |
| +------------------+ | |
| | | |
| v | |
| +-------------------+ +------------------+ | |
| | Answer Generator | <-- | Reranker | | |
| | (Gemini Flash) | | (FlashRank) | | |
| +-------------------+ +------------------+ | |
| | | |
| v | |
| +-------------------+ | |
| | Cited Answer | | |
| | with Sources | | |
| +-------------------+ | |
| ``` | |
| **Step-by-step flow:** | |
| 1. User submits a natural language query. | |
| 2. The **Query Analyzer** classifies intent (factual, summarize, comparative, list, explanatory), extracts inline filters (doc type, date range, source filename), and produces a cleaned query. | |
| 3. The **Hybrid Retriever** runs two parallel searches: | |
| - **Dense search**: encodes the query with the same embedding model, queries Qdrant with cosine similarity, fetching `top_k * 2` results. | |
| - **Sparse search**: tokenizes the query and scores all chunks via BM25Okapi, also fetching `top_k * 2` results. | |
| 4. Results are fused using **Reciprocal Rank Fusion (RRF)** with configurable weights (default: 0.6 dense, 0.4 sparse). | |
| 5. The top-K fused results are passed to the **Reranker** (FlashRank cross-encoder), which rescores and selects the best 5 passages. | |
| 6. The **Answer Generator** builds a prompt with numbered context passages and sends it to **Google Gemini Flash**, which generates a cited, markdown-formatted answer. | |
| 7. The answer is returned with source references (streaming or non-streaming). | |
| --- | |
| ## Tech Stack | |
| | Technology | Version | Purpose | | |
| |---|---|---| | |
| | **Python** | 3.12 | Runtime language. Chosen for its ML/NLP ecosystem. | | |
| | **FastAPI** | >=0.110 | Async web framework. High performance, automatic OpenAPI docs, dependency injection. | | |
| | **Uvicorn** | >=0.29 | ASGI server for running FastAPI in production. | | |
| | **Pydantic** | >=2.6 | Data validation and serialization for all request/response models. | | |
| | **pydantic-settings** | >=2.2 | Environment-based configuration with `.env` file support. | | |
| | **sentence-transformers** | >=2.6 | Embedding model loading and inference (`all-MiniLM-L6-v2`). Chosen for fast CPU inference and high quality at 384 dimensions. | | |
| | **qdrant-client** | >=1.8 | Client for Qdrant vector database. Chosen for its generous free tier (1GB), filtering support, and payload storage. | | |
| | **rank-bm25** | >=0.2.2 | BM25Okapi implementation for sparse keyword retrieval. Lightweight, pure-Python, no external dependencies. | | |
| | **FlashRank** | >=0.2 | Ultra-fast cross-encoder reranker (`ms-marco-MiniLM-L-12-v2`). Runs on CPU, no GPU required. | | |
| | **google-generativeai** | >=0.5 | Official Google Gemini SDK. Gemini 2.0 Flash offers a free tier with 15 RPM. | | |
| | **Gradio** | >=4.20 | Web UI framework mounted directly on FastAPI. Two-tab interface for Q&A and document management. | | |
| | **pypdf** | >=4.1 | PDF text extraction. Handles most PDF formats without external system dependencies. | | |
| | **beautifulsoup4** | >=4.12 | HTML parsing with tag stripping (removes scripts, styles, nav, footer, header). | | |
| | **httpx** | >=0.27 | Async/sync HTTP client used by the Gradio UI to call the FastAPI backend. | | |
| | **python-multipart** | >=0.0.9 | Required by FastAPI for file upload support. | | |
| | **python-dateutil** | >=2.9 | Fuzzy date parsing for the query analyzer's absolute date extraction. | | |
| | **Ruff** | >=0.3 | Fast Python linter. Used in CI for code quality checks. | | |
| | **pytest** | >=8.0 | Test framework. Unit tests for chunker, parsers, query analyzer, retrieval, and API. | | |
| | **Docker** | - | Containerization. Pre-downloads ML models in the build step for fast cold starts. | | |
| --- | |
| ## Project Structure | |
| ``` | |
| ragcore/ | |
| |-- .github/ | |
| | +-- workflows/ | |
| | +-- ci.yml # GitHub Actions CI pipeline (lint + test) | |
| |-- app/ | |
| | |-- __init__.py | |
| | |-- config.py # Settings class with all env vars, setup_logging() | |
| | |-- main.py # FastAPI app creation, lifespan, middleware, routing | |
| | |-- api/ | |
| | | |-- __init__.py | |
| | | |-- deps.py # Dependency injection factories for all services | |
| | | +-- routes/ | |
| | | |-- __init__.py | |
| | | |-- health.py # GET /health endpoint | |
| | | |-- ingest.py # POST /api/ingest, GET /api/documents, DELETE /api/documents/{id} | |
| | | +-- query.py # POST /api/search, POST /api/ask (with streaming) | |
| | |-- core/ | |
| | | |-- __init__.py | |
| | | |-- bm25.py # BM25 index: tokenization, search, rebuild from vectorstore | |
| | | |-- chunker.py # Sentence-aware text chunking with overlap | |
| | | |-- embedder.py # SentenceTransformer embedding service | |
| | | |-- generator.py # Answer generation with prompt templates and streaming | |
| | | |-- llm.py # Gemini API client with rate limiting | |
| | | |-- metadata.py # Metadata extraction (title, dates, tags) | |
| | | |-- query_analyzer.py # Query intent classification and filter extraction | |
| | | |-- reranker.py # FlashRank cross-encoder reranking | |
| | | |-- retriever.py # Hybrid retriever with RRF fusion | |
| | | +-- vectorstore.py # Qdrant client wrapper (CRUD, search, filtering) | |
| | |-- models/ | |
| | | |-- __init__.py | |
| | | |-- document.py # DocumentMetadata, Chunk, Document models | |
| | | +-- schemas.py # API request/response schemas (IngestResponse, QueryRequest, etc.) | |
| | |-- ui/ | |
| | | |-- __init__.py | |
| | | +-- gradio_app.py # Gradio Blocks UI (Ask tab, Documents tab) | |
| | +-- utils/ | |
| | |-- __init__.py | |
| | |-- helpers.py # generate_id, clean_text, count_words, timer, retry_with_backoff | |
| | +-- parsers.py # File parsing (PDF, TXT, HTML) and page count extraction | |
| |-- tests/ | |
| | |-- __init__.py | |
| | |-- conftest.py # Shared fixtures (TestClient, sample_text) | |
| | |-- test_api.py # API integration tests (health, redirect, docs) | |
| | |-- test_chunker.py # Chunker unit tests (empty, single, multiple, overlap) | |
| | |-- test_parsers.py # Parser unit tests (UTF-8, Latin-1, HTML, unsupported) | |
| | |-- test_query_analyzer.py # Query analyzer tests (intents, filters, dates) | |
| | +-- test_retrieval.py # RRF fusion tests (basic, empty, weights, filters) | |
| |-- .dockerignore | |
| |-- .env # Environment variables (not committed to git) | |
| |-- .gitignore | |
| |-- Dockerfile # Python 3.12-slim, pre-downloads ML models | |
| |-- docker-compose.yml # Single-service compose with env_file | |
| +-- requirements.txt # All Python dependencies with version constraints | |
| ``` | |
| --- | |
| ## Core Components Deep Dive | |
| ### Parsers (`app/utils/parsers.py`) | |
| **What it does:** Extracts raw text from uploaded files based on their extension. | |
| **Supported formats:** `.pdf`, `.txt`, `.html`, `.htm` | |
| **How it works internally:** | |
| - `parse_document(file_bytes, filename)` is the main dispatcher. It reads the file extension and calls the appropriate parser. | |
| - **PDF parsing** uses `pypdf.PdfReader` to iterate over all pages, extract text from each, and join them with double newlines. | |
| - **HTML parsing** uses `BeautifulSoup` with the `html.parser` backend. Before extracting text, it decomposes `<script>`, `<style>`, `<nav>`, `<footer>`, and `<header>` tags to remove boilerplate content. Text is extracted with `get_text(separator="\n")`. | |
| - **TXT parsing** attempts UTF-8 decoding first, falling back to Latin-1 for non-UTF-8 files. | |
| - All parsers pass their output through `clean_text()` for normalization. | |
| **Key functions:** | |
| ```python | |
| def parse_document(file_bytes: bytes, filename: str) -> str | |
| def parse_pdf(file_bytes: bytes, filename: str) -> str | |
| def parse_text(file_bytes: bytes, filename: str) -> str | |
| def parse_html(file_bytes: bytes, filename: str) -> str | |
| def get_page_count(file_bytes: bytes, filename: str) -> int | None | |
| ``` | |
| **Configuration:** No direct configuration. File size is validated at the API layer (`max_file_size_mb`). | |
| --- | |
| ### Chunker (`app/core/chunker.py`) | |
| **What it does:** Splits raw text into overlapping chunks at sentence boundaries, sized by word count. | |
| **How it works internally:** | |
| 1. Text is split into sentences using the regex pattern `(?<=[.!?])\s+` (splits after sentence-ending punctuation followed by whitespace). | |
| 2. Sentences are accumulated word-by-word into the current chunk. | |
| 3. When adding the next sentence would exceed `chunk_size` words, the current chunk is finalized. | |
| 4. Overlap is implemented by retaining the last `chunk_overlap` words from the previous chunk as the start of the new chunk. | |
| 5. Each chunk records its `text`, `start_char`, `end_char`, and `chunk_index`. | |
| **Key function:** | |
| ```python | |
| def chunk_text( | |
| text: str, | |
| chunk_size: int = 512, # Maximum words per chunk | |
| chunk_overlap: int = 50, # Number of overlapping words between consecutive chunks | |
| ) -> list[dict] | |
| ``` | |
| **Return format:** Each dict contains `{"text": str, "start_char": int, "end_char": int, "chunk_index": int}`. | |
| **Configuration:** | |
| | Setting | Default | Description | | |
| |---|---|---| | |
| | `CHUNK_SIZE` | 512 | Maximum number of words per chunk | | |
| | `CHUNK_OVERLAP` | 50 | Number of overlapping words between consecutive chunks | | |
| **Design note:** Sentence-aware splitting avoids cutting mid-sentence, which improves both retrieval relevance and answer generation quality compared to fixed-character splitting. | |
| --- | |
| ### Metadata Extractor (`app/core/metadata.py`) | |
| **What it does:** Automatically extracts structured metadata from raw document text. | |
| **How it works internally:** | |
| - **Title extraction:** Scans lines from the top of the document, returning the first non-empty line with more than 3 characters (truncated to 200 chars). | |
| - **Date extraction:** Searches the first 2000 characters for dates using three regex patterns: | |
| - `YYYY-MM-DD` (ISO format) | |
| - `MM/DD/YYYY` (US format) | |
| - `Month DD, YYYY` (long format, e.g., "January 15, 2024") | |
| - **Tag extraction:** Finds all capitalized phrases (e.g., "Machine Learning", "Neural Network") using regex, counts their occurrences, and returns the top 10 that appear at least twice. Tags are lowercased before returning. | |
| - **Doc type:** Derived from the file extension (e.g., "pdf", "html", "txt"). | |
| **Key function:** | |
| ```python | |
| def extract_metadata(raw_text: str, filename: str, page_count: int | None = None) -> DocumentMetadata | |
| ``` | |
| **Supporting functions:** | |
| ```python | |
| def extract_title(text: str) -> str | None | |
| def extract_dates(text: str) -> datetime | None | |
| def extract_tags(text: str, max_tags: int = 10) -> list[str] | |
| ``` | |
| --- | |
| ### Embedder (`app/core/embedder.py`) | |
| **What it does:** Converts text into dense vector representations using a sentence transformer model. | |
| **How it works internally:** | |
| - Uses `sentence-transformers` to load the `all-MiniLM-L6-v2` model on CPU at startup. | |
| - Encodes text in batches of 64 with L2 normalization enabled (so cosine similarity is equivalent to dot product). | |
| - The model produces 384-dimensional embeddings. | |
| - Singleton pattern via `get_embedder()` ensures the model is loaded only once. | |
| **Key class:** `EmbedderService` | |
| ```python | |
| class EmbedderService: | |
| EMBEDDING_DIM = 384 | |
| def __init__(self, model_name: str) | |
| def embed_texts(self, texts: list[str]) -> list[list[float]] # Batch embedding | |
| def embed_query(self, query: str) -> list[float] # Single query embedding | |
| ``` | |
| **Configuration:** | |
| | Setting | Default | Description | | |
| |---|---|---| | |
| | `EMBEDDING_MODEL` | `all-MiniLM-L6-v2` | HuggingFace sentence-transformers model name | | |
| | `EMBEDDING_DIM` | 384 | Embedding vector dimensionality | | |
| --- | |
| ### Vector Store -- Qdrant (`app/core/vectorstore.py`) | |
| **What it does:** Manages all interactions with the Qdrant vector database: collection management, upserting chunks, searching, filtering, scrolling, and deleting. | |
| **How it works internally:** | |
| - On initialization, connects to Qdrant Cloud using the provided URL and API key. | |
| - `ensure_collection()` checks if the collection exists; if not, creates it with cosine distance and the configured vector size. | |
| - **Upsert:** Chunks are uploaded in batches of 100 as `PointStruct` objects, with the chunk text and all metadata stored in the payload. | |
| - **Search:** Uses `query_points()` with an optional `Filter` object built from `SearchFilters`. Over-fetches `top_k * 2` results to give the fusion step more candidates. | |
| - **Filtering:** Supports exact match on `source`, `doc_type`, `MatchAny` on `tags`, and `Range` on `created_date`. | |
| - **Scroll:** Iterates through all points in the collection using offset-based pagination (batch size 100). Used to rebuild the BM25 index on startup. | |
| - **Document listing:** Aggregates all points by `document_id` to return a list of unique documents with chunk counts. | |
| **Key class:** `VectorStoreService` | |
| ```python | |
| class VectorStoreService: | |
| def __init__(self, url: str, api_key: str, collection_name: str) | |
| def ensure_collection(self, vector_size: int = 384) -> None | |
| def upsert_chunks(self, chunks: list[Chunk], embeddings: list[list[float]]) -> None | |
| def search(self, query_vector: list[float], limit: int = 10, filters: SearchFilters | None = None) -> list[dict] | |
| def delete_document(self, document_id: str) -> int | |
| def scroll_all(self, batch_size: int = 100) -> list[dict] | |
| def get_document_ids(self) -> list[dict] | |
| def count(self) -> int | |
| ``` | |
| **Payload schema stored per point:** | |
| ```json | |
| { | |
| "text": "chunk text content", | |
| "document_id": "uuid-string", | |
| "chunk_index": 0, | |
| "source": "filename.pdf", | |
| "doc_type": "pdf", | |
| "title": "Document Title or null", | |
| "created_date": "2024-01-15T00:00:00 or null", | |
| "tags": ["machine learning", "neural networks"], | |
| "page_count": 12 | |
| } | |
| ``` | |
| **Configuration:** | |
| | Setting | Default | Description | | |
| |---|---|---| | |
| | `QDRANT_URL` | (required) | Qdrant Cloud cluster URL | | |
| | `QDRANT_API_KEY` | (required) | Qdrant Cloud API key | | |
| | `QDRANT_COLLECTION` | `ragcore_docs` | Collection name in Qdrant | | |
| --- | |
| ### BM25 Index (`app/core/bm25.py`) | |
| **What it does:** Maintains an in-memory BM25 keyword index for sparse retrieval alongside the dense vector search. | |
| **How it works internally:** | |
| - **Tokenization:** Text is lowercased, split into words via `\b\w+\b`, then filtered to remove stop words (58 common English words) and single-character tokens. | |
| - Uses `rank_bm25.BM25Okapi`, which implements the Okapi BM25 scoring formula: | |
| ``` | |
| score(D, Q) = SUM[ IDF(q) * (f(q,D) * (k1+1)) / (f(q,D) + k1 * (1 - b + b * |D|/avgdl)) ] | |
| ``` | |
| - On startup, the index is rebuilt from all existing points in Qdrant via `rebuild_from_vectorstore()`, which scrolls through all stored chunks. | |
| - When new documents are ingested, `add_documents()` appends them and rebuilds the full BM25 corpus (the index is not incremental -- it rebuilds from the full document list). | |
| - Search returns scored results filtered to only those with `score > 0`. | |
| **Key class:** `BM25Index` | |
| ```python | |
| class BM25Index: | |
| def __init__(self) | |
| def build_index(self, chunks: list[Chunk]) -> None | |
| def add_documents(self, chunks: list[Chunk]) -> None | |
| def search(self, query: str, top_k: int = 10) -> list[dict] | |
| def rebuild_from_vectorstore(self, vectorstore) -> None | |
| @property | |
| def doc_count(self) -> int | |
| ``` | |
| **Tokenization function:** | |
| ```python | |
| def tokenize(text: str) -> list[str] | |
| ``` | |
| **Design note:** The in-memory approach means the BM25 index is rebuilt on every application restart (from Qdrant data). This is acceptable for small-to-medium collections (thousands of chunks) but would need a persistent store for larger deployments. | |
| --- | |
| ### Hybrid Retriever with RRF (`app/core/retriever.py`) | |
| **What it does:** Combines dense (vector) and sparse (BM25) retrieval results using Reciprocal Rank Fusion. | |
| **How it works internally:** | |
| 1. Embeds the query using the same `EmbedderService`. | |
| 2. Runs a dense search via Qdrant, fetching `top_k * 2` candidates (over-fetch to give fusion more options). | |
| 3. Runs a BM25 search, also fetching `top_k * 2` candidates. | |
| 4. If filters were provided, applies them post-hoc to BM25 results (since BM25 does not natively support metadata filtering). | |
| 5. Fuses both result lists using the **RRF formula**: | |
| ``` | |
| RRF_score(d) = SUM_over_lists[ weight_i * 1 / (k + rank_i(d)) ] | |
| ``` | |
| Where `k = 60` (smoothing constant), `rank_i(d)` is the rank of document `d` in list `i` (0-indexed), and `weight_i` is the list weight (default: 0.6 for dense, 0.4 for sparse). | |
| 6. Deduplicates by `chunk_id` and returns the top-K results as `RetrievedChunk` objects. | |
| **Key class:** `HybridRetriever` | |
| ```python | |
| class HybridRetriever: | |
| def __init__(self, vectorstore: VectorStoreService, bm25: BM25Index, embedder: EmbedderService) | |
| def retrieve(self, query: str, top_k: int = 10, filters: SearchFilters | None = None, | |
| dense_weight: float = 0.6, sparse_weight: float = 0.4) -> list[RetrievedChunk] | |
| @staticmethod | |
| def rrf_fuse(result_lists: list[list[dict]], k: int = 60, | |
| weights: list[float] | None = None) -> list[dict] | |
| @staticmethod | |
| def _apply_filters(results: list[dict], filters: SearchFilters) -> list[dict] | |
| ``` | |
| **Configuration:** | |
| | Setting | Default | Description | | |
| |---|---|---| | |
| | `TOP_K` | 10 | Number of chunks to return from retrieval | | |
| | `DENSE_WEIGHT` | 0.6 | Weight for dense (vector) search in RRF | | |
| | `SPARSE_WEIGHT` | 0.4 | Weight for sparse (BM25) search in RRF | | |
| **Why RRF?** Reciprocal Rank Fusion is a score-agnostic fusion method. Since BM25 scores and cosine similarity scores are on different scales, RRF uses only rank positions, making it a robust choice for combining heterogeneous retrieval signals. | |
| --- | |
| ### Reranker (`app/core/reranker.py`) | |
| **What it does:** Rescores retrieved chunks using a cross-encoder model to improve ranking precision. | |
| **How it works internally:** | |
| - Uses FlashRank with the `ms-marco-MiniLM-L-12-v2` model, which is a lightweight cross-encoder trained on the MS MARCO passage ranking dataset. | |
| - Unlike embedding models (which encode query and document independently), cross-encoders process the query-document pair jointly, allowing richer interaction signals. | |
| - Input: the query string and a list of `RetrievedChunk` objects from the hybrid retriever. | |
| - Output: the top `rerank_top_k` chunks reordered by cross-encoder score. | |
| - The reranker model is cached in `./flashrank_cache/` to avoid re-downloading on each startup. | |
| **Key class:** `RerankerService` | |
| ```python | |
| class RerankerService: | |
| def __init__(self) | |
| def rerank(self, query: str, chunks: list[RetrievedChunk], top_k: int = 5) -> list[RetrievedChunk] | |
| ``` | |
| **Configuration:** | |
| | Setting | Default | Description | | |
| |---|---|---| | |
| | `RERANK_TOP_K` | 5 | Number of chunks to keep after reranking | | |
| --- | |
| ### LLM Client (`app/core/llm.py`) | |
| **What it does:** Manages all communication with the Google Gemini API, including rate limiting and streaming. | |
| **How it works internally:** | |
| - Configures the `google.generativeai` library with the provided API key. | |
| - Instantiates a `GenerativeModel` for the configured model name (default: `gemini-2.0-flash`). | |
| - **Rate limiting:** Enforces a minimum interval between API calls based on `rpm_limit`. For the free tier (15 RPM), the minimum interval is 4 seconds. Uses `time.sleep()` for synchronous calls and `asyncio.sleep()` for async calls. | |
| - **Synchronous generation:** `generate(prompt, temperature, max_tokens)` returns the full response text. | |
| - **Streaming generation:** `generate_stream(prompt, temperature, max_tokens)` is an async generator that yields text chunks as they arrive from the API. | |
| **Key class:** `GeminiService` | |
| ```python | |
| class GeminiService: | |
| def __init__(self, api_key: str, model_name: str, rpm_limit: int = 15) | |
| def generate(self, prompt: str, temperature: float = 0.3, max_tokens: int = 2048) -> str | |
| async def generate_stream(self, prompt: str, temperature: float = 0.3, | |
| max_tokens: int = 2048) -> AsyncGenerator[str, None] | |
| ``` | |
| **Configuration:** | |
| | Setting | Default | Description | | |
| |---|---|---| | |
| | `GEMINI_API_KEY` | (required) | Google Gemini API key | | |
| | `GEMINI_MODEL` | `gemini-2.0-flash` | Gemini model identifier | | |
| | `GEMINI_RPM_LIMIT` | 15 | Requests per minute limit | | |
| | `GEMINI_TEMPERATURE` | 0.3 | Generation temperature (lower = more deterministic) | | |
| | `GEMINI_MAX_TOKENS` | 2048 | Maximum output tokens per generation | | |
| --- | |
| ### Query Analyzer (`app/core/query_analyzer.py`) | |
| **What it does:** Parses natural language queries to extract intent, metadata filters, and a cleaned query string. | |
| **How it works internally:** | |
| The analyzer performs multiple regex-based extractions in sequence: | |
| 1. **Document type extraction:** Matches patterns like "PDFs", "pdf", "HTML", "text files", "txt" and sets the `doc_type` filter. | |
| 2. **Relative date extraction:** Matches temporal phrases like "last week", "last month", "this year", "today", "yesterday" and converts them to `date_from`/`date_to` datetime ranges. | |
| 3. **Absolute date extraction:** Matches "after {date}" and "before {date}" patterns. Uses `python-dateutil` for fuzzy parsing of the date string. | |
| 4. **Source extraction:** Matches "from {filename.ext}" patterns to filter by specific source file. | |
| 5. **Query cleaning:** Removes all matched filter phrases from the query, collapses whitespace, and strips dangling prepositions (about, from, in, on). | |
| 6. **Intent classification:** Matches the original query against patterns for five intent types: | |
| - `summarize` -- "summarize", "summary", "overview" | |
| - `comparative` -- "compare", "difference", "vs", "versus" | |
| - `list` -- "list", "enumerate", "what are all" | |
| - `explanatory` -- starts with "why", "how", "explain" | |
| - `factual` -- starts with "what", "who", "when", "where", "how many/much" (default fallback) | |
| 7. **Confidence scoring:** Starts at 0.5, incremented by 0.1 for each filter successfully extracted, capped at 1.0. | |
| **Key class:** `QueryAnalyzer` | |
| ```python | |
| class QueryAnalyzer: | |
| def analyze(self, query: str) -> AnalyzedQuery | |
| ``` | |
| **Example:** | |
| Input: `"summarize PDFs from last month"` | |
| Output: | |
| ```json | |
| { | |
| "original_query": "summarize PDFs from last month", | |
| "clean_query": "summarize", | |
| "intent": "summarize", | |
| "extracted_filters": { | |
| "doc_type": "pdf", | |
| "date_from": "2026-02-17T00:00:00", | |
| "date_to": "2026-03-17T00:00:00" | |
| }, | |
| "confidence": 0.7 | |
| } | |
| ``` | |
| --- | |
| ### Answer Generator (`app/core/generator.py`) | |
| **What it does:** Builds a prompt from retrieved chunks and generates a cited answer using the LLM. | |
| **How it works internally:** | |
| 1. **Reranking:** Calls the `RerankerService` to narrow the retrieved chunks to `rerank_top_k`. | |
| 2. **Context building:** Formats each reranked chunk as a numbered passage with its source filename: | |
| ``` | |
| [1] (Source: report.pdf) | |
| Chunk text content here... | |
| [2] (Source: notes.txt) | |
| Another chunk text... | |
| ``` | |
| 3. **Prompt selection:** Uses `SYSTEM_PROMPT` for most intents and `SUMMARY_PROMPT` when the intent is "summarize". | |
| 4. **Prompt rules instruct the LLM to:** | |
| - Answer based ONLY on the provided context | |
| - Cite sources inline using [1], [2], etc. | |
| - Admit when context is insufficient | |
| - Use markdown formatting | |
| 5. **Streaming:** The `generate_answer_stream()` async generator yields text chunks during generation, then yields a final `GeneratedAnswer` object with source metadata. | |
| **Key class:** `AnswerGenerator` | |
| ```python | |
| class AnswerGenerator: | |
| def __init__(self, llm: GeminiService, reranker: RerankerService) | |
| def generate_answer(self, query: str, chunks: list[RetrievedChunk], | |
| rerank_top_k: int = 5, intent: str = "factual") -> GeneratedAnswer | |
| async def generate_answer_stream(self, query: str, chunks: list[RetrievedChunk], | |
| rerank_top_k: int = 5, intent: str = "factual") -> AsyncGenerator | |
| ``` | |
| --- | |
| ## Data Models | |
| All models are defined using Pydantic v2 and live in `app/models/`. | |
| ### Core Document Models (`app/models/document.py`) | |
| #### `DocumentMetadata` | |
| Stores extracted metadata for a document or chunk. | |
| | Field | Type | Default | Description | | |
| |---|---|---|---| | |
| | `source` | `str` | `""` | Original filename | | |
| | `doc_type` | `str` | `""` | File type without dot (e.g., "pdf", "html", "txt") | | |
| | `title` | `str \| None` | `None` | Extracted title (first meaningful line) | | |
| | `created_date` | `datetime \| None` | `None` | Extracted date from document content | | |
| | `tags` | `list[str]` | `[]` | Auto-extracted topic tags | | |
| | `page_count` | `int \| None` | `None` | Number of pages (PDFs only) | | |
| #### `Chunk` | |
| Represents a single text chunk derived from a document. | |
| | Field | Type | Default | Description | | |
| |---|---|---|---| | |
| | `chunk_id` | `str` | `uuid4()` | Unique chunk identifier | | |
| | `document_id` | `str` | `""` | Parent document identifier | | |
| | `text` | `str` | `""` | Chunk text content | | |
| | `metadata` | `DocumentMetadata` | `{}` | Inherited document metadata | | |
| | `chunk_index` | `int` | `0` | Position of this chunk in the document | | |
| | `start_char` | `int` | `0` | Start character offset in original text | | |
| | `end_char` | `int` | `0` | End character offset in original text | | |
| #### `Document` | |
| Represents a full ingested document. | |
| | Field | Type | Default | Description | | |
| |---|---|---|---| | |
| | `document_id` | `str` | `uuid4()` | Unique document identifier | | |
| | `filename` | `str` | `""` | Original filename | | |
| | `metadata` | `DocumentMetadata` | `{}` | Extracted metadata | | |
| | `chunks` | `list[Chunk]` | `[]` | Child chunks (populated during ingestion) | | |
| | `raw_text` | `str` | `""` | Full extracted text | | |
| ### API Schemas (`app/models/schemas.py`) | |
| #### `IngestResponse` | |
| Returned after successful document ingestion. | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `document_id` | `str` | Assigned UUID | | |
| | `filename` | `str` | Original filename | | |
| | `num_chunks` | `int` | Number of chunks created | | |
| | `message` | `str` | Human-readable success message | | |
| #### `SearchFilters` | |
| Used for metadata filtering in search and query operations. | |
| | Field | Type | Default | Description | | |
| |---|---|---|---| | |
| | `source` | `str \| None` | `None` | Filter by exact source filename | | |
| | `doc_type` | `str \| None` | `None` | Filter by document type | | |
| | `date_from` | `datetime \| None` | `None` | Filter documents created on or after this date | | |
| | `date_to` | `datetime \| None` | `None` | Filter documents created on or before this date | | |
| | `tags` | `list[str] \| None` | `None` | Filter by any matching tag | | |
| #### `RetrievedChunk` | |
| A chunk returned from retrieval, with its relevance score and rank. | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `chunk_id` | `str` | Chunk identifier | | |
| | `document_id` | `str` | Parent document identifier | | |
| | `text` | `str` | Chunk text | | |
| | `score` | `float` | Relevance score (RRF-fused or reranker score) | | |
| | `metadata` | `DocumentMetadata` | Chunk metadata | | |
| | `rank` | `int` | Position in the result list (0-indexed) | | |
| #### `SearchRequest` | |
| Request body for the `/api/search` endpoint. | |
| | Field | Type | Default | Description | | |
| |---|---|---|---| | |
| | `query` | `str` | (required) | Natural language search query | | |
| | `top_k` | `int` | `10` | Number of results to return | | |
| | `filters` | `SearchFilters \| None` | `None` | Optional explicit filters (overrides auto-extraction) | | |
| #### `SearchResponse` | |
| Response from the `/api/search` endpoint. | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `query` | `str` | Original query | | |
| | `results` | `list[RetrievedChunk]` | Retrieved and ranked chunks | | |
| | `total_results` | `int` | Number of results returned | | |
| | `search_time_ms` | `float` | Total search time in milliseconds | | |
| #### `QueryRequest` | |
| Request body for the `/api/ask` endpoint. | |
| | Field | Type | Default | Description | | |
| |---|---|---|---| | |
| | `query` | `str` | (required) | Natural language question | | |
| | `top_k` | `int` | `10` | Number of chunks to retrieve | | |
| | `rerank_top_k` | `int` | `5` | Number of chunks to keep after reranking | | |
| | `filters` | `SearchFilters \| None` | `None` | Optional explicit filters | | |
| | `stream` | `bool` | `False` | Enable Server-Sent Events streaming | | |
| #### `GeneratedAnswer` | |
| Response from the `/api/ask` endpoint (non-streaming). | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `query` | `str` | Original question | | |
| | `answer` | `str` | Generated markdown answer with inline citations | | |
| | `sources` | `list[RetrievedChunk]` | Source chunks used for generation | | |
| | `generation_time_ms` | `float` | Total generation time in milliseconds | | |
| | `model` | `str` | LLM model name used | | |
| #### `AnalyzedQuery` | |
| Internal model from the query analyzer (not directly exposed via API). | |
| | Field | Type | Default | Description | | |
| |---|---|---|---| | |
| | `original_query` | `str` | - | The raw user query | | |
| | `clean_query` | `str` | - | Query with filter phrases removed | | |
| | `intent` | `str` | `"factual"` | Classified intent | | |
| | `extracted_filters` | `SearchFilters` | `{}` | Automatically extracted filters | | |
| | `confidence` | `float` | `0.5` | Confidence in filter extraction | | |
| --- | |
| ## API Reference | |
| The FastAPI app automatically generates interactive API documentation at `/docs` (Swagger UI) and `/redoc` (ReDoc). | |
| ### Health Check | |
| ``` | |
| GET /health | |
| ``` | |
| Returns the status of all system components. | |
| **Response:** | |
| ```json | |
| { | |
| "status": "ok", | |
| "components": { | |
| "embedder": "loaded", | |
| "bm25": "142 documents", | |
| "vectorstore": "connected" | |
| } | |
| } | |
| ``` | |
| **curl example:** | |
| ```bash | |
| curl http://localhost:7860/health | |
| ``` | |
| --- | |
| ### Ingest Document | |
| ``` | |
| POST /api/ingest | |
| Content-Type: multipart/form-data | |
| ``` | |
| Uploads and indexes a document. The file is parsed, chunked, embedded, and stored in both the vector database and the BM25 index. | |
| **Request:** Multipart form with a `file` field. | |
| **Constraints:** | |
| - Supported extensions: `.pdf`, `.txt`, `.html`, `.htm` | |
| - Maximum file size: 10 MB (configurable via `MAX_FILE_SIZE_MB`) | |
| **Response (200):** | |
| ```json | |
| { | |
| "document_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", | |
| "filename": "report.pdf", | |
| "num_chunks": 47, | |
| "message": "Successfully ingested 'report.pdf' with 47 chunks" | |
| } | |
| ``` | |
| **Error responses:** | |
| - `400` -- Missing filename or unsupported file type | |
| - `413` -- File exceeds maximum size | |
| - `422` -- Could not extract text from file | |
| **curl example:** | |
| ```bash | |
| curl -X POST http://localhost:7860/api/ingest \ | |
| -F "file=@/path/to/document.pdf" | |
| ``` | |
| --- | |
| ### List Documents | |
| ``` | |
| GET /api/documents | |
| ``` | |
| Returns all indexed documents with their metadata and chunk counts. | |
| **Response (200):** | |
| ```json | |
| { | |
| "documents": [ | |
| { | |
| "document_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", | |
| "source": "report.pdf", | |
| "title": "Annual Report 2024", | |
| "doc_type": "pdf", | |
| "num_chunks": 47 | |
| } | |
| ], | |
| "total": 1 | |
| } | |
| ``` | |
| **curl example:** | |
| ```bash | |
| curl http://localhost:7860/api/documents | |
| ``` | |
| --- | |
| ### Delete Document | |
| ``` | |
| DELETE /api/documents/{document_id} | |
| ``` | |
| Removes all chunks for the given document from Qdrant and rebuilds the BM25 index. | |
| **Response (200):** | |
| ```json | |
| { | |
| "message": "Document 'a1b2c3d4-e5f6-7890-abcd-ef1234567890' deleted successfully" | |
| } | |
| ``` | |
| **curl example:** | |
| ```bash | |
| curl -X DELETE http://localhost:7860/api/documents/a1b2c3d4-e5f6-7890-abcd-ef1234567890 | |
| ``` | |
| --- | |
| ### Search (Retrieval Only) | |
| ``` | |
| POST /api/search | |
| Content-Type: application/json | |
| ``` | |
| Performs hybrid retrieval without LLM generation. Useful for inspecting which chunks would be retrieved for a given query. | |
| **Request body:** | |
| ```json | |
| { | |
| "query": "What is retrieval-augmented generation?", | |
| "top_k": 10, | |
| "filters": { | |
| "doc_type": "pdf", | |
| "tags": ["machine learning"] | |
| } | |
| } | |
| ``` | |
| **Response (200):** | |
| ```json | |
| { | |
| "query": "What is retrieval-augmented generation?", | |
| "results": [ | |
| { | |
| "chunk_id": "uuid", | |
| "document_id": "uuid", | |
| "text": "Retrieval-Augmented Generation (RAG) is...", | |
| "score": 0.0234, | |
| "metadata": { | |
| "source": "report.pdf", | |
| "doc_type": "pdf", | |
| "title": "Annual Report", | |
| "created_date": null, | |
| "tags": ["machine learning"], | |
| "page_count": 12 | |
| }, | |
| "rank": 0 | |
| } | |
| ], | |
| "total_results": 10, | |
| "search_time_ms": 142.5 | |
| } | |
| ``` | |
| **curl example:** | |
| ```bash | |
| curl -X POST http://localhost:7860/api/search \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"query": "What is RAG?", "top_k": 5}' | |
| ``` | |
| --- | |
| ### Ask (Full RAG Pipeline) | |
| ``` | |
| POST /api/ask | |
| Content-Type: application/json | |
| ``` | |
| Runs the full pipeline: query analysis, hybrid retrieval, reranking, and LLM answer generation. | |
| **Request body:** | |
| ```json | |
| { | |
| "query": "What are the key findings in the report?", | |
| "top_k": 10, | |
| "rerank_top_k": 5, | |
| "filters": null, | |
| "stream": false | |
| } | |
| ``` | |
| **Response (200, non-streaming):** | |
| ```json | |
| { | |
| "query": "What are the key findings in the report?", | |
| "answer": "Based on the provided documents, the key findings are:\n\n1. **Finding one** [1]...\n2. **Finding two** [2]...", | |
| "sources": [ | |
| { | |
| "chunk_id": "uuid", | |
| "document_id": "uuid", | |
| "text": "chunk text...", | |
| "score": 0.892, | |
| "metadata": { "source": "report.pdf", "..." : "..." }, | |
| "rank": 0 | |
| } | |
| ], | |
| "generation_time_ms": 3420.5, | |
| "model": "gemini-2.0-flash" | |
| } | |
| ``` | |
| **Streaming response (`"stream": true`):** | |
| Returns `text/event-stream` with Server-Sent Events: | |
| ``` | |
| data: {"text": "Based on"} | |
| data: {"text": " the provided"} | |
| data: {"text": " documents..."} | |
| data: {"done": true, "sources": [...], "model": "gemini-2.0-flash", "time_ms": 3420.5} | |
| ``` | |
| **curl examples:** | |
| ```bash | |
| # Non-streaming | |
| curl -X POST http://localhost:7860/api/ask \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"query": "Summarize the report", "stream": false}' | |
| # Streaming | |
| curl -X POST http://localhost:7860/api/ask \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"query": "What is RAG?", "stream": true}' \ | |
| --no-buffer | |
| ``` | |
| --- | |
| ## UI Guide | |
| RagCore includes a Gradio web interface mounted at `/ui` (the root `/` redirects there automatically). | |
| ### Ask Tab | |
| The primary interaction surface for querying your documents. | |
| **Components:** | |
| - **Query input** -- A text box where you type your question in natural language. Supports pressing Enter to submit. | |
| - **Document Type filter** -- Dropdown to restrict results to a specific file type: All, PDF, TXT, or HTML. | |
| - **Stream response toggle** -- Checkbox (default: on) to enable real-time streaming of the answer as it is generated. | |
| - **Ask button** -- Submits the query. | |
| - **Answer area** -- Displays the generated answer with markdown formatting, followed by a "Sources" section listing each referenced chunk with its filename, relevance score, and a text snippet. | |
| - **Example queries** -- Pre-filled example questions you can click to populate the query input. | |
| ### Documents Tab | |
| Manages the document collection. | |
| **Components:** | |
| - **File upload zone** -- Drag-and-drop or click to select a file (`.pdf`, `.txt`, `.html`, `.htm`). | |
| - **Upload & Index button** -- Triggers the ingestion pipeline. Shows a status card with filename, chunk count, and document ID on success. | |
| - **Indexed Documents table** -- Displays all ingested documents with their filename, type, chunk count, and truncated document ID. Click "Refresh" to update. | |
| - **Delete section** -- Paste a full document ID and click "Delete" to remove a document and all its chunks. | |
| ### Stats Bar | |
| At the top of every tab, a card shows the current count of indexed documents and total chunks. | |
| --- | |
| ## Setup and Installation | |
| ### Prerequisites | |
| - Python 3.12 or later | |
| - A Qdrant Cloud account (free tier) | |
| - A Google AI Studio account (free tier Gemini API key) | |
| - (Optional) Docker and Docker Compose | |
| ### Step 1: Get API Keys | |
| **Qdrant Cloud (vector database):** | |
| 1. Go to [https://cloud.qdrant.io](https://cloud.qdrant.io) and create a free account. | |
| 2. Create a new cluster (the free tier provides 1 GB of storage). | |
| 3. Copy the cluster URL (e.g., `https://abc123-xyz.us-east4-0.gcp.cloud.qdrant.io:6333`). | |
| 4. Generate an API key from the cluster dashboard. | |
| **Google Gemini (LLM):** | |
| 1. Go to [https://aistudio.google.com/apikey](https://aistudio.google.com/apikey). | |
| 2. Click "Create API key" and select or create a Google Cloud project. | |
| 3. Copy the generated API key. The free tier allows 15 requests per minute for Gemini 2.0 Flash. | |
| ### Step 2: Clone and Configure | |
| ```bash | |
| git clone <repository-url> | |
| cd ragcore | |
| ``` | |
| Create a `.env` file in the `ragcore/` directory: | |
| ```env | |
| # Required | |
| GEMINI_API_KEY=your-gemini-api-key-here | |
| QDRANT_URL=https://your-cluster.cloud.qdrant.io:6333 | |
| QDRANT_API_KEY=your-qdrant-api-key-here | |
| # Optional (these are the defaults) | |
| EMBEDDING_MODEL=all-MiniLM-L6-v2 | |
| EMBEDDING_DIM=384 | |
| QDRANT_COLLECTION=ragcore_docs | |
| CHUNK_SIZE=512 | |
| CHUNK_OVERLAP=50 | |
| TOP_K=10 | |
| RERANK_TOP_K=5 | |
| DENSE_WEIGHT=0.6 | |
| SPARSE_WEIGHT=0.4 | |
| GEMINI_MODEL=gemini-2.0-flash | |
| GEMINI_RPM_LIMIT=15 | |
| GEMINI_TEMPERATURE=0.3 | |
| GEMINI_MAX_TOKENS=2048 | |
| LOG_LEVEL=INFO | |
| MAX_FILE_SIZE_MB=10 | |
| ``` | |
| ### Step 3: Running Locally | |
| ```bash | |
| # Create and activate a virtual environment | |
| python -m venv .venv | |
| source .venv/bin/activate # On Linux/macOS | |
| # .venv\Scripts\activate # On Windows | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Start the server | |
| uvicorn app.main:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| The first startup will download two ML models (~90 MB for the embedding model, ~50 MB for the reranker). Subsequent startups use cached models. | |
| Once running: | |
| - Web UI: [http://localhost:7860/ui](http://localhost:7860/ui) | |
| - API docs: [http://localhost:7860/docs](http://localhost:7860/docs) | |
| - Health check: [http://localhost:7860/health](http://localhost:7860/health) | |
| ### Step 4: Running with Docker | |
| ```bash | |
| # Build and run | |
| docker compose up --build | |
| # Or build and run in detached mode | |
| docker compose up --build -d | |
| ``` | |
| The Docker build pre-downloads both ML models into the image layer, so container startup is faster. The app is exposed on port 8000 (mapped from container port 7860). | |
| Once running: [http://localhost:8000/ui](http://localhost:8000/ui) | |
| --- | |
| ## Deployment | |
| ### Deploying to HuggingFace Spaces | |
| HuggingFace Spaces provides free hosting for Gradio and Docker-based applications. RagCore is pre-configured for deployment there. | |
| **Step-by-step:** | |
| 1. **Create a HuggingFace account** at [https://huggingface.co](https://huggingface.co) if you do not have one. | |
| 2. **Create a new Space:** | |
| - Go to [https://huggingface.co/new-space](https://huggingface.co/new-space). | |
| - Choose a name (e.g., `ragcore`). | |
| - Select **Docker** as the SDK. | |
| - Choose the **Free** CPU basic tier. | |
| - Click "Create Space". | |
| 3. **Configure secrets:** | |
| - Go to your Space's Settings > Repository secrets. | |
| - Add the following secrets: | |
| - `GEMINI_API_KEY` -- your Google Gemini API key | |
| - `QDRANT_URL` -- your Qdrant Cloud cluster URL | |
| - `QDRANT_API_KEY` -- your Qdrant Cloud API key | |
| 4. **Push the code:** | |
| ```bash | |
| cd ragcore | |
| git remote add space https://huggingface.co/spaces/YOUR_USERNAME/ragcore | |
| git push space main | |
| ``` | |
| Alternatively, upload files via the HuggingFace web interface. | |
| 5. **Wait for the build** -- the Docker image will be built on HuggingFace's infrastructure. The first build takes 5-10 minutes due to model downloads. The Space will show "Running" when ready. | |
| 6. **Access your app** at `https://YOUR_USERNAME-ragcore.hf.space`. | |
| **Important notes:** | |
| - HuggingFace Spaces exposes port 7860 by default, which matches the Dockerfile's `EXPOSE 7860`. | |
| - The free tier has 2 vCPU and 16 GB RAM, which is sufficient for RagCore. | |
| - Spaces may sleep after inactivity. The first request after sleep triggers a cold start (30-60 seconds). | |
| --- | |
| ## Configuration Reference | |
| All settings are managed via environment variables, loaded from a `.env` file by `pydantic-settings`. | |
| | Variable | Type | Default | Description | | |
| |---|---|---|---| | |
| | `GEMINI_API_KEY` | string | `""` | **Required.** Google Gemini API key for LLM generation. | | |
| | `QDRANT_URL` | string | `""` | **Required.** Full URL of the Qdrant Cloud cluster (including port). | | |
| | `QDRANT_API_KEY` | string | `""` | **Required.** Qdrant Cloud API key for authentication. | | |
| | `EMBEDDING_MODEL` | string | `all-MiniLM-L6-v2` | HuggingFace model name for sentence-transformers. | | |
| | `EMBEDDING_DIM` | integer | `384` | Dimensionality of the embedding vectors. Must match the model. | | |
| | `QDRANT_COLLECTION` | string | `ragcore_docs` | Name of the Qdrant collection to use. Created automatically if missing. | | |
| | `CHUNK_SIZE` | integer | `512` | Maximum number of words per text chunk. | | |
| | `CHUNK_OVERLAP` | integer | `50` | Number of words overlapping between consecutive chunks. | | |
| | `TOP_K` | integer | `10` | Number of chunks retrieved by the hybrid retriever. | | |
| | `RERANK_TOP_K` | integer | `5` | Number of chunks kept after cross-encoder reranking. | | |
| | `DENSE_WEIGHT` | float | `0.6` | Weight for dense (vector) search in RRF fusion. Range: 0.0-1.0. | | |
| | `SPARSE_WEIGHT` | float | `0.4` | Weight for sparse (BM25) search in RRF fusion. Range: 0.0-1.0. | | |
| | `GEMINI_MODEL` | string | `gemini-2.0-flash` | Gemini model identifier. | | |
| | `GEMINI_RPM_LIMIT` | integer | `15` | Maximum requests per minute to the Gemini API. | | |
| | `GEMINI_TEMPERATURE` | float | `0.3` | LLM generation temperature. Lower values produce more deterministic output. | | |
| | `GEMINI_MAX_TOKENS` | integer | `2048` | Maximum number of output tokens per LLM generation. | | |
| | `LOG_LEVEL` | string | `INFO` | Logging level. Valid values: DEBUG, INFO, WARNING, ERROR, CRITICAL. | | |
| | `MAX_FILE_SIZE_MB` | integer | `10` | Maximum allowed file size for upload in megabytes. | | |
| --- | |
| ## How It Works End-to-End | |
| This section traces a complete user interaction: uploading a PDF and then asking a question about it. | |
| ### Phase 1: Document Ingestion | |
| **User action:** Uploads `annual-report-2024.pdf` (2.1 MB, 45 pages) via the Gradio Documents tab. | |
| 1. The Gradio UI reads the file and sends it as a multipart POST to `http://localhost:7860/api/ingest`. | |
| 2. **Validation** (`ingest.py`): | |
| - Filename is checked: extension `.pdf` is in `SUPPORTED_EXTENSIONS`. | |
| - File size 2.1 MB is under the 10 MB limit. | |
| 3. **Parsing** (`parsers.py`): | |
| - `parse_pdf()` creates a `PdfReader` from the bytes. | |
| - Iterates over all 45 pages, extracting text from each. | |
| - Joins page texts with double newlines. | |
| - `clean_text()` normalizes whitespace: collapses 3+ consecutive newlines to 2, collapses horizontal whitespace to single spaces, trims each line. | |
| - Result: ~85,000 characters of cleaned text. | |
| 4. **Metadata extraction** (`metadata.py`): | |
| - `extract_title()` returns `"Annual Report 2024 - Acme Corporation"` (first meaningful line). | |
| - `extract_dates()` finds `"2024-03-15"` in the first 2000 chars, parses it to `datetime(2024, 3, 15)`. | |
| - `extract_tags()` finds frequent capitalized phrases: `["acme corporation", "revenue growth", "machine learning", ...]`. | |
| - `get_page_count()` returns `45`. | |
| - Final `DocumentMetadata`: source="annual-report-2024.pdf", doc_type="pdf", title="Annual Report 2024 - Acme Corporation", created_date=2024-03-15, tags=[...], page_count=45. | |
| 5. **Chunking** (`chunker.py`): | |
| - Splits the ~85,000 chars into sentences via `(?<=[.!?])\s+`. | |
| - Accumulates sentences until the word count exceeds 512. | |
| - Produces ~32 chunks, each with 50-word overlap with the next. | |
| - Each chunk records start_char, end_char, and chunk_index. | |
| 6. **Embedding** (`embedder.py`): | |
| - `embed_texts()` encodes all 32 chunk texts in a single batch (batch_size=64). | |
| - Returns 32 vectors, each of dimension 384, L2-normalized. | |
| 7. **Vector storage** (`vectorstore.py`): | |
| - `upsert_chunks()` creates 32 `PointStruct` objects with the vectors and payload. | |
| - Since 32 < 100, they are uploaded in a single batch. | |
| - Each point's payload includes text, document_id, chunk_index, source, doc_type, title, created_date, tags, page_count. | |
| 8. **BM25 indexing** (`bm25.py`): | |
| - `add_documents()` tokenizes each chunk (lowercase, remove stop words, remove single chars). | |
| - Appends to the document list and rebuilds the full BM25Okapi index. | |
| 9. **Response:** Returns `IngestResponse` with document_id, filename, num_chunks=32, and success message. | |
| ### Phase 2: Querying | |
| **User action:** Types `"What was the revenue growth last year from PDFs?"` in the Ask tab with streaming enabled. | |
| 1. The Gradio UI sends a POST to `http://localhost:7860/api/ask` with: | |
| ```json | |
| {"query": "What was the revenue growth last year from PDFs?", "top_k": 10, "rerank_top_k": 5, "stream": true, "filters": {"doc_type": "pdf"}} | |
| ``` | |
| (Note: the UI sets `doc_type` filter from the dropdown if not "All".) | |
| 2. **Query analysis** (`query_analyzer.py`): | |
| - Doc type extraction: matches "PDFs" -> `filters.doc_type = "pdf"`. | |
| - Date extraction: matches "last year" -> `filters.date_from = 2025-03-17`, `filters.date_to = 2026-03-17`. | |
| - Clean query: removes "last year" and "PDFs" -> `"What was the revenue growth"`. | |
| - Intent: matches `^(?:what|...)` -> `"factual"`. | |
| - Confidence: 0.5 + 0.1 (doc_type) + 0.1 (date) = 0.7. | |
| 3. **Hybrid retrieval** (`retriever.py`): | |
| - Embeds the clean query `"What was the revenue growth"` to a 384-dim vector. | |
| - **Dense search:** Queries Qdrant with the vector, limit=20 (top_k * 2), with filters for doc_type="pdf" and date range. Returns 20 results ranked by cosine similarity. | |
| - **Sparse search:** Tokenizes query to `["what", "revenue", "growth"]` (stop words removed), scores all BM25 documents, returns top 20 by BM25 score. Post-filters by doc_type="pdf". | |
| - **RRF fusion:** For each chunk, computes `score = 0.6 * 1/(60+dense_rank) + 0.4 * 1/(60+sparse_rank)`. Chunks appearing in both lists get boosted scores. | |
| - Deduplicates by chunk_id, takes top 10. | |
| 4. **Reranking** (`reranker.py`): | |
| - Creates passage pairs: (query, chunk_text) for all 10 retrieved chunks. | |
| - The FlashRank cross-encoder scores each pair jointly. | |
| - Returns the top 5 by cross-encoder score, with updated scores and ranks. | |
| 5. **Answer generation** (`generator.py`): | |
| - Builds context with numbered passages: | |
| ``` | |
| [1] (Source: annual-report-2024.pdf) | |
| Revenue increased by 23% year-over-year... | |
| [2] (Source: annual-report-2024.pdf) | |
| The growth was primarily driven by... | |
| ``` | |
| - Constructs the SYSTEM_PROMPT with context and query. | |
| - Calls `llm.generate_stream()` which respects the rate limit, then yields text chunks. | |
| 6. **Streaming response** (`query.py`): | |
| - Each text chunk from Gemini is wrapped as `data: {"text": "..."}\n\n` (SSE format). | |
| - The Gradio UI accumulates text and renders it progressively in the answer area. | |
| - Final SSE event includes `{"done": true, "sources": [...], "model": "gemini-2.0-flash", "time_ms": 3420}`. | |
| - Gradio formats the sources as styled cards showing filename, score, and snippet. | |
| --- | |
| ## Testing | |
| ### Running Tests | |
| ```bash | |
| # Run all unit tests (excluding integration tests) | |
| pytest tests/ -v --ignore=tests/test_integration.py -x | |
| # Run a specific test file | |
| pytest tests/test_chunker.py -v | |
| # Run with coverage (install pytest-cov first) | |
| pytest tests/ -v --ignore=tests/test_integration.py --cov=app | |
| ``` | |
| ### Test Coverage | |
| | Test File | Module Under Test | What Is Tested | | |
| |---|---|---| | |
| | `test_chunker.py` | `app.core.chunker` | Empty input, single sentence, multiple chunks, overlap behavior, chunk size limits | | |
| | `test_parsers.py` | `app.utils.parsers` | UTF-8 text, Latin-1 fallback, HTML tag stripping, unsupported extensions, empty files, extension-based dispatch | | |
| | `test_query_analyzer.py` | `app.core.query_analyzer` | Intent classification (factual, comparative, summarize, explanatory), doc type extraction, date extraction, clean query preservation | | |
| | `test_retrieval.py` | `app.core.retriever` | RRF fusion (basic, empty lists, single list, weighted), metadata filter application | | |
| | `test_api.py` | `app.main` (FastAPI) | Health endpoint returns 200 with components, root redirects to `/ui`, `/docs` page loads | | |
| ### Test Fixtures | |
| Defined in `tests/conftest.py`: | |
| - `client` -- A `FastAPI TestClient` instance for API testing. | |
| - `sample_text` -- A paragraph about RAG for use in unit tests. | |
| **Note:** Unit tests mock or avoid external dependencies (Qdrant, Gemini). The CI pipeline sets dummy API keys via environment variables. Integration tests (if present in `tests/test_integration.py`) are excluded from the default test run. | |
| --- | |
| ## CI/CD | |
| ### GitHub Actions Pipeline (`.github/workflows/ci.yml`) | |
| The CI pipeline runs on every push to `main` and on every pull request targeting `main`. | |
| **Pipeline steps:** | |
| | Step | Description | | |
| |---|---| | |
| | Checkout | Clones the repository using `actions/checkout@v4` | | |
| | Set up Python | Installs Python 3.12 via `actions/setup-python@v5` | | |
| | Install dependencies | Runs `pip install -r requirements.txt` | | |
| | Lint | Runs `ruff check .` for code style and quality | | |
| | Unit tests | Runs `pytest tests/ -v --ignore=tests/test_integration.py -x` | | |
| **Environment variables set during testing:** | |
| ```yaml | |
| env: | |
| GEMINI_API_KEY: "test" | |
| QDRANT_URL: "http://localhost:6333" | |
| QDRANT_API_KEY: "test" | |
| ``` | |
| These are dummy values that allow the application to initialize its settings without connecting to real services. Tests that would require live connections are either mocked or skipped. | |
| The `-x` flag causes pytest to stop on the first failure for faster feedback. | |
| --- | |
| ## Performance and Limits | |
| ### Free Tier Limits | |
| | Service | Limit | Impact | | |
| |---|---|---| | |
| | **Qdrant Cloud** (free tier) | 1 GB storage | Approximately 500,000-700,000 chunks at 384 dimensions. More than sufficient for thousands of documents. | | |
| | **Google Gemini** (free tier) | 15 requests per minute | RagCore enforces this with built-in rate limiting (4-second minimum interval between calls). Each question costs 1 API call. | | |
| | **HuggingFace Spaces** (free tier) | 2 vCPU, 16 GB RAM | Sufficient for running the embedding model, reranker, and BM25 index concurrently. | | |
| ### Expected Latency | |
| | Operation | Typical Latency | Notes | | |
| |---|---|---| | |
| | Document ingestion (10-page PDF) | 3-8 seconds | Dominated by embedding time on CPU | | |
| | Document ingestion (50-page PDF) | 10-20 seconds | Linear with number of chunks | | |
| | Query (hybrid retrieval only) | 100-300 ms | Embedding + Qdrant + BM25 + RRF | | |
| | Query (full RAG with answer) | 3-8 seconds | Dominated by Gemini API call | | |
| | Query (streaming, time to first token) | 1-3 seconds | Reranking + Gemini startup | | |
| | BM25 rebuild on startup | 50-500 ms | Depends on collection size (scrolls all points from Qdrant) | | |
| | Embedding model cold load | 2-5 seconds | First request only; cached thereafter | | |
| | Reranker model cold load | 1-3 seconds | First request only; cached thereafter | | |
| ### Capacity Guidelines | |
| - **Small deployment** (< 100 documents, < 5,000 chunks): Everything runs comfortably within free tiers. | |
| - **Medium deployment** (100-1,000 documents, 5,000-50,000 chunks): BM25 index may use 50-500 MB RAM. Qdrant free tier still has ample space. | |
| - **Large deployment** (> 1,000 documents): Consider upgrading Qdrant to a paid tier and running the embedder on GPU for faster ingestion. | |
| --- | |
| ## Troubleshooting | |
| ### Common Errors and Fixes | |
| **Error: `"Unsupported file type '.docx'"` or similar** | |
| Only PDF, TXT, and HTML files are supported. Convert other formats to one of these before uploading. For DOCX files, export to PDF from your word processor. | |
| --- | |
| **Error: `"File too large. Maximum size is 10MB"`** | |
| Increase the limit by setting `MAX_FILE_SIZE_MB` in your `.env` file, or split the file into smaller parts. | |
| --- | |
| **Error: `"Could not extract text from file"`** | |
| The PDF may be image-based (scanned document) without an embedded text layer. pypdf cannot extract text from images. Use an OCR tool (e.g., Tesseract) to add a text layer first. | |
| --- | |
| **Error: Qdrant connection timeout or `"Connection refused"`** | |
| - Verify your `QDRANT_URL` includes the port (typically `:6333`). | |
| - Verify your `QDRANT_API_KEY` is correct. | |
| - Check that your Qdrant Cloud cluster is active (free clusters may be paused after inactivity). | |
| --- | |
| **Error: `"Gemini generation failed"` or `"429 Too Many Requests"`** | |
| You have exceeded the Gemini API rate limit. RagCore has built-in rate limiting, but if multiple users are sharing the same API key, collisions can occur. Solutions: | |
| - Wait a few seconds and retry. | |
| - Reduce `GEMINI_RPM_LIMIT` to add more buffer between calls. | |
| - Upgrade to a paid Gemini plan for higher limits. | |
| --- | |
| **Error: `"Embedder initialization deferred"`** | |
| This warning during startup means the embedding model could not be loaded immediately. This usually resolves on the first request. If it persists: | |
| - Check internet connectivity (the model needs to be downloaded on first use). | |
| - Ensure sufficient disk space (~200 MB for cached models). | |
| - Check if the `EMBEDDING_MODEL` name is correct. | |
| --- | |
| **BM25 index shows 0 documents after restart** | |
| This is expected on first startup with a fresh Qdrant collection. The BM25 index rebuilds from Qdrant on startup. If Qdrant has data but BM25 shows 0, check the Qdrant connection settings. | |
| --- | |
| **Gradio UI not loading or showing "Connecting..."** | |
| - Ensure the server is running on port 7860 (or whichever port you configured). | |
| - The Gradio UI communicates with the API via `http://localhost:7860`. If running in Docker, this internal URL is correct. If running behind a reverse proxy, the UI may need adjustment. | |
| --- | |
| **Slow first request after startup** | |
| The first request triggers lazy loading of the reranker model. This is a one-time cost of 1-3 seconds. Subsequent requests are fast. | |
| --- | |
| **Docker build fails at model download step** | |
| The Dockerfile pre-downloads ML models during build. This requires internet access during `docker build`. If building behind a corporate proxy, configure Docker's proxy settings. If the download fails, the build will fail. Retry usually resolves transient network issues. | |