Ragcore / README.md
NinjainPJs's picture
Initial deploy: RagCore RAG system with hybrid search and Gradio UI
a34068e
---
title: RagCore
emoji: 🔍
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# RagCore
**A production-ready Retrieval-Augmented Generation system with hybrid search, metadata filtering, and a conversational UI.**
RagCore solves the problem of querying unstructured documents (PDFs, text files, HTML pages) using natural language. It ingests documents, splits them into semantically meaningful chunks, indexes them in both a vector database and a BM25 keyword index, then retrieves and reranks the most relevant passages to generate grounded, citation-backed answers using Google Gemini.
Unlike naive RAG implementations that rely solely on vector similarity, RagCore combines dense (semantic) and sparse (keyword) retrieval using Reciprocal Rank Fusion, applies a cross-encoder reranker to promote the most relevant passages, and uses an intelligent query analyzer that automatically extracts filters (date ranges, document types, sources) from natural language queries.
---
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Tech Stack](#tech-stack)
3. [Project Structure](#project-structure)
4. [Core Components Deep Dive](#core-components-deep-dive)
5. [Data Models](#data-models)
6. [API Reference](#api-reference)
7. [UI Guide](#ui-guide)
8. [Setup and Installation](#setup-and-installation)
9. [Deployment](#deployment)
10. [Configuration Reference](#configuration-reference)
11. [How It Works End-to-End](#how-it-works-end-to-end)
12. [Testing](#testing)
13. [CI/CD](#cicd)
14. [Performance and Limits](#performance-and-limits)
15. [Troubleshooting](#troubleshooting)
---
## Architecture Overview
RagCore is built as a FastAPI application with two main pipelines: **Ingestion** and **Query**. A Gradio-based UI is mounted directly onto the FastAPI app at `/ui`.
### Ingestion Pipeline
```
+------------------+ +----------------+ +-------------------+
| File Upload | --> | Parser | --> | Text Cleaner |
| (PDF/TXT/HTML) | | (pypdf/bs4) | | (regex cleanup) |
+------------------+ +----------------+ +-------------------+
|
v
+------------------+ +----------------+ +-------------------+
| Qdrant Cloud | <-- | Embedder | <-- | Chunker |
| (vector store) | | (MiniLM-L6-v2) | | (sentence-aware) |
+------------------+ +----------------+ +-------------------+
| |
| v
| +-------------------+
+------------------------------------> | BM25 Index |
| (in-memory) |
+-------------------+
^
|
+-------------------+
| Metadata Extractor|
| (title/dates/tags)|
+-------------------+
```
**Step-by-step flow:**
1. User uploads a file via the `/api/ingest` endpoint or the Gradio UI.
2. The **Parser** detects file type by extension and extracts raw text (pypdf for PDFs, BeautifulSoup for HTML, direct decoding for TXT).
3. The **Text Cleaner** normalizes whitespace, collapses blank lines, and trims each line.
4. The **Metadata Extractor** pulls out the document title (first non-empty line), dates (via regex patterns), and tags (frequent capitalized phrases).
5. The **Chunker** splits text into overlapping chunks at sentence boundaries, respecting a configurable word-count limit.
6. The **Embedder** encodes each chunk into a 384-dimensional vector using the `all-MiniLM-L6-v2` sentence transformer.
7. Chunks with their vectors and payload metadata are upserted into **Qdrant Cloud** in batches of 100.
8. The same chunks are added to the in-memory **BM25 index** for keyword search.
### Query Pipeline
```
+------------------+ +-------------------+ +------------------+
| User Query | --> | Query Analyzer | --> | Hybrid Retriever|
| "What is RAG | | (intent, filters, | | |
| from PDFs?" | | cleaned query) | | +----------+ |
+------------------+ +-------------------+ | |Dense | |
| |(Qdrant) | |
| +----------+ |
| | |
| +----------+ |
| |Sparse | |
| |(BM25) | |
| +----------+ |
| | |
| +----------+ |
| |RRF Fusion| |
| +----------+ |
+------------------+
|
v
+-------------------+ +------------------+
| Answer Generator | <-- | Reranker |
| (Gemini Flash) | | (FlashRank) |
+-------------------+ +------------------+
|
v
+-------------------+
| Cited Answer |
| with Sources |
+-------------------+
```
**Step-by-step flow:**
1. User submits a natural language query.
2. The **Query Analyzer** classifies intent (factual, summarize, comparative, list, explanatory), extracts inline filters (doc type, date range, source filename), and produces a cleaned query.
3. The **Hybrid Retriever** runs two parallel searches:
- **Dense search**: encodes the query with the same embedding model, queries Qdrant with cosine similarity, fetching `top_k * 2` results.
- **Sparse search**: tokenizes the query and scores all chunks via BM25Okapi, also fetching `top_k * 2` results.
4. Results are fused using **Reciprocal Rank Fusion (RRF)** with configurable weights (default: 0.6 dense, 0.4 sparse).
5. The top-K fused results are passed to the **Reranker** (FlashRank cross-encoder), which rescores and selects the best 5 passages.
6. The **Answer Generator** builds a prompt with numbered context passages and sends it to **Google Gemini Flash**, which generates a cited, markdown-formatted answer.
7. The answer is returned with source references (streaming or non-streaming).
---
## Tech Stack
| Technology | Version | Purpose |
|---|---|---|
| **Python** | 3.12 | Runtime language. Chosen for its ML/NLP ecosystem. |
| **FastAPI** | >=0.110 | Async web framework. High performance, automatic OpenAPI docs, dependency injection. |
| **Uvicorn** | >=0.29 | ASGI server for running FastAPI in production. |
| **Pydantic** | >=2.6 | Data validation and serialization for all request/response models. |
| **pydantic-settings** | >=2.2 | Environment-based configuration with `.env` file support. |
| **sentence-transformers** | >=2.6 | Embedding model loading and inference (`all-MiniLM-L6-v2`). Chosen for fast CPU inference and high quality at 384 dimensions. |
| **qdrant-client** | >=1.8 | Client for Qdrant vector database. Chosen for its generous free tier (1GB), filtering support, and payload storage. |
| **rank-bm25** | >=0.2.2 | BM25Okapi implementation for sparse keyword retrieval. Lightweight, pure-Python, no external dependencies. |
| **FlashRank** | >=0.2 | Ultra-fast cross-encoder reranker (`ms-marco-MiniLM-L-12-v2`). Runs on CPU, no GPU required. |
| **google-generativeai** | >=0.5 | Official Google Gemini SDK. Gemini 2.0 Flash offers a free tier with 15 RPM. |
| **Gradio** | >=4.20 | Web UI framework mounted directly on FastAPI. Two-tab interface for Q&A and document management. |
| **pypdf** | >=4.1 | PDF text extraction. Handles most PDF formats without external system dependencies. |
| **beautifulsoup4** | >=4.12 | HTML parsing with tag stripping (removes scripts, styles, nav, footer, header). |
| **httpx** | >=0.27 | Async/sync HTTP client used by the Gradio UI to call the FastAPI backend. |
| **python-multipart** | >=0.0.9 | Required by FastAPI for file upload support. |
| **python-dateutil** | >=2.9 | Fuzzy date parsing for the query analyzer's absolute date extraction. |
| **Ruff** | >=0.3 | Fast Python linter. Used in CI for code quality checks. |
| **pytest** | >=8.0 | Test framework. Unit tests for chunker, parsers, query analyzer, retrieval, and API. |
| **Docker** | - | Containerization. Pre-downloads ML models in the build step for fast cold starts. |
---
## Project Structure
```
ragcore/
|-- .github/
| +-- workflows/
| +-- ci.yml # GitHub Actions CI pipeline (lint + test)
|-- app/
| |-- __init__.py
| |-- config.py # Settings class with all env vars, setup_logging()
| |-- main.py # FastAPI app creation, lifespan, middleware, routing
| |-- api/
| | |-- __init__.py
| | |-- deps.py # Dependency injection factories for all services
| | +-- routes/
| | |-- __init__.py
| | |-- health.py # GET /health endpoint
| | |-- ingest.py # POST /api/ingest, GET /api/documents, DELETE /api/documents/{id}
| | +-- query.py # POST /api/search, POST /api/ask (with streaming)
| |-- core/
| | |-- __init__.py
| | |-- bm25.py # BM25 index: tokenization, search, rebuild from vectorstore
| | |-- chunker.py # Sentence-aware text chunking with overlap
| | |-- embedder.py # SentenceTransformer embedding service
| | |-- generator.py # Answer generation with prompt templates and streaming
| | |-- llm.py # Gemini API client with rate limiting
| | |-- metadata.py # Metadata extraction (title, dates, tags)
| | |-- query_analyzer.py # Query intent classification and filter extraction
| | |-- reranker.py # FlashRank cross-encoder reranking
| | |-- retriever.py # Hybrid retriever with RRF fusion
| | +-- vectorstore.py # Qdrant client wrapper (CRUD, search, filtering)
| |-- models/
| | |-- __init__.py
| | |-- document.py # DocumentMetadata, Chunk, Document models
| | +-- schemas.py # API request/response schemas (IngestResponse, QueryRequest, etc.)
| |-- ui/
| | |-- __init__.py
| | +-- gradio_app.py # Gradio Blocks UI (Ask tab, Documents tab)
| +-- utils/
| |-- __init__.py
| |-- helpers.py # generate_id, clean_text, count_words, timer, retry_with_backoff
| +-- parsers.py # File parsing (PDF, TXT, HTML) and page count extraction
|-- tests/
| |-- __init__.py
| |-- conftest.py # Shared fixtures (TestClient, sample_text)
| |-- test_api.py # API integration tests (health, redirect, docs)
| |-- test_chunker.py # Chunker unit tests (empty, single, multiple, overlap)
| |-- test_parsers.py # Parser unit tests (UTF-8, Latin-1, HTML, unsupported)
| |-- test_query_analyzer.py # Query analyzer tests (intents, filters, dates)
| +-- test_retrieval.py # RRF fusion tests (basic, empty, weights, filters)
|-- .dockerignore
|-- .env # Environment variables (not committed to git)
|-- .gitignore
|-- Dockerfile # Python 3.12-slim, pre-downloads ML models
|-- docker-compose.yml # Single-service compose with env_file
+-- requirements.txt # All Python dependencies with version constraints
```
---
## Core Components Deep Dive
### Parsers (`app/utils/parsers.py`)
**What it does:** Extracts raw text from uploaded files based on their extension.
**Supported formats:** `.pdf`, `.txt`, `.html`, `.htm`
**How it works internally:**
- `parse_document(file_bytes, filename)` is the main dispatcher. It reads the file extension and calls the appropriate parser.
- **PDF parsing** uses `pypdf.PdfReader` to iterate over all pages, extract text from each, and join them with double newlines.
- **HTML parsing** uses `BeautifulSoup` with the `html.parser` backend. Before extracting text, it decomposes `<script>`, `<style>`, `<nav>`, `<footer>`, and `<header>` tags to remove boilerplate content. Text is extracted with `get_text(separator="\n")`.
- **TXT parsing** attempts UTF-8 decoding first, falling back to Latin-1 for non-UTF-8 files.
- All parsers pass their output through `clean_text()` for normalization.
**Key functions:**
```python
def parse_document(file_bytes: bytes, filename: str) -> str
def parse_pdf(file_bytes: bytes, filename: str) -> str
def parse_text(file_bytes: bytes, filename: str) -> str
def parse_html(file_bytes: bytes, filename: str) -> str
def get_page_count(file_bytes: bytes, filename: str) -> int | None
```
**Configuration:** No direct configuration. File size is validated at the API layer (`max_file_size_mb`).
---
### Chunker (`app/core/chunker.py`)
**What it does:** Splits raw text into overlapping chunks at sentence boundaries, sized by word count.
**How it works internally:**
1. Text is split into sentences using the regex pattern `(?<=[.!?])\s+` (splits after sentence-ending punctuation followed by whitespace).
2. Sentences are accumulated word-by-word into the current chunk.
3. When adding the next sentence would exceed `chunk_size` words, the current chunk is finalized.
4. Overlap is implemented by retaining the last `chunk_overlap` words from the previous chunk as the start of the new chunk.
5. Each chunk records its `text`, `start_char`, `end_char`, and `chunk_index`.
**Key function:**
```python
def chunk_text(
text: str,
chunk_size: int = 512, # Maximum words per chunk
chunk_overlap: int = 50, # Number of overlapping words between consecutive chunks
) -> list[dict]
```
**Return format:** Each dict contains `{"text": str, "start_char": int, "end_char": int, "chunk_index": int}`.
**Configuration:**
| Setting | Default | Description |
|---|---|---|
| `CHUNK_SIZE` | 512 | Maximum number of words per chunk |
| `CHUNK_OVERLAP` | 50 | Number of overlapping words between consecutive chunks |
**Design note:** Sentence-aware splitting avoids cutting mid-sentence, which improves both retrieval relevance and answer generation quality compared to fixed-character splitting.
---
### Metadata Extractor (`app/core/metadata.py`)
**What it does:** Automatically extracts structured metadata from raw document text.
**How it works internally:**
- **Title extraction:** Scans lines from the top of the document, returning the first non-empty line with more than 3 characters (truncated to 200 chars).
- **Date extraction:** Searches the first 2000 characters for dates using three regex patterns:
- `YYYY-MM-DD` (ISO format)
- `MM/DD/YYYY` (US format)
- `Month DD, YYYY` (long format, e.g., "January 15, 2024")
- **Tag extraction:** Finds all capitalized phrases (e.g., "Machine Learning", "Neural Network") using regex, counts their occurrences, and returns the top 10 that appear at least twice. Tags are lowercased before returning.
- **Doc type:** Derived from the file extension (e.g., "pdf", "html", "txt").
**Key function:**
```python
def extract_metadata(raw_text: str, filename: str, page_count: int | None = None) -> DocumentMetadata
```
**Supporting functions:**
```python
def extract_title(text: str) -> str | None
def extract_dates(text: str) -> datetime | None
def extract_tags(text: str, max_tags: int = 10) -> list[str]
```
---
### Embedder (`app/core/embedder.py`)
**What it does:** Converts text into dense vector representations using a sentence transformer model.
**How it works internally:**
- Uses `sentence-transformers` to load the `all-MiniLM-L6-v2` model on CPU at startup.
- Encodes text in batches of 64 with L2 normalization enabled (so cosine similarity is equivalent to dot product).
- The model produces 384-dimensional embeddings.
- Singleton pattern via `get_embedder()` ensures the model is loaded only once.
**Key class:** `EmbedderService`
```python
class EmbedderService:
EMBEDDING_DIM = 384
def __init__(self, model_name: str)
def embed_texts(self, texts: list[str]) -> list[list[float]] # Batch embedding
def embed_query(self, query: str) -> list[float] # Single query embedding
```
**Configuration:**
| Setting | Default | Description |
|---|---|---|
| `EMBEDDING_MODEL` | `all-MiniLM-L6-v2` | HuggingFace sentence-transformers model name |
| `EMBEDDING_DIM` | 384 | Embedding vector dimensionality |
---
### Vector Store -- Qdrant (`app/core/vectorstore.py`)
**What it does:** Manages all interactions with the Qdrant vector database: collection management, upserting chunks, searching, filtering, scrolling, and deleting.
**How it works internally:**
- On initialization, connects to Qdrant Cloud using the provided URL and API key.
- `ensure_collection()` checks if the collection exists; if not, creates it with cosine distance and the configured vector size.
- **Upsert:** Chunks are uploaded in batches of 100 as `PointStruct` objects, with the chunk text and all metadata stored in the payload.
- **Search:** Uses `query_points()` with an optional `Filter` object built from `SearchFilters`. Over-fetches `top_k * 2` results to give the fusion step more candidates.
- **Filtering:** Supports exact match on `source`, `doc_type`, `MatchAny` on `tags`, and `Range` on `created_date`.
- **Scroll:** Iterates through all points in the collection using offset-based pagination (batch size 100). Used to rebuild the BM25 index on startup.
- **Document listing:** Aggregates all points by `document_id` to return a list of unique documents with chunk counts.
**Key class:** `VectorStoreService`
```python
class VectorStoreService:
def __init__(self, url: str, api_key: str, collection_name: str)
def ensure_collection(self, vector_size: int = 384) -> None
def upsert_chunks(self, chunks: list[Chunk], embeddings: list[list[float]]) -> None
def search(self, query_vector: list[float], limit: int = 10, filters: SearchFilters | None = None) -> list[dict]
def delete_document(self, document_id: str) -> int
def scroll_all(self, batch_size: int = 100) -> list[dict]
def get_document_ids(self) -> list[dict]
def count(self) -> int
```
**Payload schema stored per point:**
```json
{
"text": "chunk text content",
"document_id": "uuid-string",
"chunk_index": 0,
"source": "filename.pdf",
"doc_type": "pdf",
"title": "Document Title or null",
"created_date": "2024-01-15T00:00:00 or null",
"tags": ["machine learning", "neural networks"],
"page_count": 12
}
```
**Configuration:**
| Setting | Default | Description |
|---|---|---|
| `QDRANT_URL` | (required) | Qdrant Cloud cluster URL |
| `QDRANT_API_KEY` | (required) | Qdrant Cloud API key |
| `QDRANT_COLLECTION` | `ragcore_docs` | Collection name in Qdrant |
---
### BM25 Index (`app/core/bm25.py`)
**What it does:** Maintains an in-memory BM25 keyword index for sparse retrieval alongside the dense vector search.
**How it works internally:**
- **Tokenization:** Text is lowercased, split into words via `\b\w+\b`, then filtered to remove stop words (58 common English words) and single-character tokens.
- Uses `rank_bm25.BM25Okapi`, which implements the Okapi BM25 scoring formula:
```
score(D, Q) = SUM[ IDF(q) * (f(q,D) * (k1+1)) / (f(q,D) + k1 * (1 - b + b * |D|/avgdl)) ]
```
- On startup, the index is rebuilt from all existing points in Qdrant via `rebuild_from_vectorstore()`, which scrolls through all stored chunks.
- When new documents are ingested, `add_documents()` appends them and rebuilds the full BM25 corpus (the index is not incremental -- it rebuilds from the full document list).
- Search returns scored results filtered to only those with `score > 0`.
**Key class:** `BM25Index`
```python
class BM25Index:
def __init__(self)
def build_index(self, chunks: list[Chunk]) -> None
def add_documents(self, chunks: list[Chunk]) -> None
def search(self, query: str, top_k: int = 10) -> list[dict]
def rebuild_from_vectorstore(self, vectorstore) -> None
@property
def doc_count(self) -> int
```
**Tokenization function:**
```python
def tokenize(text: str) -> list[str]
```
**Design note:** The in-memory approach means the BM25 index is rebuilt on every application restart (from Qdrant data). This is acceptable for small-to-medium collections (thousands of chunks) but would need a persistent store for larger deployments.
---
### Hybrid Retriever with RRF (`app/core/retriever.py`)
**What it does:** Combines dense (vector) and sparse (BM25) retrieval results using Reciprocal Rank Fusion.
**How it works internally:**
1. Embeds the query using the same `EmbedderService`.
2. Runs a dense search via Qdrant, fetching `top_k * 2` candidates (over-fetch to give fusion more options).
3. Runs a BM25 search, also fetching `top_k * 2` candidates.
4. If filters were provided, applies them post-hoc to BM25 results (since BM25 does not natively support metadata filtering).
5. Fuses both result lists using the **RRF formula**:
```
RRF_score(d) = SUM_over_lists[ weight_i * 1 / (k + rank_i(d)) ]
```
Where `k = 60` (smoothing constant), `rank_i(d)` is the rank of document `d` in list `i` (0-indexed), and `weight_i` is the list weight (default: 0.6 for dense, 0.4 for sparse).
6. Deduplicates by `chunk_id` and returns the top-K results as `RetrievedChunk` objects.
**Key class:** `HybridRetriever`
```python
class HybridRetriever:
def __init__(self, vectorstore: VectorStoreService, bm25: BM25Index, embedder: EmbedderService)
def retrieve(self, query: str, top_k: int = 10, filters: SearchFilters | None = None,
dense_weight: float = 0.6, sparse_weight: float = 0.4) -> list[RetrievedChunk]
@staticmethod
def rrf_fuse(result_lists: list[list[dict]], k: int = 60,
weights: list[float] | None = None) -> list[dict]
@staticmethod
def _apply_filters(results: list[dict], filters: SearchFilters) -> list[dict]
```
**Configuration:**
| Setting | Default | Description |
|---|---|---|
| `TOP_K` | 10 | Number of chunks to return from retrieval |
| `DENSE_WEIGHT` | 0.6 | Weight for dense (vector) search in RRF |
| `SPARSE_WEIGHT` | 0.4 | Weight for sparse (BM25) search in RRF |
**Why RRF?** Reciprocal Rank Fusion is a score-agnostic fusion method. Since BM25 scores and cosine similarity scores are on different scales, RRF uses only rank positions, making it a robust choice for combining heterogeneous retrieval signals.
---
### Reranker (`app/core/reranker.py`)
**What it does:** Rescores retrieved chunks using a cross-encoder model to improve ranking precision.
**How it works internally:**
- Uses FlashRank with the `ms-marco-MiniLM-L-12-v2` model, which is a lightweight cross-encoder trained on the MS MARCO passage ranking dataset.
- Unlike embedding models (which encode query and document independently), cross-encoders process the query-document pair jointly, allowing richer interaction signals.
- Input: the query string and a list of `RetrievedChunk` objects from the hybrid retriever.
- Output: the top `rerank_top_k` chunks reordered by cross-encoder score.
- The reranker model is cached in `./flashrank_cache/` to avoid re-downloading on each startup.
**Key class:** `RerankerService`
```python
class RerankerService:
def __init__(self)
def rerank(self, query: str, chunks: list[RetrievedChunk], top_k: int = 5) -> list[RetrievedChunk]
```
**Configuration:**
| Setting | Default | Description |
|---|---|---|
| `RERANK_TOP_K` | 5 | Number of chunks to keep after reranking |
---
### LLM Client (`app/core/llm.py`)
**What it does:** Manages all communication with the Google Gemini API, including rate limiting and streaming.
**How it works internally:**
- Configures the `google.generativeai` library with the provided API key.
- Instantiates a `GenerativeModel` for the configured model name (default: `gemini-2.0-flash`).
- **Rate limiting:** Enforces a minimum interval between API calls based on `rpm_limit`. For the free tier (15 RPM), the minimum interval is 4 seconds. Uses `time.sleep()` for synchronous calls and `asyncio.sleep()` for async calls.
- **Synchronous generation:** `generate(prompt, temperature, max_tokens)` returns the full response text.
- **Streaming generation:** `generate_stream(prompt, temperature, max_tokens)` is an async generator that yields text chunks as they arrive from the API.
**Key class:** `GeminiService`
```python
class GeminiService:
def __init__(self, api_key: str, model_name: str, rpm_limit: int = 15)
def generate(self, prompt: str, temperature: float = 0.3, max_tokens: int = 2048) -> str
async def generate_stream(self, prompt: str, temperature: float = 0.3,
max_tokens: int = 2048) -> AsyncGenerator[str, None]
```
**Configuration:**
| Setting | Default | Description |
|---|---|---|
| `GEMINI_API_KEY` | (required) | Google Gemini API key |
| `GEMINI_MODEL` | `gemini-2.0-flash` | Gemini model identifier |
| `GEMINI_RPM_LIMIT` | 15 | Requests per minute limit |
| `GEMINI_TEMPERATURE` | 0.3 | Generation temperature (lower = more deterministic) |
| `GEMINI_MAX_TOKENS` | 2048 | Maximum output tokens per generation |
---
### Query Analyzer (`app/core/query_analyzer.py`)
**What it does:** Parses natural language queries to extract intent, metadata filters, and a cleaned query string.
**How it works internally:**
The analyzer performs multiple regex-based extractions in sequence:
1. **Document type extraction:** Matches patterns like "PDFs", "pdf", "HTML", "text files", "txt" and sets the `doc_type` filter.
2. **Relative date extraction:** Matches temporal phrases like "last week", "last month", "this year", "today", "yesterday" and converts them to `date_from`/`date_to` datetime ranges.
3. **Absolute date extraction:** Matches "after {date}" and "before {date}" patterns. Uses `python-dateutil` for fuzzy parsing of the date string.
4. **Source extraction:** Matches "from {filename.ext}" patterns to filter by specific source file.
5. **Query cleaning:** Removes all matched filter phrases from the query, collapses whitespace, and strips dangling prepositions (about, from, in, on).
6. **Intent classification:** Matches the original query against patterns for five intent types:
- `summarize` -- "summarize", "summary", "overview"
- `comparative` -- "compare", "difference", "vs", "versus"
- `list` -- "list", "enumerate", "what are all"
- `explanatory` -- starts with "why", "how", "explain"
- `factual` -- starts with "what", "who", "when", "where", "how many/much" (default fallback)
7. **Confidence scoring:** Starts at 0.5, incremented by 0.1 for each filter successfully extracted, capped at 1.0.
**Key class:** `QueryAnalyzer`
```python
class QueryAnalyzer:
def analyze(self, query: str) -> AnalyzedQuery
```
**Example:**
Input: `"summarize PDFs from last month"`
Output:
```json
{
"original_query": "summarize PDFs from last month",
"clean_query": "summarize",
"intent": "summarize",
"extracted_filters": {
"doc_type": "pdf",
"date_from": "2026-02-17T00:00:00",
"date_to": "2026-03-17T00:00:00"
},
"confidence": 0.7
}
```
---
### Answer Generator (`app/core/generator.py`)
**What it does:** Builds a prompt from retrieved chunks and generates a cited answer using the LLM.
**How it works internally:**
1. **Reranking:** Calls the `RerankerService` to narrow the retrieved chunks to `rerank_top_k`.
2. **Context building:** Formats each reranked chunk as a numbered passage with its source filename:
```
[1] (Source: report.pdf)
Chunk text content here...
[2] (Source: notes.txt)
Another chunk text...
```
3. **Prompt selection:** Uses `SYSTEM_PROMPT` for most intents and `SUMMARY_PROMPT` when the intent is "summarize".
4. **Prompt rules instruct the LLM to:**
- Answer based ONLY on the provided context
- Cite sources inline using [1], [2], etc.
- Admit when context is insufficient
- Use markdown formatting
5. **Streaming:** The `generate_answer_stream()` async generator yields text chunks during generation, then yields a final `GeneratedAnswer` object with source metadata.
**Key class:** `AnswerGenerator`
```python
class AnswerGenerator:
def __init__(self, llm: GeminiService, reranker: RerankerService)
def generate_answer(self, query: str, chunks: list[RetrievedChunk],
rerank_top_k: int = 5, intent: str = "factual") -> GeneratedAnswer
async def generate_answer_stream(self, query: str, chunks: list[RetrievedChunk],
rerank_top_k: int = 5, intent: str = "factual") -> AsyncGenerator
```
---
## Data Models
All models are defined using Pydantic v2 and live in `app/models/`.
### Core Document Models (`app/models/document.py`)
#### `DocumentMetadata`
Stores extracted metadata for a document or chunk.
| Field | Type | Default | Description |
|---|---|---|---|
| `source` | `str` | `""` | Original filename |
| `doc_type` | `str` | `""` | File type without dot (e.g., "pdf", "html", "txt") |
| `title` | `str \| None` | `None` | Extracted title (first meaningful line) |
| `created_date` | `datetime \| None` | `None` | Extracted date from document content |
| `tags` | `list[str]` | `[]` | Auto-extracted topic tags |
| `page_count` | `int \| None` | `None` | Number of pages (PDFs only) |
#### `Chunk`
Represents a single text chunk derived from a document.
| Field | Type | Default | Description |
|---|---|---|---|
| `chunk_id` | `str` | `uuid4()` | Unique chunk identifier |
| `document_id` | `str` | `""` | Parent document identifier |
| `text` | `str` | `""` | Chunk text content |
| `metadata` | `DocumentMetadata` | `{}` | Inherited document metadata |
| `chunk_index` | `int` | `0` | Position of this chunk in the document |
| `start_char` | `int` | `0` | Start character offset in original text |
| `end_char` | `int` | `0` | End character offset in original text |
#### `Document`
Represents a full ingested document.
| Field | Type | Default | Description |
|---|---|---|---|
| `document_id` | `str` | `uuid4()` | Unique document identifier |
| `filename` | `str` | `""` | Original filename |
| `metadata` | `DocumentMetadata` | `{}` | Extracted metadata |
| `chunks` | `list[Chunk]` | `[]` | Child chunks (populated during ingestion) |
| `raw_text` | `str` | `""` | Full extracted text |
### API Schemas (`app/models/schemas.py`)
#### `IngestResponse`
Returned after successful document ingestion.
| Field | Type | Description |
|---|---|---|
| `document_id` | `str` | Assigned UUID |
| `filename` | `str` | Original filename |
| `num_chunks` | `int` | Number of chunks created |
| `message` | `str` | Human-readable success message |
#### `SearchFilters`
Used for metadata filtering in search and query operations.
| Field | Type | Default | Description |
|---|---|---|---|
| `source` | `str \| None` | `None` | Filter by exact source filename |
| `doc_type` | `str \| None` | `None` | Filter by document type |
| `date_from` | `datetime \| None` | `None` | Filter documents created on or after this date |
| `date_to` | `datetime \| None` | `None` | Filter documents created on or before this date |
| `tags` | `list[str] \| None` | `None` | Filter by any matching tag |
#### `RetrievedChunk`
A chunk returned from retrieval, with its relevance score and rank.
| Field | Type | Description |
|---|---|---|
| `chunk_id` | `str` | Chunk identifier |
| `document_id` | `str` | Parent document identifier |
| `text` | `str` | Chunk text |
| `score` | `float` | Relevance score (RRF-fused or reranker score) |
| `metadata` | `DocumentMetadata` | Chunk metadata |
| `rank` | `int` | Position in the result list (0-indexed) |
#### `SearchRequest`
Request body for the `/api/search` endpoint.
| Field | Type | Default | Description |
|---|---|---|---|
| `query` | `str` | (required) | Natural language search query |
| `top_k` | `int` | `10` | Number of results to return |
| `filters` | `SearchFilters \| None` | `None` | Optional explicit filters (overrides auto-extraction) |
#### `SearchResponse`
Response from the `/api/search` endpoint.
| Field | Type | Description |
|---|---|---|
| `query` | `str` | Original query |
| `results` | `list[RetrievedChunk]` | Retrieved and ranked chunks |
| `total_results` | `int` | Number of results returned |
| `search_time_ms` | `float` | Total search time in milliseconds |
#### `QueryRequest`
Request body for the `/api/ask` endpoint.
| Field | Type | Default | Description |
|---|---|---|---|
| `query` | `str` | (required) | Natural language question |
| `top_k` | `int` | `10` | Number of chunks to retrieve |
| `rerank_top_k` | `int` | `5` | Number of chunks to keep after reranking |
| `filters` | `SearchFilters \| None` | `None` | Optional explicit filters |
| `stream` | `bool` | `False` | Enable Server-Sent Events streaming |
#### `GeneratedAnswer`
Response from the `/api/ask` endpoint (non-streaming).
| Field | Type | Description |
|---|---|---|
| `query` | `str` | Original question |
| `answer` | `str` | Generated markdown answer with inline citations |
| `sources` | `list[RetrievedChunk]` | Source chunks used for generation |
| `generation_time_ms` | `float` | Total generation time in milliseconds |
| `model` | `str` | LLM model name used |
#### `AnalyzedQuery`
Internal model from the query analyzer (not directly exposed via API).
| Field | Type | Default | Description |
|---|---|---|---|
| `original_query` | `str` | - | The raw user query |
| `clean_query` | `str` | - | Query with filter phrases removed |
| `intent` | `str` | `"factual"` | Classified intent |
| `extracted_filters` | `SearchFilters` | `{}` | Automatically extracted filters |
| `confidence` | `float` | `0.5` | Confidence in filter extraction |
---
## API Reference
The FastAPI app automatically generates interactive API documentation at `/docs` (Swagger UI) and `/redoc` (ReDoc).
### Health Check
```
GET /health
```
Returns the status of all system components.
**Response:**
```json
{
"status": "ok",
"components": {
"embedder": "loaded",
"bm25": "142 documents",
"vectorstore": "connected"
}
}
```
**curl example:**
```bash
curl http://localhost:7860/health
```
---
### Ingest Document
```
POST /api/ingest
Content-Type: multipart/form-data
```
Uploads and indexes a document. The file is parsed, chunked, embedded, and stored in both the vector database and the BM25 index.
**Request:** Multipart form with a `file` field.
**Constraints:**
- Supported extensions: `.pdf`, `.txt`, `.html`, `.htm`
- Maximum file size: 10 MB (configurable via `MAX_FILE_SIZE_MB`)
**Response (200):**
```json
{
"document_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"filename": "report.pdf",
"num_chunks": 47,
"message": "Successfully ingested 'report.pdf' with 47 chunks"
}
```
**Error responses:**
- `400` -- Missing filename or unsupported file type
- `413` -- File exceeds maximum size
- `422` -- Could not extract text from file
**curl example:**
```bash
curl -X POST http://localhost:7860/api/ingest \
-F "file=@/path/to/document.pdf"
```
---
### List Documents
```
GET /api/documents
```
Returns all indexed documents with their metadata and chunk counts.
**Response (200):**
```json
{
"documents": [
{
"document_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"source": "report.pdf",
"title": "Annual Report 2024",
"doc_type": "pdf",
"num_chunks": 47
}
],
"total": 1
}
```
**curl example:**
```bash
curl http://localhost:7860/api/documents
```
---
### Delete Document
```
DELETE /api/documents/{document_id}
```
Removes all chunks for the given document from Qdrant and rebuilds the BM25 index.
**Response (200):**
```json
{
"message": "Document 'a1b2c3d4-e5f6-7890-abcd-ef1234567890' deleted successfully"
}
```
**curl example:**
```bash
curl -X DELETE http://localhost:7860/api/documents/a1b2c3d4-e5f6-7890-abcd-ef1234567890
```
---
### Search (Retrieval Only)
```
POST /api/search
Content-Type: application/json
```
Performs hybrid retrieval without LLM generation. Useful for inspecting which chunks would be retrieved for a given query.
**Request body:**
```json
{
"query": "What is retrieval-augmented generation?",
"top_k": 10,
"filters": {
"doc_type": "pdf",
"tags": ["machine learning"]
}
}
```
**Response (200):**
```json
{
"query": "What is retrieval-augmented generation?",
"results": [
{
"chunk_id": "uuid",
"document_id": "uuid",
"text": "Retrieval-Augmented Generation (RAG) is...",
"score": 0.0234,
"metadata": {
"source": "report.pdf",
"doc_type": "pdf",
"title": "Annual Report",
"created_date": null,
"tags": ["machine learning"],
"page_count": 12
},
"rank": 0
}
],
"total_results": 10,
"search_time_ms": 142.5
}
```
**curl example:**
```bash
curl -X POST http://localhost:7860/api/search \
-H "Content-Type: application/json" \
-d '{"query": "What is RAG?", "top_k": 5}'
```
---
### Ask (Full RAG Pipeline)
```
POST /api/ask
Content-Type: application/json
```
Runs the full pipeline: query analysis, hybrid retrieval, reranking, and LLM answer generation.
**Request body:**
```json
{
"query": "What are the key findings in the report?",
"top_k": 10,
"rerank_top_k": 5,
"filters": null,
"stream": false
}
```
**Response (200, non-streaming):**
```json
{
"query": "What are the key findings in the report?",
"answer": "Based on the provided documents, the key findings are:\n\n1. **Finding one** [1]...\n2. **Finding two** [2]...",
"sources": [
{
"chunk_id": "uuid",
"document_id": "uuid",
"text": "chunk text...",
"score": 0.892,
"metadata": { "source": "report.pdf", "..." : "..." },
"rank": 0
}
],
"generation_time_ms": 3420.5,
"model": "gemini-2.0-flash"
}
```
**Streaming response (`"stream": true`):**
Returns `text/event-stream` with Server-Sent Events:
```
data: {"text": "Based on"}
data: {"text": " the provided"}
data: {"text": " documents..."}
data: {"done": true, "sources": [...], "model": "gemini-2.0-flash", "time_ms": 3420.5}
```
**curl examples:**
```bash
# Non-streaming
curl -X POST http://localhost:7860/api/ask \
-H "Content-Type: application/json" \
-d '{"query": "Summarize the report", "stream": false}'
# Streaming
curl -X POST http://localhost:7860/api/ask \
-H "Content-Type: application/json" \
-d '{"query": "What is RAG?", "stream": true}' \
--no-buffer
```
---
## UI Guide
RagCore includes a Gradio web interface mounted at `/ui` (the root `/` redirects there automatically).
### Ask Tab
The primary interaction surface for querying your documents.
**Components:**
- **Query input** -- A text box where you type your question in natural language. Supports pressing Enter to submit.
- **Document Type filter** -- Dropdown to restrict results to a specific file type: All, PDF, TXT, or HTML.
- **Stream response toggle** -- Checkbox (default: on) to enable real-time streaming of the answer as it is generated.
- **Ask button** -- Submits the query.
- **Answer area** -- Displays the generated answer with markdown formatting, followed by a "Sources" section listing each referenced chunk with its filename, relevance score, and a text snippet.
- **Example queries** -- Pre-filled example questions you can click to populate the query input.
### Documents Tab
Manages the document collection.
**Components:**
- **File upload zone** -- Drag-and-drop or click to select a file (`.pdf`, `.txt`, `.html`, `.htm`).
- **Upload & Index button** -- Triggers the ingestion pipeline. Shows a status card with filename, chunk count, and document ID on success.
- **Indexed Documents table** -- Displays all ingested documents with their filename, type, chunk count, and truncated document ID. Click "Refresh" to update.
- **Delete section** -- Paste a full document ID and click "Delete" to remove a document and all its chunks.
### Stats Bar
At the top of every tab, a card shows the current count of indexed documents and total chunks.
---
## Setup and Installation
### Prerequisites
- Python 3.12 or later
- A Qdrant Cloud account (free tier)
- A Google AI Studio account (free tier Gemini API key)
- (Optional) Docker and Docker Compose
### Step 1: Get API Keys
**Qdrant Cloud (vector database):**
1. Go to [https://cloud.qdrant.io](https://cloud.qdrant.io) and create a free account.
2. Create a new cluster (the free tier provides 1 GB of storage).
3. Copy the cluster URL (e.g., `https://abc123-xyz.us-east4-0.gcp.cloud.qdrant.io:6333`).
4. Generate an API key from the cluster dashboard.
**Google Gemini (LLM):**
1. Go to [https://aistudio.google.com/apikey](https://aistudio.google.com/apikey).
2. Click "Create API key" and select or create a Google Cloud project.
3. Copy the generated API key. The free tier allows 15 requests per minute for Gemini 2.0 Flash.
### Step 2: Clone and Configure
```bash
git clone <repository-url>
cd ragcore
```
Create a `.env` file in the `ragcore/` directory:
```env
# Required
GEMINI_API_KEY=your-gemini-api-key-here
QDRANT_URL=https://your-cluster.cloud.qdrant.io:6333
QDRANT_API_KEY=your-qdrant-api-key-here
# Optional (these are the defaults)
EMBEDDING_MODEL=all-MiniLM-L6-v2
EMBEDDING_DIM=384
QDRANT_COLLECTION=ragcore_docs
CHUNK_SIZE=512
CHUNK_OVERLAP=50
TOP_K=10
RERANK_TOP_K=5
DENSE_WEIGHT=0.6
SPARSE_WEIGHT=0.4
GEMINI_MODEL=gemini-2.0-flash
GEMINI_RPM_LIMIT=15
GEMINI_TEMPERATURE=0.3
GEMINI_MAX_TOKENS=2048
LOG_LEVEL=INFO
MAX_FILE_SIZE_MB=10
```
### Step 3: Running Locally
```bash
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # On Linux/macOS
# .venv\Scripts\activate # On Windows
# Install dependencies
pip install -r requirements.txt
# Start the server
uvicorn app.main:app --host 0.0.0.0 --port 7860
```
The first startup will download two ML models (~90 MB for the embedding model, ~50 MB for the reranker). Subsequent startups use cached models.
Once running:
- Web UI: [http://localhost:7860/ui](http://localhost:7860/ui)
- API docs: [http://localhost:7860/docs](http://localhost:7860/docs)
- Health check: [http://localhost:7860/health](http://localhost:7860/health)
### Step 4: Running with Docker
```bash
# Build and run
docker compose up --build
# Or build and run in detached mode
docker compose up --build -d
```
The Docker build pre-downloads both ML models into the image layer, so container startup is faster. The app is exposed on port 8000 (mapped from container port 7860).
Once running: [http://localhost:8000/ui](http://localhost:8000/ui)
---
## Deployment
### Deploying to HuggingFace Spaces
HuggingFace Spaces provides free hosting for Gradio and Docker-based applications. RagCore is pre-configured for deployment there.
**Step-by-step:**
1. **Create a HuggingFace account** at [https://huggingface.co](https://huggingface.co) if you do not have one.
2. **Create a new Space:**
- Go to [https://huggingface.co/new-space](https://huggingface.co/new-space).
- Choose a name (e.g., `ragcore`).
- Select **Docker** as the SDK.
- Choose the **Free** CPU basic tier.
- Click "Create Space".
3. **Configure secrets:**
- Go to your Space's Settings > Repository secrets.
- Add the following secrets:
- `GEMINI_API_KEY` -- your Google Gemini API key
- `QDRANT_URL` -- your Qdrant Cloud cluster URL
- `QDRANT_API_KEY` -- your Qdrant Cloud API key
4. **Push the code:**
```bash
cd ragcore
git remote add space https://huggingface.co/spaces/YOUR_USERNAME/ragcore
git push space main
```
Alternatively, upload files via the HuggingFace web interface.
5. **Wait for the build** -- the Docker image will be built on HuggingFace's infrastructure. The first build takes 5-10 minutes due to model downloads. The Space will show "Running" when ready.
6. **Access your app** at `https://YOUR_USERNAME-ragcore.hf.space`.
**Important notes:**
- HuggingFace Spaces exposes port 7860 by default, which matches the Dockerfile's `EXPOSE 7860`.
- The free tier has 2 vCPU and 16 GB RAM, which is sufficient for RagCore.
- Spaces may sleep after inactivity. The first request after sleep triggers a cold start (30-60 seconds).
---
## Configuration Reference
All settings are managed via environment variables, loaded from a `.env` file by `pydantic-settings`.
| Variable | Type | Default | Description |
|---|---|---|---|
| `GEMINI_API_KEY` | string | `""` | **Required.** Google Gemini API key for LLM generation. |
| `QDRANT_URL` | string | `""` | **Required.** Full URL of the Qdrant Cloud cluster (including port). |
| `QDRANT_API_KEY` | string | `""` | **Required.** Qdrant Cloud API key for authentication. |
| `EMBEDDING_MODEL` | string | `all-MiniLM-L6-v2` | HuggingFace model name for sentence-transformers. |
| `EMBEDDING_DIM` | integer | `384` | Dimensionality of the embedding vectors. Must match the model. |
| `QDRANT_COLLECTION` | string | `ragcore_docs` | Name of the Qdrant collection to use. Created automatically if missing. |
| `CHUNK_SIZE` | integer | `512` | Maximum number of words per text chunk. |
| `CHUNK_OVERLAP` | integer | `50` | Number of words overlapping between consecutive chunks. |
| `TOP_K` | integer | `10` | Number of chunks retrieved by the hybrid retriever. |
| `RERANK_TOP_K` | integer | `5` | Number of chunks kept after cross-encoder reranking. |
| `DENSE_WEIGHT` | float | `0.6` | Weight for dense (vector) search in RRF fusion. Range: 0.0-1.0. |
| `SPARSE_WEIGHT` | float | `0.4` | Weight for sparse (BM25) search in RRF fusion. Range: 0.0-1.0. |
| `GEMINI_MODEL` | string | `gemini-2.0-flash` | Gemini model identifier. |
| `GEMINI_RPM_LIMIT` | integer | `15` | Maximum requests per minute to the Gemini API. |
| `GEMINI_TEMPERATURE` | float | `0.3` | LLM generation temperature. Lower values produce more deterministic output. |
| `GEMINI_MAX_TOKENS` | integer | `2048` | Maximum number of output tokens per LLM generation. |
| `LOG_LEVEL` | string | `INFO` | Logging level. Valid values: DEBUG, INFO, WARNING, ERROR, CRITICAL. |
| `MAX_FILE_SIZE_MB` | integer | `10` | Maximum allowed file size for upload in megabytes. |
---
## How It Works End-to-End
This section traces a complete user interaction: uploading a PDF and then asking a question about it.
### Phase 1: Document Ingestion
**User action:** Uploads `annual-report-2024.pdf` (2.1 MB, 45 pages) via the Gradio Documents tab.
1. The Gradio UI reads the file and sends it as a multipart POST to `http://localhost:7860/api/ingest`.
2. **Validation** (`ingest.py`):
- Filename is checked: extension `.pdf` is in `SUPPORTED_EXTENSIONS`.
- File size 2.1 MB is under the 10 MB limit.
3. **Parsing** (`parsers.py`):
- `parse_pdf()` creates a `PdfReader` from the bytes.
- Iterates over all 45 pages, extracting text from each.
- Joins page texts with double newlines.
- `clean_text()` normalizes whitespace: collapses 3+ consecutive newlines to 2, collapses horizontal whitespace to single spaces, trims each line.
- Result: ~85,000 characters of cleaned text.
4. **Metadata extraction** (`metadata.py`):
- `extract_title()` returns `"Annual Report 2024 - Acme Corporation"` (first meaningful line).
- `extract_dates()` finds `"2024-03-15"` in the first 2000 chars, parses it to `datetime(2024, 3, 15)`.
- `extract_tags()` finds frequent capitalized phrases: `["acme corporation", "revenue growth", "machine learning", ...]`.
- `get_page_count()` returns `45`.
- Final `DocumentMetadata`: source="annual-report-2024.pdf", doc_type="pdf", title="Annual Report 2024 - Acme Corporation", created_date=2024-03-15, tags=[...], page_count=45.
5. **Chunking** (`chunker.py`):
- Splits the ~85,000 chars into sentences via `(?<=[.!?])\s+`.
- Accumulates sentences until the word count exceeds 512.
- Produces ~32 chunks, each with 50-word overlap with the next.
- Each chunk records start_char, end_char, and chunk_index.
6. **Embedding** (`embedder.py`):
- `embed_texts()` encodes all 32 chunk texts in a single batch (batch_size=64).
- Returns 32 vectors, each of dimension 384, L2-normalized.
7. **Vector storage** (`vectorstore.py`):
- `upsert_chunks()` creates 32 `PointStruct` objects with the vectors and payload.
- Since 32 < 100, they are uploaded in a single batch.
- Each point's payload includes text, document_id, chunk_index, source, doc_type, title, created_date, tags, page_count.
8. **BM25 indexing** (`bm25.py`):
- `add_documents()` tokenizes each chunk (lowercase, remove stop words, remove single chars).
- Appends to the document list and rebuilds the full BM25Okapi index.
9. **Response:** Returns `IngestResponse` with document_id, filename, num_chunks=32, and success message.
### Phase 2: Querying
**User action:** Types `"What was the revenue growth last year from PDFs?"` in the Ask tab with streaming enabled.
1. The Gradio UI sends a POST to `http://localhost:7860/api/ask` with:
```json
{"query": "What was the revenue growth last year from PDFs?", "top_k": 10, "rerank_top_k": 5, "stream": true, "filters": {"doc_type": "pdf"}}
```
(Note: the UI sets `doc_type` filter from the dropdown if not "All".)
2. **Query analysis** (`query_analyzer.py`):
- Doc type extraction: matches "PDFs" -> `filters.doc_type = "pdf"`.
- Date extraction: matches "last year" -> `filters.date_from = 2025-03-17`, `filters.date_to = 2026-03-17`.
- Clean query: removes "last year" and "PDFs" -> `"What was the revenue growth"`.
- Intent: matches `^(?:what|...)` -> `"factual"`.
- Confidence: 0.5 + 0.1 (doc_type) + 0.1 (date) = 0.7.
3. **Hybrid retrieval** (`retriever.py`):
- Embeds the clean query `"What was the revenue growth"` to a 384-dim vector.
- **Dense search:** Queries Qdrant with the vector, limit=20 (top_k * 2), with filters for doc_type="pdf" and date range. Returns 20 results ranked by cosine similarity.
- **Sparse search:** Tokenizes query to `["what", "revenue", "growth"]` (stop words removed), scores all BM25 documents, returns top 20 by BM25 score. Post-filters by doc_type="pdf".
- **RRF fusion:** For each chunk, computes `score = 0.6 * 1/(60+dense_rank) + 0.4 * 1/(60+sparse_rank)`. Chunks appearing in both lists get boosted scores.
- Deduplicates by chunk_id, takes top 10.
4. **Reranking** (`reranker.py`):
- Creates passage pairs: (query, chunk_text) for all 10 retrieved chunks.
- The FlashRank cross-encoder scores each pair jointly.
- Returns the top 5 by cross-encoder score, with updated scores and ranks.
5. **Answer generation** (`generator.py`):
- Builds context with numbered passages:
```
[1] (Source: annual-report-2024.pdf)
Revenue increased by 23% year-over-year...
[2] (Source: annual-report-2024.pdf)
The growth was primarily driven by...
```
- Constructs the SYSTEM_PROMPT with context and query.
- Calls `llm.generate_stream()` which respects the rate limit, then yields text chunks.
6. **Streaming response** (`query.py`):
- Each text chunk from Gemini is wrapped as `data: {"text": "..."}\n\n` (SSE format).
- The Gradio UI accumulates text and renders it progressively in the answer area.
- Final SSE event includes `{"done": true, "sources": [...], "model": "gemini-2.0-flash", "time_ms": 3420}`.
- Gradio formats the sources as styled cards showing filename, score, and snippet.
---
## Testing
### Running Tests
```bash
# Run all unit tests (excluding integration tests)
pytest tests/ -v --ignore=tests/test_integration.py -x
# Run a specific test file
pytest tests/test_chunker.py -v
# Run with coverage (install pytest-cov first)
pytest tests/ -v --ignore=tests/test_integration.py --cov=app
```
### Test Coverage
| Test File | Module Under Test | What Is Tested |
|---|---|---|
| `test_chunker.py` | `app.core.chunker` | Empty input, single sentence, multiple chunks, overlap behavior, chunk size limits |
| `test_parsers.py` | `app.utils.parsers` | UTF-8 text, Latin-1 fallback, HTML tag stripping, unsupported extensions, empty files, extension-based dispatch |
| `test_query_analyzer.py` | `app.core.query_analyzer` | Intent classification (factual, comparative, summarize, explanatory), doc type extraction, date extraction, clean query preservation |
| `test_retrieval.py` | `app.core.retriever` | RRF fusion (basic, empty lists, single list, weighted), metadata filter application |
| `test_api.py` | `app.main` (FastAPI) | Health endpoint returns 200 with components, root redirects to `/ui`, `/docs` page loads |
### Test Fixtures
Defined in `tests/conftest.py`:
- `client` -- A `FastAPI TestClient` instance for API testing.
- `sample_text` -- A paragraph about RAG for use in unit tests.
**Note:** Unit tests mock or avoid external dependencies (Qdrant, Gemini). The CI pipeline sets dummy API keys via environment variables. Integration tests (if present in `tests/test_integration.py`) are excluded from the default test run.
---
## CI/CD
### GitHub Actions Pipeline (`.github/workflows/ci.yml`)
The CI pipeline runs on every push to `main` and on every pull request targeting `main`.
**Pipeline steps:**
| Step | Description |
|---|---|
| Checkout | Clones the repository using `actions/checkout@v4` |
| Set up Python | Installs Python 3.12 via `actions/setup-python@v5` |
| Install dependencies | Runs `pip install -r requirements.txt` |
| Lint | Runs `ruff check .` for code style and quality |
| Unit tests | Runs `pytest tests/ -v --ignore=tests/test_integration.py -x` |
**Environment variables set during testing:**
```yaml
env:
GEMINI_API_KEY: "test"
QDRANT_URL: "http://localhost:6333"
QDRANT_API_KEY: "test"
```
These are dummy values that allow the application to initialize its settings without connecting to real services. Tests that would require live connections are either mocked or skipped.
The `-x` flag causes pytest to stop on the first failure for faster feedback.
---
## Performance and Limits
### Free Tier Limits
| Service | Limit | Impact |
|---|---|---|
| **Qdrant Cloud** (free tier) | 1 GB storage | Approximately 500,000-700,000 chunks at 384 dimensions. More than sufficient for thousands of documents. |
| **Google Gemini** (free tier) | 15 requests per minute | RagCore enforces this with built-in rate limiting (4-second minimum interval between calls). Each question costs 1 API call. |
| **HuggingFace Spaces** (free tier) | 2 vCPU, 16 GB RAM | Sufficient for running the embedding model, reranker, and BM25 index concurrently. |
### Expected Latency
| Operation | Typical Latency | Notes |
|---|---|---|
| Document ingestion (10-page PDF) | 3-8 seconds | Dominated by embedding time on CPU |
| Document ingestion (50-page PDF) | 10-20 seconds | Linear with number of chunks |
| Query (hybrid retrieval only) | 100-300 ms | Embedding + Qdrant + BM25 + RRF |
| Query (full RAG with answer) | 3-8 seconds | Dominated by Gemini API call |
| Query (streaming, time to first token) | 1-3 seconds | Reranking + Gemini startup |
| BM25 rebuild on startup | 50-500 ms | Depends on collection size (scrolls all points from Qdrant) |
| Embedding model cold load | 2-5 seconds | First request only; cached thereafter |
| Reranker model cold load | 1-3 seconds | First request only; cached thereafter |
### Capacity Guidelines
- **Small deployment** (< 100 documents, < 5,000 chunks): Everything runs comfortably within free tiers.
- **Medium deployment** (100-1,000 documents, 5,000-50,000 chunks): BM25 index may use 50-500 MB RAM. Qdrant free tier still has ample space.
- **Large deployment** (> 1,000 documents): Consider upgrading Qdrant to a paid tier and running the embedder on GPU for faster ingestion.
---
## Troubleshooting
### Common Errors and Fixes
**Error: `"Unsupported file type '.docx'"` or similar**
Only PDF, TXT, and HTML files are supported. Convert other formats to one of these before uploading. For DOCX files, export to PDF from your word processor.
---
**Error: `"File too large. Maximum size is 10MB"`**
Increase the limit by setting `MAX_FILE_SIZE_MB` in your `.env` file, or split the file into smaller parts.
---
**Error: `"Could not extract text from file"`**
The PDF may be image-based (scanned document) without an embedded text layer. pypdf cannot extract text from images. Use an OCR tool (e.g., Tesseract) to add a text layer first.
---
**Error: Qdrant connection timeout or `"Connection refused"`**
- Verify your `QDRANT_URL` includes the port (typically `:6333`).
- Verify your `QDRANT_API_KEY` is correct.
- Check that your Qdrant Cloud cluster is active (free clusters may be paused after inactivity).
---
**Error: `"Gemini generation failed"` or `"429 Too Many Requests"`**
You have exceeded the Gemini API rate limit. RagCore has built-in rate limiting, but if multiple users are sharing the same API key, collisions can occur. Solutions:
- Wait a few seconds and retry.
- Reduce `GEMINI_RPM_LIMIT` to add more buffer between calls.
- Upgrade to a paid Gemini plan for higher limits.
---
**Error: `"Embedder initialization deferred"`**
This warning during startup means the embedding model could not be loaded immediately. This usually resolves on the first request. If it persists:
- Check internet connectivity (the model needs to be downloaded on first use).
- Ensure sufficient disk space (~200 MB for cached models).
- Check if the `EMBEDDING_MODEL` name is correct.
---
**BM25 index shows 0 documents after restart**
This is expected on first startup with a fresh Qdrant collection. The BM25 index rebuilds from Qdrant on startup. If Qdrant has data but BM25 shows 0, check the Qdrant connection settings.
---
**Gradio UI not loading or showing "Connecting..."**
- Ensure the server is running on port 7860 (or whichever port you configured).
- The Gradio UI communicates with the API via `http://localhost:7860`. If running in Docker, this internal URL is correct. If running behind a reverse proxy, the UI may need adjustment.
---
**Slow first request after startup**
The first request triggers lazy loading of the reranker model. This is a one-time cost of 1-3 seconds. Subsequent requests are fast.
---
**Docker build fails at model download step**
The Dockerfile pre-downloads ML models during build. This requires internet access during `docker build`. If building behind a corporate proxy, configure Docker's proxy settings. If the download fails, the build will fail. Retry usually resolves transient network issues.