--- title: RagCore emoji: 🔍 colorFrom: indigo colorTo: purple sdk: docker app_port: 7860 pinned: false --- # RagCore **A production-ready Retrieval-Augmented Generation system with hybrid search, metadata filtering, and a conversational UI.** RagCore solves the problem of querying unstructured documents (PDFs, text files, HTML pages) using natural language. It ingests documents, splits them into semantically meaningful chunks, indexes them in both a vector database and a BM25 keyword index, then retrieves and reranks the most relevant passages to generate grounded, citation-backed answers using Google Gemini. Unlike naive RAG implementations that rely solely on vector similarity, RagCore combines dense (semantic) and sparse (keyword) retrieval using Reciprocal Rank Fusion, applies a cross-encoder reranker to promote the most relevant passages, and uses an intelligent query analyzer that automatically extracts filters (date ranges, document types, sources) from natural language queries. --- ## Table of Contents 1. [Architecture Overview](#architecture-overview) 2. [Tech Stack](#tech-stack) 3. [Project Structure](#project-structure) 4. [Core Components Deep Dive](#core-components-deep-dive) 5. [Data Models](#data-models) 6. [API Reference](#api-reference) 7. [UI Guide](#ui-guide) 8. [Setup and Installation](#setup-and-installation) 9. [Deployment](#deployment) 10. [Configuration Reference](#configuration-reference) 11. [How It Works End-to-End](#how-it-works-end-to-end) 12. [Testing](#testing) 13. [CI/CD](#cicd) 14. [Performance and Limits](#performance-and-limits) 15. [Troubleshooting](#troubleshooting) --- ## Architecture Overview RagCore is built as a FastAPI application with two main pipelines: **Ingestion** and **Query**. A Gradio-based UI is mounted directly onto the FastAPI app at `/ui`. ### Ingestion Pipeline ``` +------------------+ +----------------+ +-------------------+ | File Upload | --> | Parser | --> | Text Cleaner | | (PDF/TXT/HTML) | | (pypdf/bs4) | | (regex cleanup) | +------------------+ +----------------+ +-------------------+ | v +------------------+ +----------------+ +-------------------+ | Qdrant Cloud | <-- | Embedder | <-- | Chunker | | (vector store) | | (MiniLM-L6-v2) | | (sentence-aware) | +------------------+ +----------------+ +-------------------+ | | | v | +-------------------+ +------------------------------------> | BM25 Index | | (in-memory) | +-------------------+ ^ | +-------------------+ | Metadata Extractor| | (title/dates/tags)| +-------------------+ ``` **Step-by-step flow:** 1. User uploads a file via the `/api/ingest` endpoint or the Gradio UI. 2. The **Parser** detects file type by extension and extracts raw text (pypdf for PDFs, BeautifulSoup for HTML, direct decoding for TXT). 3. The **Text Cleaner** normalizes whitespace, collapses blank lines, and trims each line. 4. The **Metadata Extractor** pulls out the document title (first non-empty line), dates (via regex patterns), and tags (frequent capitalized phrases). 5. The **Chunker** splits text into overlapping chunks at sentence boundaries, respecting a configurable word-count limit. 6. The **Embedder** encodes each chunk into a 384-dimensional vector using the `all-MiniLM-L6-v2` sentence transformer. 7. Chunks with their vectors and payload metadata are upserted into **Qdrant Cloud** in batches of 100. 8. The same chunks are added to the in-memory **BM25 index** for keyword search. ### Query Pipeline ``` +------------------+ +-------------------+ +------------------+ | User Query | --> | Query Analyzer | --> | Hybrid Retriever| | "What is RAG | | (intent, filters, | | | | from PDFs?" | | cleaned query) | | +----------+ | +------------------+ +-------------------+ | |Dense | | | |(Qdrant) | | | +----------+ | | | | | +----------+ | | |Sparse | | | |(BM25) | | | +----------+ | | | | | +----------+ | | |RRF Fusion| | | +----------+ | +------------------+ | v +-------------------+ +------------------+ | Answer Generator | <-- | Reranker | | (Gemini Flash) | | (FlashRank) | +-------------------+ +------------------+ | v +-------------------+ | Cited Answer | | with Sources | +-------------------+ ``` **Step-by-step flow:** 1. User submits a natural language query. 2. The **Query Analyzer** classifies intent (factual, summarize, comparative, list, explanatory), extracts inline filters (doc type, date range, source filename), and produces a cleaned query. 3. The **Hybrid Retriever** runs two parallel searches: - **Dense search**: encodes the query with the same embedding model, queries Qdrant with cosine similarity, fetching `top_k * 2` results. - **Sparse search**: tokenizes the query and scores all chunks via BM25Okapi, also fetching `top_k * 2` results. 4. Results are fused using **Reciprocal Rank Fusion (RRF)** with configurable weights (default: 0.6 dense, 0.4 sparse). 5. The top-K fused results are passed to the **Reranker** (FlashRank cross-encoder), which rescores and selects the best 5 passages. 6. The **Answer Generator** builds a prompt with numbered context passages and sends it to **Google Gemini Flash**, which generates a cited, markdown-formatted answer. 7. The answer is returned with source references (streaming or non-streaming). --- ## Tech Stack | Technology | Version | Purpose | |---|---|---| | **Python** | 3.12 | Runtime language. Chosen for its ML/NLP ecosystem. | | **FastAPI** | >=0.110 | Async web framework. High performance, automatic OpenAPI docs, dependency injection. | | **Uvicorn** | >=0.29 | ASGI server for running FastAPI in production. | | **Pydantic** | >=2.6 | Data validation and serialization for all request/response models. | | **pydantic-settings** | >=2.2 | Environment-based configuration with `.env` file support. | | **sentence-transformers** | >=2.6 | Embedding model loading and inference (`all-MiniLM-L6-v2`). Chosen for fast CPU inference and high quality at 384 dimensions. | | **qdrant-client** | >=1.8 | Client for Qdrant vector database. Chosen for its generous free tier (1GB), filtering support, and payload storage. | | **rank-bm25** | >=0.2.2 | BM25Okapi implementation for sparse keyword retrieval. Lightweight, pure-Python, no external dependencies. | | **FlashRank** | >=0.2 | Ultra-fast cross-encoder reranker (`ms-marco-MiniLM-L-12-v2`). Runs on CPU, no GPU required. | | **google-generativeai** | >=0.5 | Official Google Gemini SDK. Gemini 2.0 Flash offers a free tier with 15 RPM. | | **Gradio** | >=4.20 | Web UI framework mounted directly on FastAPI. Two-tab interface for Q&A and document management. | | **pypdf** | >=4.1 | PDF text extraction. Handles most PDF formats without external system dependencies. | | **beautifulsoup4** | >=4.12 | HTML parsing with tag stripping (removes scripts, styles, nav, footer, header). | | **httpx** | >=0.27 | Async/sync HTTP client used by the Gradio UI to call the FastAPI backend. | | **python-multipart** | >=0.0.9 | Required by FastAPI for file upload support. | | **python-dateutil** | >=2.9 | Fuzzy date parsing for the query analyzer's absolute date extraction. | | **Ruff** | >=0.3 | Fast Python linter. Used in CI for code quality checks. | | **pytest** | >=8.0 | Test framework. Unit tests for chunker, parsers, query analyzer, retrieval, and API. | | **Docker** | - | Containerization. Pre-downloads ML models in the build step for fast cold starts. | --- ## Project Structure ``` ragcore/ |-- .github/ | +-- workflows/ | +-- ci.yml # GitHub Actions CI pipeline (lint + test) |-- app/ | |-- __init__.py | |-- config.py # Settings class with all env vars, setup_logging() | |-- main.py # FastAPI app creation, lifespan, middleware, routing | |-- api/ | | |-- __init__.py | | |-- deps.py # Dependency injection factories for all services | | +-- routes/ | | |-- __init__.py | | |-- health.py # GET /health endpoint | | |-- ingest.py # POST /api/ingest, GET /api/documents, DELETE /api/documents/{id} | | +-- query.py # POST /api/search, POST /api/ask (with streaming) | |-- core/ | | |-- __init__.py | | |-- bm25.py # BM25 index: tokenization, search, rebuild from vectorstore | | |-- chunker.py # Sentence-aware text chunking with overlap | | |-- embedder.py # SentenceTransformer embedding service | | |-- generator.py # Answer generation with prompt templates and streaming | | |-- llm.py # Gemini API client with rate limiting | | |-- metadata.py # Metadata extraction (title, dates, tags) | | |-- query_analyzer.py # Query intent classification and filter extraction | | |-- reranker.py # FlashRank cross-encoder reranking | | |-- retriever.py # Hybrid retriever with RRF fusion | | +-- vectorstore.py # Qdrant client wrapper (CRUD, search, filtering) | |-- models/ | | |-- __init__.py | | |-- document.py # DocumentMetadata, Chunk, Document models | | +-- schemas.py # API request/response schemas (IngestResponse, QueryRequest, etc.) | |-- ui/ | | |-- __init__.py | | +-- gradio_app.py # Gradio Blocks UI (Ask tab, Documents tab) | +-- utils/ | |-- __init__.py | |-- helpers.py # generate_id, clean_text, count_words, timer, retry_with_backoff | +-- parsers.py # File parsing (PDF, TXT, HTML) and page count extraction |-- tests/ | |-- __init__.py | |-- conftest.py # Shared fixtures (TestClient, sample_text) | |-- test_api.py # API integration tests (health, redirect, docs) | |-- test_chunker.py # Chunker unit tests (empty, single, multiple, overlap) | |-- test_parsers.py # Parser unit tests (UTF-8, Latin-1, HTML, unsupported) | |-- test_query_analyzer.py # Query analyzer tests (intents, filters, dates) | +-- test_retrieval.py # RRF fusion tests (basic, empty, weights, filters) |-- .dockerignore |-- .env # Environment variables (not committed to git) |-- .gitignore |-- Dockerfile # Python 3.12-slim, pre-downloads ML models |-- docker-compose.yml # Single-service compose with env_file +-- requirements.txt # All Python dependencies with version constraints ``` --- ## Core Components Deep Dive ### Parsers (`app/utils/parsers.py`) **What it does:** Extracts raw text from uploaded files based on their extension. **Supported formats:** `.pdf`, `.txt`, `.html`, `.htm` **How it works internally:** - `parse_document(file_bytes, filename)` is the main dispatcher. It reads the file extension and calls the appropriate parser. - **PDF parsing** uses `pypdf.PdfReader` to iterate over all pages, extract text from each, and join them with double newlines. - **HTML parsing** uses `BeautifulSoup` with the `html.parser` backend. Before extracting text, it decomposes `