Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.13.0
title: CV RAG Search demo
emoji: 🌖
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
short_description: RAG search among chunks from resume PDFs
CV RAG Search demo
A production-ready Retrieval-Augmented Generation (RAG) system demonstrating advanced techniques for semantic document search and AI-powered synthesis. Built as a CV/resume search assistant, this project showcases enterprise-grade patterns for building intelligent document retrieval systems.
Two-Part Architecture
This project consists of two distinct components that work together:
| Component | Script | Purpose |
|---|---|---|
| 1. Indexer (Vectorizer) | scripts/indexer.py |
Converts PDFs to markdown, chunks documents, extracts metadata, and builds the vector database |
| 2. Search Agent | app.py |
Provides an AI-powered chat interface with a RAG search tool for querying the indexed documents |
Part 1: Indexer (Vectorizer)
The indexer transforms raw PDF documents into a searchable vector database. Run it with python scripts/indexer.py.
PDF to Markdown Conversion
The system supports two strategies for converting PDFs to structured Markdown, controlled by the MARKDOWN_CREATION_MODE environment variable:
| Mode | Description | Best For |
|---|---|---|
llm (default) |
Sends extracted PDF text to the LLM for intelligent structural analysis | Complex layouts, CVs with varied formatting |
pymupdf |
Uses pymupdf4llm library for local conversion | Simple documents with clear visual hierarchy |
Why LLM-based conversion is the default:
Traditional PDF-to-Markdown tools like pymupdf4llm rely on visual layout analysis, which struggles with documents that don't follow a strict tree structure. The LLM-based approach understands semantic meaning and produces consistently better hierarchical structure, especially for CVs/resumes where formatting varies widely.
# The LLM receives extracted text and generates structured markdown
system_prompt = """You are an advanced pdf-to-markdown generator...
1. Generate a hierarchical structure using appropriate markdown headers (#, ##, ###)
2. Maintain all text content accurately without summarizing or omitting information
3. Respond only the pure generated markdown text, no preamble or explanation."""
Metadata Prepending for Context Preservation
A critical challenge in RAG systems is context loss during chunking—when documents are split into smaller pieces, chunks lose awareness of their position within the document hierarchy.
This project solves this with a two-pass splitting strategy with metadata enrichment:
Step 1: Structure-Aware Splitting
# First pass: Split by document headers to preserve hierarchy
MarkdownHeaderTextSplitter(headers_to_split_on=[
("#", "h1"), ("##", "h2"), ("###", "h3")
])
# Second pass: Size-based splitting with overlap
RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
Step 2: Breadcrumb Construction
Each chunk receives a contextual breadcrumb path from its parent headers:
breadcrumb = " > ".join([h1, h2, h3]) # e.g., "Experience > Technical Skills > Python"
Step 3: Content Prepending
Metadata is prepended directly to chunk content before embedding:
[Name: Alice Smith | Section: Experience > Technical Skills > Python]
5 years of Python development focusing on web applications...
Benefits:
- Semantic search includes contextual keywords in vector embeddings
- Reduces hallucination by providing explicit context to the LLM
- Enables filtering and attribution in search results
- Improves relevance in multi-document scenarios
LLM-Based Metadata Extraction
The indexer uses the LLM itself to extract structured metadata from unstructured documents:
def extract_person_name_from_cv(markdown_content: str) -> str | None:
"""Uses LLM to extract person's name from CV content."""
prompt = f"""Extract the person's full name from this CV...
CV Content (first 2000 chars):
{markdown_content[:2000]}
"""
return llm.generate(prompt)
This extracted name is then included in every chunk's metadata, enabling person-specific queries.
Batch Embedding Generation
Large document sets are processed in configurable batches to prevent memory exhaustion:
for i in range(0, total_chunks, batch_size):
batch = chunks[i:i + batch_size]
vectorstore.add_documents(batch)
Indexing Pipeline
PDF Documents
↓
┌─────────────────────────────────┐
│ PDF → Markdown Conversion │ LLM-based (default) or pymupdf4llm
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ MarkdownHeaderTextSplitter │ Header-aware logical sections
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ RecursiveCharacterTextSplitter │ Size-based chunks with overlap
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ LLM Metadata Extraction │ Extract names, entities
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ Metadata Prepending │ Embed context into content
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ Embedding Model │ Generate vector representations
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ ChromaDB │ Persist vectors with metadata
└─────────────────────────────────┘
Part 2: Search Agent
The search agent provides an interactive chat interface powered by smolagents' CodeAgent. Run it with python app.py.
RAG Search Tool
The agent has access to a rag_search() tool that performs semantic similarity search against the indexed documents:
@tool
def rag_search(query: str) -> str:
"""Searches through indexed CVs and returns relevant chunks with metadata."""
results = vectorstore.similarity_search(query, k=TOP_K_RESULTS)
return format_results_for_llm(results)
Query Decomposition Guidance
The system prompt guides the LLM to decompose complex comparative queries:
Query: "Compare Alice's and Bob's Python skills"
→ rag_search("Alice Python experience")
→ rag_search("Bob Python experience")
→ Synthesize comparison from both results
Dynamic Result Formatting
Search results are formatted with human-readable metadata headers for LLM consumption:
[Name: Alice Smith | Section: Technical Skills | Source: alice_cv.pdf]
Content of the chunk...
Vectorstore Caching
ChromaDB instances are cached per category using a singleton pattern, avoiding repeated initialization overhead:
_vectorstore_cache: dict[str, Chroma] = {}
def _get_vectorstore(category: str) -> Chroma:
if category not in _vectorstore_cache:
_vectorstore_cache[category] = Chroma(persist_directory=...)
return _vectorstore_cache[category]
Query Pipeline
User Question
↓
┌─────────────────────────────────┐
│ CodeAgent (smolagents) │ Autonomous reasoning agent
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ rag_search() Tool │ Vectorstore similarity search
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ Result Formatting │ Structured output with metadata
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ LLM Synthesis │ Generate contextual response
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ Gradio Streaming UI │ Real-time response display
└─────────────────────────────────┘
Shared Infrastructure
Abstract Model Provider Architecture
The system implements a factory pattern for model providers, enabling seamless switching between inference backends without code changes:
# Supports three LLM backends
MODEL_PROVIDER_TYPE = Literal["hf-local", "hf-remote", "openai"]
| Provider | Implementation | Use Case |
|---|---|---|
hf-local |
TransformersModel | Local GPU/CPU inference with downloaded models |
hf-remote |
InferenceClientModel | HuggingFace Inference API |
openai |
OpenAIServerModel | OpenAI API or compatible services (Ollama, LM Studio) |
Both LLM and embedding models use singleton pattern with lazy initialization, ensuring efficient resource usage:
def get_llm_model() -> TransformersModel | InferenceClientModel | OpenAIServerModel:
global _llm_model
if _llm_model is None:
_llm_model = _create_model_based_on_config()
return _llm_model
Tech Stack
| Component | Technology |
|---|---|
| Vector Database | ChromaDB |
| Text Processing | LangChain (splitters, embeddings) |
| Embeddings | HuggingFace / OpenAI |
| LLM Framework | smolagents (CodeAgent) |
| PDF Processing | LLM-based conversion (default) / pymupdf4llm |
| Web Interface | Gradio |
Project Structure
rag-tool-demo/
├── app.py # Gradio chat interface & agent setup
├── config.py # Centralized configuration management
├── app_types.py # Type definitions
├── lib/
│ └── model_provider.py # Abstract model provider factory
├── tools/
│ └── rag_search.py # RAG search tool implementation
├── scripts/
│ └── indexer.py # Document indexing pipeline
└── rag/
├── sources/ # Source documents (PDFs)
└── chroma_db/ # Persisted vector database
Configuration
All parameters are configurable via environment variables or .env file:
# Model Providers
LLM_MODEL_PROVIDER=openai # hf-local | hf-remote | openai
EMBEDDING_MODEL_PROVIDER=hf-local
# Models
LLM_MODEL=google/gemma-3-4b-it
EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B
# RAG Parameters
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
TOP_K_RESULTS=5
# PDF Conversion
MARKDOWN_CREATION_MODE=llm # llm (default) | pymupdf
# API Configuration
OPENAI_API_KEY=your-key
LOCAL_LLM_BASE=http://localhost:1234/v1
HF_TOKEN=your-token
Getting Started
Prerequisites
- Python 3.11+
- CUDA-compatible GPU (optional, for local inference)
Installation
# Clone the repository
git clone https://github.com/yourusername/rag-tool-demo.git
cd rag-tool-demo
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your settings
Index Documents
# Place PDF documents in rag/sources/cvs/
python scripts/indexer.py
Run the Application
python app.py
The Gradio interface will be available at http://localhost:7860.
Design Patterns
| Pattern | Application |
|---|---|
| Factory | Model provider selection based on configuration |
| Singleton | Cached model and vectorstore instances |
| Strategy | Pluggable text splitters and embedding models |
| Decorator | @tool decorator for RAG search function |
Author
License
MIT