cv-rag-search / README.md
nlac
several improvements
b4615d2

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: CV RAG Search demo
emoji: 🌖
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
short_description: RAG search among chunks from resume PDFs

CV RAG Search demo

A production-ready Retrieval-Augmented Generation (RAG) system demonstrating advanced techniques for semantic document search and AI-powered synthesis. Built as a CV/resume search assistant, this project showcases enterprise-grade patterns for building intelligent document retrieval systems.

Two-Part Architecture

This project consists of two distinct components that work together:

Component Script Purpose
1. Indexer (Vectorizer) scripts/indexer.py Converts PDFs to markdown, chunks documents, extracts metadata, and builds the vector database
2. Search Agent app.py Provides an AI-powered chat interface with a RAG search tool for querying the indexed documents

Part 1: Indexer (Vectorizer)

The indexer transforms raw PDF documents into a searchable vector database. Run it with python scripts/indexer.py.

PDF to Markdown Conversion

The system supports two strategies for converting PDFs to structured Markdown, controlled by the MARKDOWN_CREATION_MODE environment variable:

Mode Description Best For
llm (default) Sends extracted PDF text to the LLM for intelligent structural analysis Complex layouts, CVs with varied formatting
pymupdf Uses pymupdf4llm library for local conversion Simple documents with clear visual hierarchy

Why LLM-based conversion is the default:

Traditional PDF-to-Markdown tools like pymupdf4llm rely on visual layout analysis, which struggles with documents that don't follow a strict tree structure. The LLM-based approach understands semantic meaning and produces consistently better hierarchical structure, especially for CVs/resumes where formatting varies widely.

# The LLM receives extracted text and generates structured markdown
system_prompt = """You are an advanced pdf-to-markdown generator...
1. Generate a hierarchical structure using appropriate markdown headers (#, ##, ###)
2. Maintain all text content accurately without summarizing or omitting information
3. Respond only the pure generated markdown text, no preamble or explanation."""

Metadata Prepending for Context Preservation

A critical challenge in RAG systems is context loss during chunking—when documents are split into smaller pieces, chunks lose awareness of their position within the document hierarchy.

This project solves this with a two-pass splitting strategy with metadata enrichment:

Step 1: Structure-Aware Splitting

# First pass: Split by document headers to preserve hierarchy
MarkdownHeaderTextSplitter(headers_to_split_on=[
    ("#", "h1"), ("##", "h2"), ("###", "h3")
])

# Second pass: Size-based splitting with overlap
RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

Step 2: Breadcrumb Construction

Each chunk receives a contextual breadcrumb path from its parent headers:

breadcrumb = " > ".join([h1, h2, h3])  # e.g., "Experience > Technical Skills > Python"

Step 3: Content Prepending

Metadata is prepended directly to chunk content before embedding:

[Name: Alice Smith | Section: Experience > Technical Skills > Python]

5 years of Python development focusing on web applications...

Benefits:

  • Semantic search includes contextual keywords in vector embeddings
  • Reduces hallucination by providing explicit context to the LLM
  • Enables filtering and attribution in search results
  • Improves relevance in multi-document scenarios

LLM-Based Metadata Extraction

The indexer uses the LLM itself to extract structured metadata from unstructured documents:

def extract_person_name_from_cv(markdown_content: str) -> str | None:
    """Uses LLM to extract person's name from CV content."""
    prompt = f"""Extract the person's full name from this CV...
    CV Content (first 2000 chars):
    {markdown_content[:2000]}
    """
    return llm.generate(prompt)

This extracted name is then included in every chunk's metadata, enabling person-specific queries.

Batch Embedding Generation

Large document sets are processed in configurable batches to prevent memory exhaustion:

for i in range(0, total_chunks, batch_size):
    batch = chunks[i:i + batch_size]
    vectorstore.add_documents(batch)

Indexing Pipeline

PDF Documents
      ↓
┌─────────────────────────────────┐
│  PDF → Markdown Conversion      │  LLM-based (default) or pymupdf4llm
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  MarkdownHeaderTextSplitter     │  Header-aware logical sections
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  RecursiveCharacterTextSplitter │  Size-based chunks with overlap
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  LLM Metadata Extraction        │  Extract names, entities
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  Metadata Prepending            │  Embed context into content
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  Embedding Model                │  Generate vector representations
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  ChromaDB                       │  Persist vectors with metadata
└─────────────────────────────────┘

Part 2: Search Agent

The search agent provides an interactive chat interface powered by smolagents' CodeAgent. Run it with python app.py.

RAG Search Tool

The agent has access to a rag_search() tool that performs semantic similarity search against the indexed documents:

@tool
def rag_search(query: str) -> str:
    """Searches through indexed CVs and returns relevant chunks with metadata."""
    results = vectorstore.similarity_search(query, k=TOP_K_RESULTS)
    return format_results_for_llm(results)

Query Decomposition Guidance

The system prompt guides the LLM to decompose complex comparative queries:

Query: "Compare Alice's and Bob's Python skills"
→ rag_search("Alice Python experience")
→ rag_search("Bob Python experience")
→ Synthesize comparison from both results

Dynamic Result Formatting

Search results are formatted with human-readable metadata headers for LLM consumption:

[Name: Alice Smith | Section: Technical Skills | Source: alice_cv.pdf]

Content of the chunk...

Vectorstore Caching

ChromaDB instances are cached per category using a singleton pattern, avoiding repeated initialization overhead:

_vectorstore_cache: dict[str, Chroma] = {}

def _get_vectorstore(category: str) -> Chroma:
    if category not in _vectorstore_cache:
        _vectorstore_cache[category] = Chroma(persist_directory=...)
    return _vectorstore_cache[category]

Query Pipeline

User Question
      ↓
┌─────────────────────────────────┐
│  CodeAgent (smolagents)         │  Autonomous reasoning agent
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  rag_search() Tool              │  Vectorstore similarity search
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  Result Formatting              │  Structured output with metadata
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  LLM Synthesis                  │  Generate contextual response
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│  Gradio Streaming UI            │  Real-time response display
└─────────────────────────────────┘

Shared Infrastructure

Abstract Model Provider Architecture

The system implements a factory pattern for model providers, enabling seamless switching between inference backends without code changes:

# Supports three LLM backends
MODEL_PROVIDER_TYPE = Literal["hf-local", "hf-remote", "openai"]
Provider Implementation Use Case
hf-local TransformersModel Local GPU/CPU inference with downloaded models
hf-remote InferenceClientModel HuggingFace Inference API
openai OpenAIServerModel OpenAI API or compatible services (Ollama, LM Studio)

Both LLM and embedding models use singleton pattern with lazy initialization, ensuring efficient resource usage:

def get_llm_model() -> TransformersModel | InferenceClientModel | OpenAIServerModel:
    global _llm_model
    if _llm_model is None:
        _llm_model = _create_model_based_on_config()
    return _llm_model

Tech Stack

Component Technology
Vector Database ChromaDB
Text Processing LangChain (splitters, embeddings)
Embeddings HuggingFace / OpenAI
LLM Framework smolagents (CodeAgent)
PDF Processing LLM-based conversion (default) / pymupdf4llm
Web Interface Gradio

Project Structure

rag-tool-demo/
├── app.py                 # Gradio chat interface & agent setup
├── config.py              # Centralized configuration management
├── app_types.py           # Type definitions
├── lib/
│   └── model_provider.py  # Abstract model provider factory
├── tools/
│   └── rag_search.py      # RAG search tool implementation
├── scripts/
│   └── indexer.py         # Document indexing pipeline
└── rag/
    ├── sources/           # Source documents (PDFs)
    └── chroma_db/         # Persisted vector database

Configuration

All parameters are configurable via environment variables or .env file:

# Model Providers
LLM_MODEL_PROVIDER=openai          # hf-local | hf-remote | openai
EMBEDDING_MODEL_PROVIDER=hf-local

# Models
LLM_MODEL=google/gemma-3-4b-it
EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B

# RAG Parameters
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
TOP_K_RESULTS=5

# PDF Conversion
MARKDOWN_CREATION_MODE=llm         # llm (default) | pymupdf

# API Configuration
OPENAI_API_KEY=your-key
LOCAL_LLM_BASE=http://localhost:1234/v1
HF_TOKEN=your-token

Getting Started

Prerequisites

  • Python 3.11+
  • CUDA-compatible GPU (optional, for local inference)

Installation

# Clone the repository
git clone https://github.com/yourusername/rag-tool-demo.git
cd rag-tool-demo

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your settings

Index Documents

# Place PDF documents in rag/sources/cvs/
python scripts/indexer.py

Run the Application

python app.py

The Gradio interface will be available at http://localhost:7860.

Design Patterns

Pattern Application
Factory Model provider selection based on configuration
Singleton Cached model and vectorstore instances
Strategy Pluggable text splitters and embedding models
Decorator @tool decorator for RAG search function

Author

nlac

License

MIT