documentation-crawler-rag / IMPLEMENTATION_GUIDE.md
chintu4's picture
file organization
a1cf951
|
Raw
History Blame Contribute Delete
8.19 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Implementation Guide: Intelligent Documentation Crawler & RAG Assistant

This guide walks through the SRS implementation for the Documentation Crawler and RAG Assistant system.

Architecture Overview

The system follows the modular architecture specified in SRS Section 6:

[Crawler] ──> [Document Extractor] ──> [Text Splitter] ──> [Embeddings] ──> [Vector DB]
                                                                                    β”‚
[Query] ──> [Text Splitter] ──> [Embeddings] ──> [Semantic Search] β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                         β”‚
                                                         β–Ό
                                                  [LLM + Prompt] ──> [Answer]

Components

1. Crawler Module (src/crawler.py)

Implements the requirements from SRS Section 3.1 - Data Ingestion & Crawling:

  • Targeted Ingestion: Recursively crawls URLs up to max_depth
  • Rate Limiting: Configurable delay between requests (default 0.5s)
  • Politeness: Respects domain constraints
  • State Management: Tracks visited URLs to prevent loops
  • Boilerplate Removal: Strips navbars, footers, sidebars using BeautifulSoup

Usage:

from src.crawler import DocumentationCrawler

crawler = DocumentationCrawler(
    base_url="https://docs.python.org",
    max_depth=3,
    delay=0.5,
    max_pages=100
)

documents = crawler.crawl()
# Returns list of {'url': str, 'content': str}

2. Enhanced App Module (src/app_enhanced.py)

Implements requirements from SRS Sections 3.2-3.4:

Features:

  • Structure-Aware Parsing: Code blocks preserved in chunks
  • Code-Preserving Chunking:
    • Chunk size: 1000 characters (SRS spec)
    • Overlap: 200 characters (SRS spec)
  • Local Vectorization: Uses sentence-transformers/all-MiniLM-L6-v2
  • Semantic Search: Top-K retrieval (k=3)
  • Multi-source Loading: PDFs + URLs + Crawler output

Usage:

from src.app_enhanced import answer_question, index_crawler_results

# Option 1: Query with URLs
answer = answer_question(
    "How to use async/await?",
    urls=["https://docs.python.org"]
)

# Option 2: Index crawler results
docs = crawler.crawl()
index_crawler_results(docs)
answer = answer_question("How to use async/await?")

3. FastAPI Server (src/src/api.py)

Implements SRS Section 4.1 - Performance & Latency and SRS Section 5 - Technical Stack:

Endpoints:

/health (GET)

Health check endpoint

curl http://localhost:8000/health

/query (POST)

Single query with complete response

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "How to use decorators?",
    "urls": ["https://docs.python.org"]
  }'

/query/stream (POST)

Streaming response (tokens streamed as generated)

curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is async/await?",
    "urls": ["https://docs.python.org"]
  }'

/crawl/prepare (POST)

Crawl a website (non-blocking status)

curl -X POST http://localhost:8000/crawl/prepare \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": "https://docs.python.org",
    "max_depth": 3,
    "max_pages": 100
  }'

/index/from-crawl (POST)

Crawl and automatically index

curl -X POST http://localhost:8000/index/from-crawl \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": "https://docs.python.org",
    "max_depth": 2
  }'

4. Gradio Dashboard (ui/gradio-dashboard.py)

User-friendly interface with URL input for web scraping.

Usage:

python ui/gradio-dashboard.py

Then open the URL shown (typically http://localhost:7860)

System Non-Functional Requirements (SRS Section 4)

Performance (SRS 4.1)

  • βœ“ Retrieval Time: Semantic search < 200ms (ChromaDB optimized)
  • βœ“ Streaming: Implemented in /query/stream endpoint
  • βœ“ TTFT: Token streaming reduces perceived latency

Scalability (SRS 4.2)

  • βœ“ Memory Efficiency: Local Chroma DB runs on consumer hardware
  • βœ“ Modular Architecture: Crawler runs independently from query API
  • βœ“ Indexing: Can be done offline, querying uninterrupted

Reliability (SRS 4.3)

  • βœ“ Graceful Degradation: Failed URLs logged, processing continues
  • βœ“ Error Handling: Try/catch blocks preserve system stability
  • βœ“ Retry Logic: Built into crawler with exponential backoff

Setup & Installation

1. Install Dependencies

pip install -r requiements.txt

2. Ensure Ollama is Running

# On Windows/Mac/Linux, start Ollama service
ollama serve

# In another terminal, pull the model
ollama pull llama3

3. Run the API Server

python -m src.api

Server runs on http://localhost:8000

4. Run the Dashboard

python ui/gradio-dashboard.py

Dashboard runs on http://localhost:7860

Workflow Examples

Example 1: Index Python Docs and Query

from src.app_enhanced import answer_question, index_crawler_results
from src.crawler import DocumentationCrawler

# Crawl Python documentation
crawler = DocumentationCrawler(
    base_url="https://docs.python.org/3",
    max_depth=2,
    max_pages=50
)
docs = crawler.crawl()

# Index the results
index_crawler_results(docs)

# Query
answer = answer_question("What is a context manager and how do I use it?")
print(answer)

Example 2: Query Multiple URLs via API

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "How do I handle exceptions?",
    "urls": [
      "https://docs.python.org",
      "https://realpython.com"
    ]
  }'

Example 3: Stream Response

curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{"question": "Explain generators in Python"}' \
  | while IFS= read -r line; do
      echo "$line" | jq -r '.content // .answer // .'
    done

Evaluation Metrics (SRS Section 7)

The system will be evaluated on:

  1. Faithfulness Rate (Target > 85%)

    • Measures adherence to crawled context
    • Tests: Query on known documentation with expected answers
  2. Answer Relevance (Target > 90%)

    • Does the answer address the developer's question?
    • Tests: Ask domain-specific questions
  3. Context Precision (Target > 80%)

    • Were the retrieved chunks actually useful?
    • Tests: Verify top-K results contain answer hints

File Structure

fd/
β”œβ”€β”€ app_enhanced.py         # Enhanced RAG with crawler integration
β”œβ”€β”€ crawler.py              # Web crawler module
β”œβ”€β”€ api.py                  # FastAPI server with streaming
β”œβ”€β”€ gradio-dashboard.py     # Gradio UI
β”œβ”€β”€ my_docs/                # PDFs go here
β”œβ”€β”€ crawler_docs.json       # Crawler output (auto-generated)
β”œβ”€β”€ requiements.txt         # Dependencies
β”œβ”€β”€ srs.md                  # Original SRS
└── README.md               # This file

Troubleshooting

"No documents found" error

  • Ensure PDFs are in my_docs/ folder, OR
  • Provide URLs via API/dashboard, OR
  • Run crawler first and index results

"Ollama connection error"

  • Ensure Ollama is running: ollama serve
  • Check model is available: ollama list
  • Pull model if needed: ollama pull llama3

Crawler timeout

  • Increase timeout parameter in crawler
  • Increase delay to avoid rate limiting
  • Reduce max_pages to crawl fewer pages

Slow retrieval

  • The first query loads embeddings (~30s on first run)
  • Subsequent queries are cached and faster
  • Consider using GPU for faster embeddings

Next Steps

  1. Test with real documentation: Try crawling your target site
  2. Evaluate metrics: Run test queries and measure quality
  3. Fine-tune parameters: Adjust chunk size, overlap, k-value
  4. Deploy: Use production-grade server (Gunicorn + Uvicorn)
  5. Monitor: Log query metrics and relevance feedback