Spaces:

chintu4
/

documentation-crawler-rag

Sleeping

App Files Files Community

documentation-crawler-rag / IMPLEMENTATION_GUIDE.md

chintu4

file organization

a1cf951 about 2 months ago

preview code

Raw

History Blame Contribute Delete

8.19 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Implementation Guide: Intelligent Documentation Crawler & RAG Assistant

This guide walks through the SRS implementation for the Documentation Crawler and RAG Assistant system.

Architecture Overview

The system follows the modular architecture specified in SRS Section 6:

[Crawler] ──> [Document Extractor] ──> [Text Splitter] ──> [Embeddings] ──> [Vector DB]
                                                                                    │
[Query] ──> [Text Splitter] ──> [Embeddings] ──> [Semantic Search] ──────────────┘
                                                         │
                                                         ▼
                                                  [LLM + Prompt] ──> [Answer]

Components

1. Crawler Module (`src/crawler.py`)

Implements the requirements from SRS Section 3.1 - Data Ingestion & Crawling:

Targeted Ingestion: Recursively crawls URLs up to max_depth
Rate Limiting: Configurable delay between requests (default 0.5s)
Politeness: Respects domain constraints
State Management: Tracks visited URLs to prevent loops
Boilerplate Removal: Strips navbars, footers, sidebars using BeautifulSoup

Usage:

from src.crawler import DocumentationCrawler

crawler = DocumentationCrawler(
    base_url="https://docs.python.org",
    max_depth=3,
    delay=0.5,
    max_pages=100
)

documents = crawler.crawl()
# Returns list of {'url': str, 'content': str}

2. Enhanced App Module (`src/app_enhanced.py`)

Implements requirements from SRS Sections 3.2-3.4:

Features:

Structure-Aware Parsing: Code blocks preserved in chunks
Code-Preserving Chunking:
- Chunk size: 1000 characters (SRS spec)
- Overlap: 200 characters (SRS spec)
Local Vectorization: Uses sentence-transformers/all-MiniLM-L6-v2
Semantic Search: Top-K retrieval (k=3)
Multi-source Loading: PDFs + URLs + Crawler output

Usage:

from src.app_enhanced import answer_question, index_crawler_results

# Option 1: Query with URLs
answer = answer_question(
    "How to use async/await?",
    urls=["https://docs.python.org"]
)

# Option 2: Index crawler results
docs = crawler.crawl()
index_crawler_results(docs)
answer = answer_question("How to use async/await?")

3. FastAPI Server (`src/src/api.py`)

Implements SRS Section 4.1 - Performance & Latency and SRS Section 5 - Technical Stack:

Endpoints:

`/health` (GET)

Health check endpoint

curl http://localhost:8000/health

`/query` (POST)

Single query with complete response

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "How to use decorators?",
    "urls": ["https://docs.python.org"]
  }'

`/query/stream` (POST)

Streaming response (tokens streamed as generated)

curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is async/await?",
    "urls": ["https://docs.python.org"]
  }'

`/crawl/prepare` (POST)

Crawl a website (non-blocking status)

curl -X POST http://localhost:8000/crawl/prepare \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": "https://docs.python.org",
    "max_depth": 3,
    "max_pages": 100
  }'

`/index/from-crawl` (POST)

Crawl and automatically index

curl -X POST http://localhost:8000/index/from-crawl \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": "https://docs.python.org",
    "max_depth": 2
  }'

4. Gradio Dashboard (`ui/gradio-dashboard.py`)

User-friendly interface with URL input for web scraping.

Usage:

python ui/gradio-dashboard.py

Then open the URL shown (typically http://localhost:7860)

System Non-Functional Requirements (SRS Section 4)

Performance (SRS 4.1)

✓ Retrieval Time: Semantic search < 200ms (ChromaDB optimized)
✓ Streaming: Implemented in /query/stream endpoint
✓ TTFT: Token streaming reduces perceived latency

Scalability (SRS 4.2)

✓ Memory Efficiency: Local Chroma DB runs on consumer hardware
✓ Modular Architecture: Crawler runs independently from query API
✓ Indexing: Can be done offline, querying uninterrupted

Reliability (SRS 4.3)

✓ Graceful Degradation: Failed URLs logged, processing continues
✓ Error Handling: Try/catch blocks preserve system stability
✓ Retry Logic: Built into crawler with exponential backoff

Setup & Installation

1. Install Dependencies

pip install -r requiements.txt

2. Ensure Ollama is Running

# On Windows/Mac/Linux, start Ollama service
ollama serve

# In another terminal, pull the model
ollama pull llama3

3. Run the API Server

python -m src.api

Server runs on http://localhost:8000

4. Run the Dashboard

python ui/gradio-dashboard.py

Dashboard runs on http://localhost:7860

Workflow Examples

Example 1: Index Python Docs and Query

from src.app_enhanced import answer_question, index_crawler_results
from src.crawler import DocumentationCrawler

# Crawl Python documentation
crawler = DocumentationCrawler(
    base_url="https://docs.python.org/3",
    max_depth=2,
    max_pages=50
)
docs = crawler.crawl()

# Index the results
index_crawler_results(docs)

# Query
answer = answer_question("What is a context manager and how do I use it?")
print(answer)

Example 2: Query Multiple URLs via API

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "How do I handle exceptions?",
    "urls": [
      "https://docs.python.org",
      "https://realpython.com"
    ]
  }'

Example 3: Stream Response

curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{"question": "Explain generators in Python"}' \
  | while IFS= read -r line; do
      echo "$line" | jq -r '.content // .answer // .'
    done

Evaluation Metrics (SRS Section 7)

The system will be evaluated on:

Faithfulness Rate (Target > 85%)
- Measures adherence to crawled context
- Tests: Query on known documentation with expected answers
Answer Relevance (Target > 90%)
- Does the answer address the developer's question?
- Tests: Ask domain-specific questions
Context Precision (Target > 80%)
- Were the retrieved chunks actually useful?
- Tests: Verify top-K results contain answer hints

File Structure

fd/
├── app_enhanced.py         # Enhanced RAG with crawler integration
├── crawler.py              # Web crawler module
├── api.py                  # FastAPI server with streaming
├── gradio-dashboard.py     # Gradio UI
├── my_docs/                # PDFs go here
├── crawler_docs.json       # Crawler output (auto-generated)
├── requiements.txt         # Dependencies
├── srs.md                  # Original SRS
└── README.md               # This file

Troubleshooting

"No documents found" error

Ensure PDFs are in my_docs/ folder, OR
Provide URLs via API/dashboard, OR
Run crawler first and index results

"Ollama connection error"

Ensure Ollama is running: ollama serve
Check model is available: ollama list
Pull model if needed: ollama pull llama3

Crawler timeout

Increase timeout parameter in crawler
Increase delay to avoid rate limiting
Reduce max_pages to crawl fewer pages

Slow retrieval

The first query loads embeddings (~30s on first run)
Subsequent queries are cached and faster
Consider using GPU for faster embeddings

Next Steps

Test with real documentation: Try crawling your target site
Evaluate metrics: Run test queries and measure quality
Fine-tune parameters: Adjust chunk size, overlap, k-value
Deploy: Use production-grade server (Gunicorn + Uvicorn)
Monitor: Log query metrics and relevance feedback

Implementation Guide: Intelligent Documentation Crawler & RAG Assistant

Architecture Overview

Components

1. Crawler Module (src/crawler.py)

2. Enhanced App Module (src/app_enhanced.py)

3. FastAPI Server (src/src/api.py)

/health (GET)

/query (POST)

/query/stream (POST)

/crawl/prepare (POST)

/index/from-crawl (POST)

4. Gradio Dashboard (ui/gradio-dashboard.py)

System Non-Functional Requirements (SRS Section 4)

Performance (SRS 4.1)

Scalability (SRS 4.2)

Reliability (SRS 4.3)

Setup & Installation

1. Install Dependencies

2. Ensure Ollama is Running

3. Run the API Server

4. Run the Dashboard

Workflow Examples

Example 1: Index Python Docs and Query

Example 2: Query Multiple URLs via API

Example 3: Stream Response

Evaluation Metrics (SRS Section 7)

File Structure

Troubleshooting

"No documents found" error

"Ollama connection error"

Crawler timeout

Slow retrieval

Next Steps

1. Crawler Module (`src/crawler.py`)

2. Enhanced App Module (`src/app_enhanced.py`)

3. FastAPI Server (`src/src/api.py`)

`/health` (GET)

`/query` (POST)

`/query/stream` (POST)

`/crawl/prepare` (POST)

`/index/from-crawl` (POST)

4. Gradio Dashboard (`ui/gradio-dashboard.py`)