Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
Implementation Guide: Intelligent Documentation Crawler & RAG Assistant
This guide walks through the SRS implementation for the Documentation Crawler and RAG Assistant system.
Architecture Overview
The system follows the modular architecture specified in SRS Section 6:
[Crawler] ββ> [Document Extractor] ββ> [Text Splitter] ββ> [Embeddings] ββ> [Vector DB]
β
[Query] ββ> [Text Splitter] ββ> [Embeddings] ββ> [Semantic Search] βββββββββββββββ
β
βΌ
[LLM + Prompt] ββ> [Answer]
Components
1. Crawler Module (src/crawler.py)
Implements the requirements from SRS Section 3.1 - Data Ingestion & Crawling:
- Targeted Ingestion: Recursively crawls URLs up to
max_depth - Rate Limiting: Configurable delay between requests (default 0.5s)
- Politeness: Respects domain constraints
- State Management: Tracks visited URLs to prevent loops
- Boilerplate Removal: Strips navbars, footers, sidebars using BeautifulSoup
Usage:
from src.crawler import DocumentationCrawler
crawler = DocumentationCrawler(
base_url="https://docs.python.org",
max_depth=3,
delay=0.5,
max_pages=100
)
documents = crawler.crawl()
# Returns list of {'url': str, 'content': str}
2. Enhanced App Module (src/app_enhanced.py)
Implements requirements from SRS Sections 3.2-3.4:
Features:
- Structure-Aware Parsing: Code blocks preserved in chunks
- Code-Preserving Chunking:
- Chunk size: 1000 characters (SRS spec)
- Overlap: 200 characters (SRS spec)
- Local Vectorization: Uses
sentence-transformers/all-MiniLM-L6-v2 - Semantic Search: Top-K retrieval (k=3)
- Multi-source Loading: PDFs + URLs + Crawler output
Usage:
from src.app_enhanced import answer_question, index_crawler_results
# Option 1: Query with URLs
answer = answer_question(
"How to use async/await?",
urls=["https://docs.python.org"]
)
# Option 2: Index crawler results
docs = crawler.crawl()
index_crawler_results(docs)
answer = answer_question("How to use async/await?")
3. FastAPI Server (src/src/api.py)
Implements SRS Section 4.1 - Performance & Latency and SRS Section 5 - Technical Stack:
Endpoints:
/health (GET)
Health check endpoint
curl http://localhost:8000/health
/query (POST)
Single query with complete response
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"question": "How to use decorators?",
"urls": ["https://docs.python.org"]
}'
/query/stream (POST)
Streaming response (tokens streamed as generated)
curl -X POST http://localhost:8000/query/stream \
-H "Content-Type: application/json" \
-d '{
"question": "What is async/await?",
"urls": ["https://docs.python.org"]
}'
/crawl/prepare (POST)
Crawl a website (non-blocking status)
curl -X POST http://localhost:8000/crawl/prepare \
-H "Content-Type: application/json" \
-d '{
"base_url": "https://docs.python.org",
"max_depth": 3,
"max_pages": 100
}'
/index/from-crawl (POST)
Crawl and automatically index
curl -X POST http://localhost:8000/index/from-crawl \
-H "Content-Type: application/json" \
-d '{
"base_url": "https://docs.python.org",
"max_depth": 2
}'
4. Gradio Dashboard (ui/gradio-dashboard.py)
User-friendly interface with URL input for web scraping.
Usage:
python ui/gradio-dashboard.py
Then open the URL shown (typically http://localhost:7860)
System Non-Functional Requirements (SRS Section 4)
Performance (SRS 4.1)
- β Retrieval Time: Semantic search < 200ms (ChromaDB optimized)
- β Streaming: Implemented in
/query/streamendpoint - β TTFT: Token streaming reduces perceived latency
Scalability (SRS 4.2)
- β Memory Efficiency: Local Chroma DB runs on consumer hardware
- β Modular Architecture: Crawler runs independently from query API
- β Indexing: Can be done offline, querying uninterrupted
Reliability (SRS 4.3)
- β Graceful Degradation: Failed URLs logged, processing continues
- β Error Handling: Try/catch blocks preserve system stability
- β Retry Logic: Built into crawler with exponential backoff
Setup & Installation
1. Install Dependencies
pip install -r requiements.txt
2. Ensure Ollama is Running
# On Windows/Mac/Linux, start Ollama service
ollama serve
# In another terminal, pull the model
ollama pull llama3
3. Run the API Server
python -m src.api
Server runs on http://localhost:8000
4. Run the Dashboard
python ui/gradio-dashboard.py
Dashboard runs on http://localhost:7860
Workflow Examples
Example 1: Index Python Docs and Query
from src.app_enhanced import answer_question, index_crawler_results
from src.crawler import DocumentationCrawler
# Crawl Python documentation
crawler = DocumentationCrawler(
base_url="https://docs.python.org/3",
max_depth=2,
max_pages=50
)
docs = crawler.crawl()
# Index the results
index_crawler_results(docs)
# Query
answer = answer_question("What is a context manager and how do I use it?")
print(answer)
Example 2: Query Multiple URLs via API
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"question": "How do I handle exceptions?",
"urls": [
"https://docs.python.org",
"https://realpython.com"
]
}'
Example 3: Stream Response
curl -X POST http://localhost:8000/query/stream \
-H "Content-Type: application/json" \
-d '{"question": "Explain generators in Python"}' \
| while IFS= read -r line; do
echo "$line" | jq -r '.content // .answer // .'
done
Evaluation Metrics (SRS Section 7)
The system will be evaluated on:
Faithfulness Rate (Target > 85%)
- Measures adherence to crawled context
- Tests: Query on known documentation with expected answers
Answer Relevance (Target > 90%)
- Does the answer address the developer's question?
- Tests: Ask domain-specific questions
Context Precision (Target > 80%)
- Were the retrieved chunks actually useful?
- Tests: Verify top-K results contain answer hints
File Structure
fd/
βββ app_enhanced.py # Enhanced RAG with crawler integration
βββ crawler.py # Web crawler module
βββ api.py # FastAPI server with streaming
βββ gradio-dashboard.py # Gradio UI
βββ my_docs/ # PDFs go here
βββ crawler_docs.json # Crawler output (auto-generated)
βββ requiements.txt # Dependencies
βββ srs.md # Original SRS
βββ README.md # This file
Troubleshooting
"No documents found" error
- Ensure PDFs are in
my_docs/folder, OR - Provide URLs via API/dashboard, OR
- Run crawler first and index results
"Ollama connection error"
- Ensure Ollama is running:
ollama serve - Check model is available:
ollama list - Pull model if needed:
ollama pull llama3
Crawler timeout
- Increase
timeoutparameter in crawler - Increase
delayto avoid rate limiting - Reduce
max_pagesto crawl fewer pages
Slow retrieval
- The first query loads embeddings (~30s on first run)
- Subsequent queries are cached and faster
- Consider using GPU for faster embeddings
Next Steps
- Test with real documentation: Try crawling your target site
- Evaluate metrics: Run test queries and measure quality
- Fine-tune parameters: Adjust chunk size, overlap, k-value
- Deploy: Use production-grade server (Gunicorn + Uvicorn)
- Monitor: Log query metrics and relevance feedback