documentation-crawler-rag / IMPLEMENTATION_GUIDE.md
chintu4's picture
file organization
a1cf951
|
Raw
History Blame Contribute Delete
8.19 kB
# Implementation Guide: Intelligent Documentation Crawler & RAG Assistant
This guide walks through the SRS implementation for the Documentation Crawler and RAG Assistant system.
## Architecture Overview
The system follows the modular architecture specified in SRS Section 6:
```
[Crawler] ──> [Document Extractor] ──> [Text Splitter] ──> [Embeddings] ──> [Vector DB]
β”‚
[Query] ──> [Text Splitter] ──> [Embeddings] ──> [Semantic Search] β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
[LLM + Prompt] ──> [Answer]
```
## Components
### 1. **Crawler Module** (`src/crawler.py`)
Implements the requirements from **SRS Section 3.1 - Data Ingestion & Crawling**:
- **Targeted Ingestion**: Recursively crawls URLs up to `max_depth`
- **Rate Limiting**: Configurable delay between requests (default 0.5s)
- **Politeness**: Respects domain constraints
- **State Management**: Tracks visited URLs to prevent loops
- **Boilerplate Removal**: Strips navbars, footers, sidebars using BeautifulSoup
**Usage**:
```python
from src.crawler import DocumentationCrawler
crawler = DocumentationCrawler(
base_url="https://docs.python.org",
max_depth=3,
delay=0.5,
max_pages=100
)
documents = crawler.crawl()
# Returns list of {'url': str, 'content': str}
```
### 2. **Enhanced App Module** (`src/app_enhanced.py`)
Implements requirements from **SRS Sections 3.2-3.4**:
**Features**:
- **Structure-Aware Parsing**: Code blocks preserved in chunks
- **Code-Preserving Chunking**:
- Chunk size: 1000 characters (SRS spec)
- Overlap: 200 characters (SRS spec)
- **Local Vectorization**: Uses `sentence-transformers/all-MiniLM-L6-v2`
- **Semantic Search**: Top-K retrieval (k=3)
- **Multi-source Loading**: PDFs + URLs + Crawler output
**Usage**:
```python
from src.app_enhanced import answer_question, index_crawler_results
# Option 1: Query with URLs
answer = answer_question(
"How to use async/await?",
urls=["https://docs.python.org"]
)
# Option 2: Index crawler results
docs = crawler.crawl()
index_crawler_results(docs)
answer = answer_question("How to use async/await?")
```
### 3. **FastAPI Server** (`src/src/api.py`)
Implements **SRS Section 4.1 - Performance & Latency** and **SRS Section 5 - Technical Stack**:
**Endpoints**:
#### `/health` (GET)
Health check endpoint
```bash
curl http://localhost:8000/health
```
#### `/query` (POST)
Single query with complete response
```bash
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"question": "How to use decorators?",
"urls": ["https://docs.python.org"]
}'
```
#### `/query/stream` (POST)
Streaming response (tokens streamed as generated)
```bash
curl -X POST http://localhost:8000/query/stream \
-H "Content-Type: application/json" \
-d '{
"question": "What is async/await?",
"urls": ["https://docs.python.org"]
}'
```
#### `/crawl/prepare` (POST)
Crawl a website (non-blocking status)
```bash
curl -X POST http://localhost:8000/crawl/prepare \
-H "Content-Type: application/json" \
-d '{
"base_url": "https://docs.python.org",
"max_depth": 3,
"max_pages": 100
}'
```
#### `/index/from-crawl` (POST)
Crawl and automatically index
```bash
curl -X POST http://localhost:8000/index/from-crawl \
-H "Content-Type: application/json" \
-d '{
"base_url": "https://docs.python.org",
"max_depth": 2
}'
```
### 4. **Gradio Dashboard** (`ui/gradio-dashboard.py`)
User-friendly interface with URL input for web scraping.
**Usage**:
```bash
python ui/gradio-dashboard.py
```
Then open the URL shown (typically `http://localhost:7860`)
## System Non-Functional Requirements (SRS Section 4)
### Performance (SRS 4.1)
- βœ“ **Retrieval Time**: Semantic search < 200ms (ChromaDB optimized)
- βœ“ **Streaming**: Implemented in `/query/stream` endpoint
- βœ“ **TTFT**: Token streaming reduces perceived latency
### Scalability (SRS 4.2)
- βœ“ **Memory Efficiency**: Local Chroma DB runs on consumer hardware
- βœ“ **Modular Architecture**: Crawler runs independently from query API
- βœ“ **Indexing**: Can be done offline, querying uninterrupted
### Reliability (SRS 4.3)
- βœ“ **Graceful Degradation**: Failed URLs logged, processing continues
- βœ“ **Error Handling**: Try/catch blocks preserve system stability
- βœ“ **Retry Logic**: Built into crawler with exponential backoff
## Setup & Installation
### 1. Install Dependencies
```bash
pip install -r requiements.txt
```
### 2. Ensure Ollama is Running
```bash
# On Windows/Mac/Linux, start Ollama service
ollama serve
# In another terminal, pull the model
ollama pull llama3
```
### 3. Run the API Server
```bash
python -m src.api
```
Server runs on `http://localhost:8000`
### 4. Run the Dashboard
```bash
python ui/gradio-dashboard.py
```
Dashboard runs on `http://localhost:7860`
## Workflow Examples
### Example 1: Index Python Docs and Query
```python
from src.app_enhanced import answer_question, index_crawler_results
from src.crawler import DocumentationCrawler
# Crawl Python documentation
crawler = DocumentationCrawler(
base_url="https://docs.python.org/3",
max_depth=2,
max_pages=50
)
docs = crawler.crawl()
# Index the results
index_crawler_results(docs)
# Query
answer = answer_question("What is a context manager and how do I use it?")
print(answer)
```
### Example 2: Query Multiple URLs via API
```bash
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"question": "How do I handle exceptions?",
"urls": [
"https://docs.python.org",
"https://realpython.com"
]
}'
```
### Example 3: Stream Response
```bash
curl -X POST http://localhost:8000/query/stream \
-H "Content-Type: application/json" \
-d '{"question": "Explain generators in Python"}' \
| while IFS= read -r line; do
echo "$line" | jq -r '.content // .answer // .'
done
```
## Evaluation Metrics (SRS Section 7)
The system will be evaluated on:
1. **Faithfulness Rate** (Target > 85%)
- Measures adherence to crawled context
- Tests: Query on known documentation with expected answers
2. **Answer Relevance** (Target > 90%)
- Does the answer address the developer's question?
- Tests: Ask domain-specific questions
3. **Context Precision** (Target > 80%)
- Were the retrieved chunks actually useful?
- Tests: Verify top-K results contain answer hints
## File Structure
```
fd/
β”œβ”€β”€ app_enhanced.py # Enhanced RAG with crawler integration
β”œβ”€β”€ crawler.py # Web crawler module
β”œβ”€β”€ api.py # FastAPI server with streaming
β”œβ”€β”€ gradio-dashboard.py # Gradio UI
β”œβ”€β”€ my_docs/ # PDFs go here
β”œβ”€β”€ crawler_docs.json # Crawler output (auto-generated)
β”œβ”€β”€ requiements.txt # Dependencies
β”œβ”€β”€ srs.md # Original SRS
└── README.md # This file
```
## Troubleshooting
### "No documents found" error
- Ensure PDFs are in `my_docs/` folder, OR
- Provide URLs via API/dashboard, OR
- Run crawler first and index results
### "Ollama connection error"
- Ensure Ollama is running: `ollama serve`
- Check model is available: `ollama list`
- Pull model if needed: `ollama pull llama3`
### Crawler timeout
- Increase `timeout` parameter in crawler
- Increase `delay` to avoid rate limiting
- Reduce `max_pages` to crawl fewer pages
### Slow retrieval
- The first query loads embeddings (~30s on first run)
- Subsequent queries are cached and faster
- Consider using GPU for faster embeddings
## Next Steps
1. **Test with real documentation**: Try crawling your target site
2. **Evaluate metrics**: Run test queries and measure quality
3. **Fine-tune parameters**: Adjust chunk size, overlap, k-value
4. **Deploy**: Use production-grade server (Gunicorn + Uvicorn)
5. **Monitor**: Log query metrics and relevance feedback