Spaces:

chintu4
/

documentation-crawler-rag

Sleeping

App Files Files Community

documentation-crawler-rag / IMPLEMENTATION_GUIDE.md

chintu4

file organization

a1cf951 about 2 months ago

preview code

Raw

History Blame Contribute Delete

8.19 kB

	# Implementation Guide: Intelligent Documentation Crawler & RAG Assistant

	This guide walks through the SRS implementation for the Documentation Crawler and RAG Assistant system.

	## Architecture Overview

	The system follows the modular architecture specified in SRS Section 6:

	```
	[Crawler] ──> [Document Extractor] ──> [Text Splitter] ──> [Embeddings] ──> [Vector DB]
	│
	[Query] ──> [Text Splitter] ──> [Embeddings] ──> [Semantic Search] ──────────────┘
	│
	▼
	[LLM + Prompt] ──> [Answer]
	```

	## Components

	### 1. Crawler Module (`src/crawler.py`)

	Implements the requirements from SRS Section 3.1 - Data Ingestion & Crawling:

	- Targeted Ingestion: Recursively crawls URLs up to `max_depth`
	- Rate Limiting: Configurable delay between requests (default 0.5s)
	- Politeness: Respects domain constraints
	- State Management: Tracks visited URLs to prevent loops
	- Boilerplate Removal: Strips navbars, footers, sidebars using BeautifulSoup

	Usage:
	```python
	from src.crawler import DocumentationCrawler

	crawler = DocumentationCrawler(
	base_url="https://docs.python.org",
	max_depth=3,
	delay=0.5,
	max_pages=100
	)

	documents = crawler.crawl()
	# Returns list of {'url': str, 'content': str}
	```

	### 2. Enhanced App Module (`src/app_enhanced.py`)

	Implements requirements from SRS Sections 3.2-3.4:

	Features:
	- Structure-Aware Parsing: Code blocks preserved in chunks
	- Code-Preserving Chunking:
	- Chunk size: 1000 characters (SRS spec)
	- Overlap: 200 characters (SRS spec)
	- Local Vectorization: Uses `sentence-transformers/all-MiniLM-L6-v2`
	- Semantic Search: Top-K retrieval (k=3)
	- Multi-source Loading: PDFs + URLs + Crawler output

	Usage:
	```python
	from src.app_enhanced import answer_question, index_crawler_results

	# Option 1: Query with URLs
	answer = answer_question(
	"How to use async/await?",
	urls=["https://docs.python.org"]
	)

	# Option 2: Index crawler results
	docs = crawler.crawl()
	index_crawler_results(docs)
	answer = answer_question("How to use async/await?")
	```

	### 3. FastAPI Server (`src/src/api.py`)

	Implements SRS Section 4.1 - Performance & Latency and SRS Section 5 - Technical Stack:

	Endpoints:

	#### `/health` (GET)
	Health check endpoint
	```bash
	curl http://localhost:8000/health
	```

	#### `/query` (POST)
	Single query with complete response
	```bash
	curl -X POST http://localhost:8000/query \
	-H "Content-Type: application/json" \
	-d '{
	"question": "How to use decorators?",
	"urls": ["https://docs.python.org"]
	}'
	```

	#### `/query/stream` (POST)
	Streaming response (tokens streamed as generated)
	```bash
	curl -X POST http://localhost:8000/query/stream \
	-H "Content-Type: application/json" \
	-d '{
	"question": "What is async/await?",
	"urls": ["https://docs.python.org"]
	}'
	```

	#### `/crawl/prepare` (POST)
	Crawl a website (non-blocking status)
	```bash
	curl -X POST http://localhost:8000/crawl/prepare \
	-H "Content-Type: application/json" \
	-d '{
	"base_url": "https://docs.python.org",
	"max_depth": 3,
	"max_pages": 100
	}'
	```

	#### `/index/from-crawl` (POST)
	Crawl and automatically index
	```bash
	curl -X POST http://localhost:8000/index/from-crawl \
	-H "Content-Type: application/json" \
	-d '{
	"base_url": "https://docs.python.org",
	"max_depth": 2
	}'
	```

	### 4. Gradio Dashboard (`ui/gradio-dashboard.py`)

	User-friendly interface with URL input for web scraping.

	Usage:
	```bash
	python ui/gradio-dashboard.py
	```

	Then open the URL shown (typically `http://localhost:7860`)

	## System Non-Functional Requirements (SRS Section 4)

	### Performance (SRS 4.1)
	- ✓ Retrieval Time: Semantic search < 200ms (ChromaDB optimized)
	- ✓ Streaming: Implemented in `/query/stream` endpoint
	- ✓ TTFT: Token streaming reduces perceived latency

	### Scalability (SRS 4.2)
	- ✓ Memory Efficiency: Local Chroma DB runs on consumer hardware
	- ✓ Modular Architecture: Crawler runs independently from query API
	- ✓ Indexing: Can be done offline, querying uninterrupted

	### Reliability (SRS 4.3)
	- ✓ Graceful Degradation: Failed URLs logged, processing continues
	- ✓ Error Handling: Try/catch blocks preserve system stability
	- ✓ Retry Logic: Built into crawler with exponential backoff

	## Setup & Installation

	### 1. Install Dependencies
	```bash
	pip install -r requiements.txt
	```

	### 2. Ensure Ollama is Running
	```bash
	# On Windows/Mac/Linux, start Ollama service
	ollama serve

	# In another terminal, pull the model
	ollama pull llama3
	```

	### 3. Run the API Server
	```bash
	python -m src.api
	```

	Server runs on `http://localhost:8000`

	### 4. Run the Dashboard
	```bash
	python ui/gradio-dashboard.py
	```

	Dashboard runs on `http://localhost:7860`

	## Workflow Examples

	### Example 1: Index Python Docs and Query

	```python
	from src.app_enhanced import answer_question, index_crawler_results
	from src.crawler import DocumentationCrawler

	# Crawl Python documentation
	crawler = DocumentationCrawler(
	base_url="https://docs.python.org/3",
	max_depth=2,
	max_pages=50
	)
	docs = crawler.crawl()

	# Index the results
	index_crawler_results(docs)

	# Query
	answer = answer_question("What is a context manager and how do I use it?")
	print(answer)
	```

	### Example 2: Query Multiple URLs via API

	```bash
	curl -X POST http://localhost:8000/query \
	-H "Content-Type: application/json" \
	-d '{
	"question": "How do I handle exceptions?",
	"urls": [
	"https://docs.python.org",
	"https://realpython.com"
	]
	}'
	```

	### Example 3: Stream Response

	```bash
	curl -X POST http://localhost:8000/query/stream \
	-H "Content-Type: application/json" \
	-d '{"question": "Explain generators in Python"}' \
	\| while IFS= read -r line; do
	echo "$line" \| jq -r '.content // .answer // .'
	done
	```

	## Evaluation Metrics (SRS Section 7)

	The system will be evaluated on:

	1. Faithfulness Rate (Target > 85%)
	- Measures adherence to crawled context
	- Tests: Query on known documentation with expected answers

	2. Answer Relevance (Target > 90%)
	- Does the answer address the developer's question?
	- Tests: Ask domain-specific questions

	3. Context Precision (Target > 80%)
	- Were the retrieved chunks actually useful?
	- Tests: Verify top-K results contain answer hints

	## File Structure

	```
	fd/
	├── app_enhanced.py # Enhanced RAG with crawler integration
	├── crawler.py # Web crawler module
	├── api.py # FastAPI server with streaming
	├── gradio-dashboard.py # Gradio UI
	├── my_docs/ # PDFs go here
	├── crawler_docs.json # Crawler output (auto-generated)
	├── requiements.txt # Dependencies
	├── srs.md # Original SRS
	└── README.md # This file
	```

	## Troubleshooting

	### "No documents found" error
	- Ensure PDFs are in `my_docs/` folder, OR
	- Provide URLs via API/dashboard, OR
	- Run crawler first and index results

	### "Ollama connection error"
	- Ensure Ollama is running: `ollama serve`
	- Check model is available: `ollama list`
	- Pull model if needed: `ollama pull llama3`

	### Crawler timeout
	- Increase `timeout` parameter in crawler
	- Increase `delay` to avoid rate limiting
	- Reduce `max_pages` to crawl fewer pages

	### Slow retrieval
	- The first query loads embeddings (~30s on first run)
	- Subsequent queries are cached and faster
	- Consider using GPU for faster embeddings

	## Next Steps

	1. Test with real documentation: Try crawling your target site
	2. Evaluate metrics: Run test queries and measure quality
	3. Fine-tune parameters: Adjust chunk size, overlap, k-value
	4. Deploy: Use production-grade server (Gunicorn + Uvicorn)
	5. Monitor: Log query metrics and relevance feedback