# Implementation Guide: Intelligent Documentation Crawler & RAG Assistant This guide walks through the SRS implementation for the Documentation Crawler and RAG Assistant system. ## Architecture Overview The system follows the modular architecture specified in SRS Section 6: ``` [Crawler] ──> [Document Extractor] ──> [Text Splitter] ──> [Embeddings] ──> [Vector DB] │ [Query] ──> [Text Splitter] ──> [Embeddings] ──> [Semantic Search] ──────────────┘ │ ▼ [LLM + Prompt] ──> [Answer] ``` ## Components ### 1. **Crawler Module** (`src/crawler.py`) Implements the requirements from **SRS Section 3.1 - Data Ingestion & Crawling**: - **Targeted Ingestion**: Recursively crawls URLs up to `max_depth` - **Rate Limiting**: Configurable delay between requests (default 0.5s) - **Politeness**: Respects domain constraints - **State Management**: Tracks visited URLs to prevent loops - **Boilerplate Removal**: Strips navbars, footers, sidebars using BeautifulSoup **Usage**: ```python from src.crawler import DocumentationCrawler crawler = DocumentationCrawler( base_url="https://docs.python.org", max_depth=3, delay=0.5, max_pages=100 ) documents = crawler.crawl() # Returns list of {'url': str, 'content': str} ``` ### 2. **Enhanced App Module** (`src/app_enhanced.py`) Implements requirements from **SRS Sections 3.2-3.4**: **Features**: - **Structure-Aware Parsing**: Code blocks preserved in chunks - **Code-Preserving Chunking**: - Chunk size: 1000 characters (SRS spec) - Overlap: 200 characters (SRS spec) - **Local Vectorization**: Uses `sentence-transformers/all-MiniLM-L6-v2` - **Semantic Search**: Top-K retrieval (k=3) - **Multi-source Loading**: PDFs + URLs + Crawler output **Usage**: ```python from src.app_enhanced import answer_question, index_crawler_results # Option 1: Query with URLs answer = answer_question( "How to use async/await?", urls=["https://docs.python.org"] ) # Option 2: Index crawler results docs = crawler.crawl() index_crawler_results(docs) answer = answer_question("How to use async/await?") ``` ### 3. **FastAPI Server** (`src/src/api.py`) Implements **SRS Section 4.1 - Performance & Latency** and **SRS Section 5 - Technical Stack**: **Endpoints**: #### `/health` (GET) Health check endpoint ```bash curl http://localhost:8000/health ``` #### `/query` (POST) Single query with complete response ```bash curl -X POST http://localhost:8000/query \ -H "Content-Type: application/json" \ -d '{ "question": "How to use decorators?", "urls": ["https://docs.python.org"] }' ``` #### `/query/stream` (POST) Streaming response (tokens streamed as generated) ```bash curl -X POST http://localhost:8000/query/stream \ -H "Content-Type: application/json" \ -d '{ "question": "What is async/await?", "urls": ["https://docs.python.org"] }' ``` #### `/crawl/prepare` (POST) Crawl a website (non-blocking status) ```bash curl -X POST http://localhost:8000/crawl/prepare \ -H "Content-Type: application/json" \ -d '{ "base_url": "https://docs.python.org", "max_depth": 3, "max_pages": 100 }' ``` #### `/index/from-crawl` (POST) Crawl and automatically index ```bash curl -X POST http://localhost:8000/index/from-crawl \ -H "Content-Type: application/json" \ -d '{ "base_url": "https://docs.python.org", "max_depth": 2 }' ``` ### 4. **Gradio Dashboard** (`ui/gradio-dashboard.py`) User-friendly interface with URL input for web scraping. **Usage**: ```bash python ui/gradio-dashboard.py ``` Then open the URL shown (typically `http://localhost:7860`) ## System Non-Functional Requirements (SRS Section 4) ### Performance (SRS 4.1) - ✓ **Retrieval Time**: Semantic search < 200ms (ChromaDB optimized) - ✓ **Streaming**: Implemented in `/query/stream` endpoint - ✓ **TTFT**: Token streaming reduces perceived latency ### Scalability (SRS 4.2) - ✓ **Memory Efficiency**: Local Chroma DB runs on consumer hardware - ✓ **Modular Architecture**: Crawler runs independently from query API - ✓ **Indexing**: Can be done offline, querying uninterrupted ### Reliability (SRS 4.3) - ✓ **Graceful Degradation**: Failed URLs logged, processing continues - ✓ **Error Handling**: Try/catch blocks preserve system stability - ✓ **Retry Logic**: Built into crawler with exponential backoff ## Setup & Installation ### 1. Install Dependencies ```bash pip install -r requiements.txt ``` ### 2. Ensure Ollama is Running ```bash # On Windows/Mac/Linux, start Ollama service ollama serve # In another terminal, pull the model ollama pull llama3 ``` ### 3. Run the API Server ```bash python -m src.api ``` Server runs on `http://localhost:8000` ### 4. Run the Dashboard ```bash python ui/gradio-dashboard.py ``` Dashboard runs on `http://localhost:7860` ## Workflow Examples ### Example 1: Index Python Docs and Query ```python from src.app_enhanced import answer_question, index_crawler_results from src.crawler import DocumentationCrawler # Crawl Python documentation crawler = DocumentationCrawler( base_url="https://docs.python.org/3", max_depth=2, max_pages=50 ) docs = crawler.crawl() # Index the results index_crawler_results(docs) # Query answer = answer_question("What is a context manager and how do I use it?") print(answer) ``` ### Example 2: Query Multiple URLs via API ```bash curl -X POST http://localhost:8000/query \ -H "Content-Type: application/json" \ -d '{ "question": "How do I handle exceptions?", "urls": [ "https://docs.python.org", "https://realpython.com" ] }' ``` ### Example 3: Stream Response ```bash curl -X POST http://localhost:8000/query/stream \ -H "Content-Type: application/json" \ -d '{"question": "Explain generators in Python"}' \ | while IFS= read -r line; do echo "$line" | jq -r '.content // .answer // .' done ``` ## Evaluation Metrics (SRS Section 7) The system will be evaluated on: 1. **Faithfulness Rate** (Target > 85%) - Measures adherence to crawled context - Tests: Query on known documentation with expected answers 2. **Answer Relevance** (Target > 90%) - Does the answer address the developer's question? - Tests: Ask domain-specific questions 3. **Context Precision** (Target > 80%) - Were the retrieved chunks actually useful? - Tests: Verify top-K results contain answer hints ## File Structure ``` fd/ ├── app_enhanced.py # Enhanced RAG with crawler integration ├── crawler.py # Web crawler module ├── api.py # FastAPI server with streaming ├── gradio-dashboard.py # Gradio UI ├── my_docs/ # PDFs go here ├── crawler_docs.json # Crawler output (auto-generated) ├── requiements.txt # Dependencies ├── srs.md # Original SRS └── README.md # This file ``` ## Troubleshooting ### "No documents found" error - Ensure PDFs are in `my_docs/` folder, OR - Provide URLs via API/dashboard, OR - Run crawler first and index results ### "Ollama connection error" - Ensure Ollama is running: `ollama serve` - Check model is available: `ollama list` - Pull model if needed: `ollama pull llama3` ### Crawler timeout - Increase `timeout` parameter in crawler - Increase `delay` to avoid rate limiting - Reduce `max_pages` to crawl fewer pages ### Slow retrieval - The first query loads embeddings (~30s on first run) - Subsequent queries are cached and faster - Consider using GPU for faster embeddings ## Next Steps 1. **Test with real documentation**: Try crawling your target site 2. **Evaluate metrics**: Run test queries and measure quality 3. **Fine-tune parameters**: Adjust chunk size, overlap, k-value 4. **Deploy**: Use production-grade server (Gunicorn + Uvicorn) 5. **Monitor**: Log query metrics and relevance feedback