Spaces:
Sleeping
Sleeping
| # Implementation Guide: Intelligent Documentation Crawler & RAG Assistant | |
| This guide walks through the SRS implementation for the Documentation Crawler and RAG Assistant system. | |
| ## Architecture Overview | |
| The system follows the modular architecture specified in SRS Section 6: | |
| ``` | |
| [Crawler] ββ> [Document Extractor] ββ> [Text Splitter] ββ> [Embeddings] ββ> [Vector DB] | |
| β | |
| [Query] ββ> [Text Splitter] ββ> [Embeddings] ββ> [Semantic Search] βββββββββββββββ | |
| β | |
| βΌ | |
| [LLM + Prompt] ββ> [Answer] | |
| ``` | |
| ## Components | |
| ### 1. **Crawler Module** (`src/crawler.py`) | |
| Implements the requirements from **SRS Section 3.1 - Data Ingestion & Crawling**: | |
| - **Targeted Ingestion**: Recursively crawls URLs up to `max_depth` | |
| - **Rate Limiting**: Configurable delay between requests (default 0.5s) | |
| - **Politeness**: Respects domain constraints | |
| - **State Management**: Tracks visited URLs to prevent loops | |
| - **Boilerplate Removal**: Strips navbars, footers, sidebars using BeautifulSoup | |
| **Usage**: | |
| ```python | |
| from src.crawler import DocumentationCrawler | |
| crawler = DocumentationCrawler( | |
| base_url="https://docs.python.org", | |
| max_depth=3, | |
| delay=0.5, | |
| max_pages=100 | |
| ) | |
| documents = crawler.crawl() | |
| # Returns list of {'url': str, 'content': str} | |
| ``` | |
| ### 2. **Enhanced App Module** (`src/app_enhanced.py`) | |
| Implements requirements from **SRS Sections 3.2-3.4**: | |
| **Features**: | |
| - **Structure-Aware Parsing**: Code blocks preserved in chunks | |
| - **Code-Preserving Chunking**: | |
| - Chunk size: 1000 characters (SRS spec) | |
| - Overlap: 200 characters (SRS spec) | |
| - **Local Vectorization**: Uses `sentence-transformers/all-MiniLM-L6-v2` | |
| - **Semantic Search**: Top-K retrieval (k=3) | |
| - **Multi-source Loading**: PDFs + URLs + Crawler output | |
| **Usage**: | |
| ```python | |
| from src.app_enhanced import answer_question, index_crawler_results | |
| # Option 1: Query with URLs | |
| answer = answer_question( | |
| "How to use async/await?", | |
| urls=["https://docs.python.org"] | |
| ) | |
| # Option 2: Index crawler results | |
| docs = crawler.crawl() | |
| index_crawler_results(docs) | |
| answer = answer_question("How to use async/await?") | |
| ``` | |
| ### 3. **FastAPI Server** (`src/src/api.py`) | |
| Implements **SRS Section 4.1 - Performance & Latency** and **SRS Section 5 - Technical Stack**: | |
| **Endpoints**: | |
| #### `/health` (GET) | |
| Health check endpoint | |
| ```bash | |
| curl http://localhost:8000/health | |
| ``` | |
| #### `/query` (POST) | |
| Single query with complete response | |
| ```bash | |
| curl -X POST http://localhost:8000/query \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "question": "How to use decorators?", | |
| "urls": ["https://docs.python.org"] | |
| }' | |
| ``` | |
| #### `/query/stream` (POST) | |
| Streaming response (tokens streamed as generated) | |
| ```bash | |
| curl -X POST http://localhost:8000/query/stream \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "question": "What is async/await?", | |
| "urls": ["https://docs.python.org"] | |
| }' | |
| ``` | |
| #### `/crawl/prepare` (POST) | |
| Crawl a website (non-blocking status) | |
| ```bash | |
| curl -X POST http://localhost:8000/crawl/prepare \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "base_url": "https://docs.python.org", | |
| "max_depth": 3, | |
| "max_pages": 100 | |
| }' | |
| ``` | |
| #### `/index/from-crawl` (POST) | |
| Crawl and automatically index | |
| ```bash | |
| curl -X POST http://localhost:8000/index/from-crawl \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "base_url": "https://docs.python.org", | |
| "max_depth": 2 | |
| }' | |
| ``` | |
| ### 4. **Gradio Dashboard** (`ui/gradio-dashboard.py`) | |
| User-friendly interface with URL input for web scraping. | |
| **Usage**: | |
| ```bash | |
| python ui/gradio-dashboard.py | |
| ``` | |
| Then open the URL shown (typically `http://localhost:7860`) | |
| ## System Non-Functional Requirements (SRS Section 4) | |
| ### Performance (SRS 4.1) | |
| - β **Retrieval Time**: Semantic search < 200ms (ChromaDB optimized) | |
| - β **Streaming**: Implemented in `/query/stream` endpoint | |
| - β **TTFT**: Token streaming reduces perceived latency | |
| ### Scalability (SRS 4.2) | |
| - β **Memory Efficiency**: Local Chroma DB runs on consumer hardware | |
| - β **Modular Architecture**: Crawler runs independently from query API | |
| - β **Indexing**: Can be done offline, querying uninterrupted | |
| ### Reliability (SRS 4.3) | |
| - β **Graceful Degradation**: Failed URLs logged, processing continues | |
| - β **Error Handling**: Try/catch blocks preserve system stability | |
| - β **Retry Logic**: Built into crawler with exponential backoff | |
| ## Setup & Installation | |
| ### 1. Install Dependencies | |
| ```bash | |
| pip install -r requiements.txt | |
| ``` | |
| ### 2. Ensure Ollama is Running | |
| ```bash | |
| # On Windows/Mac/Linux, start Ollama service | |
| ollama serve | |
| # In another terminal, pull the model | |
| ollama pull llama3 | |
| ``` | |
| ### 3. Run the API Server | |
| ```bash | |
| python -m src.api | |
| ``` | |
| Server runs on `http://localhost:8000` | |
| ### 4. Run the Dashboard | |
| ```bash | |
| python ui/gradio-dashboard.py | |
| ``` | |
| Dashboard runs on `http://localhost:7860` | |
| ## Workflow Examples | |
| ### Example 1: Index Python Docs and Query | |
| ```python | |
| from src.app_enhanced import answer_question, index_crawler_results | |
| from src.crawler import DocumentationCrawler | |
| # Crawl Python documentation | |
| crawler = DocumentationCrawler( | |
| base_url="https://docs.python.org/3", | |
| max_depth=2, | |
| max_pages=50 | |
| ) | |
| docs = crawler.crawl() | |
| # Index the results | |
| index_crawler_results(docs) | |
| # Query | |
| answer = answer_question("What is a context manager and how do I use it?") | |
| print(answer) | |
| ``` | |
| ### Example 2: Query Multiple URLs via API | |
| ```bash | |
| curl -X POST http://localhost:8000/query \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "question": "How do I handle exceptions?", | |
| "urls": [ | |
| "https://docs.python.org", | |
| "https://realpython.com" | |
| ] | |
| }' | |
| ``` | |
| ### Example 3: Stream Response | |
| ```bash | |
| curl -X POST http://localhost:8000/query/stream \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"question": "Explain generators in Python"}' \ | |
| | while IFS= read -r line; do | |
| echo "$line" | jq -r '.content // .answer // .' | |
| done | |
| ``` | |
| ## Evaluation Metrics (SRS Section 7) | |
| The system will be evaluated on: | |
| 1. **Faithfulness Rate** (Target > 85%) | |
| - Measures adherence to crawled context | |
| - Tests: Query on known documentation with expected answers | |
| 2. **Answer Relevance** (Target > 90%) | |
| - Does the answer address the developer's question? | |
| - Tests: Ask domain-specific questions | |
| 3. **Context Precision** (Target > 80%) | |
| - Were the retrieved chunks actually useful? | |
| - Tests: Verify top-K results contain answer hints | |
| ## File Structure | |
| ``` | |
| fd/ | |
| βββ app_enhanced.py # Enhanced RAG with crawler integration | |
| βββ crawler.py # Web crawler module | |
| βββ api.py # FastAPI server with streaming | |
| βββ gradio-dashboard.py # Gradio UI | |
| βββ my_docs/ # PDFs go here | |
| βββ crawler_docs.json # Crawler output (auto-generated) | |
| βββ requiements.txt # Dependencies | |
| βββ srs.md # Original SRS | |
| βββ README.md # This file | |
| ``` | |
| ## Troubleshooting | |
| ### "No documents found" error | |
| - Ensure PDFs are in `my_docs/` folder, OR | |
| - Provide URLs via API/dashboard, OR | |
| - Run crawler first and index results | |
| ### "Ollama connection error" | |
| - Ensure Ollama is running: `ollama serve` | |
| - Check model is available: `ollama list` | |
| - Pull model if needed: `ollama pull llama3` | |
| ### Crawler timeout | |
| - Increase `timeout` parameter in crawler | |
| - Increase `delay` to avoid rate limiting | |
| - Reduce `max_pages` to crawl fewer pages | |
| ### Slow retrieval | |
| - The first query loads embeddings (~30s on first run) | |
| - Subsequent queries are cached and faster | |
| - Consider using GPU for faster embeddings | |
| ## Next Steps | |
| 1. **Test with real documentation**: Try crawling your target site | |
| 2. **Evaluate metrics**: Run test queries and measure quality | |
| 3. **Fine-tune parameters**: Adjust chunk size, overlap, k-value | |
| 4. **Deploy**: Use production-grade server (Gunicorn + Uvicorn) | |
| 5. **Monitor**: Log query metrics and relevance feedback | |