# Implementation Guide: Intelligent Documentation Crawler & RAG Assistant

This guide walks through the SRS implementation for the Documentation Crawler and RAG Assistant system.

## Architecture Overview

The system follows the modular architecture specified in SRS Section 6:

```
[Crawler] ──> [Document Extractor] ──> [Text Splitter] ──> [Embeddings] ──> [Vector DB]
                                                                                    │
[Query] ──> [Text Splitter] ──> [Embeddings] ──> [Semantic Search] ──────────────┘
                                                         │
                                                         ▼
                                                  [LLM + Prompt] ──> [Answer]
```

## Components

### 1. **Crawler Module** (`src/crawler.py`)

Implements the requirements from **SRS Section 3.1 - Data Ingestion & Crawling**:

- **Targeted Ingestion**: Recursively crawls URLs up to `max_depth`
- **Rate Limiting**: Configurable delay between requests (default 0.5s)
- **Politeness**: Respects domain constraints
- **State Management**: Tracks visited URLs to prevent loops
- **Boilerplate Removal**: Strips navbars, footers, sidebars using BeautifulSoup

**Usage**:
```python
from src.crawler import DocumentationCrawler

crawler = DocumentationCrawler(
    base_url="https://docs.python.org",
    max_depth=3,
    delay=0.5,
    max_pages=100
)

documents = crawler.crawl()
# Returns list of {'url': str, 'content': str}
```

### 2. **Enhanced App Module** (`src/app_enhanced.py`)

Implements requirements from **SRS Sections 3.2-3.4**:

**Features**:
- **Structure-Aware Parsing**: Code blocks preserved in chunks
- **Code-Preserving Chunking**: 
  - Chunk size: 1000 characters (SRS spec)
  - Overlap: 200 characters (SRS spec)
- **Local Vectorization**: Uses `sentence-transformers/all-MiniLM-L6-v2`
- **Semantic Search**: Top-K retrieval (k=3)
- **Multi-source Loading**: PDFs + URLs + Crawler output

**Usage**:
```python
from src.app_enhanced import answer_question, index_crawler_results

# Option 1: Query with URLs
answer = answer_question(
    "How to use async/await?",
    urls=["https://docs.python.org"]
)

# Option 2: Index crawler results
docs = crawler.crawl()
index_crawler_results(docs)
answer = answer_question("How to use async/await?")
```

### 3. **FastAPI Server** (`src/src/api.py`)

Implements **SRS Section 4.1 - Performance & Latency** and **SRS Section 5 - Technical Stack**:

**Endpoints**:

#### `/health` (GET)
Health check endpoint
```bash
curl http://localhost:8000/health
```

#### `/query` (POST)
Single query with complete response
```bash
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "How to use decorators?",
    "urls": ["https://docs.python.org"]
  }'
```

#### `/query/stream` (POST)
Streaming response (tokens streamed as generated)
```bash
curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is async/await?",
    "urls": ["https://docs.python.org"]
  }'
```

#### `/crawl/prepare` (POST)
Crawl a website (non-blocking status)
```bash
curl -X POST http://localhost:8000/crawl/prepare \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": "https://docs.python.org",
    "max_depth": 3,
    "max_pages": 100
  }'
```

#### `/index/from-crawl` (POST)
Crawl and automatically index
```bash
curl -X POST http://localhost:8000/index/from-crawl \
  -H "Content-Type: application/json" \
  -d '{
    "base_url": "https://docs.python.org",
    "max_depth": 2
  }'
```

### 4. **Gradio Dashboard** (`ui/gradio-dashboard.py`)

User-friendly interface with URL input for web scraping.

**Usage**:
```bash
python ui/gradio-dashboard.py
```

Then open the URL shown (typically `http://localhost:7860`)

## System Non-Functional Requirements (SRS Section 4)

### Performance (SRS 4.1)
- ✓ **Retrieval Time**: Semantic search < 200ms (ChromaDB optimized)
- ✓ **Streaming**: Implemented in `/query/stream` endpoint
- ✓ **TTFT**: Token streaming reduces perceived latency

### Scalability (SRS 4.2)
- ✓ **Memory Efficiency**: Local Chroma DB runs on consumer hardware
- ✓ **Modular Architecture**: Crawler runs independently from query API
- ✓ **Indexing**: Can be done offline, querying uninterrupted

### Reliability (SRS 4.3)
- ✓ **Graceful Degradation**: Failed URLs logged, processing continues
- ✓ **Error Handling**: Try/catch blocks preserve system stability
- ✓ **Retry Logic**: Built into crawler with exponential backoff

## Setup & Installation

### 1. Install Dependencies
```bash
pip install -r requiements.txt
```

### 2. Ensure Ollama is Running
```bash
# On Windows/Mac/Linux, start Ollama service
ollama serve

# In another terminal, pull the model
ollama pull llama3
```

### 3. Run the API Server
```bash
python -m src.api
```

Server runs on `http://localhost:8000`

### 4. Run the Dashboard
```bash
python ui/gradio-dashboard.py
```

Dashboard runs on `http://localhost:7860`

## Workflow Examples

### Example 1: Index Python Docs and Query

```python
from src.app_enhanced import answer_question, index_crawler_results
from src.crawler import DocumentationCrawler

# Crawl Python documentation
crawler = DocumentationCrawler(
    base_url="https://docs.python.org/3",
    max_depth=2,
    max_pages=50
)
docs = crawler.crawl()

# Index the results
index_crawler_results(docs)

# Query
answer = answer_question("What is a context manager and how do I use it?")
print(answer)
```

### Example 2: Query Multiple URLs via API

```bash
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "How do I handle exceptions?",
    "urls": [
      "https://docs.python.org",
      "https://realpython.com"
    ]
  }'
```

### Example 3: Stream Response

```bash
curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{"question": "Explain generators in Python"}' \
  | while IFS= read -r line; do
      echo "$line" | jq -r '.content // .answer // .'
    done
```

## Evaluation Metrics (SRS Section 7)

The system will be evaluated on:

1. **Faithfulness Rate** (Target > 85%)
   - Measures adherence to crawled context
   - Tests: Query on known documentation with expected answers

2. **Answer Relevance** (Target > 90%)
   - Does the answer address the developer's question?
   - Tests: Ask domain-specific questions

3. **Context Precision** (Target > 80%)
   - Were the retrieved chunks actually useful?
   - Tests: Verify top-K results contain answer hints

## File Structure

```
fd/
├── app_enhanced.py         # Enhanced RAG with crawler integration
├── crawler.py              # Web crawler module
├── api.py                  # FastAPI server with streaming
├── gradio-dashboard.py     # Gradio UI
├── my_docs/                # PDFs go here
├── crawler_docs.json       # Crawler output (auto-generated)
├── requiements.txt         # Dependencies
├── srs.md                  # Original SRS
└── README.md               # This file
```

## Troubleshooting

### "No documents found" error
- Ensure PDFs are in `my_docs/` folder, OR
- Provide URLs via API/dashboard, OR
- Run crawler first and index results

### "Ollama connection error"
- Ensure Ollama is running: `ollama serve`
- Check model is available: `ollama list`
- Pull model if needed: `ollama pull llama3`

### Crawler timeout
- Increase `timeout` parameter in crawler
- Increase `delay` to avoid rate limiting
- Reduce `max_pages` to crawl fewer pages

### Slow retrieval
- The first query loads embeddings (~30s on first run)
- Subsequent queries are cached and faster
- Consider using GPU for faster embeddings

## Next Steps

1. **Test with real documentation**: Try crawling your target site
2. **Evaluate metrics**: Run test queries and measure quality
3. **Fine-tune parameters**: Adjust chunk size, overlap, k-value
4. **Deploy**: Use production-grade server (Gunicorn + Uvicorn)
5. **Monitor**: Log query metrics and relevance feedback