# RAG API Documentation

Fast API endpoint for querying the product design RAG system with <3 second response times.

## Quick Start

### Deploy the API

```bash
# Deploy to Modal
modal deploy src/rag/rag_api.py

# Get the URL
modal app list
```

### Use the API

```python
from src.rag.api_client import RAGAPIClient

client = RAGAPIClient(base_url="https://your-modal-url.modal.run")
result = client.query("What are the three product tiers?")
print(result['answer'])
```

## API Endpoints

### Health Check

```http
GET /health
```

**Response:**
```json
{
  "status": "healthy",
  "service": "rag-api"
}
```

### Query

```http
POST /query
Content-Type: application/json

{
  "question": "What are the three product tiers?",
  "top_k": 5,
  "max_tokens": 1024
}
```

**Response:**
```json
{
  "answer": "The three product tiers are...",
  "retrieval_time": 0.45,
  "generation_time": 1.23,
  "total_time": 1.68,
  "sources": [
    {
      "content": "...",
      "metadata": {...}
    }
  ],
  "success": true
}
```

## Performance Optimization

### Target: <3 Second Responses

The API is optimized for fast responses:

1. **Warm Containers**: `min_containers=1` keeps a container ready
2. **Optimized LLM**: Reduced max_tokens (1024 vs 1536)
3. **Limited Context**: Top 3 documents, 800 chars each
4. **Prefix Caching**: Enabled for faster generation
5. **Concurrent Requests**: Up to 10 concurrent requests

### Response Time Breakdown

- **Retrieval**: 0.3-0.8 seconds
- **Generation**: 1.0-2.0 seconds
- **Total**: 1.5-3.0 seconds (target: <3s)

## Usage Examples

### Python Client

```python
from src.rag.api_client import RAGAPIClient

# Initialize
client = RAGAPIClient(base_url="https://your-api-url.modal.run")

# Health check
health = client.health_check()
print(health)

# Query
result = client.query("What are the premium ranges?")
print(result['answer'])

# Fast query (optimized for speed)
result = client.query_fast("What are the three tiers?")
print(result['answer'])
```

### cURL

```bash
# Health check
curl https://your-api-url.modal.run/health

# Query
curl -X POST https://your-api-url.modal.run/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What are the three product tiers?",
    "top_k": 5,
    "max_tokens": 1024
  }'
```

### JavaScript/TypeScript

```javascript
const response = await fetch('https://your-api-url.modal.run/query', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    question: 'What are the three product tiers?',
    top_k: 5,
    max_tokens: 1024
  })
});

const data = await response.json();
console.log(data.answer);
```

## Configuration

### Environment Variables

- `MODAL_APP_NAME`: App name (default: "insurance-rag-api")
- `MODAL_VOLUME_NAME`: Volume name (default: "mcp-hack-ins-products")

### API Parameters

- `question` (required): The question to ask
- `top_k` (optional, default: 5): Number of documents to retrieve
- `max_tokens` (optional, default: 1024): Maximum response length

## Performance Tips

1. **Use Fast Query**: For speed-critical applications, use `query_fast()` method
2. **Reduce top_k**: Lower `top_k` (e.g., 3) for faster retrieval
3. **Reduce max_tokens**: Lower `max_tokens` (e.g., 512) for faster generation
4. **Cache Results**: Cache common queries client-side
5. **Batch Requests**: If possible, batch multiple queries

## Error Handling

```python
result = client.query("your question")

if result.get("success"):
    print(result['answer'])
else:
    print(f"Error: {result.get('error', 'Unknown error')}")
```

## Monitoring

### Response Times

Monitor the `total_time` field in responses:
- < 2s: Excellent
- 2-3s: Good (target)
- > 3s: May need optimization

### Health Monitoring

```python
health = client.health_check()
if health.get("status") != "healthy":
    # Handle unhealthy state
    pass
```

## Deployment

### Modal Deployment

```bash
# Deploy
modal deploy src/rag/rag_api.py

# Get URL
modal app show insurance-rag-api
```

### Local Testing

```bash
# Run locally (for development)
modal serve src/rag/rag_api.py
```

## Rate Limiting

The API supports up to 10 concurrent requests. For higher throughput:
- Deploy multiple instances
- Use load balancer
- Implement client-side rate limiting

## Security

- Add authentication if needed
- Use HTTPS in production
- Implement rate limiting
- Validate input questions

## Troubleshooting

### Slow Responses (>3s)
- Check if container is warm (`min_containers=1`)
- Reduce `max_tokens`
- Reduce `top_k`
- Check network latency

### Errors
- Verify documents are indexed
- Check Modal app status
- Review error messages in response