Spaces:
Runtime error
Runtime error
| # RAG API Documentation | |
| Fast API endpoint for querying the product design RAG system with <3 second response times. | |
| ## Quick Start | |
| ### Deploy the API | |
| ```bash | |
| # Deploy to Modal | |
| modal deploy src/rag/rag_api.py | |
| # Get the URL | |
| modal app list | |
| ``` | |
| ### Use the API | |
| ```python | |
| from src.rag.api_client import RAGAPIClient | |
| client = RAGAPIClient(base_url="https://your-modal-url.modal.run") | |
| result = client.query("What are the three product tiers?") | |
| print(result['answer']) | |
| ``` | |
| ## API Endpoints | |
| ### Health Check | |
| ```http | |
| GET /health | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "status": "healthy", | |
| "service": "rag-api" | |
| } | |
| ``` | |
| ### Query | |
| ```http | |
| POST /query | |
| Content-Type: application/json | |
| { | |
| "question": "What are the three product tiers?", | |
| "top_k": 5, | |
| "max_tokens": 1024 | |
| } | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "answer": "The three product tiers are...", | |
| "retrieval_time": 0.45, | |
| "generation_time": 1.23, | |
| "total_time": 1.68, | |
| "sources": [ | |
| { | |
| "content": "...", | |
| "metadata": {...} | |
| } | |
| ], | |
| "success": true | |
| } | |
| ``` | |
| ## Performance Optimization | |
| ### Target: <3 Second Responses | |
| The API is optimized for fast responses: | |
| 1. **Warm Containers**: `min_containers=1` keeps a container ready | |
| 2. **Optimized LLM**: Reduced max_tokens (1024 vs 1536) | |
| 3. **Limited Context**: Top 3 documents, 800 chars each | |
| 4. **Prefix Caching**: Enabled for faster generation | |
| 5. **Concurrent Requests**: Up to 10 concurrent requests | |
| ### Response Time Breakdown | |
| - **Retrieval**: 0.3-0.8 seconds | |
| - **Generation**: 1.0-2.0 seconds | |
| - **Total**: 1.5-3.0 seconds (target: <3s) | |
| ## Usage Examples | |
| ### Python Client | |
| ```python | |
| from src.rag.api_client import RAGAPIClient | |
| # Initialize | |
| client = RAGAPIClient(base_url="https://your-api-url.modal.run") | |
| # Health check | |
| health = client.health_check() | |
| print(health) | |
| # Query | |
| result = client.query("What are the premium ranges?") | |
| print(result['answer']) | |
| # Fast query (optimized for speed) | |
| result = client.query_fast("What are the three tiers?") | |
| print(result['answer']) | |
| ``` | |
| ### cURL | |
| ```bash | |
| # Health check | |
| curl https://your-api-url.modal.run/health | |
| # Query | |
| curl -X POST https://your-api-url.modal.run/query \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "question": "What are the three product tiers?", | |
| "top_k": 5, | |
| "max_tokens": 1024 | |
| }' | |
| ``` | |
| ### JavaScript/TypeScript | |
| ```javascript | |
| const response = await fetch('https://your-api-url.modal.run/query', { | |
| method: 'POST', | |
| headers: { | |
| 'Content-Type': 'application/json', | |
| }, | |
| body: JSON.stringify({ | |
| question: 'What are the three product tiers?', | |
| top_k: 5, | |
| max_tokens: 1024 | |
| }) | |
| }); | |
| const data = await response.json(); | |
| console.log(data.answer); | |
| ``` | |
| ## Configuration | |
| ### Environment Variables | |
| - `MODAL_APP_NAME`: App name (default: "insurance-rag-api") | |
| - `MODAL_VOLUME_NAME`: Volume name (default: "mcp-hack-ins-products") | |
| ### API Parameters | |
| - `question` (required): The question to ask | |
| - `top_k` (optional, default: 5): Number of documents to retrieve | |
| - `max_tokens` (optional, default: 1024): Maximum response length | |
| ## Performance Tips | |
| 1. **Use Fast Query**: For speed-critical applications, use `query_fast()` method | |
| 2. **Reduce top_k**: Lower `top_k` (e.g., 3) for faster retrieval | |
| 3. **Reduce max_tokens**: Lower `max_tokens` (e.g., 512) for faster generation | |
| 4. **Cache Results**: Cache common queries client-side | |
| 5. **Batch Requests**: If possible, batch multiple queries | |
| ## Error Handling | |
| ```python | |
| result = client.query("your question") | |
| if result.get("success"): | |
| print(result['answer']) | |
| else: | |
| print(f"Error: {result.get('error', 'Unknown error')}") | |
| ``` | |
| ## Monitoring | |
| ### Response Times | |
| Monitor the `total_time` field in responses: | |
| - < 2s: Excellent | |
| - 2-3s: Good (target) | |
| - > 3s: May need optimization | |
| ### Health Monitoring | |
| ```python | |
| health = client.health_check() | |
| if health.get("status") != "healthy": | |
| # Handle unhealthy state | |
| pass | |
| ``` | |
| ## Deployment | |
| ### Modal Deployment | |
| ```bash | |
| # Deploy | |
| modal deploy src/rag/rag_api.py | |
| # Get URL | |
| modal app show insurance-rag-api | |
| ``` | |
| ### Local Testing | |
| ```bash | |
| # Run locally (for development) | |
| modal serve src/rag/rag_api.py | |
| ``` | |
| ## Rate Limiting | |
| The API supports up to 10 concurrent requests. For higher throughput: | |
| - Deploy multiple instances | |
| - Use load balancer | |
| - Implement client-side rate limiting | |
| ## Security | |
| - Add authentication if needed | |
| - Use HTTPS in production | |
| - Implement rate limiting | |
| - Validate input questions | |
| ## Troubleshooting | |
| ### Slow Responses (>3s) | |
| - Check if container is warm (`min_containers=1`) | |
| - Reduce `max_tokens` | |
| - Reduce `top_k` | |
| - Check network latency | |
| ### Errors | |
| - Verify documents are indexed | |
| - Check Modal app status | |
| - Review error messages in response | |