Spaces:

MCP-1st-Birthday
/

sdlc-agent

Runtime error

App Files Files Community

sdlc-agent / docs /api /RAG_API.md

Veeru-c

initial commit

06bd253 17 days ago

preview code

raw

history blame contribute delete

4.68 kB

	# RAG API Documentation

	Fast API endpoint for querying the product design RAG system with <3 second response times.

	## Quick Start

	### Deploy the API

	```bash
	# Deploy to Modal
	modal deploy src/rag/rag_api.py

	# Get the URL
	modal app list
	```

	### Use the API

	```python
	from src.rag.api_client import RAGAPIClient

	client = RAGAPIClient(base_url="https://your-modal-url.modal.run")
	result = client.query("What are the three product tiers?")
	print(result['answer'])
	```

	## API Endpoints

	### Health Check

	```http
	GET /health
	```

	Response:
	```json
	{
	"status": "healthy",
	"service": "rag-api"
	}
	```

	### Query

	```http
	POST /query
	Content-Type: application/json

	{
	"question": "What are the three product tiers?",
	"top_k": 5,
	"max_tokens": 1024
	}
	```

	Response:
	```json
	{
	"answer": "The three product tiers are...",
	"retrieval_time": 0.45,
	"generation_time": 1.23,
	"total_time": 1.68,
	"sources": [
	{
	"content": "...",
	"metadata": {...}
	}
	],
	"success": true
	}
	```

	## Performance Optimization

	### Target: <3 Second Responses

	The API is optimized for fast responses:

	1. Warm Containers: `min_containers=1` keeps a container ready
	2. Optimized LLM: Reduced max_tokens (1024 vs 1536)
	3. Limited Context: Top 3 documents, 800 chars each
	4. Prefix Caching: Enabled for faster generation
	5. Concurrent Requests: Up to 10 concurrent requests

	### Response Time Breakdown

	- Retrieval: 0.3-0.8 seconds
	- Generation: 1.0-2.0 seconds
	- Total: 1.5-3.0 seconds (target: <3s)

	## Usage Examples

	### Python Client

	```python
	from src.rag.api_client import RAGAPIClient

	# Initialize
	client = RAGAPIClient(base_url="https://your-api-url.modal.run")

	# Health check
	health = client.health_check()
	print(health)

	# Query
	result = client.query("What are the premium ranges?")
	print(result['answer'])

	# Fast query (optimized for speed)
	result = client.query_fast("What are the three tiers?")
	print(result['answer'])
	```

	### cURL

	```bash
	# Health check
	curl https://your-api-url.modal.run/health

	# Query
	curl -X POST https://your-api-url.modal.run/query \
	-H "Content-Type: application/json" \
	-d '{
	"question": "What are the three product tiers?",
	"top_k": 5,
	"max_tokens": 1024
	}'
	```

	### JavaScript/TypeScript

	```javascript
	const response = await fetch('https://your-api-url.modal.run/query', {
	method: 'POST',
	headers: {
	'Content-Type': 'application/json',
	},
	body: JSON.stringify({
	question: 'What are the three product tiers?',
	top_k: 5,
	max_tokens: 1024
	})
	});

	const data = await response.json();
	console.log(data.answer);
	```

	## Configuration

	### Environment Variables

	- `MODAL_APP_NAME`: App name (default: "insurance-rag-api")
	- `MODAL_VOLUME_NAME`: Volume name (default: "mcp-hack-ins-products")

	### API Parameters

	- `question` (required): The question to ask
	- `top_k` (optional, default: 5): Number of documents to retrieve
	- `max_tokens` (optional, default: 1024): Maximum response length

	## Performance Tips

	1. Use Fast Query: For speed-critical applications, use `query_fast()` method
	2. Reduce top_k: Lower `top_k` (e.g., 3) for faster retrieval
	3. Reduce max_tokens: Lower `max_tokens` (e.g., 512) for faster generation
	4. Cache Results: Cache common queries client-side
	5. Batch Requests: If possible, batch multiple queries

	## Error Handling

	```python
	result = client.query("your question")

	if result.get("success"):
	print(result['answer'])
	else:
	print(f"Error: {result.get('error', 'Unknown error')}")
	```

	## Monitoring

	### Response Times

	Monitor the `total_time` field in responses:
	- < 2s: Excellent
	- 2-3s: Good (target)
	- > 3s: May need optimization

	### Health Monitoring

	```python
	health = client.health_check()
	if health.get("status") != "healthy":
	# Handle unhealthy state
	pass
	```

	## Deployment

	### Modal Deployment

	```bash
	# Deploy
	modal deploy src/rag/rag_api.py

	# Get URL
	modal app show insurance-rag-api
	```

	### Local Testing

	```bash
	# Run locally (for development)
	modal serve src/rag/rag_api.py
	```

	## Rate Limiting

	The API supports up to 10 concurrent requests. For higher throughput:
	- Deploy multiple instances
	- Use load balancer
	- Implement client-side rate limiting

	## Security

	- Add authentication if needed
	- Use HTTPS in production
	- Implement rate limiting
	- Validate input questions

	## Troubleshooting

	### Slow Responses (>3s)
	- Check if container is warm (`min_containers=1`)
	- Reduce `max_tokens`
	- Reduce `top_k`
	- Check network latency

	### Errors
	- Verify documents are indexed
	- Check Modal app status
	- Review error messages in response