CapStoneRAG10 / docs /RPM_RATE_LIMITING.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# RPM Rate Limiting Implementation - 30 RPM Compliance
## Overview
The RAG evaluation system now includes comprehensive rate limiting to ensure strict compliance with the Groq API's **30 requests per minute (RPM)** limit during evaluation.
## What Was Implemented
### 1. **Enhanced Rate Limiter**
- Tracks requests within 1-minute windows
- Automatically waits when approaching/reaching limit
- Provides detailed logging of current request rate
- Safe recursive retry after waiting period
### 2. **Safety Margin Configuration**
```python
# config.py
groq_rpm_limit: int = 30 # API limit
rate_limit_delay: float = 2.5 # Safety delay (increased from 2.0)
```
**Why 2.5 seconds?**
- 30 RPM = 2.0 seconds minimum between requests
- 2.5 seconds = ~24 actual RPM (20% safety margin below limit)
- Prevents accidental RPM violations due to network delays
### 3. **Dual-Layer Rate Limiting**
#### Layer 1: Request Tracking (RateLimiter.acquire_sync)
- Tracks request timestamps in 60-second window
- Waits when 30 requests already made in last 60 seconds
- Logs current rate: "Current: X requests in last minute (Limit: 30 RPM)"
#### Layer 2: Safety Delay (time.sleep)
- 2.5 second delay after each successful API call
- Ensures even under load, we stay well below 30 RPM
- Configurable via `rate_limit_delay` setting
## How It Works
### Single Evaluation Flow
```
1. User starts evaluation
↓
2. advanced_rag_evaluator.evaluate()
↓
3. _get_gpt_labels() is called
β”œβ”€ [EVALUATION] Making GPT labeling API call...
β”œβ”€ [EVALUATION] This respects the 30 RPM rate limit
↓
4. llm_client.generate()
β”œβ”€ [RATE LIMIT] Applying rate limiting (RPM: 30, delay: 2.5s)
β”œβ”€ Calls rate_limiter.acquire_sync()
β”‚ β”œβ”€ [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
β”‚ └─ If at limit: [RATE LIMIT] Waiting X.XXs before next request...
β”‚
β”œβ”€ Makes API call to Groq
β”œβ”€ [LLM RESPONSE] {...}
β”œβ”€ [RATE LIMIT] Adding safety delay: 2.5s
β”œβ”€ time.sleep(2.5)
↓
5. Evaluation continues...
```
### Batch Evaluation Flow (Multiple Evaluations)
```
Eval 1: 0s [API call] + 2.5s wait
Eval 2: 2.5s [API call] + 2.5s wait
Eval 3: 5.0s [API call] + 2.5s wait
...
Eval 12: 27.5s [API call] + 2.5s wait
Eval 13: 30s [API call] + 2.5s wait
Result: 12-13 evaluations per 60 seconds = ~24 RPM
(Well below 30 RPM limit with safety margin)
```
## Configuration Options
### In config.py
```python
class Settings(BaseSettings):
# Rate Limiting
# 30 RPM = 2 seconds minimum between requests to stay under limit
groq_rpm_limit: int = 30 # API limit (required)
rate_limit_delay: float = 2.5 # Safety delay in seconds
```
### Adjusting the Settings
**To be more aggressive (higher risk):**
```python
groq_rpm_limit: int = 30
rate_limit_delay: float = 2.0 # Closer to mathematical minimum
# Result: ~30 actual RPM (risky, no safety margin)
```
**To be more conservative (lower risk):**
```python
groq_rpm_limit: int = 30
rate_limit_delay: float = 3.0 # More safety margin
# Result: ~20 actual RPM (very safe, more time)
```
**To use environment variables:**
```bash
# .env file
GROQ_RPM_LIMIT=30
RATE_LIMIT_DELAY=2.5
```
## Rate Limiting in Action
### Console Output Example
```
[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] Adding safety delay: 2.5s
[LLM RESPONSE] {"relevance_explanation": "...", ...}
[Waits 2.5 seconds]
[EVALUATION] Evaluation complete
```
### When Limit Is Reached
```
[EVALUATION] Eval 29 starting...
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 30 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] At 30 RPM limit. Waiting 45.32s before next request...
[System waits ~45 seconds for oldest request to age out]
[RATE LIMIT] Current: 1 requests in last minute (Limit: 30 RPM)
[API call made]
[RATE LIMIT] Adding safety delay: 2.5s
```
## Performance Impact
### Time Per Evaluation
| Phase | Duration | Notes |
|-------|----------|-------|
| Rate limit check | < 1ms | Checking request history |
| API call | 1-3s | Network + Groq processing |
| Safety delay | 2.5s | Consistent across all calls |
| **Total per eval** | **~3.5-5.5s** | Includes API response time |
### Batch Processing Times
| Num Evals | Min Time | Max Time | Actual Rate |
|-----------|----------|----------|-------------|
| 10 | 35s | 55s | ~12-17 RPM |
| 20 | 70s | 110s | ~11-17 RPM |
| 30 | 105s | 165s | ~11-17 RPM |
| 50 | 175s | 275s | ~11-17 RPM |
**Key Insight:** Actual RPM is well below 30 due to:
- 2.5s safety delay
- Time for API responses
- Network latency
## Implementation Details
### RateLimiter Class (llm_client.py)
```python
class RateLimiter:
"""Rate limiter for API calls to respect RPM limits."""
def __init__(self, max_requests_per_minute: int = 30):
self.max_requests = max_requests_per_minute
self.request_times = deque() # Tracks request times
self.lock = asyncio.Lock()
def acquire_sync(self):
"""Synchronous rate limit check before API call."""
now = datetime.now()
# Remove requests older than 1 minute
while self.request_times and (now - self.request_times[0]) > timedelta(minutes=1):
self.request_times.popleft()
# If at limit, wait
if len(self.request_times) >= self.max_requests:
# Calculate wait time and sleep
oldest = self.request_times[0]
wait_time = 60 - (now - oldest).total_seconds()
if wait_time > 0:
print(f"[RATE LIMIT] Waiting {wait_time:.2f}s...")
time.sleep(wait_time)
return self.acquire_sync() # Retry
# Record this request
self.request_times.append(now)
current_rpm = len(self.request_times)
print(f"[RATE LIMIT] Current: {current_rpm} requests in last minute")
```
### Usage in GroqLLMClient (llm_client.py)
```python
def generate(self, prompt: str, ...) -> str:
"""Generate with rate limiting."""
# Step 1: Apply rate limiting
self.rate_limiter.acquire_sync()
# Step 2: Make API call
response = self.client.chat.completions.create(...)
# Step 3: Add safety delay
time.sleep(self.rate_limit_delay)
return response.choices[0].message.content
```
### Integration in Evaluation (advanced_rag_evaluator.py)
```python
def _get_gpt_labels(self, question, response, documents):
"""Evaluate with GPT labeling (rate limited)."""
print(f"[EVALUATION] Making GPT labeling API call...")
print(f"[EVALUATION] This respects the 30 RPM rate limit")
# This call internally uses rate limiting
llm_response = self.llm_client.generate(
prompt=prompt,
max_tokens=2048,
temperature=0.0
)
# Processing continues after rate limiting/delay
```
## Best Practices
### For Development
```python
# Use default settings for most cases
settings = Settings() # Uses 30 RPM limit, 2.5s delay
# Check actual rate being used
print(f"RPM Limit: {settings.groq_rpm_limit}")
print(f"Safety Delay: {settings.rate_limit_delay}")
```
### For Batch Processing
```python
# Process evaluations - rate limiting is automatic
for test_case in test_cases:
scores = evaluator.evaluate(
question=test_case["question"],
response=test_case["response"],
retrieved_documents=test_case["documents"]
)
# No need to add manual delays - handled automatically
```
### For Monitoring
```python
# Check console output for rate limit messages
# [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
# [RATE LIMIT] Adding safety delay: 2.5s
# If you see "Waiting X.XXs" - system is managing load correctly
```
### Avoid These Mistakes
❌ **Don't add additional delays:**
```python
# NOT NEEDED - rate limiting already applied
result = llm_client.generate(prompt)
time.sleep(5) # ❌ Don't add this
```
❌ **Don't override settings:**
```python
# NOT RECOMMENDED - could exceed RPM limit
groq_rpm_limit = 50 # ❌ Don't change without understanding impact
rate_limit_delay = 0.5 # ❌ Too aggressive
```
βœ… **Do let the system handle it:**
```python
# βœ“ System automatically respects limits
evaluator.evaluate(...)
# Rate limiting is transparent
```
## Troubleshooting
### Evaluations Are Very Slow
**Symptom:** Each evaluation takes 5+ seconds
**Cause:** Rate limiting is working correctly
- Each API call: ~1-3s
- Safety delay: 2.5s
- Total: 3.5-5.5s per evaluation
**Solution:** This is expected with 30 RPM limit. Increase delay only if needed:
```python
rate_limit_delay = 1.5 # Slightly faster (but less safe margin)
```
### "Waiting X.XXs" Messages Appear
**Symptom:** Frequent waiting messages during batch evaluation
**Cause:** Approaching or hitting the 30 RPM limit
**Solution:** Normal behavior - system is protecting the API
- Wait time decreases as requests age out of 60-second window
- Continue processing - evaluation will complete after wait
### Evaluation Takes Longer Than Expected
**Symptom:** 50 evaluations taking 5+ minutes
**Cause:** 30 RPM limit (by design)
- 50 evals Γ— 5.5s = 275s β‰ˆ 4.6 minutes
**Calculation:**
```
50 evaluations Γ· 30 requests/minute = 1.67 minutes minimum
With 2.5s delays: ~4-5 minutes typical
```
**Solution:** This is acceptable for compliance. No action needed.
## Files Modified
- βœ… **config.py** - Updated rate_limit_delay to 2.5s (safety margin)
- βœ… **llm_client.py** - Enhanced RateLimiter with logging
- βœ… **llm_client.py** - Enhanced generate() with rate limit messaging
- βœ… **advanced_rag_evaluator.py** - Added evaluation-level logging
## Testing Rate Limiting
### Manual Test
```python
from llm_client import RateLimiter
import time
limiter = RateLimiter(max_requests_per_minute=3) # Set low for testing
# Make 4 rapid requests
for i in range(4):
print(f"\nRequest {i+1}:")
limiter.acquire_sync()
print("Making API call...")
time.sleep(0.1)
# Output will show waiting message on 4th request
```
### Batch Test
```python
# Run batch evaluation and check logs
# Look for: [RATE LIMIT] messages showing rate compliance
results = evaluator.evaluate_batch(test_cases)
# Should see messages like:
# [RATE LIMIT] Current: 1 requests in last minute
# [RATE LIMIT] Current: 2 requests in last minute
# ... up to 30
```
## Summary
βœ… **Automatic Compliance:** Rate limiting is transparent and automatic
βœ… **Safety Margin:** 2.5s delay ensures well below 30 RPM limit
βœ… **Detailed Logging:** Console shows rate limiting in action
βœ… **Configurable:** Can adjust settings if needed
βœ… **Zero Code Changes:** Works with existing evaluation code
The system will never exceed the 30 RPM limit during evaluation.