Spaces:
Sleeping
RPM Rate Limiting Implementation - 30 RPM Compliance
Overview
The RAG evaluation system now includes comprehensive rate limiting to ensure strict compliance with the Groq API's 30 requests per minute (RPM) limit during evaluation.
What Was Implemented
1. Enhanced Rate Limiter
- Tracks requests within 1-minute windows
- Automatically waits when approaching/reaching limit
- Provides detailed logging of current request rate
- Safe recursive retry after waiting period
2. Safety Margin Configuration
# config.py
groq_rpm_limit: int = 30 # API limit
rate_limit_delay: float = 2.5 # Safety delay (increased from 2.0)
Why 2.5 seconds?
- 30 RPM = 2.0 seconds minimum between requests
- 2.5 seconds = ~24 actual RPM (20% safety margin below limit)
- Prevents accidental RPM violations due to network delays
3. Dual-Layer Rate Limiting
Layer 1: Request Tracking (RateLimiter.acquire_sync)
- Tracks request timestamps in 60-second window
- Waits when 30 requests already made in last 60 seconds
- Logs current rate: "Current: X requests in last minute (Limit: 30 RPM)"
Layer 2: Safety Delay (time.sleep)
- 2.5 second delay after each successful API call
- Ensures even under load, we stay well below 30 RPM
- Configurable via
rate_limit_delaysetting
How It Works
Single Evaluation Flow
1. User starts evaluation
β
2. advanced_rag_evaluator.evaluate()
β
3. _get_gpt_labels() is called
ββ [EVALUATION] Making GPT labeling API call...
ββ [EVALUATION] This respects the 30 RPM rate limit
β
4. llm_client.generate()
ββ [RATE LIMIT] Applying rate limiting (RPM: 30, delay: 2.5s)
ββ Calls rate_limiter.acquire_sync()
β ββ [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
β ββ If at limit: [RATE LIMIT] Waiting X.XXs before next request...
β
ββ Makes API call to Groq
ββ [LLM RESPONSE] {...}
ββ [RATE LIMIT] Adding safety delay: 2.5s
ββ time.sleep(2.5)
β
5. Evaluation continues...
Batch Evaluation Flow (Multiple Evaluations)
Eval 1: 0s [API call] + 2.5s wait
Eval 2: 2.5s [API call] + 2.5s wait
Eval 3: 5.0s [API call] + 2.5s wait
...
Eval 12: 27.5s [API call] + 2.5s wait
Eval 13: 30s [API call] + 2.5s wait
Result: 12-13 evaluations per 60 seconds = ~24 RPM
(Well below 30 RPM limit with safety margin)
Configuration Options
In config.py
class Settings(BaseSettings):
# Rate Limiting
# 30 RPM = 2 seconds minimum between requests to stay under limit
groq_rpm_limit: int = 30 # API limit (required)
rate_limit_delay: float = 2.5 # Safety delay in seconds
Adjusting the Settings
To be more aggressive (higher risk):
groq_rpm_limit: int = 30
rate_limit_delay: float = 2.0 # Closer to mathematical minimum
# Result: ~30 actual RPM (risky, no safety margin)
To be more conservative (lower risk):
groq_rpm_limit: int = 30
rate_limit_delay: float = 3.0 # More safety margin
# Result: ~20 actual RPM (very safe, more time)
To use environment variables:
# .env file
GROQ_RPM_LIMIT=30
RATE_LIMIT_DELAY=2.5
Rate Limiting in Action
Console Output Example
[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] Adding safety delay: 2.5s
[LLM RESPONSE] {"relevance_explanation": "...", ...}
[Waits 2.5 seconds]
[EVALUATION] Evaluation complete
When Limit Is Reached
[EVALUATION] Eval 29 starting...
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 30 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] At 30 RPM limit. Waiting 45.32s before next request...
[System waits ~45 seconds for oldest request to age out]
[RATE LIMIT] Current: 1 requests in last minute (Limit: 30 RPM)
[API call made]
[RATE LIMIT] Adding safety delay: 2.5s
Performance Impact
Time Per Evaluation
| Phase | Duration | Notes |
|---|---|---|
| Rate limit check | < 1ms | Checking request history |
| API call | 1-3s | Network + Groq processing |
| Safety delay | 2.5s | Consistent across all calls |
| Total per eval | ~3.5-5.5s | Includes API response time |
Batch Processing Times
| Num Evals | Min Time | Max Time | Actual Rate |
|---|---|---|---|
| 10 | 35s | 55s | ~12-17 RPM |
| 20 | 70s | 110s | ~11-17 RPM |
| 30 | 105s | 165s | ~11-17 RPM |
| 50 | 175s | 275s | ~11-17 RPM |
Key Insight: Actual RPM is well below 30 due to:
- 2.5s safety delay
- Time for API responses
- Network latency
Implementation Details
RateLimiter Class (llm_client.py)
class RateLimiter:
"""Rate limiter for API calls to respect RPM limits."""
def __init__(self, max_requests_per_minute: int = 30):
self.max_requests = max_requests_per_minute
self.request_times = deque() # Tracks request times
self.lock = asyncio.Lock()
def acquire_sync(self):
"""Synchronous rate limit check before API call."""
now = datetime.now()
# Remove requests older than 1 minute
while self.request_times and (now - self.request_times[0]) > timedelta(minutes=1):
self.request_times.popleft()
# If at limit, wait
if len(self.request_times) >= self.max_requests:
# Calculate wait time and sleep
oldest = self.request_times[0]
wait_time = 60 - (now - oldest).total_seconds()
if wait_time > 0:
print(f"[RATE LIMIT] Waiting {wait_time:.2f}s...")
time.sleep(wait_time)
return self.acquire_sync() # Retry
# Record this request
self.request_times.append(now)
current_rpm = len(self.request_times)
print(f"[RATE LIMIT] Current: {current_rpm} requests in last minute")
Usage in GroqLLMClient (llm_client.py)
def generate(self, prompt: str, ...) -> str:
"""Generate with rate limiting."""
# Step 1: Apply rate limiting
self.rate_limiter.acquire_sync()
# Step 2: Make API call
response = self.client.chat.completions.create(...)
# Step 3: Add safety delay
time.sleep(self.rate_limit_delay)
return response.choices[0].message.content
Integration in Evaluation (advanced_rag_evaluator.py)
def _get_gpt_labels(self, question, response, documents):
"""Evaluate with GPT labeling (rate limited)."""
print(f"[EVALUATION] Making GPT labeling API call...")
print(f"[EVALUATION] This respects the 30 RPM rate limit")
# This call internally uses rate limiting
llm_response = self.llm_client.generate(
prompt=prompt,
max_tokens=2048,
temperature=0.0
)
# Processing continues after rate limiting/delay
Best Practices
For Development
# Use default settings for most cases
settings = Settings() # Uses 30 RPM limit, 2.5s delay
# Check actual rate being used
print(f"RPM Limit: {settings.groq_rpm_limit}")
print(f"Safety Delay: {settings.rate_limit_delay}")
For Batch Processing
# Process evaluations - rate limiting is automatic
for test_case in test_cases:
scores = evaluator.evaluate(
question=test_case["question"],
response=test_case["response"],
retrieved_documents=test_case["documents"]
)
# No need to add manual delays - handled automatically
For Monitoring
# Check console output for rate limit messages
# [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
# [RATE LIMIT] Adding safety delay: 2.5s
# If you see "Waiting X.XXs" - system is managing load correctly
Avoid These Mistakes
β Don't add additional delays:
# NOT NEEDED - rate limiting already applied
result = llm_client.generate(prompt)
time.sleep(5) # β Don't add this
β Don't override settings:
# NOT RECOMMENDED - could exceed RPM limit
groq_rpm_limit = 50 # β Don't change without understanding impact
rate_limit_delay = 0.5 # β Too aggressive
β Do let the system handle it:
# β System automatically respects limits
evaluator.evaluate(...)
# Rate limiting is transparent
Troubleshooting
Evaluations Are Very Slow
Symptom: Each evaluation takes 5+ seconds
Cause: Rate limiting is working correctly
- Each API call: ~1-3s
- Safety delay: 2.5s
- Total: 3.5-5.5s per evaluation
Solution: This is expected with 30 RPM limit. Increase delay only if needed:
rate_limit_delay = 1.5 # Slightly faster (but less safe margin)
"Waiting X.XXs" Messages Appear
Symptom: Frequent waiting messages during batch evaluation
Cause: Approaching or hitting the 30 RPM limit
Solution: Normal behavior - system is protecting the API
- Wait time decreases as requests age out of 60-second window
- Continue processing - evaluation will complete after wait
Evaluation Takes Longer Than Expected
Symptom: 50 evaluations taking 5+ minutes
Cause: 30 RPM limit (by design)
- 50 evals Γ 5.5s = 275s β 4.6 minutes
Calculation:
50 evaluations Γ· 30 requests/minute = 1.67 minutes minimum
With 2.5s delays: ~4-5 minutes typical
Solution: This is acceptable for compliance. No action needed.
Files Modified
- β config.py - Updated rate_limit_delay to 2.5s (safety margin)
- β llm_client.py - Enhanced RateLimiter with logging
- β llm_client.py - Enhanced generate() with rate limit messaging
- β advanced_rag_evaluator.py - Added evaluation-level logging
Testing Rate Limiting
Manual Test
from llm_client import RateLimiter
import time
limiter = RateLimiter(max_requests_per_minute=3) # Set low for testing
# Make 4 rapid requests
for i in range(4):
print(f"\nRequest {i+1}:")
limiter.acquire_sync()
print("Making API call...")
time.sleep(0.1)
# Output will show waiting message on 4th request
Batch Test
# Run batch evaluation and check logs
# Look for: [RATE LIMIT] messages showing rate compliance
results = evaluator.evaluate_batch(test_cases)
# Should see messages like:
# [RATE LIMIT] Current: 1 requests in last minute
# [RATE LIMIT] Current: 2 requests in last minute
# ... up to 30
Summary
β Automatic Compliance: Rate limiting is transparent and automatic β Safety Margin: 2.5s delay ensures well below 30 RPM limit β Detailed Logging: Console shows rate limiting in action β Configurable: Can adjust settings if needed β Zero Code Changes: Works with existing evaluation code
The system will never exceed the 30 RPM limit during evaluation.