# RPM Rate Limiting Implementation - 30 RPM Compliance ## Overview The RAG evaluation system now includes comprehensive rate limiting to ensure strict compliance with the Groq API's **30 requests per minute (RPM)** limit during evaluation. ## What Was Implemented ### 1. **Enhanced Rate Limiter** - Tracks requests within 1-minute windows - Automatically waits when approaching/reaching limit - Provides detailed logging of current request rate - Safe recursive retry after waiting period ### 2. **Safety Margin Configuration** ```python # config.py groq_rpm_limit: int = 30 # API limit rate_limit_delay: float = 2.5 # Safety delay (increased from 2.0) ``` **Why 2.5 seconds?** - 30 RPM = 2.0 seconds minimum between requests - 2.5 seconds = ~24 actual RPM (20% safety margin below limit) - Prevents accidental RPM violations due to network delays ### 3. **Dual-Layer Rate Limiting** #### Layer 1: Request Tracking (RateLimiter.acquire_sync) - Tracks request timestamps in 60-second window - Waits when 30 requests already made in last 60 seconds - Logs current rate: "Current: X requests in last minute (Limit: 30 RPM)" #### Layer 2: Safety Delay (time.sleep) - 2.5 second delay after each successful API call - Ensures even under load, we stay well below 30 RPM - Configurable via `rate_limit_delay` setting ## How It Works ### Single Evaluation Flow ``` 1. User starts evaluation ↓ 2. advanced_rag_evaluator.evaluate() ↓ 3. _get_gpt_labels() is called ├─ [EVALUATION] Making GPT labeling API call... ├─ [EVALUATION] This respects the 30 RPM rate limit ↓ 4. llm_client.generate() ├─ [RATE LIMIT] Applying rate limiting (RPM: 30, delay: 2.5s) ├─ Calls rate_limiter.acquire_sync() │ ├─ [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM) │ └─ If at limit: [RATE LIMIT] Waiting X.XXs before next request... │ ├─ Makes API call to Groq ├─ [LLM RESPONSE] {...} ├─ [RATE LIMIT] Adding safety delay: 2.5s ├─ time.sleep(2.5) ↓ 5. Evaluation continues... ``` ### Batch Evaluation Flow (Multiple Evaluations) ``` Eval 1: 0s [API call] + 2.5s wait Eval 2: 2.5s [API call] + 2.5s wait Eval 3: 5.0s [API call] + 2.5s wait ... Eval 12: 27.5s [API call] + 2.5s wait Eval 13: 30s [API call] + 2.5s wait Result: 12-13 evaluations per 60 seconds = ~24 RPM (Well below 30 RPM limit with safety margin) ``` ## Configuration Options ### In config.py ```python class Settings(BaseSettings): # Rate Limiting # 30 RPM = 2 seconds minimum between requests to stay under limit groq_rpm_limit: int = 30 # API limit (required) rate_limit_delay: float = 2.5 # Safety delay in seconds ``` ### Adjusting the Settings **To be more aggressive (higher risk):** ```python groq_rpm_limit: int = 30 rate_limit_delay: float = 2.0 # Closer to mathematical minimum # Result: ~30 actual RPM (risky, no safety margin) ``` **To be more conservative (lower risk):** ```python groq_rpm_limit: int = 30 rate_limit_delay: float = 3.0 # More safety margin # Result: ~20 actual RPM (very safe, more time) ``` **To use environment variables:** ```bash # .env file GROQ_RPM_LIMIT=30 RATE_LIMIT_DELAY=2.5 ``` ## Rate Limiting in Action ### Console Output Example ``` [EVALUATION] Making GPT labeling API call... [EVALUATION] This respects the 30 RPM rate limit [RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s) [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM) [RATE LIMIT] Adding safety delay: 2.5s [LLM RESPONSE] {"relevance_explanation": "...", ...} [Waits 2.5 seconds] [EVALUATION] Evaluation complete ``` ### When Limit Is Reached ``` [EVALUATION] Eval 29 starting... [RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s) [RATE LIMIT] Current: 30 requests in last minute (Limit: 30 RPM) [RATE LIMIT] At 30 RPM limit. Waiting 45.32s before next request... [System waits ~45 seconds for oldest request to age out] [RATE LIMIT] Current: 1 requests in last minute (Limit: 30 RPM) [API call made] [RATE LIMIT] Adding safety delay: 2.5s ``` ## Performance Impact ### Time Per Evaluation | Phase | Duration | Notes | |-------|----------|-------| | Rate limit check | < 1ms | Checking request history | | API call | 1-3s | Network + Groq processing | | Safety delay | 2.5s | Consistent across all calls | | **Total per eval** | **~3.5-5.5s** | Includes API response time | ### Batch Processing Times | Num Evals | Min Time | Max Time | Actual Rate | |-----------|----------|----------|-------------| | 10 | 35s | 55s | ~12-17 RPM | | 20 | 70s | 110s | ~11-17 RPM | | 30 | 105s | 165s | ~11-17 RPM | | 50 | 175s | 275s | ~11-17 RPM | **Key Insight:** Actual RPM is well below 30 due to: - 2.5s safety delay - Time for API responses - Network latency ## Implementation Details ### RateLimiter Class (llm_client.py) ```python class RateLimiter: """Rate limiter for API calls to respect RPM limits.""" def __init__(self, max_requests_per_minute: int = 30): self.max_requests = max_requests_per_minute self.request_times = deque() # Tracks request times self.lock = asyncio.Lock() def acquire_sync(self): """Synchronous rate limit check before API call.""" now = datetime.now() # Remove requests older than 1 minute while self.request_times and (now - self.request_times[0]) > timedelta(minutes=1): self.request_times.popleft() # If at limit, wait if len(self.request_times) >= self.max_requests: # Calculate wait time and sleep oldest = self.request_times[0] wait_time = 60 - (now - oldest).total_seconds() if wait_time > 0: print(f"[RATE LIMIT] Waiting {wait_time:.2f}s...") time.sleep(wait_time) return self.acquire_sync() # Retry # Record this request self.request_times.append(now) current_rpm = len(self.request_times) print(f"[RATE LIMIT] Current: {current_rpm} requests in last minute") ``` ### Usage in GroqLLMClient (llm_client.py) ```python def generate(self, prompt: str, ...) -> str: """Generate with rate limiting.""" # Step 1: Apply rate limiting self.rate_limiter.acquire_sync() # Step 2: Make API call response = self.client.chat.completions.create(...) # Step 3: Add safety delay time.sleep(self.rate_limit_delay) return response.choices[0].message.content ``` ### Integration in Evaluation (advanced_rag_evaluator.py) ```python def _get_gpt_labels(self, question, response, documents): """Evaluate with GPT labeling (rate limited).""" print(f"[EVALUATION] Making GPT labeling API call...") print(f"[EVALUATION] This respects the 30 RPM rate limit") # This call internally uses rate limiting llm_response = self.llm_client.generate( prompt=prompt, max_tokens=2048, temperature=0.0 ) # Processing continues after rate limiting/delay ``` ## Best Practices ### For Development ```python # Use default settings for most cases settings = Settings() # Uses 30 RPM limit, 2.5s delay # Check actual rate being used print(f"RPM Limit: {settings.groq_rpm_limit}") print(f"Safety Delay: {settings.rate_limit_delay}") ``` ### For Batch Processing ```python # Process evaluations - rate limiting is automatic for test_case in test_cases: scores = evaluator.evaluate( question=test_case["question"], response=test_case["response"], retrieved_documents=test_case["documents"] ) # No need to add manual delays - handled automatically ``` ### For Monitoring ```python # Check console output for rate limit messages # [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM) # [RATE LIMIT] Adding safety delay: 2.5s # If you see "Waiting X.XXs" - system is managing load correctly ``` ### Avoid These Mistakes ❌ **Don't add additional delays:** ```python # NOT NEEDED - rate limiting already applied result = llm_client.generate(prompt) time.sleep(5) # ❌ Don't add this ``` ❌ **Don't override settings:** ```python # NOT RECOMMENDED - could exceed RPM limit groq_rpm_limit = 50 # ❌ Don't change without understanding impact rate_limit_delay = 0.5 # ❌ Too aggressive ``` ✅ **Do let the system handle it:** ```python # ✓ System automatically respects limits evaluator.evaluate(...) # Rate limiting is transparent ``` ## Troubleshooting ### Evaluations Are Very Slow **Symptom:** Each evaluation takes 5+ seconds **Cause:** Rate limiting is working correctly - Each API call: ~1-3s - Safety delay: 2.5s - Total: 3.5-5.5s per evaluation **Solution:** This is expected with 30 RPM limit. Increase delay only if needed: ```python rate_limit_delay = 1.5 # Slightly faster (but less safe margin) ``` ### "Waiting X.XXs" Messages Appear **Symptom:** Frequent waiting messages during batch evaluation **Cause:** Approaching or hitting the 30 RPM limit **Solution:** Normal behavior - system is protecting the API - Wait time decreases as requests age out of 60-second window - Continue processing - evaluation will complete after wait ### Evaluation Takes Longer Than Expected **Symptom:** 50 evaluations taking 5+ minutes **Cause:** 30 RPM limit (by design) - 50 evals × 5.5s = 275s ≈ 4.6 minutes **Calculation:** ``` 50 evaluations ÷ 30 requests/minute = 1.67 minutes minimum With 2.5s delays: ~4-5 minutes typical ``` **Solution:** This is acceptable for compliance. No action needed. ## Files Modified - ✅ **config.py** - Updated rate_limit_delay to 2.5s (safety margin) - ✅ **llm_client.py** - Enhanced RateLimiter with logging - ✅ **llm_client.py** - Enhanced generate() with rate limit messaging - ✅ **advanced_rag_evaluator.py** - Added evaluation-level logging ## Testing Rate Limiting ### Manual Test ```python from llm_client import RateLimiter import time limiter = RateLimiter(max_requests_per_minute=3) # Set low for testing # Make 4 rapid requests for i in range(4): print(f"\nRequest {i+1}:") limiter.acquire_sync() print("Making API call...") time.sleep(0.1) # Output will show waiting message on 4th request ``` ### Batch Test ```python # Run batch evaluation and check logs # Look for: [RATE LIMIT] messages showing rate compliance results = evaluator.evaluate_batch(test_cases) # Should see messages like: # [RATE LIMIT] Current: 1 requests in last minute # [RATE LIMIT] Current: 2 requests in last minute # ... up to 30 ``` ## Summary ✅ **Automatic Compliance:** Rate limiting is transparent and automatic ✅ **Safety Margin:** 2.5s delay ensures well below 30 RPM limit ✅ **Detailed Logging:** Console shows rate limiting in action ✅ **Configurable:** Can adjust settings if needed ✅ **Zero Code Changes:** Works with existing evaluation code The system will never exceed the 30 RPM limit during evaluation.