Spaces:
Sleeping
Sleeping
| # RPM Rate Limiting Implementation - 30 RPM Compliance | |
| ## Overview | |
| The RAG evaluation system now includes comprehensive rate limiting to ensure strict compliance with the Groq API's **30 requests per minute (RPM)** limit during evaluation. | |
| ## What Was Implemented | |
| ### 1. **Enhanced Rate Limiter** | |
| - Tracks requests within 1-minute windows | |
| - Automatically waits when approaching/reaching limit | |
| - Provides detailed logging of current request rate | |
| - Safe recursive retry after waiting period | |
| ### 2. **Safety Margin Configuration** | |
| ```python | |
| # config.py | |
| groq_rpm_limit: int = 30 # API limit | |
| rate_limit_delay: float = 2.5 # Safety delay (increased from 2.0) | |
| ``` | |
| **Why 2.5 seconds?** | |
| - 30 RPM = 2.0 seconds minimum between requests | |
| - 2.5 seconds = ~24 actual RPM (20% safety margin below limit) | |
| - Prevents accidental RPM violations due to network delays | |
| ### 3. **Dual-Layer Rate Limiting** | |
| #### Layer 1: Request Tracking (RateLimiter.acquire_sync) | |
| - Tracks request timestamps in 60-second window | |
| - Waits when 30 requests already made in last 60 seconds | |
| - Logs current rate: "Current: X requests in last minute (Limit: 30 RPM)" | |
| #### Layer 2: Safety Delay (time.sleep) | |
| - 2.5 second delay after each successful API call | |
| - Ensures even under load, we stay well below 30 RPM | |
| - Configurable via `rate_limit_delay` setting | |
| ## How It Works | |
| ### Single Evaluation Flow | |
| ``` | |
| 1. User starts evaluation | |
| β | |
| 2. advanced_rag_evaluator.evaluate() | |
| β | |
| 3. _get_gpt_labels() is called | |
| ββ [EVALUATION] Making GPT labeling API call... | |
| ββ [EVALUATION] This respects the 30 RPM rate limit | |
| β | |
| 4. llm_client.generate() | |
| ββ [RATE LIMIT] Applying rate limiting (RPM: 30, delay: 2.5s) | |
| ββ Calls rate_limiter.acquire_sync() | |
| β ββ [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM) | |
| β ββ If at limit: [RATE LIMIT] Waiting X.XXs before next request... | |
| β | |
| ββ Makes API call to Groq | |
| ββ [LLM RESPONSE] {...} | |
| ββ [RATE LIMIT] Adding safety delay: 2.5s | |
| ββ time.sleep(2.5) | |
| β | |
| 5. Evaluation continues... | |
| ``` | |
| ### Batch Evaluation Flow (Multiple Evaluations) | |
| ``` | |
| Eval 1: 0s [API call] + 2.5s wait | |
| Eval 2: 2.5s [API call] + 2.5s wait | |
| Eval 3: 5.0s [API call] + 2.5s wait | |
| ... | |
| Eval 12: 27.5s [API call] + 2.5s wait | |
| Eval 13: 30s [API call] + 2.5s wait | |
| Result: 12-13 evaluations per 60 seconds = ~24 RPM | |
| (Well below 30 RPM limit with safety margin) | |
| ``` | |
| ## Configuration Options | |
| ### In config.py | |
| ```python | |
| class Settings(BaseSettings): | |
| # Rate Limiting | |
| # 30 RPM = 2 seconds minimum between requests to stay under limit | |
| groq_rpm_limit: int = 30 # API limit (required) | |
| rate_limit_delay: float = 2.5 # Safety delay in seconds | |
| ``` | |
| ### Adjusting the Settings | |
| **To be more aggressive (higher risk):** | |
| ```python | |
| groq_rpm_limit: int = 30 | |
| rate_limit_delay: float = 2.0 # Closer to mathematical minimum | |
| # Result: ~30 actual RPM (risky, no safety margin) | |
| ``` | |
| **To be more conservative (lower risk):** | |
| ```python | |
| groq_rpm_limit: int = 30 | |
| rate_limit_delay: float = 3.0 # More safety margin | |
| # Result: ~20 actual RPM (very safe, more time) | |
| ``` | |
| **To use environment variables:** | |
| ```bash | |
| # .env file | |
| GROQ_RPM_LIMIT=30 | |
| RATE_LIMIT_DELAY=2.5 | |
| ``` | |
| ## Rate Limiting in Action | |
| ### Console Output Example | |
| ``` | |
| [EVALUATION] Making GPT labeling API call... | |
| [EVALUATION] This respects the 30 RPM rate limit | |
| [RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s) | |
| [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM) | |
| [RATE LIMIT] Adding safety delay: 2.5s | |
| [LLM RESPONSE] {"relevance_explanation": "...", ...} | |
| [Waits 2.5 seconds] | |
| [EVALUATION] Evaluation complete | |
| ``` | |
| ### When Limit Is Reached | |
| ``` | |
| [EVALUATION] Eval 29 starting... | |
| [RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s) | |
| [RATE LIMIT] Current: 30 requests in last minute (Limit: 30 RPM) | |
| [RATE LIMIT] At 30 RPM limit. Waiting 45.32s before next request... | |
| [System waits ~45 seconds for oldest request to age out] | |
| [RATE LIMIT] Current: 1 requests in last minute (Limit: 30 RPM) | |
| [API call made] | |
| [RATE LIMIT] Adding safety delay: 2.5s | |
| ``` | |
| ## Performance Impact | |
| ### Time Per Evaluation | |
| | Phase | Duration | Notes | | |
| |-------|----------|-------| | |
| | Rate limit check | < 1ms | Checking request history | | |
| | API call | 1-3s | Network + Groq processing | | |
| | Safety delay | 2.5s | Consistent across all calls | | |
| | **Total per eval** | **~3.5-5.5s** | Includes API response time | | |
| ### Batch Processing Times | |
| | Num Evals | Min Time | Max Time | Actual Rate | | |
| |-----------|----------|----------|-------------| | |
| | 10 | 35s | 55s | ~12-17 RPM | | |
| | 20 | 70s | 110s | ~11-17 RPM | | |
| | 30 | 105s | 165s | ~11-17 RPM | | |
| | 50 | 175s | 275s | ~11-17 RPM | | |
| **Key Insight:** Actual RPM is well below 30 due to: | |
| - 2.5s safety delay | |
| - Time for API responses | |
| - Network latency | |
| ## Implementation Details | |
| ### RateLimiter Class (llm_client.py) | |
| ```python | |
| class RateLimiter: | |
| """Rate limiter for API calls to respect RPM limits.""" | |
| def __init__(self, max_requests_per_minute: int = 30): | |
| self.max_requests = max_requests_per_minute | |
| self.request_times = deque() # Tracks request times | |
| self.lock = asyncio.Lock() | |
| def acquire_sync(self): | |
| """Synchronous rate limit check before API call.""" | |
| now = datetime.now() | |
| # Remove requests older than 1 minute | |
| while self.request_times and (now - self.request_times[0]) > timedelta(minutes=1): | |
| self.request_times.popleft() | |
| # If at limit, wait | |
| if len(self.request_times) >= self.max_requests: | |
| # Calculate wait time and sleep | |
| oldest = self.request_times[0] | |
| wait_time = 60 - (now - oldest).total_seconds() | |
| if wait_time > 0: | |
| print(f"[RATE LIMIT] Waiting {wait_time:.2f}s...") | |
| time.sleep(wait_time) | |
| return self.acquire_sync() # Retry | |
| # Record this request | |
| self.request_times.append(now) | |
| current_rpm = len(self.request_times) | |
| print(f"[RATE LIMIT] Current: {current_rpm} requests in last minute") | |
| ``` | |
| ### Usage in GroqLLMClient (llm_client.py) | |
| ```python | |
| def generate(self, prompt: str, ...) -> str: | |
| """Generate with rate limiting.""" | |
| # Step 1: Apply rate limiting | |
| self.rate_limiter.acquire_sync() | |
| # Step 2: Make API call | |
| response = self.client.chat.completions.create(...) | |
| # Step 3: Add safety delay | |
| time.sleep(self.rate_limit_delay) | |
| return response.choices[0].message.content | |
| ``` | |
| ### Integration in Evaluation (advanced_rag_evaluator.py) | |
| ```python | |
| def _get_gpt_labels(self, question, response, documents): | |
| """Evaluate with GPT labeling (rate limited).""" | |
| print(f"[EVALUATION] Making GPT labeling API call...") | |
| print(f"[EVALUATION] This respects the 30 RPM rate limit") | |
| # This call internally uses rate limiting | |
| llm_response = self.llm_client.generate( | |
| prompt=prompt, | |
| max_tokens=2048, | |
| temperature=0.0 | |
| ) | |
| # Processing continues after rate limiting/delay | |
| ``` | |
| ## Best Practices | |
| ### For Development | |
| ```python | |
| # Use default settings for most cases | |
| settings = Settings() # Uses 30 RPM limit, 2.5s delay | |
| # Check actual rate being used | |
| print(f"RPM Limit: {settings.groq_rpm_limit}") | |
| print(f"Safety Delay: {settings.rate_limit_delay}") | |
| ``` | |
| ### For Batch Processing | |
| ```python | |
| # Process evaluations - rate limiting is automatic | |
| for test_case in test_cases: | |
| scores = evaluator.evaluate( | |
| question=test_case["question"], | |
| response=test_case["response"], | |
| retrieved_documents=test_case["documents"] | |
| ) | |
| # No need to add manual delays - handled automatically | |
| ``` | |
| ### For Monitoring | |
| ```python | |
| # Check console output for rate limit messages | |
| # [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM) | |
| # [RATE LIMIT] Adding safety delay: 2.5s | |
| # If you see "Waiting X.XXs" - system is managing load correctly | |
| ``` | |
| ### Avoid These Mistakes | |
| β **Don't add additional delays:** | |
| ```python | |
| # NOT NEEDED - rate limiting already applied | |
| result = llm_client.generate(prompt) | |
| time.sleep(5) # β Don't add this | |
| ``` | |
| β **Don't override settings:** | |
| ```python | |
| # NOT RECOMMENDED - could exceed RPM limit | |
| groq_rpm_limit = 50 # β Don't change without understanding impact | |
| rate_limit_delay = 0.5 # β Too aggressive | |
| ``` | |
| β **Do let the system handle it:** | |
| ```python | |
| # β System automatically respects limits | |
| evaluator.evaluate(...) | |
| # Rate limiting is transparent | |
| ``` | |
| ## Troubleshooting | |
| ### Evaluations Are Very Slow | |
| **Symptom:** Each evaluation takes 5+ seconds | |
| **Cause:** Rate limiting is working correctly | |
| - Each API call: ~1-3s | |
| - Safety delay: 2.5s | |
| - Total: 3.5-5.5s per evaluation | |
| **Solution:** This is expected with 30 RPM limit. Increase delay only if needed: | |
| ```python | |
| rate_limit_delay = 1.5 # Slightly faster (but less safe margin) | |
| ``` | |
| ### "Waiting X.XXs" Messages Appear | |
| **Symptom:** Frequent waiting messages during batch evaluation | |
| **Cause:** Approaching or hitting the 30 RPM limit | |
| **Solution:** Normal behavior - system is protecting the API | |
| - Wait time decreases as requests age out of 60-second window | |
| - Continue processing - evaluation will complete after wait | |
| ### Evaluation Takes Longer Than Expected | |
| **Symptom:** 50 evaluations taking 5+ minutes | |
| **Cause:** 30 RPM limit (by design) | |
| - 50 evals Γ 5.5s = 275s β 4.6 minutes | |
| **Calculation:** | |
| ``` | |
| 50 evaluations Γ· 30 requests/minute = 1.67 minutes minimum | |
| With 2.5s delays: ~4-5 minutes typical | |
| ``` | |
| **Solution:** This is acceptable for compliance. No action needed. | |
| ## Files Modified | |
| - β **config.py** - Updated rate_limit_delay to 2.5s (safety margin) | |
| - β **llm_client.py** - Enhanced RateLimiter with logging | |
| - β **llm_client.py** - Enhanced generate() with rate limit messaging | |
| - β **advanced_rag_evaluator.py** - Added evaluation-level logging | |
| ## Testing Rate Limiting | |
| ### Manual Test | |
| ```python | |
| from llm_client import RateLimiter | |
| import time | |
| limiter = RateLimiter(max_requests_per_minute=3) # Set low for testing | |
| # Make 4 rapid requests | |
| for i in range(4): | |
| print(f"\nRequest {i+1}:") | |
| limiter.acquire_sync() | |
| print("Making API call...") | |
| time.sleep(0.1) | |
| # Output will show waiting message on 4th request | |
| ``` | |
| ### Batch Test | |
| ```python | |
| # Run batch evaluation and check logs | |
| # Look for: [RATE LIMIT] messages showing rate compliance | |
| results = evaluator.evaluate_batch(test_cases) | |
| # Should see messages like: | |
| # [RATE LIMIT] Current: 1 requests in last minute | |
| # [RATE LIMIT] Current: 2 requests in last minute | |
| # ... up to 30 | |
| ``` | |
| ## Summary | |
| β **Automatic Compliance:** Rate limiting is transparent and automatic | |
| β **Safety Margin:** 2.5s delay ensures well below 30 RPM limit | |
| β **Detailed Logging:** Console shows rate limiting in action | |
| β **Configurable:** Can adjust settings if needed | |
| β **Zero Code Changes:** Works with existing evaluation code | |
| The system will never exceed the 30 RPM limit during evaluation. | |