CapStoneRAG10 / docs /RPM_RATE_LIMITING.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

RPM Rate Limiting Implementation - 30 RPM Compliance

Overview

The RAG evaluation system now includes comprehensive rate limiting to ensure strict compliance with the Groq API's 30 requests per minute (RPM) limit during evaluation.

What Was Implemented

1. Enhanced Rate Limiter

  • Tracks requests within 1-minute windows
  • Automatically waits when approaching/reaching limit
  • Provides detailed logging of current request rate
  • Safe recursive retry after waiting period

2. Safety Margin Configuration

# config.py
groq_rpm_limit: int = 30        # API limit
rate_limit_delay: float = 2.5   # Safety delay (increased from 2.0)

Why 2.5 seconds?

  • 30 RPM = 2.0 seconds minimum between requests
  • 2.5 seconds = ~24 actual RPM (20% safety margin below limit)
  • Prevents accidental RPM violations due to network delays

3. Dual-Layer Rate Limiting

Layer 1: Request Tracking (RateLimiter.acquire_sync)

  • Tracks request timestamps in 60-second window
  • Waits when 30 requests already made in last 60 seconds
  • Logs current rate: "Current: X requests in last minute (Limit: 30 RPM)"

Layer 2: Safety Delay (time.sleep)

  • 2.5 second delay after each successful API call
  • Ensures even under load, we stay well below 30 RPM
  • Configurable via rate_limit_delay setting

How It Works

Single Evaluation Flow

1. User starts evaluation
   ↓
2. advanced_rag_evaluator.evaluate()
   ↓
3. _get_gpt_labels() is called
   β”œβ”€ [EVALUATION] Making GPT labeling API call...
   β”œβ”€ [EVALUATION] This respects the 30 RPM rate limit
   ↓
4. llm_client.generate()
   β”œβ”€ [RATE LIMIT] Applying rate limiting (RPM: 30, delay: 2.5s)
   β”œβ”€ Calls rate_limiter.acquire_sync()
   β”‚  β”œβ”€ [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
   β”‚  └─ If at limit: [RATE LIMIT] Waiting X.XXs before next request...
   β”‚
   β”œβ”€ Makes API call to Groq
   β”œβ”€ [LLM RESPONSE] {...}
   β”œβ”€ [RATE LIMIT] Adding safety delay: 2.5s
   β”œβ”€ time.sleep(2.5)
   ↓
5. Evaluation continues...

Batch Evaluation Flow (Multiple Evaluations)

Eval 1: 0s    [API call] + 2.5s wait
Eval 2: 2.5s  [API call] + 2.5s wait  
Eval 3: 5.0s  [API call] + 2.5s wait
...
Eval 12: 27.5s [API call] + 2.5s wait
Eval 13: 30s  [API call] + 2.5s wait

Result: 12-13 evaluations per 60 seconds = ~24 RPM
(Well below 30 RPM limit with safety margin)

Configuration Options

In config.py

class Settings(BaseSettings):
    # Rate Limiting
    # 30 RPM = 2 seconds minimum between requests to stay under limit
    groq_rpm_limit: int = 30                    # API limit (required)
    rate_limit_delay: float = 2.5               # Safety delay in seconds

Adjusting the Settings

To be more aggressive (higher risk):

groq_rpm_limit: int = 30
rate_limit_delay: float = 2.0  # Closer to mathematical minimum
# Result: ~30 actual RPM (risky, no safety margin)

To be more conservative (lower risk):

groq_rpm_limit: int = 30
rate_limit_delay: float = 3.0  # More safety margin
# Result: ~20 actual RPM (very safe, more time)

To use environment variables:

# .env file
GROQ_RPM_LIMIT=30
RATE_LIMIT_DELAY=2.5

Rate Limiting in Action

Console Output Example

[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] Adding safety delay: 2.5s

[LLM RESPONSE] {"relevance_explanation": "...", ...}

[Waits 2.5 seconds]

[EVALUATION] Evaluation complete

When Limit Is Reached

[EVALUATION] Eval 29 starting...
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 30 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] At 30 RPM limit. Waiting 45.32s before next request...

[System waits ~45 seconds for oldest request to age out]

[RATE LIMIT] Current: 1 requests in last minute (Limit: 30 RPM)
[API call made]
[RATE LIMIT] Adding safety delay: 2.5s

Performance Impact

Time Per Evaluation

Phase Duration Notes
Rate limit check < 1ms Checking request history
API call 1-3s Network + Groq processing
Safety delay 2.5s Consistent across all calls
Total per eval ~3.5-5.5s Includes API response time

Batch Processing Times

Num Evals Min Time Max Time Actual Rate
10 35s 55s ~12-17 RPM
20 70s 110s ~11-17 RPM
30 105s 165s ~11-17 RPM
50 175s 275s ~11-17 RPM

Key Insight: Actual RPM is well below 30 due to:

  • 2.5s safety delay
  • Time for API responses
  • Network latency

Implementation Details

RateLimiter Class (llm_client.py)

class RateLimiter:
    """Rate limiter for API calls to respect RPM limits."""
    
    def __init__(self, max_requests_per_minute: int = 30):
        self.max_requests = max_requests_per_minute
        self.request_times = deque()  # Tracks request times
        self.lock = asyncio.Lock()
    
    def acquire_sync(self):
        """Synchronous rate limit check before API call."""
        now = datetime.now()
        
        # Remove requests older than 1 minute
        while self.request_times and (now - self.request_times[0]) > timedelta(minutes=1):
            self.request_times.popleft()
        
        # If at limit, wait
        if len(self.request_times) >= self.max_requests:
            # Calculate wait time and sleep
            oldest = self.request_times[0]
            wait_time = 60 - (now - oldest).total_seconds()
            if wait_time > 0:
                print(f"[RATE LIMIT] Waiting {wait_time:.2f}s...")
                time.sleep(wait_time)
                return self.acquire_sync()  # Retry
        
        # Record this request
        self.request_times.append(now)
        current_rpm = len(self.request_times)
        print(f"[RATE LIMIT] Current: {current_rpm} requests in last minute")

Usage in GroqLLMClient (llm_client.py)

def generate(self, prompt: str, ...) -> str:
    """Generate with rate limiting."""
    # Step 1: Apply rate limiting
    self.rate_limiter.acquire_sync()
    
    # Step 2: Make API call
    response = self.client.chat.completions.create(...)
    
    # Step 3: Add safety delay
    time.sleep(self.rate_limit_delay)
    
    return response.choices[0].message.content

Integration in Evaluation (advanced_rag_evaluator.py)

def _get_gpt_labels(self, question, response, documents):
    """Evaluate with GPT labeling (rate limited)."""
    print(f"[EVALUATION] Making GPT labeling API call...")
    print(f"[EVALUATION] This respects the 30 RPM rate limit")
    
    # This call internally uses rate limiting
    llm_response = self.llm_client.generate(
        prompt=prompt,
        max_tokens=2048,
        temperature=0.0
    )
    
    # Processing continues after rate limiting/delay

Best Practices

For Development

# Use default settings for most cases
settings = Settings()  # Uses 30 RPM limit, 2.5s delay

# Check actual rate being used
print(f"RPM Limit: {settings.groq_rpm_limit}")
print(f"Safety Delay: {settings.rate_limit_delay}")

For Batch Processing

# Process evaluations - rate limiting is automatic
for test_case in test_cases:
    scores = evaluator.evaluate(
        question=test_case["question"],
        response=test_case["response"],
        retrieved_documents=test_case["documents"]
    )
    # No need to add manual delays - handled automatically

For Monitoring

# Check console output for rate limit messages
# [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
# [RATE LIMIT] Adding safety delay: 2.5s

# If you see "Waiting X.XXs" - system is managing load correctly

Avoid These Mistakes

❌ Don't add additional delays:

# NOT NEEDED - rate limiting already applied
result = llm_client.generate(prompt)
time.sleep(5)  # ❌ Don't add this

❌ Don't override settings:

# NOT RECOMMENDED - could exceed RPM limit
groq_rpm_limit = 50  # ❌ Don't change without understanding impact
rate_limit_delay = 0.5  # ❌ Too aggressive

βœ… Do let the system handle it:

# βœ“ System automatically respects limits
evaluator.evaluate(...)
# Rate limiting is transparent

Troubleshooting

Evaluations Are Very Slow

Symptom: Each evaluation takes 5+ seconds

Cause: Rate limiting is working correctly

  • Each API call: ~1-3s
  • Safety delay: 2.5s
  • Total: 3.5-5.5s per evaluation

Solution: This is expected with 30 RPM limit. Increase delay only if needed:

rate_limit_delay = 1.5  # Slightly faster (but less safe margin)

"Waiting X.XXs" Messages Appear

Symptom: Frequent waiting messages during batch evaluation

Cause: Approaching or hitting the 30 RPM limit

Solution: Normal behavior - system is protecting the API

  • Wait time decreases as requests age out of 60-second window
  • Continue processing - evaluation will complete after wait

Evaluation Takes Longer Than Expected

Symptom: 50 evaluations taking 5+ minutes

Cause: 30 RPM limit (by design)

  • 50 evals Γ— 5.5s = 275s β‰ˆ 4.6 minutes

Calculation:

50 evaluations Γ· 30 requests/minute = 1.67 minutes minimum
With 2.5s delays: ~4-5 minutes typical

Solution: This is acceptable for compliance. No action needed.

Files Modified

  • βœ… config.py - Updated rate_limit_delay to 2.5s (safety margin)
  • βœ… llm_client.py - Enhanced RateLimiter with logging
  • βœ… llm_client.py - Enhanced generate() with rate limit messaging
  • βœ… advanced_rag_evaluator.py - Added evaluation-level logging

Testing Rate Limiting

Manual Test

from llm_client import RateLimiter
import time

limiter = RateLimiter(max_requests_per_minute=3)  # Set low for testing

# Make 4 rapid requests
for i in range(4):
    print(f"\nRequest {i+1}:")
    limiter.acquire_sync()
    print("Making API call...")
    time.sleep(0.1)

# Output will show waiting message on 4th request

Batch Test

# Run batch evaluation and check logs
# Look for: [RATE LIMIT] messages showing rate compliance
results = evaluator.evaluate_batch(test_cases)
# Should see messages like:
# [RATE LIMIT] Current: 1 requests in last minute
# [RATE LIMIT] Current: 2 requests in last minute
# ... up to 30

Summary

βœ… Automatic Compliance: Rate limiting is transparent and automatic βœ… Safety Margin: 2.5s delay ensures well below 30 RPM limit βœ… Detailed Logging: Console shows rate limiting in action βœ… Configurable: Can adjust settings if needed βœ… Zero Code Changes: Works with existing evaluation code

The system will never exceed the 30 RPM limit during evaluation.