# RPM Rate Limiting Implementation - 30 RPM Compliance

## Overview

The RAG evaluation system now includes comprehensive rate limiting to ensure strict compliance with the Groq API's **30 requests per minute (RPM)** limit during evaluation.

## What Was Implemented

### 1. **Enhanced Rate Limiter**
- Tracks requests within 1-minute windows
- Automatically waits when approaching/reaching limit
- Provides detailed logging of current request rate
- Safe recursive retry after waiting period

### 2. **Safety Margin Configuration**
```python
# config.py
groq_rpm_limit: int = 30        # API limit
rate_limit_delay: float = 2.5   # Safety delay (increased from 2.0)
```

**Why 2.5 seconds?**
- 30 RPM = 2.0 seconds minimum between requests
- 2.5 seconds = ~24 actual RPM (20% safety margin below limit)
- Prevents accidental RPM violations due to network delays

### 3. **Dual-Layer Rate Limiting**

#### Layer 1: Request Tracking (RateLimiter.acquire_sync)
- Tracks request timestamps in 60-second window
- Waits when 30 requests already made in last 60 seconds
- Logs current rate: "Current: X requests in last minute (Limit: 30 RPM)"

#### Layer 2: Safety Delay (time.sleep)
- 2.5 second delay after each successful API call
- Ensures even under load, we stay well below 30 RPM
- Configurable via `rate_limit_delay` setting

## How It Works

### Single Evaluation Flow

```
1. User starts evaluation
   ↓
2. advanced_rag_evaluator.evaluate()
   ↓
3. _get_gpt_labels() is called
   ├─ [EVALUATION] Making GPT labeling API call...
   ├─ [EVALUATION] This respects the 30 RPM rate limit
   ↓
4. llm_client.generate()
   ├─ [RATE LIMIT] Applying rate limiting (RPM: 30, delay: 2.5s)
   ├─ Calls rate_limiter.acquire_sync()
   │  ├─ [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
   │  └─ If at limit: [RATE LIMIT] Waiting X.XXs before next request...
   │
   ├─ Makes API call to Groq
   ├─ [LLM RESPONSE] {...}
   ├─ [RATE LIMIT] Adding safety delay: 2.5s
   ├─ time.sleep(2.5)
   ↓
5. Evaluation continues...
```

### Batch Evaluation Flow (Multiple Evaluations)

```
Eval 1: 0s    [API call] + 2.5s wait
Eval 2: 2.5s  [API call] + 2.5s wait  
Eval 3: 5.0s  [API call] + 2.5s wait
...
Eval 12: 27.5s [API call] + 2.5s wait
Eval 13: 30s  [API call] + 2.5s wait

Result: 12-13 evaluations per 60 seconds = ~24 RPM
(Well below 30 RPM limit with safety margin)
```

## Configuration Options

### In config.py

```python
class Settings(BaseSettings):
    # Rate Limiting
    # 30 RPM = 2 seconds minimum between requests to stay under limit
    groq_rpm_limit: int = 30                    # API limit (required)
    rate_limit_delay: float = 2.5               # Safety delay in seconds
```

### Adjusting the Settings

**To be more aggressive (higher risk):**
```python
groq_rpm_limit: int = 30
rate_limit_delay: float = 2.0  # Closer to mathematical minimum
# Result: ~30 actual RPM (risky, no safety margin)
```

**To be more conservative (lower risk):**
```python
groq_rpm_limit: int = 30
rate_limit_delay: float = 3.0  # More safety margin
# Result: ~20 actual RPM (very safe, more time)
```

**To use environment variables:**
```bash
# .env file
GROQ_RPM_LIMIT=30
RATE_LIMIT_DELAY=2.5
```

## Rate Limiting in Action

### Console Output Example

```
[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] Adding safety delay: 2.5s

[LLM RESPONSE] {"relevance_explanation": "...", ...}

[Waits 2.5 seconds]

[EVALUATION] Evaluation complete
```

### When Limit Is Reached

```
[EVALUATION] Eval 29 starting...
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 30 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] At 30 RPM limit. Waiting 45.32s before next request...

[System waits ~45 seconds for oldest request to age out]

[RATE LIMIT] Current: 1 requests in last minute (Limit: 30 RPM)
[API call made]
[RATE LIMIT] Adding safety delay: 2.5s
```

## Performance Impact

### Time Per Evaluation

| Phase | Duration | Notes |
|-------|----------|-------|
| Rate limit check | < 1ms | Checking request history |
| API call | 1-3s | Network + Groq processing |
| Safety delay | 2.5s | Consistent across all calls |
| **Total per eval** | **~3.5-5.5s** | Includes API response time |

### Batch Processing Times

| Num Evals | Min Time | Max Time | Actual Rate |
|-----------|----------|----------|-------------|
| 10 | 35s | 55s | ~12-17 RPM |
| 20 | 70s | 110s | ~11-17 RPM |
| 30 | 105s | 165s | ~11-17 RPM |
| 50 | 175s | 275s | ~11-17 RPM |

**Key Insight:** Actual RPM is well below 30 due to:
- 2.5s safety delay
- Time for API responses
- Network latency

## Implementation Details

### RateLimiter Class (llm_client.py)

```python
class RateLimiter:
    """Rate limiter for API calls to respect RPM limits."""
    
    def __init__(self, max_requests_per_minute: int = 30):
        self.max_requests = max_requests_per_minute
        self.request_times = deque()  # Tracks request times
        self.lock = asyncio.Lock()
    
    def acquire_sync(self):
        """Synchronous rate limit check before API call."""
        now = datetime.now()
        
        # Remove requests older than 1 minute
        while self.request_times and (now - self.request_times[0]) > timedelta(minutes=1):
            self.request_times.popleft()
        
        # If at limit, wait
        if len(self.request_times) >= self.max_requests:
            # Calculate wait time and sleep
            oldest = self.request_times[0]
            wait_time = 60 - (now - oldest).total_seconds()
            if wait_time > 0:
                print(f"[RATE LIMIT] Waiting {wait_time:.2f}s...")
                time.sleep(wait_time)
                return self.acquire_sync()  # Retry
        
        # Record this request
        self.request_times.append(now)
        current_rpm = len(self.request_times)
        print(f"[RATE LIMIT] Current: {current_rpm} requests in last minute")
```

### Usage in GroqLLMClient (llm_client.py)

```python
def generate(self, prompt: str, ...) -> str:
    """Generate with rate limiting."""
    # Step 1: Apply rate limiting
    self.rate_limiter.acquire_sync()
    
    # Step 2: Make API call
    response = self.client.chat.completions.create(...)
    
    # Step 3: Add safety delay
    time.sleep(self.rate_limit_delay)
    
    return response.choices[0].message.content
```

### Integration in Evaluation (advanced_rag_evaluator.py)

```python
def _get_gpt_labels(self, question, response, documents):
    """Evaluate with GPT labeling (rate limited)."""
    print(f"[EVALUATION] Making GPT labeling API call...")
    print(f"[EVALUATION] This respects the 30 RPM rate limit")
    
    # This call internally uses rate limiting
    llm_response = self.llm_client.generate(
        prompt=prompt,
        max_tokens=2048,
        temperature=0.0
    )
    
    # Processing continues after rate limiting/delay
```

## Best Practices

### For Development

```python
# Use default settings for most cases
settings = Settings()  # Uses 30 RPM limit, 2.5s delay

# Check actual rate being used
print(f"RPM Limit: {settings.groq_rpm_limit}")
print(f"Safety Delay: {settings.rate_limit_delay}")
```

### For Batch Processing

```python
# Process evaluations - rate limiting is automatic
for test_case in test_cases:
    scores = evaluator.evaluate(
        question=test_case["question"],
        response=test_case["response"],
        retrieved_documents=test_case["documents"]
    )
    # No need to add manual delays - handled automatically
```

### For Monitoring

```python
# Check console output for rate limit messages
# [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
# [RATE LIMIT] Adding safety delay: 2.5s

# If you see "Waiting X.XXs" - system is managing load correctly
```

### Avoid These Mistakes

❌ **Don't add additional delays:**
```python
# NOT NEEDED - rate limiting already applied
result = llm_client.generate(prompt)
time.sleep(5)  # ❌ Don't add this
```

❌ **Don't override settings:**
```python
# NOT RECOMMENDED - could exceed RPM limit
groq_rpm_limit = 50  # ❌ Don't change without understanding impact
rate_limit_delay = 0.5  # ❌ Too aggressive
```

✅ **Do let the system handle it:**
```python
# ✓ System automatically respects limits
evaluator.evaluate(...)
# Rate limiting is transparent
```

## Troubleshooting

### Evaluations Are Very Slow

**Symptom:** Each evaluation takes 5+ seconds

**Cause:** Rate limiting is working correctly
- Each API call: ~1-3s
- Safety delay: 2.5s
- Total: 3.5-5.5s per evaluation

**Solution:** This is expected with 30 RPM limit. Increase delay only if needed:
```python
rate_limit_delay = 1.5  # Slightly faster (but less safe margin)
```

### "Waiting X.XXs" Messages Appear

**Symptom:** Frequent waiting messages during batch evaluation

**Cause:** Approaching or hitting the 30 RPM limit

**Solution:** Normal behavior - system is protecting the API
- Wait time decreases as requests age out of 60-second window
- Continue processing - evaluation will complete after wait

### Evaluation Takes Longer Than Expected

**Symptom:** 50 evaluations taking 5+ minutes

**Cause:** 30 RPM limit (by design)
- 50 evals × 5.5s = 275s ≈ 4.6 minutes

**Calculation:**
```
50 evaluations ÷ 30 requests/minute = 1.67 minutes minimum
With 2.5s delays: ~4-5 minutes typical
```

**Solution:** This is acceptable for compliance. No action needed.

## Files Modified

- ✅ **config.py** - Updated rate_limit_delay to 2.5s (safety margin)
- ✅ **llm_client.py** - Enhanced RateLimiter with logging
- ✅ **llm_client.py** - Enhanced generate() with rate limit messaging
- ✅ **advanced_rag_evaluator.py** - Added evaluation-level logging

## Testing Rate Limiting

### Manual Test

```python
from llm_client import RateLimiter
import time

limiter = RateLimiter(max_requests_per_minute=3)  # Set low for testing

# Make 4 rapid requests
for i in range(4):
    print(f"\nRequest {i+1}:")
    limiter.acquire_sync()
    print("Making API call...")
    time.sleep(0.1)

# Output will show waiting message on 4th request
```

### Batch Test

```python
# Run batch evaluation and check logs
# Look for: [RATE LIMIT] messages showing rate compliance
results = evaluator.evaluate_batch(test_cases)
# Should see messages like:
# [RATE LIMIT] Current: 1 requests in last minute
# [RATE LIMIT] Current: 2 requests in last minute
# ... up to 30
```

## Summary

✅ **Automatic Compliance:** Rate limiting is transparent and automatic
✅ **Safety Margin:** 2.5s delay ensures well below 30 RPM limit
✅ **Detailed Logging:** Console shows rate limiting in action
✅ **Configurable:** Can adjust settings if needed
✅ **Zero Code Changes:** Works with existing evaluation code

The system will never exceed the 30 RPM limit during evaluation.