Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /RPM_RATE_LIMITING.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 2 months ago

preview code

raw

history blame contribute delete

11 kB

	# RPM Rate Limiting Implementation - 30 RPM Compliance

	## Overview

	The RAG evaluation system now includes comprehensive rate limiting to ensure strict compliance with the Groq API's 30 requests per minute (RPM) limit during evaluation.

	## What Was Implemented

	### 1. Enhanced Rate Limiter
	- Tracks requests within 1-minute windows
	- Automatically waits when approaching/reaching limit
	- Provides detailed logging of current request rate
	- Safe recursive retry after waiting period

	### 2. Safety Margin Configuration
	```python
	# config.py
	groq_rpm_limit: int = 30 # API limit
	rate_limit_delay: float = 2.5 # Safety delay (increased from 2.0)
	```

	Why 2.5 seconds?
	- 30 RPM = 2.0 seconds minimum between requests
	- 2.5 seconds = ~24 actual RPM (20% safety margin below limit)
	- Prevents accidental RPM violations due to network delays

	### 3. Dual-Layer Rate Limiting

	#### Layer 1: Request Tracking (RateLimiter.acquire_sync)
	- Tracks request timestamps in 60-second window
	- Waits when 30 requests already made in last 60 seconds
	- Logs current rate: "Current: X requests in last minute (Limit: 30 RPM)"

	#### Layer 2: Safety Delay (time.sleep)
	- 2.5 second delay after each successful API call
	- Ensures even under load, we stay well below 30 RPM
	- Configurable via `rate_limit_delay` setting

	## How It Works

	### Single Evaluation Flow

	```
	1. User starts evaluation
	↓
	2. advanced_rag_evaluator.evaluate()
	↓
	3. _get_gpt_labels() is called
	├─ [EVALUATION] Making GPT labeling API call...
	├─ [EVALUATION] This respects the 30 RPM rate limit
	↓
	4. llm_client.generate()
	├─ [RATE LIMIT] Applying rate limiting (RPM: 30, delay: 2.5s)
	├─ Calls rate_limiter.acquire_sync()
	│ ├─ [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
	│ └─ If at limit: [RATE LIMIT] Waiting X.XXs before next request...
	│
	├─ Makes API call to Groq
	├─ [LLM RESPONSE] {...}
	├─ [RATE LIMIT] Adding safety delay: 2.5s
	├─ time.sleep(2.5)
	↓
	5. Evaluation continues...
	```

	### Batch Evaluation Flow (Multiple Evaluations)

	```
	Eval 1: 0s [API call] + 2.5s wait
	Eval 2: 2.5s [API call] + 2.5s wait
	Eval 3: 5.0s [API call] + 2.5s wait
	...
	Eval 12: 27.5s [API call] + 2.5s wait
	Eval 13: 30s [API call] + 2.5s wait

	Result: 12-13 evaluations per 60 seconds = ~24 RPM
	(Well below 30 RPM limit with safety margin)
	```

	## Configuration Options

	### In config.py

	```python
	class Settings(BaseSettings):
	# Rate Limiting
	# 30 RPM = 2 seconds minimum between requests to stay under limit
	groq_rpm_limit: int = 30 # API limit (required)
	rate_limit_delay: float = 2.5 # Safety delay in seconds
	```

	### Adjusting the Settings

	To be more aggressive (higher risk):
	```python
	groq_rpm_limit: int = 30
	rate_limit_delay: float = 2.0 # Closer to mathematical minimum
	# Result: ~30 actual RPM (risky, no safety margin)
	```

	To be more conservative (lower risk):
	```python
	groq_rpm_limit: int = 30
	rate_limit_delay: float = 3.0 # More safety margin
	# Result: ~20 actual RPM (very safe, more time)
	```

	To use environment variables:
	```bash
	# .env file
	GROQ_RPM_LIMIT=30
	RATE_LIMIT_DELAY=2.5
	```

	## Rate Limiting in Action

	### Console Output Example

	```
	[EVALUATION] Making GPT labeling API call...
	[EVALUATION] This respects the 30 RPM rate limit
	[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
	[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
	[RATE LIMIT] Adding safety delay: 2.5s

	[LLM RESPONSE] {"relevance_explanation": "...", ...}

	[Waits 2.5 seconds]

	[EVALUATION] Evaluation complete
	```

	### When Limit Is Reached

	```
	[EVALUATION] Eval 29 starting...
	[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
	[RATE LIMIT] Current: 30 requests in last minute (Limit: 30 RPM)
	[RATE LIMIT] At 30 RPM limit. Waiting 45.32s before next request...

	[System waits ~45 seconds for oldest request to age out]

	[RATE LIMIT] Current: 1 requests in last minute (Limit: 30 RPM)
	[API call made]
	[RATE LIMIT] Adding safety delay: 2.5s
	```

	## Performance Impact

	### Time Per Evaluation

	\| Phase \| Duration \| Notes \|
	\|-------\|----------\|-------\|
	\| Rate limit check \| < 1ms \| Checking request history \|
	\| API call \| 1-3s \| Network + Groq processing \|
	\| Safety delay \| 2.5s \| Consistent across all calls \|
	\| Total per eval \| ~3.5-5.5s \| Includes API response time \|

	### Batch Processing Times

	\| Num Evals \| Min Time \| Max Time \| Actual Rate \|
	\|-----------\|----------\|----------\|-------------\|
	\| 10 \| 35s \| 55s \| ~12-17 RPM \|
	\| 20 \| 70s \| 110s \| ~11-17 RPM \|
	\| 30 \| 105s \| 165s \| ~11-17 RPM \|
	\| 50 \| 175s \| 275s \| ~11-17 RPM \|

	Key Insight: Actual RPM is well below 30 due to:
	- 2.5s safety delay
	- Time for API responses
	- Network latency

	## Implementation Details

	### RateLimiter Class (llm_client.py)

	```python
	class RateLimiter:
	"""Rate limiter for API calls to respect RPM limits."""

	def __init__(self, max_requests_per_minute: int = 30):
	self.max_requests = max_requests_per_minute
	self.request_times = deque() # Tracks request times
	self.lock = asyncio.Lock()

	def acquire_sync(self):
	"""Synchronous rate limit check before API call."""
	now = datetime.now()

	# Remove requests older than 1 minute
	while self.request_times and (now - self.request_times[0]) > timedelta(minutes=1):
	self.request_times.popleft()

	# If at limit, wait
	if len(self.request_times) >= self.max_requests:
	# Calculate wait time and sleep
	oldest = self.request_times[0]
	wait_time = 60 - (now - oldest).total_seconds()
	if wait_time > 0:
	print(f"[RATE LIMIT] Waiting {wait_time:.2f}s...")
	time.sleep(wait_time)
	return self.acquire_sync() # Retry

	# Record this request
	self.request_times.append(now)
	current_rpm = len(self.request_times)
	print(f"[RATE LIMIT] Current: {current_rpm} requests in last minute")
	```

	### Usage in GroqLLMClient (llm_client.py)

	```python
	def generate(self, prompt: str, ...) -> str:
	"""Generate with rate limiting."""
	# Step 1: Apply rate limiting
	self.rate_limiter.acquire_sync()

	# Step 2: Make API call
	response = self.client.chat.completions.create(...)

	# Step 3: Add safety delay
	time.sleep(self.rate_limit_delay)

	return response.choices[0].message.content
	```

	### Integration in Evaluation (advanced_rag_evaluator.py)

	```python
	def _get_gpt_labels(self, question, response, documents):
	"""Evaluate with GPT labeling (rate limited)."""
	print(f"[EVALUATION] Making GPT labeling API call...")
	print(f"[EVALUATION] This respects the 30 RPM rate limit")

	# This call internally uses rate limiting
	llm_response = self.llm_client.generate(
	prompt=prompt,
	max_tokens=2048,
	temperature=0.0
	)

	# Processing continues after rate limiting/delay
	```

	## Best Practices

	### For Development

	```python
	# Use default settings for most cases
	settings = Settings() # Uses 30 RPM limit, 2.5s delay

	# Check actual rate being used
	print(f"RPM Limit: {settings.groq_rpm_limit}")
	print(f"Safety Delay: {settings.rate_limit_delay}")
	```

	### For Batch Processing

	```python
	# Process evaluations - rate limiting is automatic
	for test_case in test_cases:
	scores = evaluator.evaluate(
	question=test_case["question"],
	response=test_case["response"],
	retrieved_documents=test_case["documents"]
	)
	# No need to add manual delays - handled automatically
	```

	### For Monitoring

	```python
	# Check console output for rate limit messages
	# [RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
	# [RATE LIMIT] Adding safety delay: 2.5s

	# If you see "Waiting X.XXs" - system is managing load correctly
	```

	### Avoid These Mistakes

	❌ Don't add additional delays:
	```python
	# NOT NEEDED - rate limiting already applied
	result = llm_client.generate(prompt)
	time.sleep(5) # ❌ Don't add this
	```

	❌ Don't override settings:
	```python
	# NOT RECOMMENDED - could exceed RPM limit
	groq_rpm_limit = 50 # ❌ Don't change without understanding impact
	rate_limit_delay = 0.5 # ❌ Too aggressive
	```

	✅ Do let the system handle it:
	```python
	# ✓ System automatically respects limits
	evaluator.evaluate(...)
	# Rate limiting is transparent
	```

	## Troubleshooting

	### Evaluations Are Very Slow

	Symptom: Each evaluation takes 5+ seconds

	Cause: Rate limiting is working correctly
	- Each API call: ~1-3s
	- Safety delay: 2.5s
	- Total: 3.5-5.5s per evaluation

	Solution: This is expected with 30 RPM limit. Increase delay only if needed:
	```python
	rate_limit_delay = 1.5 # Slightly faster (but less safe margin)
	```

	### "Waiting X.XXs" Messages Appear

	Symptom: Frequent waiting messages during batch evaluation

	Cause: Approaching or hitting the 30 RPM limit

	Solution: Normal behavior - system is protecting the API
	- Wait time decreases as requests age out of 60-second window
	- Continue processing - evaluation will complete after wait

	### Evaluation Takes Longer Than Expected

	Symptom: 50 evaluations taking 5+ minutes

	Cause: 30 RPM limit (by design)
	- 50 evals × 5.5s = 275s ≈ 4.6 minutes

	Calculation:
	```
	50 evaluations ÷ 30 requests/minute = 1.67 minutes minimum
	With 2.5s delays: ~4-5 minutes typical
	```

	Solution: This is acceptable for compliance. No action needed.

	## Files Modified

	- ✅ config.py - Updated rate_limit_delay to 2.5s (safety margin)
	- ✅ llm_client.py - Enhanced RateLimiter with logging
	- ✅ llm_client.py - Enhanced generate() with rate limit messaging
	- ✅ advanced_rag_evaluator.py - Added evaluation-level logging

	## Testing Rate Limiting

	### Manual Test

	```python
	from llm_client import RateLimiter
	import time

	limiter = RateLimiter(max_requests_per_minute=3) # Set low for testing

	# Make 4 rapid requests
	for i in range(4):
	print(f"\nRequest {i+1}:")
	limiter.acquire_sync()
	print("Making API call...")
	time.sleep(0.1)

	# Output will show waiting message on 4th request
	```

	### Batch Test

	```python
	# Run batch evaluation and check logs
	# Look for: [RATE LIMIT] messages showing rate compliance
	results = evaluator.evaluate_batch(test_cases)
	# Should see messages like:
	# [RATE LIMIT] Current: 1 requests in last minute
	# [RATE LIMIT] Current: 2 requests in last minute
	# ... up to 30
	```

	## Summary

	✅ Automatic Compliance: Rate limiting is transparent and automatic
	✅ Safety Margin: 2.5s delay ensures well below 30 RPM limit
	✅ Detailed Logging: Console shows rate limiting in action
	✅ Configurable: Can adjust settings if needed
	✅ Zero Code Changes: Works with existing evaluation code

	The system will never exceed the 30 RPM limit during evaluation.