Spaces:
Sleeping
RPM Rate Limiting - Quick Summary
Implementation Complete β
The RAG evaluation system now has comprehensive rate limiting to ensure strict compliance with the 30 RPM (requests per minute) limit when using Groq API.
What Was Changed
1. Configuration (config.py)
# Rate Limiting
groq_rpm_limit: int = 30 # API limit
rate_limit_delay: float = 2.5 # Safety margin (was 2.0)
Why increase to 2.5 seconds?
- 30 RPM = 2.0s mathematical minimum
- 2.5s = ~24 actual RPM (20% safety margin)
- Prevents accidental violations from network delays
2. Enhanced Rate Limiter (llm_client.py)
- Improved logging:
[RATE LIMIT]messages - Tracks requests in rolling 60-second window
- Automatically waits when approaching limit
- Shows current rate: "Current: 5 requests in last minute (Limit: 30 RPM)"
3. Enhanced API Call Handler (llm_client.py)
def generate(self, prompt, ...):
# Before API call: Check rate limit
self.rate_limiter.acquire_sync()
# Make API call
response = self.client.chat.completions.create(...)
# After API call: Add safety delay
time.sleep(self.rate_limit_delay) # 2.5 seconds
4. Evaluation Logging (advanced_rag_evaluator.py)
Added messages to evaluation process:
[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
How It Works
Single Evaluation Timeline
User starts evaluation
β
[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
β
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
β
[API Call to Groq] (1-3 seconds)
β
[LLM RESPONSE] {...parsed JSON...}
β
[RATE LIMIT] Adding safety delay: 2.5s
β
[Wait 2.5 seconds]
β
Evaluation continues
Batch Evaluation (50 evaluations)
| Evaluation | Time | Notes |
|---|---|---|
| Eval 1-12 | 0-66s | Sequential: 5.5s each |
| Eval 13-24 | 66-132s | Continues: 5.5s each |
| Eval 25-36 | 132-198s | Continues: 5.5s each |
| Eval 37-50 | 198-275s | Continues: 5.5s each |
Result: 50 evaluations in ~275 seconds = ~11 RPM (well below 30 limit)
Rate Limiting in Action
Console Output Example
[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
[API processes...]
[LLM RESPONSE] {
"relevance_explanation": "...",
"overall_supported": true,
...
}
[RATE LIMIT] Adding safety delay: 2.5s
[waits 2.5 seconds...]
When Limit Is Reached
[RATE LIMIT] Current: 30 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] At 30 RPM limit. Waiting 45.32s before next request...
[System waits 45 seconds...]
[RATE LIMIT] Current: 2 requests in last minute (Limit: 30 RPM)
[Evaluation continues...]
Time Per Evaluation
| Component | Duration | Notes |
|---|---|---|
| Rate limit check | < 1ms | Negligible |
| API call | 1-3s | Network + Groq processing |
| Safety delay | 2.5s | Configured safety margin |
| Total | ~3.5-5.5s | Per evaluation |
Key Point: This is by design. Rate limiting adds ~2.5s per evaluation to stay compliant.
Usage (No Changes Needed!)
Single Evaluation
scores, llm_info = evaluator.evaluate(
question="What is AI?",
response="AI is...",
retrieved_documents=[...]
)
# Rate limiting happens automatically
Batch Evaluation
for test_case in test_cases:
scores = evaluator.evaluate(
question=test_case["question"],
response=test_case["response"],
retrieved_documents=test_case["documents"]
)
# Rate limiting happens automatically
# No manual delays needed!
Verification
Check Rate Limiting is Active
Run evaluation and look for:
β [RATE LIMIT] messages in console
β [EVALUATION] messages before API calls
β Consistent 2.5s delays between evaluations
β Actual RPM well below 30
Monitor Current Rate
Watch console during evaluation:
[RATE LIMIT] Current: 1 requests in last minute
[RATE LIMIT] Current: 2 requests in last minute
[RATE LIMIT] Current: 3 requests in last minute
... up to 30
If it reaches 30, system automatically waits for oldest request to age out.
Configuration Options
Default (Recommended)
groq_rpm_limit: int = 30 # 30 RPM limit
rate_limit_delay: float = 2.5 # ~24 actual RPM
More Aggressive (Higher Risk)
groq_rpm_limit: int = 30 # 30 RPM limit
rate_limit_delay: float = 2.0 # ~30 actual RPM (no safety margin!)
More Conservative (Lower Risk)
groq_rpm_limit: int = 30 # 30 RPM limit
rate_limit_delay: float = 3.0 # ~20 actual RPM (very safe)
Troubleshooting
Q: Why are evaluations slow?
A: By design. Rate limiting adds ~2.5s per evaluation for compliance.
- Each eval: 3.5-5.5 seconds total
- 50 evals: 175-275 seconds (3-5 minutes)
Q: Why do I see "Waiting X.XXs" messages?
A: System is protecting the API by waiting for rate limit to reset.
- This is normal behavior
- Continue processing - evaluation will complete
Q: Can I disable rate limiting?
A: Not recommended, but you can adjust:
rate_limit_delay: float = 1.0 # Faster (but riskier)
Q: Does this affect other API calls?
A: No, only Groq LLM calls:
- Embedding models: Not affected
- ChromaDB operations: Not affected
- Only GPT labeling evaluation: Rate limited
Files Modified
β config.py
rate_limit_delay: 2.0 β 2.5 seconds
β llm_client.py
- Enhanced RateLimiter with logging
- Enhanced generate() with rate limit messages
- Added current RPM tracking
β advanced_rag_evaluator.py
- Added evaluation-level logging
- Documents rate limiting behavior
β docs/RPM_RATE_LIMITING.md (New)
- Comprehensive documentation
- Implementation details
- Troubleshooting guide
Summary
β
Automatic: Rate limiting is transparent and automatic
β
Safe: 20% safety margin below 30 RPM limit
β
Logged: Detailed console messages show what's happening
β
Compliant: Never exceeds 30 RPM limit
β
No Code Changes: Works with existing evaluation code
The system is now fully compliant with the 30 RPM Groq API limit.