CapStoneRAG10 / docs /RPM_QUICK_SUMMARY.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

RPM Rate Limiting - Quick Summary

Implementation Complete βœ…

The RAG evaluation system now has comprehensive rate limiting to ensure strict compliance with the 30 RPM (requests per minute) limit when using Groq API.


What Was Changed

1. Configuration (config.py)

# Rate Limiting
groq_rpm_limit: int = 30                    # API limit
rate_limit_delay: float = 2.5               # Safety margin (was 2.0)

Why increase to 2.5 seconds?

  • 30 RPM = 2.0s mathematical minimum
  • 2.5s = ~24 actual RPM (20% safety margin)
  • Prevents accidental violations from network delays

2. Enhanced Rate Limiter (llm_client.py)

  • Improved logging: [RATE LIMIT] messages
  • Tracks requests in rolling 60-second window
  • Automatically waits when approaching limit
  • Shows current rate: "Current: 5 requests in last minute (Limit: 30 RPM)"

3. Enhanced API Call Handler (llm_client.py)

def generate(self, prompt, ...):
    # Before API call: Check rate limit
    self.rate_limiter.acquire_sync()
    
    # Make API call
    response = self.client.chat.completions.create(...)
    
    # After API call: Add safety delay
    time.sleep(self.rate_limit_delay)  # 2.5 seconds

4. Evaluation Logging (advanced_rag_evaluator.py)

Added messages to evaluation process:

[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit

How It Works

Single Evaluation Timeline

User starts evaluation
    ↓
[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
    ↓
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
    ↓
[API Call to Groq] (1-3 seconds)
    ↓
[LLM RESPONSE] {...parsed JSON...}
    ↓
[RATE LIMIT] Adding safety delay: 2.5s
    ↓
[Wait 2.5 seconds]
    ↓
Evaluation continues

Batch Evaluation (50 evaluations)

Evaluation Time Notes
Eval 1-12 0-66s Sequential: 5.5s each
Eval 13-24 66-132s Continues: 5.5s each
Eval 25-36 132-198s Continues: 5.5s each
Eval 37-50 198-275s Continues: 5.5s each

Result: 50 evaluations in ~275 seconds = ~11 RPM (well below 30 limit)


Rate Limiting in Action

Console Output Example

[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)

[API processes...]

[LLM RESPONSE] {
  "relevance_explanation": "...",
  "overall_supported": true,
  ...
}

[RATE LIMIT] Adding safety delay: 2.5s
[waits 2.5 seconds...]

When Limit Is Reached

[RATE LIMIT] Current: 30 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] At 30 RPM limit. Waiting 45.32s before next request...
[System waits 45 seconds...]
[RATE LIMIT] Current: 2 requests in last minute (Limit: 30 RPM)
[Evaluation continues...]

Time Per Evaluation

Component Duration Notes
Rate limit check < 1ms Negligible
API call 1-3s Network + Groq processing
Safety delay 2.5s Configured safety margin
Total ~3.5-5.5s Per evaluation

Key Point: This is by design. Rate limiting adds ~2.5s per evaluation to stay compliant.


Usage (No Changes Needed!)

Single Evaluation

scores, llm_info = evaluator.evaluate(
    question="What is AI?",
    response="AI is...",
    retrieved_documents=[...]
)
# Rate limiting happens automatically

Batch Evaluation

for test_case in test_cases:
    scores = evaluator.evaluate(
        question=test_case["question"],
        response=test_case["response"],
        retrieved_documents=test_case["documents"]
    )
    # Rate limiting happens automatically
    # No manual delays needed!

Verification

Check Rate Limiting is Active

Run evaluation and look for:

βœ“ [RATE LIMIT] messages in console
βœ“ [EVALUATION] messages before API calls
βœ“ Consistent 2.5s delays between evaluations
βœ“ Actual RPM well below 30

Monitor Current Rate

Watch console during evaluation:

[RATE LIMIT] Current: 1 requests in last minute
[RATE LIMIT] Current: 2 requests in last minute
[RATE LIMIT] Current: 3 requests in last minute
... up to 30

If it reaches 30, system automatically waits for oldest request to age out.


Configuration Options

Default (Recommended)

groq_rpm_limit: int = 30       # 30 RPM limit
rate_limit_delay: float = 2.5  # ~24 actual RPM

More Aggressive (Higher Risk)

groq_rpm_limit: int = 30       # 30 RPM limit
rate_limit_delay: float = 2.0  # ~30 actual RPM (no safety margin!)

More Conservative (Lower Risk)

groq_rpm_limit: int = 30       # 30 RPM limit
rate_limit_delay: float = 3.0  # ~20 actual RPM (very safe)

Troubleshooting

Q: Why are evaluations slow?

A: By design. Rate limiting adds ~2.5s per evaluation for compliance.

  • Each eval: 3.5-5.5 seconds total
  • 50 evals: 175-275 seconds (3-5 minutes)

Q: Why do I see "Waiting X.XXs" messages?

A: System is protecting the API by waiting for rate limit to reset.

  • This is normal behavior
  • Continue processing - evaluation will complete

Q: Can I disable rate limiting?

A: Not recommended, but you can adjust:

rate_limit_delay: float = 1.0  # Faster (but riskier)

Q: Does this affect other API calls?

A: No, only Groq LLM calls:

  • Embedding models: Not affected
  • ChromaDB operations: Not affected
  • Only GPT labeling evaluation: Rate limited

Files Modified

βœ… config.py

  • rate_limit_delay: 2.0 β†’ 2.5 seconds

βœ… llm_client.py

  • Enhanced RateLimiter with logging
  • Enhanced generate() with rate limit messages
  • Added current RPM tracking

βœ… advanced_rag_evaluator.py

  • Added evaluation-level logging
  • Documents rate limiting behavior

βœ… docs/RPM_RATE_LIMITING.md (New)

  • Comprehensive documentation
  • Implementation details
  • Troubleshooting guide

Summary

βœ… Automatic: Rate limiting is transparent and automatic βœ… Safe: 20% safety margin below 30 RPM limit
βœ… Logged: Detailed console messages show what's happening βœ… Compliant: Never exceeds 30 RPM limit βœ… No Code Changes: Works with existing evaluation code

The system is now fully compliant with the 30 RPM Groq API limit.