Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /RPM_QUICK_SUMMARY.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a 4 months ago

preview code

raw

history blame contribute delete

6.57 kB

RPM Rate Limiting - Quick Summary

Implementation Complete ✅

The RAG evaluation system now has comprehensive rate limiting to ensure strict compliance with the 30 RPM (requests per minute) limit when using Groq API.

What Was Changed

1. Configuration (config.py)

# Rate Limiting
groq_rpm_limit: int = 30                    # API limit
rate_limit_delay: float = 2.5               # Safety margin (was 2.0)

Why increase to 2.5 seconds?

30 RPM = 2.0s mathematical minimum
2.5s = ~24 actual RPM (20% safety margin)
Prevents accidental violations from network delays

2. Enhanced Rate Limiter (llm_client.py)

Improved logging: [RATE LIMIT] messages
Tracks requests in rolling 60-second window
Automatically waits when approaching limit
Shows current rate: "Current: 5 requests in last minute (Limit: 30 RPM)"

3. Enhanced API Call Handler (llm_client.py)

def generate(self, prompt, ...):
    # Before API call: Check rate limit
    self.rate_limiter.acquire_sync()
    
    # Make API call
    response = self.client.chat.completions.create(...)
    
    # After API call: Add safety delay
    time.sleep(self.rate_limit_delay)  # 2.5 seconds

4. Evaluation Logging (advanced_rag_evaluator.py)

Added messages to evaluation process:

[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit

How It Works

Single Evaluation Timeline

User starts evaluation
    ↓
[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
    ↓
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)
    ↓
[API Call to Groq] (1-3 seconds)
    ↓
[LLM RESPONSE] {...parsed JSON...}
    ↓
[RATE LIMIT] Adding safety delay: 2.5s
    ↓
[Wait 2.5 seconds]
    ↓
Evaluation continues

Batch Evaluation (50 evaluations)

Evaluation	Time	Notes
Eval 1-12	0-66s	Sequential: 5.5s each
Eval 13-24	66-132s	Continues: 5.5s each
Eval 25-36	132-198s	Continues: 5.5s each
Eval 37-50	198-275s	Continues: 5.5s each

Result: 50 evaluations in ~275 seconds = ~11 RPM (well below 30 limit)

Rate Limiting in Action

Console Output Example

[EVALUATION] Making GPT labeling API call...
[EVALUATION] This respects the 30 RPM rate limit
[RATE LIMIT] Applying rate limiting (RPM limit: 30, delay: 2.5s)
[RATE LIMIT] Current: 5 requests in last minute (Limit: 30 RPM)

[API processes...]

[LLM RESPONSE] {
  "relevance_explanation": "...",
  "overall_supported": true,
  ...
}

[RATE LIMIT] Adding safety delay: 2.5s
[waits 2.5 seconds...]

When Limit Is Reached

[RATE LIMIT] Current: 30 requests in last minute (Limit: 30 RPM)
[RATE LIMIT] At 30 RPM limit. Waiting 45.32s before next request...
[System waits 45 seconds...]
[RATE LIMIT] Current: 2 requests in last minute (Limit: 30 RPM)
[Evaluation continues...]

Time Per Evaluation

Component	Duration	Notes
Rate limit check	< 1ms	Negligible
API call	1-3s	Network + Groq processing
Safety delay	2.5s	Configured safety margin
Total	~3.5-5.5s	Per evaluation

Key Point: This is by design. Rate limiting adds ~2.5s per evaluation to stay compliant.

Usage (No Changes Needed!)

Single Evaluation

scores, llm_info = evaluator.evaluate(
    question="What is AI?",
    response="AI is...",
    retrieved_documents=[...]
)
# Rate limiting happens automatically

Batch Evaluation

for test_case in test_cases:
    scores = evaluator.evaluate(
        question=test_case["question"],
        response=test_case["response"],
        retrieved_documents=test_case["documents"]
    )
    # Rate limiting happens automatically
    # No manual delays needed!

Verification

Check Rate Limiting is Active

Run evaluation and look for:

✓ [RATE LIMIT] messages in console
✓ [EVALUATION] messages before API calls
✓ Consistent 2.5s delays between evaluations
✓ Actual RPM well below 30

Monitor Current Rate

Watch console during evaluation:

[RATE LIMIT] Current: 1 requests in last minute
[RATE LIMIT] Current: 2 requests in last minute
[RATE LIMIT] Current: 3 requests in last minute
... up to 30

If it reaches 30, system automatically waits for oldest request to age out.

Configuration Options

Default (Recommended)

groq_rpm_limit: int = 30       # 30 RPM limit
rate_limit_delay: float = 2.5  # ~24 actual RPM

More Aggressive (Higher Risk)

groq_rpm_limit: int = 30       # 30 RPM limit
rate_limit_delay: float = 2.0  # ~30 actual RPM (no safety margin!)

More Conservative (Lower Risk)

groq_rpm_limit: int = 30       # 30 RPM limit
rate_limit_delay: float = 3.0  # ~20 actual RPM (very safe)

Troubleshooting

Q: Why are evaluations slow?

A: By design. Rate limiting adds ~2.5s per evaluation for compliance.

Each eval: 3.5-5.5 seconds total
50 evals: 175-275 seconds (3-5 minutes)

Q: Why do I see "Waiting X.XXs" messages?

A: System is protecting the API by waiting for rate limit to reset.

This is normal behavior
Continue processing - evaluation will complete

Q: Can I disable rate limiting?

A: Not recommended, but you can adjust:

rate_limit_delay: float = 1.0  # Faster (but riskier)

Q: Does this affect other API calls?

A: No, only Groq LLM calls:

Embedding models: Not affected
ChromaDB operations: Not affected
Only GPT labeling evaluation: Rate limited

Files Modified

✅ config.py

rate_limit_delay: 2.0 → 2.5 seconds

✅ llm_client.py

Enhanced RateLimiter with logging
Enhanced generate() with rate limit messages
Added current RPM tracking

✅ advanced_rag_evaluator.py

Added evaluation-level logging
Documents rate limiting behavior

✅ docs/RPM_RATE_LIMITING.md (New)

Comprehensive documentation
Implementation details
Troubleshooting guide

Summary

✅ Automatic: Rate limiting is transparent and automatic ✅ Safe: 20% safety margin below 30 RPM limit
✅ Logged: Detailed console messages show what's happening ✅ Compliant: Never exceeds 30 RPM limit ✅ No Code Changes: Works with existing evaluation code

The system is now fully compliant with the 30 RPM Groq API limit.