Spaces:
Paused
Performance Optimization Summary
Changes Made
β 1. Increased Concurrent Request Capacity
File: services/ai-service/src/ai_med_extract/services/request_queue.py
- Max Concurrent Requests: Increased from 2 β 6
- Max Queue Size: Set to 10 requests
- Queue Timeout: 20 minutes (1200s)
Impact: Can now handle 6 simultaneous requests instead of 2, reducing queue wait times significantly.
β 2. Added Comprehensive Detailed Logging
New Files Created:
services/ai-service/src/ai_med_extract/utils/detailed_logging.pyservices/ai-service/src/ai_med_extract/utils/model_keepalive.pyservices/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py
Logging Enhancements:
Request Queue Logging:
π₯ ENQUEUE REQUEST: req_12345
- Job ID: job_67890
- Priority: NORMAL
- Current active: 2/6
- Current queue: 0/10
β
REQUEST ACCEPTED (immediate): req_12345
- Active slots: 2/6
- Will acquire slot immediately
π SLOT ACQUIRED: req_12345
- Wait time: 0.05s
- Active slots: 3/6
- Total processed: 42
β
SLOT RELEASED: req_12345
- Processing time: 45.3s
- Active slots: 2/6
- Queue size: 0/10
Model Loading Logging:
================================================================================
π₯ EAGER MODEL LOADING - Starting primary model preload...
================================================================================
π§ Model Configuration:
- Name: microsoft/Phi-3-mini-4k-instruct-gguf
- Type: gguf
- Loading Mode: EAGER (not lazy)
β³ Loading model into memory...
β
PRIMARY MODEL LOADED SUCCESSFULLY
- Model: microsoft/Phi-3-mini-4k-instruct-gguf
- Load Time: 23.45s
- Status: Ready for inference
- Memory Usage: 2048.5 MB
β±οΈ Total eager loading time: 23.45s
================================================================================
Generation Logging:
================================================================================
π GENERATION STARTED
- Model: microsoft/Phi-3-mini-4k-instruct-gguf
- Timestamp: 2025-11-27T15:19:23+05:30
- Input length: 1250 characters
- Input tokens (est): ~312
- Configuration:
β’ max_tokens: 8192
β’ temperature: 0.7
β’ top_p: 0.9
β³ Generating response...
β
GENERATION COMPLETED
- Model: microsoft/Phi-3-mini-4k-instruct-gguf
- Duration: 12.34s
- Output length: 2500 characters
- Output tokens (est): ~625
- Tokens/second: ~50.6
================================================================================
β 3. Eager Model Loading (Disabled Lazy Loading)
File: services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py
Changes:
- Models now preload at application startup
- Primary model (GGUF) loads immediately
- No more cold start delays on first request
Before:
lazy=True # Model loads on first use
After:
lazy=False # EAGER LOADING - preload at startup
β 4. Model Keep-Alive Service
File: services/ai-service/src/ai_med_extract/utils/model_keepalive.py
Features:
- Pings loaded models every 5 minutes
- Prevents models from being unloaded during idle periods
- Tracks ping statistics and errors
Logging:
π Model keep-alive service started (interval: 300s)
β
Keep-alive ping #1 sent to 1 models (errors: 0)
β
Keep-alive ping #2 sent to 1 models (errors: 0)
β 5. Environment Configuration
File: services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py
New Environment Variables:
MAX_CONCURRENT_REQUESTS=6
MAX_QUEUE_SIZE=10
EAGER_MODEL_LOADING=true
MODEL_KEEPALIVE=true
MODEL_KEEPALIVE_INTERVAL=300
DETAILED_LOGGING=true
LOG_MODEL_OPERATIONS=true
LOG_GENERATION_METRICS=true
β 6. New Monitoring Endpoints
Added Endpoints:
/warmup- Keep models warm{ "status": "warm", "timestamp": "2025-11-27T15:19:23+05:30", "models_loaded": 1, "primary_model": "microsoft/Phi-3-mini-4k-instruct-gguf", "loaded_model_names": ["microsoft/Phi-3-mini-4k-instruct-gguf"] }/model-status- Check loaded models{ "loaded_models": [...], "total_loaded": 1, "timestamp": "2025-11-27T15:19:23+05:30" }/queue-status- Check request queue{ "active_requests": 3, "queue_size": 2, "max_concurrent": 6, "max_queue_size": 10, "total_processed": 156, "total_rejected": 2, "total_timeout": 0 }/keepalive-status- Check keep-alive service{ "running": true, "interval_seconds": 300, "total_pings": 24, "total_errors": 0, "uptime_minutes": 120 }
Expected Performance Improvements
| Metric | Before | After | Improvement |
|---|---|---|---|
| First request (cold) | 2-5 min | 30-60 sec | 75% faster |
| Subsequent requests | 30-60 sec | 30-60 sec | Consistent |
| After 15 min idle | 2-5 min | 30-60 sec | 75% faster |
| Concurrent capacity | 2 requests | 6 requests | 3x capacity |
| Queue capacity | 10 requests | 10 requests | Same |
| Consistency | β Variable | β Consistent | Much better |
How to Apply
Quick Integration (Add to app.py):
# At the top, after imports
from services.ai_service.src.ai_med_extract.utils.hf_spaces_optimizations import (
configure_hf_spaces_env,
apply_hf_spaces_optimizations
)
# Before creating the app
configure_hf_spaces_env()
# After creating the app
app = create_app(initialize=False)
initialize_agents(app, preload_small_models=False)
# ADD THIS LINE:
apply_hf_spaces_optimizations(app)
logging.info("Application initialized successfully")
Monitoring Your Deployment
1. Check Logs for Detailed Information
Look for these log patterns:
Startup:
π§ Configuring HF Spaces environment variables...
β
HF Spaces environment variables configured:
- MAX_CONCURRENT_REQUESTS: 6
- MAX_QUEUE_SIZE: 10
- EAGER_MODEL_LOADING: true
- MODEL_KEEPALIVE: true (interval: 300s)
- DETAILED_LOGGING: true
Model Loading:
π₯ EAGER MODEL LOADING - Starting primary model preload...
β
PRIMARY MODEL LOADED SUCCESSFULLY
- Model: microsoft/Phi-3-mini-4k-instruct-gguf
- Load Time: 23.45s
Request Processing:
π₯ ENQUEUE REQUEST: req_12345
β
REQUEST ACCEPTED (immediate): req_12345
π SLOT ACQUIRED: req_12345
- Wait time: 0.05s
β
GENERATION COMPLETED
- Duration: 12.34s
- Tokens/second: ~50.6
β
SLOT RELEASED: req_12345
- Processing time: 45.3s
2. Use Monitoring Endpoints
# Check if models are warm
curl https://your-space.hf.space/warmup
# Check queue status
curl https://your-space.hf.space/queue-status
# Check model status
curl https://your-space.hf.space/model-status
# Check keep-alive service
curl https://your-space.hf.space/keepalive-status
3. Set Up External Monitoring
Use UptimeRobot (free tier):
- Monitor:
https://your-space.hf.space/warmup - Interval: Every 5 minutes
- This keeps your space warm and prevents cold starts
Troubleshooting
Issue: GPU OOM (Out of Memory)
Symptoms: Errors about CUDA out of memory
Solution: Reduce concurrent requests
# In hf_spaces_optimizations.py, line 188:
os.environ.setdefault("MAX_CONCURRENT_REQUESTS", "4") # Reduce from 6 to 4
Issue: Logs too verbose
Solution: Disable detailed logging
# In app.py or environment:
os.environ["DETAILED_LOGGING"] = "false"
Issue: Keep-alive not working
Check:
curl https://your-space.hf.space/keepalive-status
Expected:
{
"running": true,
"total_pings": 24,
"total_errors": 0
}
Files Modified/Created
Created:
- β
services/ai-service/src/ai_med_extract/utils/model_keepalive.py - β
services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py - β
services/ai-service/src/ai_med_extract/utils/detailed_logging.py - β
docs/HF_SPACES_PERFORMANCE_GUIDE.md - β
docs/QUICK_FIX_PERFORMANCE.md
Modified:
- β
services/ai-service/src/ai_med_extract/services/request_queue.py- Increased max_concurrent to 6
- Added detailed logging throughout
Next Steps
- Integrate the optimizations into
app.py(see "How to Apply" above) - Deploy to HF Spaces
- Monitor using the new endpoints
- Set up external monitoring (UptimeRobot)
- Review logs to ensure everything is working
Last Updated: 2025-11-27
Configuration: 6 concurrent requests, 10 queue size, eager loading, keep-alive enabled
Expected Result: 75% faster, 3x capacity, consistent performance