Spaces:

salvinjose
/

HNTAI

Paused

📥 ENQUEUE REQUEST: req_12345
   - Job ID: job_67890
   - Priority: NORMAL
   - Current active: 2/6
   - Current queue: 0/10

✅ REQUEST ACCEPTED (immediate): req_12345
   - Active slots: 2/6
   - Will acquire slot immediately

🚀 SLOT ACQUIRED: req_12345
   - Wait time: 0.05s
   - Active slots: 3/6
   - Total processed: 42

✅ SLOT RELEASED: req_12345
   - Processing time: 45.3s
   - Active slots: 2/6
   - Queue size: 0/10

Model Loading Logging:

================================================================================
📥 EAGER MODEL LOADING - Starting primary model preload...
================================================================================
🔧 Model Configuration:
   - Name: microsoft/Phi-3-mini-4k-instruct-gguf
   - Type: gguf
   - Loading Mode: EAGER (not lazy)

⏳ Loading model into memory...
✅ PRIMARY MODEL LOADED SUCCESSFULLY
   - Model: microsoft/Phi-3-mini-4k-instruct-gguf
   - Load Time: 23.45s
   - Status: Ready for inference
   - Memory Usage: 2048.5 MB
⏱️  Total eager loading time: 23.45s
================================================================================

Generation Logging:

================================================================================
🚀 GENERATION STARTED
   - Model: microsoft/Phi-3-mini-4k-instruct-gguf
   - Timestamp: 2025-11-27T15:19:23+05:30
   - Input length: 1250 characters
   - Input tokens (est): ~312
   - Configuration:
     • max_tokens: 8192
     • temperature: 0.7
     • top_p: 0.9
⏳ Generating response...

✅ GENERATION COMPLETED
   - Model: microsoft/Phi-3-mini-4k-instruct-gguf
   - Duration: 12.34s
   - Output length: 2500 characters
   - Output tokens (est): ~625
   - Tokens/second: ~50.6
================================================================================

✅ 3. Eager Model Loading (Disabled Lazy Loading)

File: services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py

Changes:

Models now preload at application startup
Primary model (GGUF) loads immediately
No more cold start delays on first request

Before:

lazy=True  # Model loads on first use

After:

lazy=False  # EAGER LOADING - preload at startup

✅ 4. Model Keep-Alive Service

File: services/ai-service/src/ai_med_extract/utils/model_keepalive.py

Features:

Pings loaded models every 5 minutes
Prevents models from being unloaded during idle periods
Tracks ping statistics and errors

Logging:

🚀 Model keep-alive service started (interval: 300s)
✅ Keep-alive ping #1 sent to 1 models (errors: 0)
✅ Keep-alive ping #2 sent to 1 models (errors: 0)

✅ 5. Environment Configuration

File: services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py

New Environment Variables:

MAX_CONCURRENT_REQUESTS=6
MAX_QUEUE_SIZE=10
EAGER_MODEL_LOADING=true
MODEL_KEEPALIVE=true
MODEL_KEEPALIVE_INTERVAL=300
DETAILED_LOGGING=true
LOG_MODEL_OPERATIONS=true
LOG_GENERATION_METRICS=true

✅ 6. New Monitoring Endpoints

Added Endpoints:

/warmup - Keep models warm

{
  "status": "warm",
  "timestamp": "2025-11-27T15:19:23+05:30",
  "models_loaded": 1,
  "primary_model": "microsoft/Phi-3-mini-4k-instruct-gguf",
  "loaded_model_names": ["microsoft/Phi-3-mini-4k-instruct-gguf"]
}

/model-status - Check loaded models

{
  "loaded_models": [...],
  "total_loaded": 1,
  "timestamp": "2025-11-27T15:19:23+05:30"
}

/queue-status - Check request queue

{
  "active_requests": 3,
  "queue_size": 2,
  "max_concurrent": 6,
  "max_queue_size": 10,
  "total_processed": 156,
  "total_rejected": 2,
  "total_timeout": 0
}

/keepalive-status - Check keep-alive service

{
  "running": true,
  "interval_seconds": 300,
  "total_pings": 24,
  "total_errors": 0,
  "uptime_minutes": 120
}

Expected Performance Improvements

Metric	Before	After	Improvement
First request (cold)	2-5 min	30-60 sec	75% faster
Subsequent requests	30-60 sec	30-60 sec	Consistent
After 15 min idle	2-5 min	30-60 sec	75% faster
Concurrent capacity	2 requests	6 requests	3x capacity
Queue capacity	10 requests	10 requests	Same
Consistency	❌ Variable	✅ Consistent	Much better

How to Apply

Quick Integration (Add to `app.py`):

# At the top, after imports
from services.ai_service.src.ai_med_extract.utils.hf_spaces_optimizations import (
    configure_hf_spaces_env,
    apply_hf_spaces_optimizations
)

# Before creating the app
configure_hf_spaces_env()

# After creating the app
app = create_app(initialize=False)
initialize_agents(app, preload_small_models=False)

# ADD THIS LINE:
apply_hf_spaces_optimizations(app)

logging.info("Application initialized successfully")

Monitoring Your Deployment

1. Check Logs for Detailed Information

Look for these log patterns:

Startup:

🔧 Configuring HF Spaces environment variables...
✅ HF Spaces environment variables configured:
   - MAX_CONCURRENT_REQUESTS: 6
   - MAX_QUEUE_SIZE: 10
   - EAGER_MODEL_LOADING: true
   - MODEL_KEEPALIVE: true (interval: 300s)
   - DETAILED_LOGGING: true

Model Loading:

📥 EAGER MODEL LOADING - Starting primary model preload...
✅ PRIMARY MODEL LOADED SUCCESSFULLY
   - Model: microsoft/Phi-3-mini-4k-instruct-gguf
   - Load Time: 23.45s

Request Processing:

📥 ENQUEUE REQUEST: req_12345
✅ REQUEST ACCEPTED (immediate): req_12345
🚀 SLOT ACQUIRED: req_12345
   - Wait time: 0.05s
✅ GENERATION COMPLETED
   - Duration: 12.34s
   - Tokens/second: ~50.6
✅ SLOT RELEASED: req_12345
   - Processing time: 45.3s

2. Use Monitoring Endpoints

# Check if models are warm
curl https://your-space.hf.space/warmup

# Check queue status
curl https://your-space.hf.space/queue-status

# Check model status
curl https://your-space.hf.space/model-status

# Check keep-alive service
curl https://your-space.hf.space/keepalive-status

3. Set Up External Monitoring

Use UptimeRobot (free tier):

Monitor: https://your-space.hf.space/warmup
Interval: Every 5 minutes
This keeps your space warm and prevents cold starts

Troubleshooting

Issue: GPU OOM (Out of Memory)

Symptoms: Errors about CUDA out of memory

Solution: Reduce concurrent requests

# In hf_spaces_optimizations.py, line 188:
os.environ.setdefault("MAX_CONCURRENT_REQUESTS", "4")  # Reduce from 6 to 4

Issue: Logs too verbose

Solution: Disable detailed logging

# In app.py or environment:
os.environ["DETAILED_LOGGING"] = "false"

Issue: Keep-alive not working

Check:

curl https://your-space.hf.space/keepalive-status

Expected:

{
  "running": true,
  "total_pings": 24,
  "total_errors": 0
}

Files Modified/Created

Created:

✅ services/ai-service/src/ai_med_extract/utils/model_keepalive.py
✅ services/ai-service/src/ai_med_extract/utils/hf_spaces_optimizations.py
✅ services/ai-service/src/ai_med_extract/utils/detailed_logging.py
✅ docs/HF_SPACES_PERFORMANCE_GUIDE.md
✅ docs/QUICK_FIX_PERFORMANCE.md

Modified:

✅ services/ai-service/src/ai_med_extract/services/request_queue.py
- Increased max_concurrent to 6
- Added detailed logging throughout

Next Steps

Integrate the optimizations into app.py (see "How to Apply" above)
Deploy to HF Spaces
Monitor using the new endpoints
Set up external monitoring (UptimeRobot)
Review logs to ensure everything is working

Last Updated: 2025-11-27
Configuration: 6 concurrent requests, 10 queue size, eager loading, keep-alive enabled
Expected Result: 75% faster, 3x capacity, consistent performance

Performance Optimization Summary

Changes Made

✅ 1. Increased Concurrent Request Capacity

✅ 2. Added Comprehensive Detailed Logging

Request Queue Logging: