Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.12.0
title: Lightweight AI Backend
emoji: π€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.26.0
python_version: '3.10'
app_file: app.py
pinned: false
Lightweight Multi-Model AI Backend for Hugging Face Spaces
Production-Ready for FREE CPU Tier
π Quick Start
This is a complete, production-ready Hugging Face Gradio Space optimized for the FREE CPU tier. It requires NO GPU and includes four AI capabilities:
β General chat (powered by TinyLlama) β Code generation (powered by TinyLlama) β Text summarization (powered by FLAN-T5-Small) β Text-to-image generation (lightweight procedural)
π¦ Features
Optimization for CPU Tier
- Lazy Loading: Models loaded only when needed
- Memory Efficient: ~1.5-2GB total RAM usage
- Fast Responses: Token limits ensure <10s per request
- Queue System: Handles concurrent requests safely
- Float32 Precision: Optimized for CPU computation
Model Selection
| Feature | Model | Size | Speed |
|---|---|---|---|
| Chat | TinyLlama-1.1B-Chat-v1.0 | 1.1B params | β‘ Very Fast |
| Code | TinyLlama-1.1B-Chat-v1.0 | 1.1B params | β‘ Very Fast |
| Summarization | google/flan-t5-small | 170M params | β‘β‘ Fastest |
| Image Gen | Procedural Rendering | N/A | β‘β‘β‘ Instant |
API Endpoints
All endpoints are exposed through Gradio and accessible programmatically:
1. /generate_chat - General Chat
{
"prompt": "Hello, how are you?",
"max_tokens": 150,
"temperature": 0.7
}
# Returns: Generated chat response
2. /generate_code - Code Generation
{
"prompt": "Write a function to reverse a string",
"max_tokens": 256,
"temperature": 0.3
}
# Returns: Generated Python code
3. /summarize_text - Text Summarization
{
"text": "Long article text here...",
"max_length": 100
}
# Returns: Summarized text
4. /generate_image - Text-to-Image
{
"prompt": "A red sunset over mountains",
"width": 256,
"height": 256
}
# Returns: Generated PIL Image
π§ Configuration & Tuning
Model Parameters
Chat Generation:
max_tokens: 50-200 (default: 150)temperature: 0.1-1.0 (default: 0.7)top_p: Fixed at 0.9 for quality
Code Generation:
max_tokens: 100-300 (default: 256)temperature: 0.1-1.0 (default: 0.3 - lower for deterministic code)
Summarization:
max_length: 20-150 (default: 100)min_length: Fixed at 20
Image Generation:
width: 128-256 (default: 256)height: 128-256 (default: 256)
Memory Optimization
The application includes several memory-saving techniques:
# 1. Lazy Loading
# Models only loaded when first called
model_manager.load_chat_model() # Called only if needed
# 2. Garbage Collection
gc.collect() # Called after each inference
# 3. CPU Optimization
torch.set_num_threads(4) # Limits threading overhead
# 4. Token Limits
max_tokens = min(max_tokens, 200) # Hard caps for stability
# 5. Input Truncation
if len(text) > 1000:
text = text[:1000] # Prevent OOM on summarization
π Performance Benchmarks (Rough Estimates)
On a 2-core CPU with 4GB RAM:
| Operation | Time | RAM Used |
|---|---|---|
| Load TinyLlama | 8-12s | 1.2GB |
| Chat Response (50 tokens) | 3-5s | 1.2GB |
| Load FLAN-T5 | 4-6s | 0.5GB |
| Summarize (100 words) | 2-3s | 0.5GB |
| Generate Image (256x256) | <1s | <100MB |
Total idle memory: ~1.5GB Max concurrent memory: ~2GB
π Deployment to Hugging Face Spaces
Step 1: Create New Space
- Go to huggingface.co/spaces
- Click "Create new Space"
- Select "Gradio" as SDK
- Choose "Public" or "Private"
Step 2: Upload Files
Upload these files to your Space:
app.pyrequirements.txt.gitignore(optional)
Step 3: Configure Space Settings
- Docker: Leave as default (builds from requirements.txt)
- Python requirements: Auto-detected from requirements.txt
- Persistent storage: Not needed for this project
Space will auto-restart after upload. Models will be downloaded on first use.
π Advanced Usage
Using the API Programmatically
import requests
import json
# Call the chat endpoint
response = requests.post(
"https://your-username-ai-backend.hf.space/api/predict",
json={
"data": [
"Hello, what is Python?", # prompt
150, # max_tokens
0.7 # temperature
]
}
)
result = response.json()
print(result["data"][0]) # Generated response
Using with cURL
curl -X POST https://your-username-ai-backend.hf.space/api/predict \
-H "Content-Type: application/json" \
-d '{
"data": ["What is machine learning?", 150, 0.7]
}'
Custom Model Loading
To use different models, modify the model names in app.py:
# In model_manager.load_chat_model():
model_name = "different/model-name" # Change here
Recommended lightweight alternatives:
- Chat:
microsoft/phi-1,mosaicml/mpt-7b-instruct(7B, may be heavy) - Summarization:
google/flan-t5-base(larger, ~250M) - Code: Same TinyLlama or try
Salesforce/codet5-small
β οΈ Troubleshooting
Out of Memory Errors
Symptom: Space crashes with OOM Solution:
- Reduce
max_tokenslimits in code - Reduce max summarization input length
- Increase queue timeout in
demo.launch()
Slow Responses
Symptom: Takes >10 seconds per request Solution:
- Reduce token limits
- Disable some models in production
- Monitor CPU/RAM in Space logs
Model Download Failures
Symptom: "Cannot download model" error Solution:
- Check internet connectivity in logs
- Models auto-download on first request (may take 1-2 min)
- Wait for "Model loaded successfully" message
π― Production Checklist
- β Models tested on CPU
- β Error handling for all endpoints
- β Memory cleanup between requests
- β Queue system for concurrency
- β Token limits for stability
- β Float32 precision for CPU
- β Gradio Blocks UI for testing
- β API documentation
- β Lazy loading implemented
- β Optimized requirements.txt
π Scaling Beyond Free Tier
If you need more performance:
- Upgrade to paid GPU tier: Enables larger models (7B+)
- Use external APIs: ollama, vLLM for local deployment
- Implement caching: Cache popular responses
- Model distillation: Train smaller task-specific models
π License & Attribution
- TinyLlama: MIT License
- FLAN-T5: Apache 2.0
- Transformers: Apache 2.0
- Gradio: Apache 2.0
π€ Contributing
To modify or improve:
- Clone this Space locally
- Modify
app.pyorrequirements.txt - Test locally with
python app.py - Push changes back to Space
π§ Support
For issues:
- Check Space logs (Settings β Logs)
- Review "Troubleshooting" section above
- Check Hugging Face Spaces documentation
- Review model cards on huggingface.co
Built for Hugging Face Spaces - Optimized for FREE CPU Tier π
Created: 2024 Last Updated: 2024