Spaces:

Sabithulla
/

lightweight-ai-backend

Sleeping

App Files Files Community

lightweight-ai-backend / README.md

AI Backend Deploy

Fix: Add required HF Spaces YAML frontmatter to README

bbe47b5 about 2 months ago

preview code

raw

history blame contribute delete

7.22 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

metadata

title: Lightweight AI Backend
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.26.0
python_version: '3.10'
app_file: app.py
pinned: false

Lightweight Multi-Model AI Backend for Hugging Face Spaces

Production-Ready for FREE CPU Tier

🚀 Quick Start

This is a complete, production-ready Hugging Face Gradio Space optimized for the FREE CPU tier. It requires NO GPU and includes four AI capabilities:

✅ General chat (powered by TinyLlama) ✅ Code generation (powered by TinyLlama) ✅ Text summarization (powered by FLAN-T5-Small) ✅ Text-to-image generation (lightweight procedural)

📦 Features

Optimization for CPU Tier

Lazy Loading: Models loaded only when needed
Memory Efficient: ~1.5-2GB total RAM usage
Fast Responses: Token limits ensure <10s per request
Queue System: Handles concurrent requests safely
Float32 Precision: Optimized for CPU computation

Model Selection

Feature	Model	Size	Speed
Chat	TinyLlama-1.1B-Chat-v1.0	1.1B params	⚡ Very Fast
Code	TinyLlama-1.1B-Chat-v1.0	1.1B params	⚡ Very Fast
Summarization	google/flan-t5-small	170M params	⚡⚡ Fastest
Image Gen	Procedural Rendering	N/A	⚡⚡⚡ Instant

API Endpoints

All endpoints are exposed through Gradio and accessible programmatically:

1. `/generate_chat` - General Chat

{
    "prompt": "Hello, how are you?",
    "max_tokens": 150,
    "temperature": 0.7
}
# Returns: Generated chat response

2. `/generate_code` - Code Generation

{
    "prompt": "Write a function to reverse a string",
    "max_tokens": 256,
    "temperature": 0.3
}
# Returns: Generated Python code

3. `/summarize_text` - Text Summarization

{
    "text": "Long article text here...",
    "max_length": 100
}
# Returns: Summarized text

4. `/generate_image` - Text-to-Image

{
    "prompt": "A red sunset over mountains",
    "width": 256,
    "height": 256
}
# Returns: Generated PIL Image

🔧 Configuration & Tuning

Model Parameters

Chat Generation:

max_tokens: 50-200 (default: 150)
temperature: 0.1-1.0 (default: 0.7)
top_p: Fixed at 0.9 for quality

Code Generation:

max_tokens: 100-300 (default: 256)
temperature: 0.1-1.0 (default: 0.3 - lower for deterministic code)

Summarization:

max_length: 20-150 (default: 100)
min_length: Fixed at 20

Image Generation:

width: 128-256 (default: 256)
height: 128-256 (default: 256)

Memory Optimization

The application includes several memory-saving techniques:

# 1. Lazy Loading
# Models only loaded when first called
model_manager.load_chat_model()  # Called only if needed

# 2. Garbage Collection
gc.collect()  # Called after each inference

# 3. CPU Optimization
torch.set_num_threads(4)  # Limits threading overhead

# 4. Token Limits
max_tokens = min(max_tokens, 200)  # Hard caps for stability

# 5. Input Truncation
if len(text) > 1000:
    text = text[:1000]  # Prevent OOM on summarization

📊 Performance Benchmarks (Rough Estimates)

On a 2-core CPU with 4GB RAM:

Operation	Time	RAM Used
Load TinyLlama	8-12s	1.2GB
Chat Response (50 tokens)	3-5s	1.2GB
Load FLAN-T5	4-6s	0.5GB
Summarize (100 words)	2-3s	0.5GB
Generate Image (256x256)	<1s	<100MB

Total idle memory: ~1.5GB Max concurrent memory: ~2GB

🚀 Deployment to Hugging Face Spaces

Step 1: Create New Space

Go to huggingface.co/spaces
Click "Create new Space"
Select "Gradio" as SDK
Choose "Public" or "Private"

Step 2: Upload Files

Upload these files to your Space:

app.py
requirements.txt
.gitignore (optional)

Step 3: Configure Space Settings

Docker: Leave as default (builds from requirements.txt)
Python requirements: Auto-detected from requirements.txt
Persistent storage: Not needed for this project

Space will auto-restart after upload. Models will be downloaded on first use.

🔍 Advanced Usage

Using the API Programmatically

import requests
import json

# Call the chat endpoint
response = requests.post(
    "https://your-username-ai-backend.hf.space/api/predict",
    json={
        "data": [
            "Hello, what is Python?",  # prompt
            150,                        # max_tokens
            0.7                         # temperature
        ]
    }
)

result = response.json()
print(result["data"][0])  # Generated response

Using with cURL

curl -X POST https://your-username-ai-backend.hf.space/api/predict \
  -H "Content-Type: application/json" \
  -d '{
    "data": ["What is machine learning?", 150, 0.7]
  }'

Custom Model Loading

To use different models, modify the model names in app.py:

# In model_manager.load_chat_model():
model_name = "different/model-name"  # Change here

Recommended lightweight alternatives:

Chat: microsoft/phi-1, mosaicml/mpt-7b-instruct (7B, may be heavy)
Summarization: google/flan-t5-base (larger, ~250M)
Code: Same TinyLlama or try Salesforce/codet5-small

⚠️ Troubleshooting

Out of Memory Errors

Symptom: Space crashes with OOM Solution:

Reduce max_tokens limits in code
Reduce max summarization input length
Increase queue timeout in demo.launch()

Slow Responses

Symptom: Takes >10 seconds per request Solution:

Reduce token limits
Disable some models in production
Monitor CPU/RAM in Space logs

Model Download Failures

Symptom: "Cannot download model" error Solution:

Check internet connectivity in logs
Models auto-download on first request (may take 1-2 min)
Wait for "Model loaded successfully" message

🎯 Production Checklist

✅ Models tested on CPU
✅ Error handling for all endpoints
✅ Memory cleanup between requests
✅ Queue system for concurrency
✅ Token limits for stability
✅ Float32 precision for CPU
✅ Gradio Blocks UI for testing
✅ API documentation
✅ Lazy loading implemented
✅ Optimized requirements.txt

📈 Scaling Beyond Free Tier

If you need more performance:

Upgrade to paid GPU tier: Enables larger models (7B+)
Use external APIs: ollama, vLLM for local deployment
Implement caching: Cache popular responses
Model distillation: Train smaller task-specific models

📝 License & Attribution

TinyLlama: MIT License
FLAN-T5: Apache 2.0
Transformers: Apache 2.0
Gradio: Apache 2.0

🤝 Contributing

To modify or improve:

Clone this Space locally
Modify app.py or requirements.txt
Test locally with python app.py
Push changes back to Space

📧 Support

For issues:

Check Space logs (Settings → Logs)
Review "Troubleshooting" section above
Check Hugging Face Spaces documentation
Review model cards on huggingface.co

Built for Hugging Face Spaces - Optimized for FREE CPU Tier 🚀

Created: 2024 Last Updated: 2024