AI Backend Deploy
Fix: Add required HF Spaces YAML frontmatter to README
bbe47b5

A newer version of the Gradio SDK is available: 6.12.0

Upgrade
metadata
title: Lightweight AI Backend
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.26.0
python_version: '3.10'
app_file: app.py
pinned: false

Lightweight Multi-Model AI Backend for Hugging Face Spaces

Production-Ready for FREE CPU Tier

πŸš€ Quick Start

This is a complete, production-ready Hugging Face Gradio Space optimized for the FREE CPU tier. It requires NO GPU and includes four AI capabilities:

βœ… General chat (powered by TinyLlama) βœ… Code generation (powered by TinyLlama) βœ… Text summarization (powered by FLAN-T5-Small) βœ… Text-to-image generation (lightweight procedural)

πŸ“¦ Features

Optimization for CPU Tier

  • Lazy Loading: Models loaded only when needed
  • Memory Efficient: ~1.5-2GB total RAM usage
  • Fast Responses: Token limits ensure <10s per request
  • Queue System: Handles concurrent requests safely
  • Float32 Precision: Optimized for CPU computation

Model Selection

Feature Model Size Speed
Chat TinyLlama-1.1B-Chat-v1.0 1.1B params ⚑ Very Fast
Code TinyLlama-1.1B-Chat-v1.0 1.1B params ⚑ Very Fast
Summarization google/flan-t5-small 170M params ⚑⚑ Fastest
Image Gen Procedural Rendering N/A ⚑⚑⚑ Instant

API Endpoints

All endpoints are exposed through Gradio and accessible programmatically:

1. /generate_chat - General Chat

{
    "prompt": "Hello, how are you?",
    "max_tokens": 150,
    "temperature": 0.7
}
# Returns: Generated chat response

2. /generate_code - Code Generation

{
    "prompt": "Write a function to reverse a string",
    "max_tokens": 256,
    "temperature": 0.3
}
# Returns: Generated Python code

3. /summarize_text - Text Summarization

{
    "text": "Long article text here...",
    "max_length": 100
}
# Returns: Summarized text

4. /generate_image - Text-to-Image

{
    "prompt": "A red sunset over mountains",
    "width": 256,
    "height": 256
}
# Returns: Generated PIL Image

πŸ”§ Configuration & Tuning

Model Parameters

Chat Generation:

  • max_tokens: 50-200 (default: 150)
  • temperature: 0.1-1.0 (default: 0.7)
  • top_p: Fixed at 0.9 for quality

Code Generation:

  • max_tokens: 100-300 (default: 256)
  • temperature: 0.1-1.0 (default: 0.3 - lower for deterministic code)

Summarization:

  • max_length: 20-150 (default: 100)
  • min_length: Fixed at 20

Image Generation:

  • width: 128-256 (default: 256)
  • height: 128-256 (default: 256)

Memory Optimization

The application includes several memory-saving techniques:

# 1. Lazy Loading
# Models only loaded when first called
model_manager.load_chat_model()  # Called only if needed

# 2. Garbage Collection
gc.collect()  # Called after each inference

# 3. CPU Optimization
torch.set_num_threads(4)  # Limits threading overhead

# 4. Token Limits
max_tokens = min(max_tokens, 200)  # Hard caps for stability

# 5. Input Truncation
if len(text) > 1000:
    text = text[:1000]  # Prevent OOM on summarization

πŸ“Š Performance Benchmarks (Rough Estimates)

On a 2-core CPU with 4GB RAM:

Operation Time RAM Used
Load TinyLlama 8-12s 1.2GB
Chat Response (50 tokens) 3-5s 1.2GB
Load FLAN-T5 4-6s 0.5GB
Summarize (100 words) 2-3s 0.5GB
Generate Image (256x256) <1s <100MB

Total idle memory: ~1.5GB Max concurrent memory: ~2GB

πŸš€ Deployment to Hugging Face Spaces

Step 1: Create New Space

  1. Go to huggingface.co/spaces
  2. Click "Create new Space"
  3. Select "Gradio" as SDK
  4. Choose "Public" or "Private"

Step 2: Upload Files

Upload these files to your Space:

  • app.py
  • requirements.txt
  • .gitignore (optional)

Step 3: Configure Space Settings

  • Docker: Leave as default (builds from requirements.txt)
  • Python requirements: Auto-detected from requirements.txt
  • Persistent storage: Not needed for this project

Space will auto-restart after upload. Models will be downloaded on first use.

πŸ” Advanced Usage

Using the API Programmatically

import requests
import json

# Call the chat endpoint
response = requests.post(
    "https://your-username-ai-backend.hf.space/api/predict",
    json={
        "data": [
            "Hello, what is Python?",  # prompt
            150,                        # max_tokens
            0.7                         # temperature
        ]
    }
)

result = response.json()
print(result["data"][0])  # Generated response

Using with cURL

curl -X POST https://your-username-ai-backend.hf.space/api/predict \
  -H "Content-Type: application/json" \
  -d '{
    "data": ["What is machine learning?", 150, 0.7]
  }'

Custom Model Loading

To use different models, modify the model names in app.py:

# In model_manager.load_chat_model():
model_name = "different/model-name"  # Change here

Recommended lightweight alternatives:

  • Chat: microsoft/phi-1, mosaicml/mpt-7b-instruct (7B, may be heavy)
  • Summarization: google/flan-t5-base (larger, ~250M)
  • Code: Same TinyLlama or try Salesforce/codet5-small

⚠️ Troubleshooting

Out of Memory Errors

Symptom: Space crashes with OOM Solution:

  1. Reduce max_tokens limits in code
  2. Reduce max summarization input length
  3. Increase queue timeout in demo.launch()

Slow Responses

Symptom: Takes >10 seconds per request Solution:

  1. Reduce token limits
  2. Disable some models in production
  3. Monitor CPU/RAM in Space logs

Model Download Failures

Symptom: "Cannot download model" error Solution:

  1. Check internet connectivity in logs
  2. Models auto-download on first request (may take 1-2 min)
  3. Wait for "Model loaded successfully" message

🎯 Production Checklist

  • βœ… Models tested on CPU
  • βœ… Error handling for all endpoints
  • βœ… Memory cleanup between requests
  • βœ… Queue system for concurrency
  • βœ… Token limits for stability
  • βœ… Float32 precision for CPU
  • βœ… Gradio Blocks UI for testing
  • βœ… API documentation
  • βœ… Lazy loading implemented
  • βœ… Optimized requirements.txt

πŸ“ˆ Scaling Beyond Free Tier

If you need more performance:

  1. Upgrade to paid GPU tier: Enables larger models (7B+)
  2. Use external APIs: ollama, vLLM for local deployment
  3. Implement caching: Cache popular responses
  4. Model distillation: Train smaller task-specific models

πŸ“ License & Attribution

  • TinyLlama: MIT License
  • FLAN-T5: Apache 2.0
  • Transformers: Apache 2.0
  • Gradio: Apache 2.0

🀝 Contributing

To modify or improve:

  1. Clone this Space locally
  2. Modify app.py or requirements.txt
  3. Test locally with python app.py
  4. Push changes back to Space

πŸ“§ Support

For issues:

  • Check Space logs (Settings β†’ Logs)
  • Review "Troubleshooting" section above
  • Check Hugging Face Spaces documentation
  • Review model cards on huggingface.co

Built for Hugging Face Spaces - Optimized for FREE CPU Tier πŸš€

Created: 2024 Last Updated: 2024