NextTokenPrediction / DEPLOYMENT.md
Polarium
AI Text Assistant
c76198f

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Deployment Instructions

Deploying to Hugging Face Spaces

Prerequisites

  • A Hugging Face account (free)
  • Git installed locally

Steps

  1. Create a new Space on Hugging Face:

    • Go to https://huggingface.co/spaces
    • Click "Create new Space"
    • Choose a name (e.g., "ai-text-assistant")
    • Select "Gradio" as the SDK
    • Choose visibility (Public or Private)
    • Click "Create Space"
  2. Clone your Space repository:

    git clone https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME
    cd YOUR_SPACE_NAME
    
  3. Copy the application files: Copy these files from this project to your Space repository:

    • app.py
    • requirements.txt
    • README.md
    • .gitignore (optional)
  4. Commit and push:

    git add .
    git commit -m "Initial commit: AI Text Assistant"
    git push
    
  5. Wait for deployment:

    • Hugging Face Spaces will automatically detect the changes
    • The build process will install dependencies and start the app
    • This may take 5-10 minutes for the first deployment
    • You can watch the build logs in the Space's "Logs" tab
  6. Access your app:

    • Once deployed, your app will be available at:
    • https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME

Local Testing

To test locally before deploying:

# Install dependencies
pip install -r requirements.txt

# Run the app
python app.py

The app will be available at http://127.0.0.1:7860

Configuration Options

Hardware

For better performance, you can upgrade your Space's hardware:

  • Go to Space Settings → Hardware
  • Options include CPU (free), GPU T4 (small fee), GPU A10G, etc.
  • The app works on CPU but will be faster with GPU

Environment Variables

You can set these in Space Settings → Variables:

  • TRANSFORMERS_CACHE: Custom cache directory for models
  • HF_HOME: Hugging Face home directory

Troubleshooting

Build fails with memory errors:

  • The models are relatively small, but if you encounter issues:
  • Upgrade to a better hardware tier
  • Or consider using Hugging Face Inference API instead

App starts slowly:

  • The first run downloads models (~1GB for Qwen, ~1.6GB for BART)
  • Subsequent runs will use cached models
  • Model loading takes 30-60 seconds on CPU

Token alternatives not showing:

  • Make sure you hover over the generated words
  • The tooltip appears on hover with a slight delay
  • Try different browsers if issues persist

Performance Notes

  • First Load: Slow due to model downloads
  • Model Loading: 30-60 seconds on CPU, 5-10 seconds on GPU
  • Generation Speed:
    • Qwen (0.5B): ~10-20 tokens/sec on CPU, ~100+ tokens/sec on GPU
    • BART-large: ~5-10 tokens/sec on CPU, ~50+ tokens/sec on GPU

Support

For issues or questions: