TranscriptWriting / HUGGINGFACE_SPACES_SETUP.md
jmisak's picture
Upload 6 files
57fa449 verified
|
raw
history blame
5.85 kB

HuggingFace Spaces Deployment Guide

Overview

This application is configured to run on HuggingFace Spaces using local model inference (no external API calls required).


Quick Setup

1. Create a New Space

  1. Go to https://huggingface.co/new-space
  2. Choose Gradio as the SDK
  3. Select GPU hardware (T4 or better recommended)
  4. Name your Space (e.g., transcriptor-ai)

2. Upload Your Code

Upload all files from this directory to your Space, or connect a Git repository.

3. Configure Space Settings (Optional)

Go to Settings β†’ Variables in your Space and add:

Variable Value Description
DEBUG_MODE True or False Enable detailed logging
LLM_TEMPERATURE 0.7 Model creativity (0.0-1.0)
LLM_TIMEOUT 120 Timeout in seconds
LOCAL_MODEL microsoft/Phi-3-mini-4k-instruct Model to use

Note: All settings have sensible defaults - you don't need to set these unless you want to customize.


Hardware Requirements

Recommended: GPU (T4 or better)

  • Phi-3-mini-4k-instruct: 3.8B params, ~8GB GPU RAM
  • Processing speed: ~30-60 seconds per transcript chunk
  • Best for: Production use with multiple users

Alternative: CPU (not recommended)

  • Will work but be very slow (5-10 minutes per chunk)
  • Only suitable for testing

Supported Models

You can change the model by setting the LOCAL_MODEL variable:

Small & Fast (Recommended for Free Tier)

LOCAL_MODEL=microsoft/Phi-3-mini-4k-instruct  (Default - 3.8B params)

Medium (Better quality, needs more GPU)

LOCAL_MODEL=mistralai/Mistral-7B-Instruct-v0.3  (7B params)

Alternatives

LOCAL_MODEL=HuggingFaceH4/zephyr-7b-beta       (7B params, good instruction following)
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B params, very fast but lower quality)

Configuration Files

βœ… Required Files

  • app.py - Main application
  • requirements.txt - Python dependencies
  • llm.py, extractors.py, etc. - Core modules

⚠️ NOT Needed for Spaces

  • .env file - Use Spaces Variables instead
  • Local database files
  • API keys (unless using external APIs)

Environment Configuration

The app automatically detects if it's running on HuggingFace Spaces and uses local model inference by default.

Default Configuration (no .env needed):

USE_HF_API = False        # Don't use HF Inference API
USE_LMSTUDIO = False      # Don't use LM Studio
LLM_BACKEND = local       # Use local transformers
DEBUG_MODE = False        # Disable debug logs

To override: Set Spaces Variables (Settings β†’ Variables)


Troubleshooting

Issue: "Out of Memory" Error

Solution: Switch to a smaller model

LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0

Issue: Very Slow Processing

Solution:

  1. Make sure you selected GPU hardware (not CPU)
  2. Check Space logs for "Model loaded on cuda" confirmation
  3. If on CPU, upgrade to GPU tier

Issue: Quality Score 0.00

Causes:

  1. Model not loaded properly (check logs for "[Local Model] Loading...")
  2. GPU out of memory (model falls back to CPU)
  3. Timeout too short (increase LLM_TIMEOUT)

Debug Steps:

  1. Set DEBUG_MODE=True in Spaces Variables
  2. Check logs for detailed error messages
  3. Look for "[Local Model] βœ… Generated X characters"

Issue: Model Downloads Every Time

Solution: HuggingFace Spaces caches models automatically, but first load takes 2-5 minutes.

  • Subsequent starts are faster (~30 seconds)
  • Don't restart Space unnecessarily

Performance Optimization

1. Reduce Context Window

Edit llm.py line 399:

max_length=2000  # Reduce from 3500 for faster processing

2. Lower Token Limit

Set Spaces Variable:

MAX_TOKENS_PER_REQUEST=800  # Default is 1500

3. Use Smaller Model

LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0

4. Disable Debug Mode

DEBUG_MODE=False

Monitoring

View Logs

  1. Go to your Space
  2. Click Logs tab at the top
  3. Look for startup messages:
βœ… Configuration loaded for HuggingFace Spaces
πŸš€ TranscriptorAI Enterprise - LLM Backend: local
[Local Model] Loading microsoft/Phi-3-mini-4k-instruct...
[Local Model] βœ… Model loaded on cuda:0

Check Processing

During analysis, you should see:

[Local Model] Generating (1500 max tokens, temp=0.7)...
[Local Model] βœ… Generated 1247 characters
[LLM Debug] βœ… Successfully extracted JSON with 7 fields

Cost Estimation

Free Tier (CPU)

  • ⚠️ Very slow but free
  • ~5-10 minutes per transcript

GPU (T4) - ~$0.60/hour

  • ⚑ Fast processing
  • ~30-60 seconds per transcript
  • Space sleeps after inactivity (saves money)

Persistent GPU (Upgraded)

  • Always-on for instant access
  • Higher cost but best user experience

Security Notes

  1. No API Keys Needed: Everything runs locally
  2. Private Processing: Data never leaves your Space
  3. Secrets Management: Use Spaces Secrets (not Variables) for sensitive data
  4. Model Access: Phi-3 and most models don't require gated access

Next Steps

  1. βœ… Upload code to your Space
  2. βœ… Select GPU hardware
  3. βœ… Wait for first model download (~2-5 min)
  4. βœ… Test with a sample transcript
  5. πŸŽ‰ Share your Space URL!

Support


Last Updated: October 2025