TranscriptWriting / HUGGINGFACE_SPACES_SETUP.md
jmisak's picture
Upload 6 files
57fa449 verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

HuggingFace Spaces Deployment Guide

Overview

This application is configured to run on HuggingFace Spaces using local model inference (no external API calls required).


Quick Setup

1. Create a New Space

  1. Go to https://huggingface.co/new-space
  2. Choose Gradio as the SDK
  3. Select GPU hardware (T4 or better recommended)
  4. Name your Space (e.g., transcriptor-ai)

2. Upload Your Code

Upload all files from this directory to your Space, or connect a Git repository.

3. Configure Space Settings (Optional)

Go to Settings β†’ Variables in your Space and add:

Variable Value Description
DEBUG_MODE True or False Enable detailed logging
LLM_TEMPERATURE 0.7 Model creativity (0.0-1.0)
LLM_TIMEOUT 120 Timeout in seconds
LOCAL_MODEL microsoft/Phi-3-mini-4k-instruct Model to use

Note: All settings have sensible defaults - you don't need to set these unless you want to customize.


Hardware Requirements

Recommended: GPU (T4 or better)

  • Phi-3-mini-4k-instruct: 3.8B params, ~8GB GPU RAM
  • Processing speed: ~30-60 seconds per transcript chunk
  • Best for: Production use with multiple users

Alternative: CPU (not recommended)

  • Will work but be very slow (5-10 minutes per chunk)
  • Only suitable for testing

Supported Models

You can change the model by setting the LOCAL_MODEL variable:

Small & Fast (Recommended for Free Tier)

LOCAL_MODEL=microsoft/Phi-3-mini-4k-instruct  (Default - 3.8B params)

Medium (Better quality, needs more GPU)

LOCAL_MODEL=mistralai/Mistral-7B-Instruct-v0.3  (7B params)

Alternatives

LOCAL_MODEL=HuggingFaceH4/zephyr-7b-beta       (7B params, good instruction following)
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B params, very fast but lower quality)

Configuration Files

βœ… Required Files

  • app.py - Main application
  • requirements.txt - Python dependencies
  • llm.py, extractors.py, etc. - Core modules

⚠️ NOT Needed for Spaces

  • .env file - Use Spaces Variables instead
  • Local database files
  • API keys (unless using external APIs)

Environment Configuration

The app automatically detects if it's running on HuggingFace Spaces and uses local model inference by default.

Default Configuration (no .env needed):

USE_HF_API = False        # Don't use HF Inference API
USE_LMSTUDIO = False      # Don't use LM Studio
LLM_BACKEND = local       # Use local transformers
DEBUG_MODE = False        # Disable debug logs

To override: Set Spaces Variables (Settings β†’ Variables)


Troubleshooting

Issue: "Out of Memory" Error

Solution: Switch to a smaller model

LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0

Issue: Very Slow Processing

Solution:

  1. Make sure you selected GPU hardware (not CPU)
  2. Check Space logs for "Model loaded on cuda" confirmation
  3. If on CPU, upgrade to GPU tier

Issue: Quality Score 0.00

Causes:

  1. Model not loaded properly (check logs for "[Local Model] Loading...")
  2. GPU out of memory (model falls back to CPU)
  3. Timeout too short (increase LLM_TIMEOUT)

Debug Steps:

  1. Set DEBUG_MODE=True in Spaces Variables
  2. Check logs for detailed error messages
  3. Look for "[Local Model] βœ… Generated X characters"

Issue: Model Downloads Every Time

Solution: HuggingFace Spaces caches models automatically, but first load takes 2-5 minutes.

  • Subsequent starts are faster (~30 seconds)
  • Don't restart Space unnecessarily

Performance Optimization

1. Reduce Context Window

Edit llm.py line 399:

max_length=2000  # Reduce from 3500 for faster processing

2. Lower Token Limit

Set Spaces Variable:

MAX_TOKENS_PER_REQUEST=800  # Default is 1500

3. Use Smaller Model

LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0

4. Disable Debug Mode

DEBUG_MODE=False

Monitoring

View Logs

  1. Go to your Space
  2. Click Logs tab at the top
  3. Look for startup messages:
βœ… Configuration loaded for HuggingFace Spaces
πŸš€ TranscriptorAI Enterprise - LLM Backend: local
[Local Model] Loading microsoft/Phi-3-mini-4k-instruct...
[Local Model] βœ… Model loaded on cuda:0

Check Processing

During analysis, you should see:

[Local Model] Generating (1500 max tokens, temp=0.7)...
[Local Model] βœ… Generated 1247 characters
[LLM Debug] βœ… Successfully extracted JSON with 7 fields

Cost Estimation

Free Tier (CPU)

  • ⚠️ Very slow but free
  • ~5-10 minutes per transcript

GPU (T4) - ~$0.60/hour

  • ⚑ Fast processing
  • ~30-60 seconds per transcript
  • Space sleeps after inactivity (saves money)

Persistent GPU (Upgraded)

  • Always-on for instant access
  • Higher cost but best user experience

Security Notes

  1. No API Keys Needed: Everything runs locally
  2. Private Processing: Data never leaves your Space
  3. Secrets Management: Use Spaces Secrets (not Variables) for sensitive data
  4. Model Access: Phi-3 and most models don't require gated access

Next Steps

  1. βœ… Upload code to your Space
  2. βœ… Select GPU hardware
  3. βœ… Wait for first model download (~2-5 min)
  4. βœ… Test with a sample transcript
  5. πŸŽ‰ Share your Space URL!

Support


Last Updated: October 2025