Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.2.0
HuggingFace Spaces Deployment Guide
Overview
This application is configured to run on HuggingFace Spaces using local model inference (no external API calls required).
Quick Setup
1. Create a New Space
- Go to https://huggingface.co/new-space
- Choose Gradio as the SDK
- Select GPU hardware (T4 or better recommended)
- Name your Space (e.g.,
transcriptor-ai)
2. Upload Your Code
Upload all files from this directory to your Space, or connect a Git repository.
3. Configure Space Settings (Optional)
Go to Settings β Variables in your Space and add:
| Variable | Value | Description |
|---|---|---|
DEBUG_MODE |
True or False |
Enable detailed logging |
LLM_TEMPERATURE |
0.7 |
Model creativity (0.0-1.0) |
LLM_TIMEOUT |
120 |
Timeout in seconds |
LOCAL_MODEL |
microsoft/Phi-3-mini-4k-instruct |
Model to use |
Note: All settings have sensible defaults - you don't need to set these unless you want to customize.
Hardware Requirements
Recommended: GPU (T4 or better)
- Phi-3-mini-4k-instruct: 3.8B params, ~8GB GPU RAM
- Processing speed: ~30-60 seconds per transcript chunk
- Best for: Production use with multiple users
Alternative: CPU (not recommended)
- Will work but be very slow (5-10 minutes per chunk)
- Only suitable for testing
Supported Models
You can change the model by setting the LOCAL_MODEL variable:
Small & Fast (Recommended for Free Tier)
LOCAL_MODEL=microsoft/Phi-3-mini-4k-instruct (Default - 3.8B params)
Medium (Better quality, needs more GPU)
LOCAL_MODEL=mistralai/Mistral-7B-Instruct-v0.3 (7B params)
Alternatives
LOCAL_MODEL=HuggingFaceH4/zephyr-7b-beta (7B params, good instruction following)
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B params, very fast but lower quality)
Configuration Files
β Required Files
app.py- Main applicationrequirements.txt- Python dependenciesllm.py,extractors.py, etc. - Core modules
β οΈ NOT Needed for Spaces
.envfile - Use Spaces Variables instead- Local database files
- API keys (unless using external APIs)
Environment Configuration
The app automatically detects if it's running on HuggingFace Spaces and uses local model inference by default.
Default Configuration (no .env needed):
USE_HF_API = False # Don't use HF Inference API
USE_LMSTUDIO = False # Don't use LM Studio
LLM_BACKEND = local # Use local transformers
DEBUG_MODE = False # Disable debug logs
To override: Set Spaces Variables (Settings β Variables)
Troubleshooting
Issue: "Out of Memory" Error
Solution: Switch to a smaller model
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0
Issue: Very Slow Processing
Solution:
- Make sure you selected GPU hardware (not CPU)
- Check Space logs for "Model loaded on cuda" confirmation
- If on CPU, upgrade to GPU tier
Issue: Quality Score 0.00
Causes:
- Model not loaded properly (check logs for "[Local Model] Loading...")
- GPU out of memory (model falls back to CPU)
- Timeout too short (increase
LLM_TIMEOUT)
Debug Steps:
- Set
DEBUG_MODE=Truein Spaces Variables - Check logs for detailed error messages
- Look for "[Local Model] β Generated X characters"
Issue: Model Downloads Every Time
Solution: HuggingFace Spaces caches models automatically, but first load takes 2-5 minutes.
- Subsequent starts are faster (~30 seconds)
- Don't restart Space unnecessarily
Performance Optimization
1. Reduce Context Window
Edit llm.py line 399:
max_length=2000 # Reduce from 3500 for faster processing
2. Lower Token Limit
Set Spaces Variable:
MAX_TOKENS_PER_REQUEST=800 # Default is 1500
3. Use Smaller Model
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0
4. Disable Debug Mode
DEBUG_MODE=False
Monitoring
View Logs
- Go to your Space
- Click Logs tab at the top
- Look for startup messages:
β
Configuration loaded for HuggingFace Spaces
π TranscriptorAI Enterprise - LLM Backend: local
[Local Model] Loading microsoft/Phi-3-mini-4k-instruct...
[Local Model] β
Model loaded on cuda:0
Check Processing
During analysis, you should see:
[Local Model] Generating (1500 max tokens, temp=0.7)...
[Local Model] β
Generated 1247 characters
[LLM Debug] β
Successfully extracted JSON with 7 fields
Cost Estimation
Free Tier (CPU)
- β οΈ Very slow but free
- ~5-10 minutes per transcript
GPU (T4) - ~$0.60/hour
- β‘ Fast processing
- ~30-60 seconds per transcript
- Space sleeps after inactivity (saves money)
Persistent GPU (Upgraded)
- Always-on for instant access
- Higher cost but best user experience
Security Notes
- No API Keys Needed: Everything runs locally
- Private Processing: Data never leaves your Space
- Secrets Management: Use Spaces Secrets (not Variables) for sensitive data
- Model Access: Phi-3 and most models don't require gated access
Next Steps
- β Upload code to your Space
- β Select GPU hardware
- β Wait for first model download (~2-5 min)
- β Test with a sample transcript
- π Share your Space URL!
Support
- HuggingFace Spaces Docs: https://huggingface.co/docs/hub/spaces
- Transformers Docs: https://huggingface.co/docs/transformers
- GPU Pricing: https://huggingface.co/pricing
Last Updated: October 2025