# HuggingFace Spaces Deployment Guide ## Overview This application is configured to run on **HuggingFace Spaces** using local model inference (no external API calls required). --- ## Quick Setup ### 1. Create a New Space 1. Go to https://huggingface.co/new-space 2. Choose **Gradio** as the SDK 3. Select **GPU** hardware (T4 or better recommended) 4. Name your Space (e.g., `transcriptor-ai`) ### 2. Upload Your Code Upload all files from this directory to your Space, or connect a Git repository. ### 3. Configure Space Settings (Optional) Go to **Settings → Variables** in your Space and add: | Variable | Value | Description | |----------|-------|-------------| | `DEBUG_MODE` | `True` or `False` | Enable detailed logging | | `LLM_TEMPERATURE` | `0.7` | Model creativity (0.0-1.0) | | `LLM_TIMEOUT` | `120` | Timeout in seconds | | `LOCAL_MODEL` | `microsoft/Phi-3-mini-4k-instruct` | Model to use | **Note:** All settings have sensible defaults - you don't need to set these unless you want to customize. --- ## Hardware Requirements ### Recommended: GPU (T4 or better) - **Phi-3-mini-4k-instruct**: 3.8B params, ~8GB GPU RAM - Processing speed: ~30-60 seconds per transcript chunk - **Best for:** Production use with multiple users ### Alternative: CPU (not recommended) - Will work but be very slow (5-10 minutes per chunk) - Only suitable for testing --- ## Supported Models You can change the model by setting the `LOCAL_MODEL` variable: ### Small & Fast (Recommended for Free Tier) ``` LOCAL_MODEL=microsoft/Phi-3-mini-4k-instruct (Default - 3.8B params) ``` ### Medium (Better quality, needs more GPU) ``` LOCAL_MODEL=mistralai/Mistral-7B-Instruct-v0.3 (7B params) ``` ### Alternatives ``` LOCAL_MODEL=HuggingFaceH4/zephyr-7b-beta (7B params, good instruction following) LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B params, very fast but lower quality) ``` --- ## Configuration Files ### ✅ Required Files - `app.py` - Main application - `requirements.txt` - Python dependencies - `llm.py`, `extractors.py`, etc. - Core modules ### ⚠️ NOT Needed for Spaces - `.env` file - Use Spaces Variables instead - Local database files - API keys (unless using external APIs) --- ## Environment Configuration The app automatically detects if it's running on HuggingFace Spaces and uses local model inference by default. **Default Configuration (no .env needed):** ```python USE_HF_API = False # Don't use HF Inference API USE_LMSTUDIO = False # Don't use LM Studio LLM_BACKEND = local # Use local transformers DEBUG_MODE = False # Disable debug logs ``` **To override:** Set Spaces Variables (Settings → Variables) --- ## Troubleshooting ### Issue: "Out of Memory" Error **Solution:** Switch to a smaller model ``` LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 ``` ### Issue: Very Slow Processing **Solution:** 1. Make sure you selected **GPU** hardware (not CPU) 2. Check Space logs for "Model loaded on cuda" confirmation 3. If on CPU, upgrade to GPU tier ### Issue: Quality Score 0.00 **Causes:** 1. Model not loaded properly (check logs for "[Local Model] Loading...") 2. GPU out of memory (model falls back to CPU) 3. Timeout too short (increase `LLM_TIMEOUT`) **Debug Steps:** 1. Set `DEBUG_MODE=True` in Spaces Variables 2. Check logs for detailed error messages 3. Look for "[Local Model] ✅ Generated X characters" ### Issue: Model Downloads Every Time **Solution:** HuggingFace Spaces caches models automatically, but first load takes 2-5 minutes. - Subsequent starts are faster (~30 seconds) - Don't restart Space unnecessarily --- ## Performance Optimization ### 1. Reduce Context Window Edit `llm.py` line 399: ```python max_length=2000 # Reduce from 3500 for faster processing ``` ### 2. Lower Token Limit Set Spaces Variable: ``` MAX_TOKENS_PER_REQUEST=800 # Default is 1500 ``` ### 3. Use Smaller Model ``` LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 ``` ### 4. Disable Debug Mode ``` DEBUG_MODE=False ``` --- ## Monitoring ### View Logs 1. Go to your Space 2. Click **Logs** tab at the top 3. Look for startup messages: ``` ✅ Configuration loaded for HuggingFace Spaces 🚀 TranscriptorAI Enterprise - LLM Backend: local [Local Model] Loading microsoft/Phi-3-mini-4k-instruct... [Local Model] ✅ Model loaded on cuda:0 ``` ### Check Processing During analysis, you should see: ``` [Local Model] Generating (1500 max tokens, temp=0.7)... [Local Model] ✅ Generated 1247 characters [LLM Debug] ✅ Successfully extracted JSON with 7 fields ``` --- ## Cost Estimation ### Free Tier (CPU) - ⚠️ Very slow but free - ~5-10 minutes per transcript ### GPU (T4) - ~$0.60/hour - ⚡ Fast processing - ~30-60 seconds per transcript - Space sleeps after inactivity (saves money) ### Persistent GPU (Upgraded) - Always-on for instant access - Higher cost but best user experience --- ## Security Notes 1. **No API Keys Needed:** Everything runs locally 2. **Private Processing:** Data never leaves your Space 3. **Secrets Management:** Use Spaces Secrets (not Variables) for sensitive data 4. **Model Access:** Phi-3 and most models don't require gated access --- ## Next Steps 1. ✅ Upload code to your Space 2. ✅ Select GPU hardware 3. ✅ Wait for first model download (~2-5 min) 4. ✅ Test with a sample transcript 5. 🎉 Share your Space URL! --- ## Support - **HuggingFace Spaces Docs:** https://huggingface.co/docs/hub/spaces - **Transformers Docs:** https://huggingface.co/docs/transformers - **GPU Pricing:** https://huggingface.co/pricing --- **Last Updated:** October 2025