# HuggingFace Spaces Deployment Guide

## Overview
This application is configured to run on **HuggingFace Spaces** using local model inference (no external API calls required).

---

## Quick Setup

### 1. Create a New Space
1. Go to https://huggingface.co/new-space
2. Choose **Gradio** as the SDK
3. Select **GPU** hardware (T4 or better recommended)
4. Name your Space (e.g., `transcriptor-ai`)

### 2. Upload Your Code
Upload all files from this directory to your Space, or connect a Git repository.

### 3. Configure Space Settings (Optional)

Go to **Settings → Variables** in your Space and add:

| Variable | Value | Description |
|----------|-------|-------------|
| `DEBUG_MODE` | `True` or `False` | Enable detailed logging |
| `LLM_TEMPERATURE` | `0.7` | Model creativity (0.0-1.0) |
| `LLM_TIMEOUT` | `120` | Timeout in seconds |
| `LOCAL_MODEL` | `microsoft/Phi-3-mini-4k-instruct` | Model to use |

**Note:** All settings have sensible defaults - you don't need to set these unless you want to customize.

---

## Hardware Requirements

### Recommended: GPU (T4 or better)
- **Phi-3-mini-4k-instruct**: 3.8B params, ~8GB GPU RAM
- Processing speed: ~30-60 seconds per transcript chunk
- **Best for:** Production use with multiple users

### Alternative: CPU (not recommended)
- Will work but be very slow (5-10 minutes per chunk)
- Only suitable for testing

---

## Supported Models

You can change the model by setting the `LOCAL_MODEL` variable:

### Small & Fast (Recommended for Free Tier)
```
LOCAL_MODEL=microsoft/Phi-3-mini-4k-instruct  (Default - 3.8B params)
```

### Medium (Better quality, needs more GPU)
```
LOCAL_MODEL=mistralai/Mistral-7B-Instruct-v0.3  (7B params)
```

### Alternatives
```
LOCAL_MODEL=HuggingFaceH4/zephyr-7b-beta       (7B params, good instruction following)
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B params, very fast but lower quality)
```

---

## Configuration Files

### ✅ Required Files
- `app.py` - Main application
- `requirements.txt` - Python dependencies
- `llm.py`, `extractors.py`, etc. - Core modules

### ⚠️ NOT Needed for Spaces
- `.env` file - Use Spaces Variables instead
- Local database files
- API keys (unless using external APIs)

---

## Environment Configuration

The app automatically detects if it's running on HuggingFace Spaces and uses local model inference by default.

**Default Configuration (no .env needed):**
```python
USE_HF_API = False        # Don't use HF Inference API
USE_LMSTUDIO = False      # Don't use LM Studio
LLM_BACKEND = local       # Use local transformers
DEBUG_MODE = False        # Disable debug logs
```

**To override:** Set Spaces Variables (Settings → Variables)

---

## Troubleshooting

### Issue: "Out of Memory" Error
**Solution:** Switch to a smaller model
```
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0
```

### Issue: Very Slow Processing
**Solution:**
1. Make sure you selected **GPU** hardware (not CPU)
2. Check Space logs for "Model loaded on cuda" confirmation
3. If on CPU, upgrade to GPU tier

### Issue: Quality Score 0.00
**Causes:**
1. Model not loaded properly (check logs for "[Local Model] Loading...")
2. GPU out of memory (model falls back to CPU)
3. Timeout too short (increase `LLM_TIMEOUT`)

**Debug Steps:**
1. Set `DEBUG_MODE=True` in Spaces Variables
2. Check logs for detailed error messages
3. Look for "[Local Model] ✅ Generated X characters"

### Issue: Model Downloads Every Time
**Solution:** HuggingFace Spaces caches models automatically, but first load takes 2-5 minutes.
- Subsequent starts are faster (~30 seconds)
- Don't restart Space unnecessarily

---

## Performance Optimization

### 1. Reduce Context Window
Edit `llm.py` line 399:
```python
max_length=2000  # Reduce from 3500 for faster processing
```

### 2. Lower Token Limit
Set Spaces Variable:
```
MAX_TOKENS_PER_REQUEST=800  # Default is 1500
```

### 3. Use Smaller Model
```
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0
```

### 4. Disable Debug Mode
```
DEBUG_MODE=False
```

---

## Monitoring

### View Logs
1. Go to your Space
2. Click **Logs** tab at the top
3. Look for startup messages:

```
✅ Configuration loaded for HuggingFace Spaces
🚀 TranscriptorAI Enterprise - LLM Backend: local
[Local Model] Loading microsoft/Phi-3-mini-4k-instruct...
[Local Model] ✅ Model loaded on cuda:0
```

### Check Processing
During analysis, you should see:
```
[Local Model] Generating (1500 max tokens, temp=0.7)...
[Local Model] ✅ Generated 1247 characters
[LLM Debug] ✅ Successfully extracted JSON with 7 fields
```

---

## Cost Estimation

### Free Tier (CPU)
- ⚠️ Very slow but free
- ~5-10 minutes per transcript

### GPU (T4) - ~$0.60/hour
- ⚡ Fast processing
- ~30-60 seconds per transcript
- Space sleeps after inactivity (saves money)

### Persistent GPU (Upgraded)
- Always-on for instant access
- Higher cost but best user experience

---

## Security Notes

1. **No API Keys Needed:** Everything runs locally
2. **Private Processing:** Data never leaves your Space
3. **Secrets Management:** Use Spaces Secrets (not Variables) for sensitive data
4. **Model Access:** Phi-3 and most models don't require gated access

---

## Next Steps

1. ✅ Upload code to your Space
2. ✅ Select GPU hardware
3. ✅ Wait for first model download (~2-5 min)
4. ✅ Test with a sample transcript
5. 🎉 Share your Space URL!

---

## Support

- **HuggingFace Spaces Docs:** https://huggingface.co/docs/hub/spaces
- **Transformers Docs:** https://huggingface.co/docs/transformers
- **GPU Pricing:** https://huggingface.co/pricing

---

**Last Updated:** October 2025