Spaces:
Running
Running
| # API Layer (Python FastAPI β port 8000) | |
| Local Voxtral inference pipeline. Loads `mistralai/Voxtral-Mini-3B-2507` + `YongkangZOU/evoxtral-lora` (PEFT adapter) locally, runs VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER) for video inputs. | |
| **Requirements**: Python 3.11+, system **ffmpeg** (`brew install ffmpeg` / `apt install ffmpeg`). GPU with ~8 GB VRAM recommended; CPU fallback supported (expect ~50 s per audio second). | |
| --- | |
| ## Startup | |
| ```bash | |
| cd api | |
| python -m venv .venv | |
| source .venv/bin/activate # Windows: .venv\Scripts\activate | |
| pip install -r requirements.txt | |
| uvicorn main:app --host 0.0.0.0 --port 8000 --reload | |
| ``` | |
| Default port: **8000**. On first start the Voxtral model (~8 GB total) is downloaded from HuggingFace. Set `HF_HUB_DISABLE_XET=1` if download stalls behind a local proxy. | |
| --- | |
| ## API | |
| ### POST /transcribe | |
| Simple transcription. Audio is converted to WAV and passed to the local Voxtral model. | |
| | | | | |
| |--|--| | |
| | **Content-Type** | `multipart/form-data` | | |
| | **Body** | `audio` β audio or video file (required) | | |
| | **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv | | |
| | **Max size** | `MAX_UPLOAD_MB` (default 100 MB) | | |
| **Response (200)** | |
| ```json | |
| { | |
| "text": "Hello! [laughs] How are you?", | |
| "words": [], | |
| "languageCode": "en" | |
| } | |
| ``` | |
| --- | |
| ### POST /transcribe-diarize | |
| Full pipeline: local Voxtral STT + VAD sentence segmentation + per-segment emotion tagging. For video inputs, also runs per-frame FER via MobileViT-XXS ONNX and adds `face_emotion` per segment. | |
| | | | | |
| |--|--| | |
| | **Content-Type** | `multipart/form-data` | | |
| | **Body** | `audio` β audio or video file (required) | | |
| | **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv | | |
| | **Max size** | `MAX_UPLOAD_MB` (default 100 MB) | | |
| Segmentation: silence gaps β₯ 0.3 s create a new segment; gaps < 0.3 s are merged. | |
| **Response (200) β audio input** | |
| ```json | |
| { | |
| "segments": [ | |
| { | |
| "id": 1, | |
| "speaker": "SPEAKER_00", | |
| "start": 0.96, | |
| "end": 3.23, | |
| "text": "Hello! [laughs] How are you?", | |
| "emotion": "Happy", | |
| "valence": 0.7, | |
| "arousal": 0.6 | |
| } | |
| ], | |
| "duration": 5.65, | |
| "text": "Hello! [laughs] How are you?", | |
| "filename": "audio.m4a", | |
| "diarization_method": "vad", | |
| "has_video": false | |
| } | |
| ``` | |
| **Response (200) β video input** (adds `face_emotion` per segment) | |
| ```json | |
| { | |
| "segments": [ | |
| { | |
| "id": 1, | |
| "speaker": "SPEAKER_00", | |
| "start": 0.96, | |
| "end": 3.23, | |
| "text": "Hello!", | |
| "emotion": "Happy", | |
| "valence": 0.7, | |
| "arousal": 0.6, | |
| "face_emotion": "Happy" | |
| } | |
| ], | |
| "duration": 5.65, | |
| "text": "Hello!", | |
| "filename": "video.mov", | |
| "diarization_method": "vad", | |
| "has_video": true | |
| } | |
| ``` | |
| `face_emotion` values: `Anger | Contempt | Disgust | Fear | Happy | Neutral | Sad | Surprise` | |
| **Errors** | |
| | Status | Meaning | | |
| |--------|---------| | |
| | 400 | No/invalid file, empty, or unsupported format | | |
| | 413 | File exceeds `MAX_UPLOAD_MB` | | |
| | 500 | Transcription or inference error | | |
| --- | |
| ### GET /health | |
| **Response (200)** | |
| ```json | |
| { | |
| "status": "ok", | |
| "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)", | |
| "model_loaded": true, | |
| "ffmpeg": true, | |
| "fer_enabled": true, | |
| "device": "cpu", | |
| "max_upload_mb": 100 | |
| } | |
| ``` | |
| --- | |
| ### GET /debug-inference | |
| Smoke-test endpoint: synthesizes 0.5 s of silence and runs a minimal `generate()` call. Useful for verifying the model is loaded and functional without uploading a real file. | |
| **Response (200)** | |
| ```json | |
| { | |
| "ok": true, | |
| "text": "", | |
| "dtype": "torch.bfloat16", | |
| "device": "cpu" | |
| } | |
| ``` | |
| --- | |
| ## Environment variables | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `MODEL_ID` | `mistralai/Voxtral-Mini-3B-2507` | Base Voxtral model on HF Hub | | |
| | `ADAPTER_ID` | `YongkangZOU/evoxtral-lora` | PEFT LoRA adapter on HF Hub | | |
| | `FER_MODEL_PATH` | (auto-detected) | Path to `emotion_model_web.onnx`; auto-detects `/app/models/` (Docker) and `../models/` (local) | | |
| | `MAX_UPLOAD_MB` | `100` | Max upload size in MB | | |
| --- | |
| ## Usage examples | |
| ```bash | |
| # Health | |
| curl -s http://127.0.0.1:8000/health | |
| # Smoke-test inference | |
| curl -s http://127.0.0.1:8000/debug-inference | |
| # Simple transcription | |
| curl -X POST http://127.0.0.1:8000/transcribe -F "audio=@audio.m4a" | |
| # Full pipeline (audio) | |
| curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@audio.m4a" | |
| # Full pipeline (video β also returns face_emotion per segment) | |
| curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@video.mov" | |
| ``` | |