Spaces:
Running
API Layer (Python FastAPI — port 8000)
Local Voxtral inference pipeline. Loads mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (PEFT adapter) locally, runs VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER) for video inputs.
Requirements: Python 3.11+, system ffmpeg (brew install ffmpeg / apt install ffmpeg). GPU with ~8 GB VRAM recommended; CPU fallback supported (expect ~50 s per audio second).
Startup
cd api
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Default port: 8000. On first start the Voxtral model (~8 GB total) is downloaded from HuggingFace. Set HF_HUB_DISABLE_XET=1 if download stalls behind a local proxy.
API
POST /transcribe
Simple transcription. Audio is converted to WAV and passed to the local Voxtral model.
| Content-Type | multipart/form-data |
| Body | audio — audio or video file (required) |
| Formats | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| Max size | MAX_UPLOAD_MB (default 100 MB) |
Response (200)
{
"text": "Hello! [laughs] How are you?",
"words": [],
"languageCode": "en"
}
POST /transcribe-diarize
Full pipeline: local Voxtral STT + VAD sentence segmentation + per-segment emotion tagging. For video inputs, also runs per-frame FER via MobileViT-XXS ONNX and adds face_emotion per segment.
| Content-Type | multipart/form-data |
| Body | audio — audio or video file (required) |
| Formats | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| Max size | MAX_UPLOAD_MB (default 100 MB) |
Segmentation: silence gaps ≥ 0.3 s create a new segment; gaps < 0.3 s are merged.
Response (200) — audio input
{
"segments": [
{
"id": 1,
"speaker": "SPEAKER_00",
"start": 0.96,
"end": 3.23,
"text": "Hello! [laughs] How are you?",
"emotion": "Happy",
"valence": 0.7,
"arousal": 0.6
}
],
"duration": 5.65,
"text": "Hello! [laughs] How are you?",
"filename": "audio.m4a",
"diarization_method": "vad",
"has_video": false
}
Response (200) — video input (adds face_emotion per segment)
{
"segments": [
{
"id": 1,
"speaker": "SPEAKER_00",
"start": 0.96,
"end": 3.23,
"text": "Hello!",
"emotion": "Happy",
"valence": 0.7,
"arousal": 0.6,
"face_emotion": "Happy"
}
],
"duration": 5.65,
"text": "Hello!",
"filename": "video.mov",
"diarization_method": "vad",
"has_video": true
}
face_emotion values: Anger | Contempt | Disgust | Fear | Happy | Neutral | Sad | Surprise
Errors
| Status | Meaning |
|---|---|
| 400 | No/invalid file, empty, or unsupported format |
| 413 | File exceeds MAX_UPLOAD_MB |
| 500 | Transcription or inference error |
GET /health
Response (200)
{
"status": "ok",
"model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
"model_loaded": true,
"ffmpeg": true,
"fer_enabled": true,
"device": "cpu",
"max_upload_mb": 100
}
GET /debug-inference
Smoke-test endpoint: synthesizes 0.5 s of silence and runs a minimal generate() call. Useful for verifying the model is loaded and functional without uploading a real file.
Response (200)
{
"ok": true,
"text": "",
"dtype": "torch.bfloat16",
"device": "cpu"
}
Environment variables
| Variable | Default | Description |
|---|---|---|
MODEL_ID |
mistralai/Voxtral-Mini-3B-2507 |
Base Voxtral model on HF Hub |
ADAPTER_ID |
YongkangZOU/evoxtral-lora |
PEFT LoRA adapter on HF Hub |
FER_MODEL_PATH |
(auto-detected) | Path to emotion_model_web.onnx; auto-detects /app/models/ (Docker) and ../models/ (local) |
MAX_UPLOAD_MB |
100 |
Max upload size in MB |
Usage examples
# Health
curl -s http://127.0.0.1:8000/health
# Smoke-test inference
curl -s http://127.0.0.1:8000/debug-inference
# Simple transcription
curl -X POST http://127.0.0.1:8000/transcribe -F "audio=@audio.m4a"
# Full pipeline (audio)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@audio.m4a"
# Full pipeline (video — also returns face_emotion per segment)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@video.mov"