Spaces:
Running
Running
File size: 4,565 Bytes
c4c4f17 e0fe7d5 c4c4f17 e0fe7d5 c4c4f17 e0fe7d5 c4c4f17 e0fe7d5 c4c4f17 e0fe7d5 c4c4f17 e0fe7d5 c4c4f17 fdef69c e0fe7d5 fdef69c e0fe7d5 a80f74e e0fe7d5 a80f74e e0fe7d5 fdef69c c4c4f17 fdef69c c4c4f17 fdef69c e0fe7d5 c4c4f17 e0fe7d5 c4c4f17 fdef69c ed75654 a80f74e fdef69c ed75654 a80f74e fdef69c c4c4f17 fdef69c c4c4f17 fdef69c c4c4f17 fdef69c e0fe7d5 c4c4f17 e0fe7d5 fdef69c c4c4f17 e0fe7d5 c4c4f17 e0fe7d5 fdef69c e0fe7d5 c4c4f17 fdef69c e0fe7d5 c4c4f17 fdef69c e0fe7d5 c4c4f17 e0fe7d5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | # API Layer (Python FastAPI — port 8000)
Local Voxtral inference pipeline. Loads `mistralai/Voxtral-Mini-3B-2507` + `YongkangZOU/evoxtral-lora` (PEFT adapter) locally, runs VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER) for video inputs.
**Requirements**: Python 3.11+, system **ffmpeg** (`brew install ffmpeg` / `apt install ffmpeg`). GPU with ~8 GB VRAM recommended; CPU fallback supported (expect ~50 s per audio second).
---
## Startup
```bash
cd api
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```
Default port: **8000**. On first start the Voxtral model (~8 GB total) is downloaded from HuggingFace. Set `HF_HUB_DISABLE_XET=1` if download stalls behind a local proxy.
---
## API
### POST /transcribe
Simple transcription. Audio is converted to WAV and passed to the local Voxtral model.
| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio or video file (required) |
| **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| **Max size** | `MAX_UPLOAD_MB` (default 100 MB) |
**Response (200)**
```json
{
"text": "Hello! [laughs] How are you?",
"words": [],
"languageCode": "en"
}
```
---
### POST /transcribe-diarize
Full pipeline: local Voxtral STT + VAD sentence segmentation + per-segment emotion tagging. For video inputs, also runs per-frame FER via MobileViT-XXS ONNX and adds `face_emotion` per segment.
| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio or video file (required) |
| **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| **Max size** | `MAX_UPLOAD_MB` (default 100 MB) |
Segmentation: silence gaps ≥ 0.3 s create a new segment; gaps < 0.3 s are merged.
**Response (200) — audio input**
```json
{
"segments": [
{
"id": 1,
"speaker": "SPEAKER_00",
"start": 0.96,
"end": 3.23,
"text": "Hello! [laughs] How are you?",
"emotion": "Happy",
"valence": 0.7,
"arousal": 0.6
}
],
"duration": 5.65,
"text": "Hello! [laughs] How are you?",
"filename": "audio.m4a",
"diarization_method": "vad",
"has_video": false
}
```
**Response (200) — video input** (adds `face_emotion` per segment)
```json
{
"segments": [
{
"id": 1,
"speaker": "SPEAKER_00",
"start": 0.96,
"end": 3.23,
"text": "Hello!",
"emotion": "Happy",
"valence": 0.7,
"arousal": 0.6,
"face_emotion": "Happy"
}
],
"duration": 5.65,
"text": "Hello!",
"filename": "video.mov",
"diarization_method": "vad",
"has_video": true
}
```
`face_emotion` values: `Anger | Contempt | Disgust | Fear | Happy | Neutral | Sad | Surprise`
**Errors**
| Status | Meaning |
|--------|---------|
| 400 | No/invalid file, empty, or unsupported format |
| 413 | File exceeds `MAX_UPLOAD_MB` |
| 500 | Transcription or inference error |
---
### GET /health
**Response (200)**
```json
{
"status": "ok",
"model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
"model_loaded": true,
"ffmpeg": true,
"fer_enabled": true,
"device": "cpu",
"max_upload_mb": 100
}
```
---
### GET /debug-inference
Smoke-test endpoint: synthesizes 0.5 s of silence and runs a minimal `generate()` call. Useful for verifying the model is loaded and functional without uploading a real file.
**Response (200)**
```json
{
"ok": true,
"text": "",
"dtype": "torch.bfloat16",
"device": "cpu"
}
```
---
## Environment variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_ID` | `mistralai/Voxtral-Mini-3B-2507` | Base Voxtral model on HF Hub |
| `ADAPTER_ID` | `YongkangZOU/evoxtral-lora` | PEFT LoRA adapter on HF Hub |
| `FER_MODEL_PATH` | (auto-detected) | Path to `emotion_model_web.onnx`; auto-detects `/app/models/` (Docker) and `../models/` (local) |
| `MAX_UPLOAD_MB` | `100` | Max upload size in MB |
---
## Usage examples
```bash
# Health
curl -s http://127.0.0.1:8000/health
# Smoke-test inference
curl -s http://127.0.0.1:8000/debug-inference
# Simple transcription
curl -X POST http://127.0.0.1:8000/transcribe -F "audio=@audio.m4a"
# Full pipeline (audio)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@audio.m4a"
# Full pipeline (video — also returns face_emotion per segment)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@video.mov"
```
|