File size: 4,565 Bytes
c4c4f17
e0fe7d5
c4c4f17
e0fe7d5
c4c4f17
e0fe7d5
 
 
 
 
 
c4c4f17
e0fe7d5
 
 
 
 
 
c4c4f17
e0fe7d5
 
 
 
 
 
 
c4c4f17
e0fe7d5
 
 
 
c4c4f17
 
fdef69c
e0fe7d5
fdef69c
e0fe7d5
 
 
a80f74e
e0fe7d5
a80f74e
e0fe7d5
 
 
 
 
fdef69c
 
c4c4f17
fdef69c
 
 
 
c4c4f17
 
fdef69c
e0fe7d5
c4c4f17
e0fe7d5
c4c4f17
fdef69c
 
 
 
 
 
 
ed75654
 
a80f74e
 
 
 
fdef69c
 
ed75654
a80f74e
fdef69c
c4c4f17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdef69c
 
 
c4c4f17
fdef69c
 
 
 
 
 
 
c4c4f17
fdef69c
 
 
 
 
 
e0fe7d5
 
 
 
c4c4f17
e0fe7d5
fdef69c
c4c4f17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0fe7d5
 
 
 
 
 
 
 
 
c4c4f17
 
 
e0fe7d5
 
 
 
fdef69c
e0fe7d5
 
c4c4f17
 
 
 
 
 
fdef69c
 
e0fe7d5
c4c4f17
fdef69c
e0fe7d5
c4c4f17
 
e0fe7d5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
# API Layer (Python FastAPI — port 8000)

Local Voxtral inference pipeline. Loads `mistralai/Voxtral-Mini-3B-2507` + `YongkangZOU/evoxtral-lora` (PEFT adapter) locally, runs VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER) for video inputs.

**Requirements**: Python 3.11+, system **ffmpeg** (`brew install ffmpeg` / `apt install ffmpeg`). GPU with ~8 GB VRAM recommended; CPU fallback supported (expect ~50 s per audio second).

---

## Startup

```bash
cd api
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```

Default port: **8000**. On first start the Voxtral model (~8 GB total) is downloaded from HuggingFace. Set `HF_HUB_DISABLE_XET=1` if download stalls behind a local proxy.

---

## API

### POST /transcribe

Simple transcription. Audio is converted to WAV and passed to the local Voxtral model.

| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio or video file (required) |
| **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| **Max size** | `MAX_UPLOAD_MB` (default 100 MB) |

**Response (200)**

```json
{
  "text": "Hello! [laughs] How are you?",
  "words": [],
  "languageCode": "en"
}
```

---

### POST /transcribe-diarize

Full pipeline: local Voxtral STT + VAD sentence segmentation + per-segment emotion tagging. For video inputs, also runs per-frame FER via MobileViT-XXS ONNX and adds `face_emotion` per segment.

| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio or video file (required) |
| **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| **Max size** | `MAX_UPLOAD_MB` (default 100 MB) |

Segmentation: silence gaps ≥ 0.3 s create a new segment; gaps < 0.3 s are merged.

**Response (200) — audio input**

```json
{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.96,
      "end": 3.23,
      "text": "Hello! [laughs] How are you?",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6
    }
  ],
  "duration": 5.65,
  "text": "Hello! [laughs] How are you?",
  "filename": "audio.m4a",
  "diarization_method": "vad",
  "has_video": false
}
```

**Response (200) — video input** (adds `face_emotion` per segment)

```json
{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.96,
      "end": 3.23,
      "text": "Hello!",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6,
      "face_emotion": "Happy"
    }
  ],
  "duration": 5.65,
  "text": "Hello!",
  "filename": "video.mov",
  "diarization_method": "vad",
  "has_video": true
}
```

`face_emotion` values: `Anger | Contempt | Disgust | Fear | Happy | Neutral | Sad | Surprise`

**Errors**

| Status | Meaning |
|--------|---------|
| 400 | No/invalid file, empty, or unsupported format |
| 413 | File exceeds `MAX_UPLOAD_MB` |
| 500 | Transcription or inference error |

---

### GET /health

**Response (200)**

```json
{
  "status": "ok",
  "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
  "model_loaded": true,
  "ffmpeg": true,
  "fer_enabled": true,
  "device": "cpu",
  "max_upload_mb": 100
}
```

---

### GET /debug-inference

Smoke-test endpoint: synthesizes 0.5 s of silence and runs a minimal `generate()` call. Useful for verifying the model is loaded and functional without uploading a real file.

**Response (200)**

```json
{
  "ok": true,
  "text": "",
  "dtype": "torch.bfloat16",
  "device": "cpu"
}
```

---

## Environment variables

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_ID` | `mistralai/Voxtral-Mini-3B-2507` | Base Voxtral model on HF Hub |
| `ADAPTER_ID` | `YongkangZOU/evoxtral-lora` | PEFT LoRA adapter on HF Hub |
| `FER_MODEL_PATH` | (auto-detected) | Path to `emotion_model_web.onnx`; auto-detects `/app/models/` (Docker) and `../models/` (local) |
| `MAX_UPLOAD_MB` | `100` | Max upload size in MB |

---

## Usage examples

```bash
# Health
curl -s http://127.0.0.1:8000/health

# Smoke-test inference
curl -s http://127.0.0.1:8000/debug-inference

# Simple transcription
curl -X POST http://127.0.0.1:8000/transcribe -F "audio=@audio.m4a"

# Full pipeline (audio)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@audio.m4a"

# Full pipeline (video — also returns face_emotion per segment)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@video.mov"
```