# API Layer (Python FastAPI — port 8000)

Local Voxtral inference pipeline. Loads `mistralai/Voxtral-Mini-3B-2507` + `YongkangZOU/evoxtral-lora` (PEFT adapter) locally, runs VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER) for video inputs.

**Requirements**: Python 3.11+, system **ffmpeg** (`brew install ffmpeg` / `apt install ffmpeg`). GPU with ~8 GB VRAM recommended; CPU fallback supported (expect ~50 s per audio second).

---

## Startup

```bash
cd api
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```

Default port: **8000**. On first start the Voxtral model (~8 GB total) is downloaded from HuggingFace. Set `HF_HUB_DISABLE_XET=1` if download stalls behind a local proxy.

---

## API

### POST /transcribe

Simple transcription. Audio is converted to WAV and passed to the local Voxtral model.

| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio or video file (required) |
| **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| **Max size** | `MAX_UPLOAD_MB` (default 100 MB) |

**Response (200)**

```json
{
  "text": "Hello! [laughs] How are you?",
  "words": [],
  "languageCode": "en"
}
```

---

### POST /transcribe-diarize

Full pipeline: local Voxtral STT + VAD sentence segmentation + per-segment emotion tagging. For video inputs, also runs per-frame FER via MobileViT-XXS ONNX and adds `face_emotion` per segment.

| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio or video file (required) |
| **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| **Max size** | `MAX_UPLOAD_MB` (default 100 MB) |

Segmentation: silence gaps ≥ 0.3 s create a new segment; gaps < 0.3 s are merged.

**Response (200) — audio input**

```json
{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.96,
      "end": 3.23,
      "text": "Hello! [laughs] How are you?",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6
    }
  ],
  "duration": 5.65,
  "text": "Hello! [laughs] How are you?",
  "filename": "audio.m4a",
  "diarization_method": "vad",
  "has_video": false
}
```

**Response (200) — video input** (adds `face_emotion` per segment)

```json
{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.96,
      "end": 3.23,
      "text": "Hello!",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6,
      "face_emotion": "Happy"
    }
  ],
  "duration": 5.65,
  "text": "Hello!",
  "filename": "video.mov",
  "diarization_method": "vad",
  "has_video": true
}
```

`face_emotion` values: `Anger | Contempt | Disgust | Fear | Happy | Neutral | Sad | Surprise`

**Errors**

| Status | Meaning |
|--------|---------|
| 400 | No/invalid file, empty, or unsupported format |
| 413 | File exceeds `MAX_UPLOAD_MB` |
| 500 | Transcription or inference error |

---

### GET /health

**Response (200)**

```json
{
  "status": "ok",
  "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
  "model_loaded": true,
  "ffmpeg": true,
  "fer_enabled": true,
  "device": "cpu",
  "max_upload_mb": 100
}
```

---

### GET /debug-inference

Smoke-test endpoint: synthesizes 0.5 s of silence and runs a minimal `generate()` call. Useful for verifying the model is loaded and functional without uploading a real file.

**Response (200)**

```json
{
  "ok": true,
  "text": "",
  "dtype": "torch.bfloat16",
  "device": "cpu"
}
```

---

## Environment variables

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_ID` | `mistralai/Voxtral-Mini-3B-2507` | Base Voxtral model on HF Hub |
| `ADAPTER_ID` | `YongkangZOU/evoxtral-lora` | PEFT LoRA adapter on HF Hub |
| `FER_MODEL_PATH` | (auto-detected) | Path to `emotion_model_web.onnx`; auto-detects `/app/models/` (Docker) and `../models/` (local) |
| `MAX_UPLOAD_MB` | `100` | Max upload size in MB |

---

## Usage examples

```bash
# Health
curl -s http://127.0.0.1:8000/health

# Smoke-test inference
curl -s http://127.0.0.1:8000/debug-inference

# Simple transcription
curl -X POST http://127.0.0.1:8000/transcribe -F "audio=@audio.m4a"

# Full pipeline (audio)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@audio.m4a"

# Full pipeline (video — also returns face_emotion per segment)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@video.mov"
```