Spaces:

mistral-hackaton-2026
/

ethos

Running

File size: 4,565 Bytes

c4c4f17
e0fe7d5
c4c4f17
e0fe7d5
c4c4f17
e0fe7d5
 
 
 
 
 
c4c4f17
e0fe7d5
 
 
 
 
 
c4c4f17
e0fe7d5
 
 
 
 
 
 
c4c4f17
e0fe7d5
 
 
 
c4c4f17
 
fdef69c
e0fe7d5
fdef69c
e0fe7d5
 
 
a80f74e
e0fe7d5
a80f74e
e0fe7d5
 
 
 
 
fdef69c
 
c4c4f17
fdef69c
 
 
 
c4c4f17
 
fdef69c
e0fe7d5
c4c4f17
e0fe7d5
c4c4f17
fdef69c
 
 
 
 
 
 
ed75654
 
a80f74e
 
 
 
fdef69c
 
ed75654
a80f74e
fdef69c
c4c4f17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdef69c
 
 
c4c4f17
fdef69c
 
 
 
 
 
 
c4c4f17
fdef69c
 
 
 
 
 
e0fe7d5
 
 
 
c4c4f17
e0fe7d5
fdef69c
c4c4f17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e0fe7d5
 
 
 
 
 
 
 
 
c4c4f17
 
 
e0fe7d5
 
 
 
fdef69c
e0fe7d5
 
c4c4f17
 
 
 
 
 
fdef69c
 
e0fe7d5
c4c4f17
fdef69c
e0fe7d5
c4c4f17
 
e0fe7d5

# API Layer (Python FastAPI — port 8000)

Local Voxtral inference pipeline. Loads `mistralai/Voxtral-Mini-3B-2507` + `YongkangZOU/evoxtral-lora` (PEFT adapter) locally, runs VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER) for video inputs.

**Requirements**: Python 3.11+, system **ffmpeg** (`brew install ffmpeg` / `apt install ffmpeg`). GPU with ~8 GB VRAM recommended; CPU fallback supported (expect ~50 s per audio second).

---

## Startup

```bash
cd api
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```

Default port: **8000**. On first start the Voxtral model (~8 GB total) is downloaded from HuggingFace. Set `HF_HUB_DISABLE_XET=1` if download stalls behind a local proxy.

---

## API

### POST /transcribe

Simple transcription. Audio is converted to WAV and passed to the local Voxtral model.

| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio or video file (required) |
| **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| **Max size** | `MAX_UPLOAD_MB` (default 100 MB) |

**Response (200)**

```json
{
  "text": "Hello! [laughs] How are you?",
  "words": [],
  "languageCode": "en"
}
```

---

### POST /transcribe-diarize

Full pipeline: local Voxtral STT + VAD sentence segmentation + per-segment emotion tagging. For video inputs, also runs per-frame FER via MobileViT-XXS ONNX and adds `face_emotion` per segment.

| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio or video file (required) |
| **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| **Max size** | `MAX_UPLOAD_MB` (default 100 MB) |

Segmentation: silence gaps ≥ 0.3 s create a new segment; gaps < 0.3 s are merged.

**Response (200) — audio input**

```json
{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.96,
      "end": 3.23,
      "text": "Hello! [laughs] How are you?",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6
    }
  ],
  "duration": 5.65,
  "text": "Hello! [laughs] How are you?",
  "filename": "audio.m4a",
  "diarization_method": "vad",
  "has_video": false
}
```

**Response (200) — video input** (adds `face_emotion` per segment)

```json
{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.96,
      "end": 3.23,
      "text": "Hello!",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6,
      "face_emotion": "Happy"
    }
  ],
  "duration": 5.65,
  "text": "Hello!",
  "filename": "video.mov",
  "diarization_method": "vad",
  "has_video": true
}
```

`face_emotion` values: `Anger | Contempt | Disgust | Fear | Happy | Neutral | Sad | Surprise`

**Errors**

| Status | Meaning |
|--------|---------|
| 400 | No/invalid file, empty, or unsupported format |
| 413 | File exceeds `MAX_UPLOAD_MB` |
| 500 | Transcription or inference error |

---

### GET /health

**Response (200)**

```json
{
  "status": "ok",
  "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
  "model_loaded": true,
  "ffmpeg": true,
  "fer_enabled": true,
  "device": "cpu",
  "max_upload_mb": 100
}
```

---

### GET /debug-inference

Smoke-test endpoint: synthesizes 0.5 s of silence and runs a minimal `generate()` call. Useful for verifying the model is loaded and functional without uploading a real file.

**Response (200)**

```json
{
  "ok": true,
  "text": "",
  "dtype": "torch.bfloat16",
  "device": "cpu"
}
```

---

## Environment variables

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_ID` | `mistralai/Voxtral-Mini-3B-2507` | Base Voxtral model on HF Hub |
| `ADAPTER_ID` | `YongkangZOU/evoxtral-lora` | PEFT LoRA adapter on HF Hub |
| `FER_MODEL_PATH` | (auto-detected) | Path to `emotion_model_web.onnx`; auto-detects `/app/models/` (Docker) and `../models/` (local) |
| `MAX_UPLOAD_MB` | `100` | Max upload size in MB |

---

## Usage examples

```bash
# Health
curl -s http://127.0.0.1:8000/health

# Smoke-test inference
curl -s http://127.0.0.1:8000/debug-inference

# Simple transcription
curl -X POST http://127.0.0.1:8000/transcribe -F "audio=@audio.m4a"

# Full pipeline (audio)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@audio.m4a"

# Full pipeline (video — also returns face_emotion per segment)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@video.mov"
```