ethos / api /README.md
Lior-0618's picture
feat: replace frontend with redesigned Ethos Studio UI + update READMEs
c4c4f17
# API Layer (Python FastAPI β€” port 8000)
Local Voxtral inference pipeline. Loads `mistralai/Voxtral-Mini-3B-2507` + `YongkangZOU/evoxtral-lora` (PEFT adapter) locally, runs VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER) for video inputs.
**Requirements**: Python 3.11+, system **ffmpeg** (`brew install ffmpeg` / `apt install ffmpeg`). GPU with ~8 GB VRAM recommended; CPU fallback supported (expect ~50 s per audio second).
---
## Startup
```bash
cd api
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```
Default port: **8000**. On first start the Voxtral model (~8 GB total) is downloaded from HuggingFace. Set `HF_HUB_DISABLE_XET=1` if download stalls behind a local proxy.
---
## API
### POST /transcribe
Simple transcription. Audio is converted to WAV and passed to the local Voxtral model.
| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` β€” audio or video file (required) |
| **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| **Max size** | `MAX_UPLOAD_MB` (default 100 MB) |
**Response (200)**
```json
{
"text": "Hello! [laughs] How are you?",
"words": [],
"languageCode": "en"
}
```
---
### POST /transcribe-diarize
Full pipeline: local Voxtral STT + VAD sentence segmentation + per-segment emotion tagging. For video inputs, also runs per-frame FER via MobileViT-XXS ONNX and adds `face_emotion` per segment.
| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` β€” audio or video file (required) |
| **Formats** | wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv |
| **Max size** | `MAX_UPLOAD_MB` (default 100 MB) |
Segmentation: silence gaps β‰₯ 0.3 s create a new segment; gaps < 0.3 s are merged.
**Response (200) β€” audio input**
```json
{
"segments": [
{
"id": 1,
"speaker": "SPEAKER_00",
"start": 0.96,
"end": 3.23,
"text": "Hello! [laughs] How are you?",
"emotion": "Happy",
"valence": 0.7,
"arousal": 0.6
}
],
"duration": 5.65,
"text": "Hello! [laughs] How are you?",
"filename": "audio.m4a",
"diarization_method": "vad",
"has_video": false
}
```
**Response (200) β€” video input** (adds `face_emotion` per segment)
```json
{
"segments": [
{
"id": 1,
"speaker": "SPEAKER_00",
"start": 0.96,
"end": 3.23,
"text": "Hello!",
"emotion": "Happy",
"valence": 0.7,
"arousal": 0.6,
"face_emotion": "Happy"
}
],
"duration": 5.65,
"text": "Hello!",
"filename": "video.mov",
"diarization_method": "vad",
"has_video": true
}
```
`face_emotion` values: `Anger | Contempt | Disgust | Fear | Happy | Neutral | Sad | Surprise`
**Errors**
| Status | Meaning |
|--------|---------|
| 400 | No/invalid file, empty, or unsupported format |
| 413 | File exceeds `MAX_UPLOAD_MB` |
| 500 | Transcription or inference error |
---
### GET /health
**Response (200)**
```json
{
"status": "ok",
"model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
"model_loaded": true,
"ffmpeg": true,
"fer_enabled": true,
"device": "cpu",
"max_upload_mb": 100
}
```
---
### GET /debug-inference
Smoke-test endpoint: synthesizes 0.5 s of silence and runs a minimal `generate()` call. Useful for verifying the model is loaded and functional without uploading a real file.
**Response (200)**
```json
{
"ok": true,
"text": "",
"dtype": "torch.bfloat16",
"device": "cpu"
}
```
---
## Environment variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_ID` | `mistralai/Voxtral-Mini-3B-2507` | Base Voxtral model on HF Hub |
| `ADAPTER_ID` | `YongkangZOU/evoxtral-lora` | PEFT LoRA adapter on HF Hub |
| `FER_MODEL_PATH` | (auto-detected) | Path to `emotion_model_web.onnx`; auto-detects `/app/models/` (Docker) and `../models/` (local) |
| `MAX_UPLOAD_MB` | `100` | Max upload size in MB |
---
## Usage examples
```bash
# Health
curl -s http://127.0.0.1:8000/health
# Smoke-test inference
curl -s http://127.0.0.1:8000/debug-inference
# Simple transcription
curl -X POST http://127.0.0.1:8000/transcribe -F "audio=@audio.m4a"
# Full pipeline (audio)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@audio.m4a"
# Full pipeline (video β€” also returns face_emotion per segment)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@video.mov"
```