ethos / api /README.md
Lior-0618's picture
feat: replace frontend with redesigned Ethos Studio UI + update READMEs
c4c4f17

API Layer (Python FastAPI — port 8000)

Local Voxtral inference pipeline. Loads mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (PEFT adapter) locally, runs VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER) for video inputs.

Requirements: Python 3.11+, system ffmpeg (brew install ffmpeg / apt install ffmpeg). GPU with ~8 GB VRAM recommended; CPU fallback supported (expect ~50 s per audio second).


Startup

cd api
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Default port: 8000. On first start the Voxtral model (~8 GB total) is downloaded from HuggingFace. Set HF_HUB_DISABLE_XET=1 if download stalls behind a local proxy.


API

POST /transcribe

Simple transcription. Audio is converted to WAV and passed to the local Voxtral model.

Content-Type multipart/form-data
Body audio — audio or video file (required)
Formats wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv
Max size MAX_UPLOAD_MB (default 100 MB)

Response (200)

{
  "text": "Hello! [laughs] How are you?",
  "words": [],
  "languageCode": "en"
}

POST /transcribe-diarize

Full pipeline: local Voxtral STT + VAD sentence segmentation + per-segment emotion tagging. For video inputs, also runs per-frame FER via MobileViT-XXS ONNX and adds face_emotion per segment.

Content-Type multipart/form-data
Body audio — audio or video file (required)
Formats wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv
Max size MAX_UPLOAD_MB (default 100 MB)

Segmentation: silence gaps ≥ 0.3 s create a new segment; gaps < 0.3 s are merged.

Response (200) — audio input

{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.96,
      "end": 3.23,
      "text": "Hello! [laughs] How are you?",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6
    }
  ],
  "duration": 5.65,
  "text": "Hello! [laughs] How are you?",
  "filename": "audio.m4a",
  "diarization_method": "vad",
  "has_video": false
}

Response (200) — video input (adds face_emotion per segment)

{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.96,
      "end": 3.23,
      "text": "Hello!",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6,
      "face_emotion": "Happy"
    }
  ],
  "duration": 5.65,
  "text": "Hello!",
  "filename": "video.mov",
  "diarization_method": "vad",
  "has_video": true
}

face_emotion values: Anger | Contempt | Disgust | Fear | Happy | Neutral | Sad | Surprise

Errors

Status Meaning
400 No/invalid file, empty, or unsupported format
413 File exceeds MAX_UPLOAD_MB
500 Transcription or inference error

GET /health

Response (200)

{
  "status": "ok",
  "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
  "model_loaded": true,
  "ffmpeg": true,
  "fer_enabled": true,
  "device": "cpu",
  "max_upload_mb": 100
}

GET /debug-inference

Smoke-test endpoint: synthesizes 0.5 s of silence and runs a minimal generate() call. Useful for verifying the model is loaded and functional without uploading a real file.

Response (200)

{
  "ok": true,
  "text": "",
  "dtype": "torch.bfloat16",
  "device": "cpu"
}

Environment variables

Variable Default Description
MODEL_ID mistralai/Voxtral-Mini-3B-2507 Base Voxtral model on HF Hub
ADAPTER_ID YongkangZOU/evoxtral-lora PEFT LoRA adapter on HF Hub
FER_MODEL_PATH (auto-detected) Path to emotion_model_web.onnx; auto-detects /app/models/ (Docker) and ../models/ (local)
MAX_UPLOAD_MB 100 Max upload size in MB

Usage examples

# Health
curl -s http://127.0.0.1:8000/health

# Smoke-test inference
curl -s http://127.0.0.1:8000/debug-inference

# Simple transcription
curl -X POST http://127.0.0.1:8000/transcribe -F "audio=@audio.m4a"

# Full pipeline (audio)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@audio.m4a"

# Full pipeline (video — also returns face_emotion per segment)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@video.mov"