Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / api /README.md

Lior-0618

feat: replace frontend with redesigned Ethos Studio UI + update READMEs

c4c4f17 6 days ago

preview code

raw

history blame contribute delete

4.57 kB

API Layer (Python FastAPI — port 8000)

Local Voxtral inference pipeline. Loads mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (PEFT adapter) locally, runs VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER) for video inputs.

Requirements: Python 3.11+, system ffmpeg (brew install ffmpeg / apt install ffmpeg). GPU with ~8 GB VRAM recommended; CPU fallback supported (expect ~50 s per audio second).

Startup

cd api
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Default port: 8000. On first start the Voxtral model (~8 GB total) is downloaded from HuggingFace. Set HF_HUB_DISABLE_XET=1 if download stalls behind a local proxy.

API

POST /transcribe

Simple transcription. Audio is converted to WAV and passed to the local Voxtral model.


Content-Type	`multipart/form-data`
Body	`audio` — audio or video file (required)
Formats	wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv
Max size	`MAX_UPLOAD_MB` (default 100 MB)

Response (200)

{
  "text": "Hello! [laughs] How are you?",
  "words": [],
  "languageCode": "en"
}

POST /transcribe-diarize

Full pipeline: local Voxtral STT + VAD sentence segmentation + per-segment emotion tagging. For video inputs, also runs per-frame FER via MobileViT-XXS ONNX and adds face_emotion per segment.


Content-Type	`multipart/form-data`
Body	`audio` — audio or video file (required)
Formats	wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv
Max size	`MAX_UPLOAD_MB` (default 100 MB)

Segmentation: silence gaps ≥ 0.3 s create a new segment; gaps < 0.3 s are merged.

Response (200) — audio input

{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.96,
      "end": 3.23,
      "text": "Hello! [laughs] How are you?",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6
    }
  ],
  "duration": 5.65,
  "text": "Hello! [laughs] How are you?",
  "filename": "audio.m4a",
  "diarization_method": "vad",
  "has_video": false
}

Response (200) — video input (adds face_emotion per segment)

{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.96,
      "end": 3.23,
      "text": "Hello!",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6,
      "face_emotion": "Happy"
    }
  ],
  "duration": 5.65,
  "text": "Hello!",
  "filename": "video.mov",
  "diarization_method": "vad",
  "has_video": true
}

Errors

Status	Meaning
400	No/invalid file, empty, or unsupported format
413	File exceeds `MAX_UPLOAD_MB`
500	Transcription or inference error

GET /health

Response (200)

{
  "status": "ok",
  "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
  "model_loaded": true,
  "ffmpeg": true,
  "fer_enabled": true,
  "device": "cpu",
  "max_upload_mb": 100
}

GET /debug-inference

Smoke-test endpoint: synthesizes 0.5 s of silence and runs a minimal generate() call. Useful for verifying the model is loaded and functional without uploading a real file.

Response (200)

{
  "ok": true,
  "text": "",
  "dtype": "torch.bfloat16",
  "device": "cpu"
}

Environment variables

Variable	Default	Description
`MODEL_ID`	`mistralai/Voxtral-Mini-3B-2507`	Base Voxtral model on HF Hub
`ADAPTER_ID`	`YongkangZOU/evoxtral-lora`	PEFT LoRA adapter on HF Hub
`FER_MODEL_PATH`	(auto-detected)	Path to `emotion_model_web.onnx`; auto-detects `/app/models/` (Docker) and `../models/` (local)
`MAX_UPLOAD_MB`	`100`	Max upload size in MB

Usage examples

# Health
curl -s http://127.0.0.1:8000/health

# Smoke-test inference
curl -s http://127.0.0.1:8000/debug-inference

# Simple transcription
curl -X POST http://127.0.0.1:8000/transcribe -F "audio=@audio.m4a"

# Full pipeline (audio)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@audio.m4a"

# Full pipeline (video — also returns face_emotion per segment)
curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@video.mov"