Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / api /README.md

Lior-0618

feat: replace frontend with redesigned Ethos Studio UI + update READMEs

c4c4f17 7 days ago

preview code

raw

history blame contribute delete

4.57 kB

	# API Layer (Python FastAPI — port 8000)

	Local Voxtral inference pipeline. Loads `mistralai/Voxtral-Mini-3B-2507` + `YongkangZOU/evoxtral-lora` (PEFT adapter) locally, runs VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER) for video inputs.

	Requirements: Python 3.11+, system ffmpeg (`brew install ffmpeg` / `apt install ffmpeg`). GPU with ~8 GB VRAM recommended; CPU fallback supported (expect ~50 s per audio second).

	---

	## Startup

	```bash
	cd api
	python -m venv .venv
	source .venv/bin/activate # Windows: .venv\Scripts\activate
	pip install -r requirements.txt
	uvicorn main:app --host 0.0.0.0 --port 8000 --reload
	```

	Default port: 8000. On first start the Voxtral model (~8 GB total) is downloaded from HuggingFace. Set `HF_HUB_DISABLE_XET=1` if download stalls behind a local proxy.

	---

	## API

	### POST /transcribe

	Simple transcription. Audio is converted to WAV and passed to the local Voxtral model.

	\| \| \|
	\|--\|--\|
	\| Content-Type \| `multipart/form-data` \|
	\| Body \| `audio` — audio or video file (required) \|
	\| Formats \| wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv \|
	\| Max size \| `MAX_UPLOAD_MB` (default 100 MB) \|

	Response (200)

	```json
	{
	"text": "Hello! [laughs] How are you?",
	"words": [],
	"languageCode": "en"
	}
	```

	---

	### POST /transcribe-diarize

	Full pipeline: local Voxtral STT + VAD sentence segmentation + per-segment emotion tagging. For video inputs, also runs per-frame FER via MobileViT-XXS ONNX and adds `face_emotion` per segment.

	\| \| \|
	\|--\|--\|
	\| Content-Type \| `multipart/form-data` \|
	\| Body \| `audio` — audio or video file (required) \|
	\| Formats \| wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv \|
	\| Max size \| `MAX_UPLOAD_MB` (default 100 MB) \|

	Segmentation: silence gaps ≥ 0.3 s create a new segment; gaps < 0.3 s are merged.

	Response (200) — audio input

	```json
	{
	"segments": [
	{
	"id": 1,
	"speaker": "SPEAKER_00",
	"start": 0.96,
	"end": 3.23,
	"text": "Hello! [laughs] How are you?",
	"emotion": "Happy",
	"valence": 0.7,
	"arousal": 0.6
	}
	],
	"duration": 5.65,
	"text": "Hello! [laughs] How are you?",
	"filename": "audio.m4a",
	"diarization_method": "vad",
	"has_video": false
	}
	```

	Response (200) — video input (adds `face_emotion` per segment)

	```json
	{
	"segments": [
	{
	"id": 1,
	"speaker": "SPEAKER_00",
	"start": 0.96,
	"end": 3.23,
	"text": "Hello!",
	"emotion": "Happy",
	"valence": 0.7,
	"arousal": 0.6,
	"face_emotion": "Happy"
	}
	],
	"duration": 5.65,
	"text": "Hello!",
	"filename": "video.mov",
	"diarization_method": "vad",
	"has_video": true
	}
	```

	`face_emotion` values: `Anger \| Contempt \| Disgust \| Fear \| Happy \| Neutral \| Sad \| Surprise`

	Errors

	\| Status \| Meaning \|
	\|--------\|---------\|
	\| 400 \| No/invalid file, empty, or unsupported format \|
	\| 413 \| File exceeds `MAX_UPLOAD_MB` \|
	\| 500 \| Transcription or inference error \|

	---

	### GET /health

	Response (200)

	```json
	{
	"status": "ok",
	"model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
	"model_loaded": true,
	"ffmpeg": true,
	"fer_enabled": true,
	"device": "cpu",
	"max_upload_mb": 100
	}
	```

	---

	### GET /debug-inference

	Smoke-test endpoint: synthesizes 0.5 s of silence and runs a minimal `generate()` call. Useful for verifying the model is loaded and functional without uploading a real file.

	Response (200)

	```json
	{
	"ok": true,
	"text": "",
	"dtype": "torch.bfloat16",
	"device": "cpu"
	}
	```

	---

	## Environment variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `MODEL_ID` \| `mistralai/Voxtral-Mini-3B-2507` \| Base Voxtral model on HF Hub \|
	\| `ADAPTER_ID` \| `YongkangZOU/evoxtral-lora` \| PEFT LoRA adapter on HF Hub \|
	\| `FER_MODEL_PATH` \| (auto-detected) \| Path to `emotion_model_web.onnx`; auto-detects `/app/models/` (Docker) and `../models/` (local) \|
	\| `MAX_UPLOAD_MB` \| `100` \| Max upload size in MB \|

	---

	## Usage examples

	```bash
	# Health
	curl -s http://127.0.0.1:8000/health

	# Smoke-test inference
	curl -s http://127.0.0.1:8000/debug-inference

	# Simple transcription
	curl -X POST http://127.0.0.1:8000/transcribe -F "audio=@audio.m4a"

	# Full pipeline (audio)
	curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@audio.m4a"

	# Full pipeline (video — also returns face_emotion per segment)
	curl -X POST http://127.0.0.1:8000/transcribe-diarize -F "audio=@video.mov"
	```