Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / README.md

Lior-0618

chore: merge master → dev/video-fer (SSE transcribe-stream)

aa15e90 6 days ago

preview code

raw

history blame contribute delete

5.93 kB

metadata

title: Ethos Studio
emoji: 🎤
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

Ethos Studio — Emotional Speech Recognition

Built for the Mistral AI Online Hackathon 2026 (W&B Fine-Tuning Track).

Ethos Studio is a full-stack emotional speech recognition platform combining real-time transcription, facial emotion recognition, and expressive audio tagging. It turns raw speech into richly annotated transcripts with emotions, non-verbal sounds, and delivery cues.

Key Components

Evoxtral — Expressive Tagged Transcription

LoRA finetune of Voxtral-Mini-3B-2507 that produces transcriptions with inline ElevenLabs v3 audio tags. Two-stage pipeline: SFT (3 epochs) → RL via RAFT (rejection sampling, 1 epoch).

Standard ASR: So I was thinking maybe we could try that new restaurant downtown.

Evoxtral: [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously]

Two model variants:

Evoxtral SFT — Best transcription accuracy (lowest WER)
Evoxtral RL — Best expressive tag accuracy (highest Tag F1)

Metric	Base Voxtral	Evoxtral SFT	Evoxtral RL	Best
WER ↓	6.64%	4.47%	5.12%	SFT
CER ↓	2.72%	1.23%	1.48%	SFT
Tag F1 ↑	22.0%	67.2%	69.4%	RL
Tag Recall ↑	22.0%	69.4%	72.7%	RL
Emphasis F1 ↑	42.0%	84.0%	86.0%	RL

FER — Facial Emotion Recognition

MobileViT-XXS model trained on 8 emotion classes, exported to ONNX for real-time browser inference.

Classes: Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise

Voxtral Server — Speech-to-Text + Emotion

Speech-to-text service with VAD sentence segmentation and per-segment emotion analysis, powered by Voxtral Mini 4B.

Architecture

Browser (port 3030)  →  Server layer (Node, :3000)  →  Model layer (Python, :8000)
      ↑ Studio UI            POST /api/speech-to-text          POST /transcribe
      ↑ Upload dialog        POST /api/transcribe-diarize      POST /transcribe-diarize
                             GET  /health                       GET  /health

Layer	Path	Role
Model	`model/voxtral-server`	Voxtral inference, VAD segmentation, emotion analysis
Server	`demo/server`	API entrypoint; proxies to Model
Frontend	`demo`	Next.js UI (upload, Studio editor, waveform, timeline)
Evoxtral	`training/scripts/`	Training, eval, RL, serving for expressive transcription
FER	`models/`	Facial emotion recognition ONNX model

See demo/README.md for full API and usage; model/voxtral-server/README.md for the Model API.

Project Structure

├── api/                    # Python FastAPI — local Voxtral inference + FER
├── proxy/                  # Node.js/Express — API gateway for frontend
├── web/                    # Next.js — Studio editor UI
├── training/               # Fine-tuning code (SFT + RL), data prep, eval
│   └── scripts/            # Modal scripts: train, RL (RAFT), eval, serve
├── space/                  # HuggingFace Space (Gradio demo)
├── models/                 # FER ONNX model (MobileViT-XXS)
├── docs/                   # Technical report, design docs, research refs
├── data/                   # Training data scripts (audio files gitignored)
└── Dockerfile              # Single-container HF Spaces build

How to Run

Requirements: Python 3.10+, Node.js 20+, ffmpeg; GPU recommended.

Model layer (port 8000)

cd model/voxtral-server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Server layer (port 3000)

cd demo/server && npm install && npm run dev

Frontend (port 3030)

cd demo && npm install && npm run dev

Open http://localhost:3030.

Evoxtral API (Modal)

modal deploy training/scripts/serve_modal.py

Tech Stack

Models: Voxtral-Mini-3B + LoRA, Voxtral-Mini-4B, MobileViT-XXS
Training: PyTorch, PEFT, Weights & Biases
Inference: Modal (serverless GPU), HuggingFace ZeroGPU, ONNX Runtime
Backend: FastAPI, Node.js
Frontend: Next.js, Gradio