ethos / README.md
Lior-0618's picture
chore: merge master β†’ dev/video-fer (SSE transcribe-stream)
aa15e90
metadata
title: Ethos Studio
emoji: 🎀
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

Ethos Studio β€” Emotional Speech Recognition

Built for the Mistral AI Online Hackathon 2026 (W&B Fine-Tuning Track).

Ethos Studio is a full-stack emotional speech recognition platform combining real-time transcription, facial emotion recognition, and expressive audio tagging. It turns raw speech into richly annotated transcripts with emotions, non-verbal sounds, and delivery cues.

Key Components

Evoxtral β€” Expressive Tagged Transcription

LoRA finetune of Voxtral-Mini-3B-2507 that produces transcriptions with inline ElevenLabs v3 audio tags. Two-stage pipeline: SFT (3 epochs) β†’ RL via RAFT (rejection sampling, 1 epoch).

Standard ASR: So I was thinking maybe we could try that new restaurant downtown.

Evoxtral: [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously]

Two model variants:

  • Evoxtral SFT β€” Best transcription accuracy (lowest WER)
  • Evoxtral RL β€” Best expressive tag accuracy (highest Tag F1)
Metric Base Voxtral Evoxtral SFT Evoxtral RL Best
WER ↓ 6.64% 4.47% 5.12% SFT
CER ↓ 2.72% 1.23% 1.48% SFT
Tag F1 ↑ 22.0% 67.2% 69.4% RL
Tag Recall ↑ 22.0% 69.4% 72.7% RL
Emphasis F1 ↑ 42.0% 84.0% 86.0% RL

FER β€” Facial Emotion Recognition

MobileViT-XXS model trained on 8 emotion classes, exported to ONNX for real-time browser inference.

Classes: Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise

Voxtral Server β€” Speech-to-Text + Emotion

Speech-to-text service with VAD sentence segmentation and per-segment emotion analysis, powered by Voxtral Mini 4B.

Architecture

Browser (port 3030)  β†’  Server layer (Node, :3000)  β†’  Model layer (Python, :8000)
      ↑ Studio UI            POST /api/speech-to-text          POST /transcribe
      ↑ Upload dialog        POST /api/transcribe-diarize      POST /transcribe-diarize
                             GET  /health                       GET  /health
Layer Path Role
Model model/voxtral-server Voxtral inference, VAD segmentation, emotion analysis
Server demo/server API entrypoint; proxies to Model
Frontend demo Next.js UI (upload, Studio editor, waveform, timeline)
Evoxtral training/scripts/ Training, eval, RL, serving for expressive transcription
FER models/ Facial emotion recognition ONNX model

See demo/README.md for full API and usage; model/voxtral-server/README.md for the Model API.

Project Structure

β”œβ”€β”€ api/                    # Python FastAPI β€” local Voxtral inference + FER
β”œβ”€β”€ proxy/                  # Node.js/Express β€” API gateway for frontend
β”œβ”€β”€ web/                    # Next.js β€” Studio editor UI
β”œβ”€β”€ training/               # Fine-tuning code (SFT + RL), data prep, eval
β”‚   └── scripts/            # Modal scripts: train, RL (RAFT), eval, serve
β”œβ”€β”€ space/                  # HuggingFace Space (Gradio demo)
β”œβ”€β”€ models/                 # FER ONNX model (MobileViT-XXS)
β”œβ”€β”€ docs/                   # Technical report, design docs, research refs
β”œβ”€β”€ data/                   # Training data scripts (audio files gitignored)
└── Dockerfile              # Single-container HF Spaces build

How to Run

Requirements: Python 3.10+, Node.js 20+, ffmpeg; GPU recommended.

Model layer (port 8000)

cd model/voxtral-server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Server layer (port 3000)

cd demo/server && npm install && npm run dev

Frontend (port 3030)

cd demo && npm install && npm run dev

Open http://localhost:3030.

Evoxtral API (Modal)

modal deploy training/scripts/serve_modal.py

Tech Stack

  • Models: Voxtral-Mini-3B + LoRA, Voxtral-Mini-4B, MobileViT-XXS
  • Training: PyTorch, PEFT, Weights & Biases
  • Inference: Modal (serverless GPU), HuggingFace ZeroGPU, ONNX Runtime
  • Backend: FastAPI, Node.js
  • Frontend: Next.js, Gradio

Links