Spaces:
Running
title: Ethos Studio
emoji: π€
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
Ethos Studio β Emotional Speech Recognition
Built for the Mistral AI Online Hackathon 2026 (W&B Fine-Tuning Track).
Ethos Studio is a full-stack emotional speech recognition platform combining real-time transcription, facial emotion recognition, and expressive audio tagging. It turns raw speech into richly annotated transcripts with emotions, non-verbal sounds, and delivery cues.
Key Components
Evoxtral β Expressive Tagged Transcription
LoRA finetune of Voxtral-Mini-3B-2507 that produces transcriptions with inline ElevenLabs v3 audio tags. Two-stage pipeline: SFT (3 epochs) β RL via RAFT (rejection sampling, 1 epoch).
Standard ASR: So I was thinking maybe we could try that new restaurant downtown.
Evoxtral: [nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously]
Two model variants:
- Evoxtral SFT β Best transcription accuracy (lowest WER)
- Evoxtral RL β Best expressive tag accuracy (highest Tag F1)
| Metric | Base Voxtral | Evoxtral SFT | Evoxtral RL | Best |
|---|---|---|---|---|
| WER β | 6.64% | 4.47% | 5.12% | SFT |
| CER β | 2.72% | 1.23% | 1.48% | SFT |
| Tag F1 β | 22.0% | 67.2% | 69.4% | RL |
| Tag Recall β | 22.0% | 69.4% | 72.7% | RL |
| Emphasis F1 β | 42.0% | 84.0% | 86.0% | RL |
- SFT Model | RL Model
- Live Demo (HF Space)
- API (Swagger UI)
- W&B Dashboard
- Technical Report (PDF) | LaTeX source
FER β Facial Emotion Recognition
MobileViT-XXS model trained on 8 emotion classes, exported to ONNX for real-time browser inference.
Classes: Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise
Voxtral Server β Speech-to-Text + Emotion
Speech-to-text service with VAD sentence segmentation and per-segment emotion analysis, powered by Voxtral Mini 4B.
Architecture
Browser (port 3030) β Server layer (Node, :3000) β Model layer (Python, :8000)
β Studio UI POST /api/speech-to-text POST /transcribe
β Upload dialog POST /api/transcribe-diarize POST /transcribe-diarize
GET /health GET /health
| Layer | Path | Role |
|---|---|---|
| Model | model/voxtral-server |
Voxtral inference, VAD segmentation, emotion analysis |
| Server | demo/server |
API entrypoint; proxies to Model |
| Frontend | demo |
Next.js UI (upload, Studio editor, waveform, timeline) |
| Evoxtral | training/scripts/ |
Training, eval, RL, serving for expressive transcription |
| FER | models/ |
Facial emotion recognition ONNX model |
See demo/README.md for full API and usage; model/voxtral-server/README.md for the Model API.
Project Structure
βββ api/ # Python FastAPI β local Voxtral inference + FER
βββ proxy/ # Node.js/Express β API gateway for frontend
βββ web/ # Next.js β Studio editor UI
βββ training/ # Fine-tuning code (SFT + RL), data prep, eval
β βββ scripts/ # Modal scripts: train, RL (RAFT), eval, serve
βββ space/ # HuggingFace Space (Gradio demo)
βββ models/ # FER ONNX model (MobileViT-XXS)
βββ docs/ # Technical report, design docs, research refs
βββ data/ # Training data scripts (audio files gitignored)
βββ Dockerfile # Single-container HF Spaces build
How to Run
Requirements: Python 3.10+, Node.js 20+, ffmpeg; GPU recommended.
Model layer (port 8000)
cd model/voxtral-server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Server layer (port 3000)
cd demo/server && npm install && npm run dev
Frontend (port 3030)
cd demo && npm install && npm run dev
Open http://localhost:3030.
Evoxtral API (Modal)
modal deploy training/scripts/serve_modal.py
Tech Stack
- Models: Voxtral-Mini-3B + LoRA, Voxtral-Mini-4B, MobileViT-XXS
- Training: PyTorch, PEFT, Weights & Biases
- Inference: Modal (serverless GPU), HuggingFace ZeroGPU, ONNX Runtime
- Backend: FastAPI, Node.js
- Frontend: Next.js, Gradio