ethos / README.md
Lior-0618's picture
chore: merge master β†’ dev/video-fer (SSE transcribe-stream)
aa15e90
---
title: Ethos Studio
emoji: 🎀
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
---
# Ethos Studio β€” Emotional Speech Recognition
Built for the **Mistral AI Online Hackathon 2026** (W&B Fine-Tuning Track).
Ethos Studio is a full-stack emotional speech recognition platform combining real-time transcription, facial emotion recognition, and expressive audio tagging. It turns raw speech into richly annotated transcripts with emotions, non-verbal sounds, and delivery cues.
## Key Components
### Evoxtral β€” Expressive Tagged Transcription
LoRA finetune of [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) that produces transcriptions with inline [ElevenLabs v3](https://elevenlabs.io/docs/api-reference/text-to-speech) audio tags. Two-stage pipeline: **SFT** (3 epochs) β†’ **RL via RAFT** (rejection sampling, 1 epoch).
**Standard ASR:** `So I was thinking maybe we could try that new restaurant downtown.`
**Evoxtral:** `[nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously]`
**Two model variants:**
- **[Evoxtral SFT](https://huggingface.co/YongkangZOU/evoxtral-lora)** β€” Best transcription accuracy (lowest WER)
- **[Evoxtral RL](https://huggingface.co/YongkangZOU/evoxtral-rl)** β€” Best expressive tag accuracy (highest Tag F1)
| Metric | Base Voxtral | Evoxtral SFT | Evoxtral RL | Best |
|--------|-------------|-------------|------------|------|
| **WER** ↓ | 6.64% | **4.47%** | 5.12% | SFT |
| **CER** ↓ | 2.72% | **1.23%** | 1.48% | SFT |
| **Tag F1** ↑ | 22.0% | 67.2% | **69.4%** | RL |
| **Tag Recall** ↑ | 22.0% | 69.4% | **72.7%** | RL |
| **Emphasis F1** ↑ | 42.0% | 84.0% | **86.0%** | RL |
- [SFT Model](https://huggingface.co/YongkangZOU/evoxtral-lora) | [RL Model](https://huggingface.co/YongkangZOU/evoxtral-rl)
- [Live Demo (HF Space)](https://huggingface.co/spaces/YongkangZOU/evoxtral)
- [API (Swagger UI)](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs)
- [W&B Dashboard](https://wandb.ai/yongkang-zou-ai/evoxtral)
- [Technical Report (PDF)](Evoxtral%20Technical%20Report.pdf) | [LaTeX source](docs/technical_report.tex)
### FER β€” Facial Emotion Recognition
MobileViT-XXS model trained on 8 emotion classes, exported to ONNX for real-time browser inference.
**Classes:** Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise
### Voxtral Server β€” Speech-to-Text + Emotion
Speech-to-text service with VAD sentence segmentation and per-segment emotion analysis, powered by [Voxtral Mini 4B](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602).
## Architecture
```
Browser (port 3030) β†’ Server layer (Node, :3000) β†’ Model layer (Python, :8000)
↑ Studio UI POST /api/speech-to-text POST /transcribe
↑ Upload dialog POST /api/transcribe-diarize POST /transcribe-diarize
GET /health GET /health
```
| Layer | Path | Role |
|-------|------|------|
| **Model** | `model/voxtral-server` | Voxtral inference, VAD segmentation, emotion analysis |
| **Server** | `demo/server` | API entrypoint; proxies to Model |
| **Frontend** | `demo` | Next.js UI (upload, Studio editor, waveform, timeline) |
| **Evoxtral** | `training/scripts/` | Training, eval, RL, serving for expressive transcription |
| **FER** | `models/` | Facial emotion recognition ONNX model |
See [demo/README.md](demo/README.md) for full API and usage; [model/voxtral-server/README.md](model/voxtral-server/README.md) for the Model API.
## Project Structure
```
β”œβ”€β”€ api/ # Python FastAPI β€” local Voxtral inference + FER
β”œβ”€β”€ proxy/ # Node.js/Express β€” API gateway for frontend
β”œβ”€β”€ web/ # Next.js β€” Studio editor UI
β”œβ”€β”€ training/ # Fine-tuning code (SFT + RL), data prep, eval
β”‚ └── scripts/ # Modal scripts: train, RL (RAFT), eval, serve
β”œβ”€β”€ space/ # HuggingFace Space (Gradio demo)
β”œβ”€β”€ models/ # FER ONNX model (MobileViT-XXS)
β”œβ”€β”€ docs/ # Technical report, design docs, research refs
β”œβ”€β”€ data/ # Training data scripts (audio files gitignored)
└── Dockerfile # Single-container HF Spaces build
```
## How to Run
**Requirements**: Python 3.10+, Node.js 20+, ffmpeg; GPU recommended.
### Model layer (port 8000)
```bash
cd model/voxtral-server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```
### Server layer (port 3000)
```bash
cd demo/server && npm install && npm run dev
```
### Frontend (port 3030)
```bash
cd demo && npm install && npm run dev
```
Open [http://localhost:3030](http://localhost:3030).
### Evoxtral API (Modal)
```bash
modal deploy training/scripts/serve_modal.py
```
## Tech Stack
- **Models**: Voxtral-Mini-3B + LoRA, Voxtral-Mini-4B, MobileViT-XXS
- **Training**: PyTorch, PEFT, Weights & Biases
- **Inference**: Modal (serverless GPU), HuggingFace ZeroGPU, ONNX Runtime
- **Backend**: FastAPI, Node.js
- **Frontend**: Next.js, Gradio
## Links
- [W&B Project](https://wandb.ai/yongkang-zou-ai/evoxtral) | [W&B Eval Report](https://wandb.ai/yongkang-zou-ai/evoxtral/reports/Evoxtral-β€”-Evaluation-Results:-Base-vs-SFT-vs-RL--VmlldzoxNjA3MzI3Nw==)
- [Evoxtral SFT Model](https://huggingface.co/YongkangZOU/evoxtral-lora) | [Evoxtral RL Model](https://huggingface.co/YongkangZOU/evoxtral-rl)
- [Evoxtral Demo](https://huggingface.co/spaces/YongkangZOU/evoxtral)
- [Evoxtral API (Swagger)](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs)
- [Technical Report (PDF)](Evoxtral%20Technical%20Report.pdf) | [LaTeX](docs/technical_report.tex)