---
title: Ethos Studio
emoji: 🎤
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
---

# Ethos Studio — Emotional Speech Recognition

Built for the **Mistral AI Online Hackathon 2026** (W&B Fine-Tuning Track).

Ethos Studio is a full-stack emotional speech recognition platform combining real-time transcription, facial emotion recognition, and expressive audio tagging. It turns raw speech into richly annotated transcripts with emotions, non-verbal sounds, and delivery cues.

## Key Components

### Evoxtral — Expressive Tagged Transcription

LoRA finetune of [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) that produces transcriptions with inline [ElevenLabs v3](https://elevenlabs.io/docs/api-reference/text-to-speech) audio tags. Two-stage pipeline: **SFT** (3 epochs) → **RL via RAFT** (rejection sampling, 1 epoch).

**Standard ASR:** `So I was thinking maybe we could try that new restaurant downtown.`

**Evoxtral:** `[nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously]`

**Two model variants:**
- **[Evoxtral SFT](https://huggingface.co/YongkangZOU/evoxtral-lora)** — Best transcription accuracy (lowest WER)
- **[Evoxtral RL](https://huggingface.co/YongkangZOU/evoxtral-rl)** — Best expressive tag accuracy (highest Tag F1)

| Metric | Base Voxtral | Evoxtral SFT | Evoxtral RL | Best |
|--------|-------------|-------------|------------|------|
| **WER** ↓ | 6.64% | **4.47%** | 5.12% | SFT |
| **CER** ↓ | 2.72% | **1.23%** | 1.48% | SFT |
| **Tag F1** ↑ | 22.0% | 67.2% | **69.4%** | RL |
| **Tag Recall** ↑ | 22.0% | 69.4% | **72.7%** | RL |
| **Emphasis F1** ↑ | 42.0% | 84.0% | **86.0%** | RL |

- [SFT Model](https://huggingface.co/YongkangZOU/evoxtral-lora) | [RL Model](https://huggingface.co/YongkangZOU/evoxtral-rl)
- [Live Demo (HF Space)](https://huggingface.co/spaces/YongkangZOU/evoxtral)
- [API (Swagger UI)](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs)
- [W&B Dashboard](https://wandb.ai/yongkang-zou-ai/evoxtral)
- [Technical Report (PDF)](Evoxtral%20Technical%20Report.pdf) | [LaTeX source](docs/technical_report.tex)

### FER — Facial Emotion Recognition

MobileViT-XXS model trained on 8 emotion classes, exported to ONNX for real-time browser inference.

**Classes:** Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise

### Voxtral Server — Speech-to-Text + Emotion

Speech-to-text service with VAD sentence segmentation and per-segment emotion analysis, powered by [Voxtral Mini 4B](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602).

## Architecture

```
Browser (port 3030)  →  Server layer (Node, :3000)  →  Model layer (Python, :8000)
      ↑ Studio UI            POST /api/speech-to-text          POST /transcribe
      ↑ Upload dialog        POST /api/transcribe-diarize      POST /transcribe-diarize
                             GET  /health                       GET  /health
```

| Layer | Path | Role |
|-------|------|------|
| **Model** | `model/voxtral-server` | Voxtral inference, VAD segmentation, emotion analysis |
| **Server** | `demo/server` | API entrypoint; proxies to Model |
| **Frontend** | `demo` | Next.js UI (upload, Studio editor, waveform, timeline) |
| **Evoxtral** | `training/scripts/` | Training, eval, RL, serving for expressive transcription |
| **FER** | `models/` | Facial emotion recognition ONNX model |

See [demo/README.md](demo/README.md) for full API and usage; [model/voxtral-server/README.md](model/voxtral-server/README.md) for the Model API.

## Project Structure

```
├── api/                    # Python FastAPI — local Voxtral inference + FER
├── proxy/                  # Node.js/Express — API gateway for frontend
├── web/                    # Next.js — Studio editor UI
├── training/               # Fine-tuning code (SFT + RL), data prep, eval
│   └── scripts/            # Modal scripts: train, RL (RAFT), eval, serve
├── space/                  # HuggingFace Space (Gradio demo)
├── models/                 # FER ONNX model (MobileViT-XXS)
├── docs/                   # Technical report, design docs, research refs
├── data/                   # Training data scripts (audio files gitignored)
└── Dockerfile              # Single-container HF Spaces build
```

## How to Run

**Requirements**: Python 3.10+, Node.js 20+, ffmpeg; GPU recommended.

### Model layer (port 8000)

```bash
cd model/voxtral-server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```

### Server layer (port 3000)

```bash
cd demo/server && npm install && npm run dev
```

### Frontend (port 3030)

```bash
cd demo && npm install && npm run dev
```

Open [http://localhost:3030](http://localhost:3030).

### Evoxtral API (Modal)

```bash
modal deploy training/scripts/serve_modal.py
```

## Tech Stack

- **Models**: Voxtral-Mini-3B + LoRA, Voxtral-Mini-4B, MobileViT-XXS
- **Training**: PyTorch, PEFT, Weights & Biases
- **Inference**: Modal (serverless GPU), HuggingFace ZeroGPU, ONNX Runtime
- **Backend**: FastAPI, Node.js
- **Frontend**: Next.js, Gradio

## Links

- [W&B Project](https://wandb.ai/yongkang-zou-ai/evoxtral) | [W&B Eval Report](https://wandb.ai/yongkang-zou-ai/evoxtral/reports/Evoxtral-—-Evaluation-Results:-Base-vs-SFT-vs-RL--VmlldzoxNjA3MzI3Nw==)
- [Evoxtral SFT Model](https://huggingface.co/YongkangZOU/evoxtral-lora) | [Evoxtral RL Model](https://huggingface.co/YongkangZOU/evoxtral-rl)
- [Evoxtral Demo](https://huggingface.co/spaces/YongkangZOU/evoxtral)
- [Evoxtral API (Swagger)](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs)
- [Technical Report (PDF)](Evoxtral%20Technical%20Report.pdf) | [LaTeX](docs/technical_report.tex)