Spaces:
Running
Running
| title: Ethos Studio | |
| emoji: π€ | |
| colorFrom: purple | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # Ethos Studio β Emotional Speech Recognition | |
| Built for the **Mistral AI Online Hackathon 2026** (W&B Fine-Tuning Track). | |
| Ethos Studio is a full-stack emotional speech recognition platform combining real-time transcription, facial emotion recognition, and expressive audio tagging. It turns raw speech into richly annotated transcripts with emotions, non-verbal sounds, and delivery cues. | |
| ## Key Components | |
| ### Evoxtral β Expressive Tagged Transcription | |
| LoRA finetune of [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) that produces transcriptions with inline [ElevenLabs v3](https://elevenlabs.io/docs/api-reference/text-to-speech) audio tags. Two-stage pipeline: **SFT** (3 epochs) β **RL via RAFT** (rejection sampling, 1 epoch). | |
| **Standard ASR:** `So I was thinking maybe we could try that new restaurant downtown.` | |
| **Evoxtral:** `[nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously]` | |
| **Two model variants:** | |
| - **[Evoxtral SFT](https://huggingface.co/YongkangZOU/evoxtral-lora)** β Best transcription accuracy (lowest WER) | |
| - **[Evoxtral RL](https://huggingface.co/YongkangZOU/evoxtral-rl)** β Best expressive tag accuracy (highest Tag F1) | |
| | Metric | Base Voxtral | Evoxtral SFT | Evoxtral RL | Best | | |
| |--------|-------------|-------------|------------|------| | |
| | **WER** β | 6.64% | **4.47%** | 5.12% | SFT | | |
| | **CER** β | 2.72% | **1.23%** | 1.48% | SFT | | |
| | **Tag F1** β | 22.0% | 67.2% | **69.4%** | RL | | |
| | **Tag Recall** β | 22.0% | 69.4% | **72.7%** | RL | | |
| | **Emphasis F1** β | 42.0% | 84.0% | **86.0%** | RL | | |
| - [SFT Model](https://huggingface.co/YongkangZOU/evoxtral-lora) | [RL Model](https://huggingface.co/YongkangZOU/evoxtral-rl) | |
| - [Live Demo (HF Space)](https://huggingface.co/spaces/YongkangZOU/evoxtral) | |
| - [API (Swagger UI)](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs) | |
| - [W&B Dashboard](https://wandb.ai/yongkang-zou-ai/evoxtral) | |
| - [Technical Report (PDF)](Evoxtral%20Technical%20Report.pdf) | [LaTeX source](docs/technical_report.tex) | |
| ### FER β Facial Emotion Recognition | |
| MobileViT-XXS model trained on 8 emotion classes, exported to ONNX for real-time browser inference. | |
| **Classes:** Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise | |
| ### Voxtral Server β Speech-to-Text + Emotion | |
| Speech-to-text service with VAD sentence segmentation and per-segment emotion analysis, powered by [Voxtral Mini 4B](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602). | |
| ## Architecture | |
| ``` | |
| Browser (port 3030) β Server layer (Node, :3000) β Model layer (Python, :8000) | |
| β Studio UI POST /api/speech-to-text POST /transcribe | |
| β Upload dialog POST /api/transcribe-diarize POST /transcribe-diarize | |
| GET /health GET /health | |
| ``` | |
| | Layer | Path | Role | | |
| |-------|------|------| | |
| | **Model** | `model/voxtral-server` | Voxtral inference, VAD segmentation, emotion analysis | | |
| | **Server** | `demo/server` | API entrypoint; proxies to Model | | |
| | **Frontend** | `demo` | Next.js UI (upload, Studio editor, waveform, timeline) | | |
| | **Evoxtral** | `training/scripts/` | Training, eval, RL, serving for expressive transcription | | |
| | **FER** | `models/` | Facial emotion recognition ONNX model | | |
| See [demo/README.md](demo/README.md) for full API and usage; [model/voxtral-server/README.md](model/voxtral-server/README.md) for the Model API. | |
| ## Project Structure | |
| ``` | |
| βββ api/ # Python FastAPI β local Voxtral inference + FER | |
| βββ proxy/ # Node.js/Express β API gateway for frontend | |
| βββ web/ # Next.js β Studio editor UI | |
| βββ training/ # Fine-tuning code (SFT + RL), data prep, eval | |
| β βββ scripts/ # Modal scripts: train, RL (RAFT), eval, serve | |
| βββ space/ # HuggingFace Space (Gradio demo) | |
| βββ models/ # FER ONNX model (MobileViT-XXS) | |
| βββ docs/ # Technical report, design docs, research refs | |
| βββ data/ # Training data scripts (audio files gitignored) | |
| βββ Dockerfile # Single-container HF Spaces build | |
| ``` | |
| ## How to Run | |
| **Requirements**: Python 3.10+, Node.js 20+, ffmpeg; GPU recommended. | |
| ### Model layer (port 8000) | |
| ```bash | |
| cd model/voxtral-server | |
| python -m venv .venv && source .venv/bin/activate | |
| pip install -r requirements.txt | |
| uvicorn main:app --host 0.0.0.0 --port 8000 --reload | |
| ``` | |
| ### Server layer (port 3000) | |
| ```bash | |
| cd demo/server && npm install && npm run dev | |
| ``` | |
| ### Frontend (port 3030) | |
| ```bash | |
| cd demo && npm install && npm run dev | |
| ``` | |
| Open [http://localhost:3030](http://localhost:3030). | |
| ### Evoxtral API (Modal) | |
| ```bash | |
| modal deploy training/scripts/serve_modal.py | |
| ``` | |
| ## Tech Stack | |
| - **Models**: Voxtral-Mini-3B + LoRA, Voxtral-Mini-4B, MobileViT-XXS | |
| - **Training**: PyTorch, PEFT, Weights & Biases | |
| - **Inference**: Modal (serverless GPU), HuggingFace ZeroGPU, ONNX Runtime | |
| - **Backend**: FastAPI, Node.js | |
| - **Frontend**: Next.js, Gradio | |
| ## Links | |
| - [W&B Project](https://wandb.ai/yongkang-zou-ai/evoxtral) | [W&B Eval Report](https://wandb.ai/yongkang-zou-ai/evoxtral/reports/Evoxtral-β-Evaluation-Results:-Base-vs-SFT-vs-RL--VmlldzoxNjA3MzI3Nw==) | |
| - [Evoxtral SFT Model](https://huggingface.co/YongkangZOU/evoxtral-lora) | [Evoxtral RL Model](https://huggingface.co/YongkangZOU/evoxtral-rl) | |
| - [Evoxtral Demo](https://huggingface.co/spaces/YongkangZOU/evoxtral) | |
| - [Evoxtral API (Swagger)](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs) | |
| - [Technical Report (PDF)](Evoxtral%20Technical%20Report.pdf) | [LaTeX](docs/technical_report.tex) | |