Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / README.md

Lior-0618

chore: merge master → dev/video-fer (SSE transcribe-stream)

aa15e90 6 days ago

preview code

raw

history blame contribute delete

5.93 kB

	---
	title: Ethos Studio
	emoji: 🎤
	colorFrom: purple
	colorTo: indigo
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# Ethos Studio — Emotional Speech Recognition

	Built for the Mistral AI Online Hackathon 2026 (W&B Fine-Tuning Track).

	Ethos Studio is a full-stack emotional speech recognition platform combining real-time transcription, facial emotion recognition, and expressive audio tagging. It turns raw speech into richly annotated transcripts with emotions, non-verbal sounds, and delivery cues.

	## Key Components

	### Evoxtral — Expressive Tagged Transcription

	LoRA finetune of [Voxtral-Mini-3B-2507](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) that produces transcriptions with inline [ElevenLabs v3](https://elevenlabs.io/docs/api-reference/text-to-speech) audio tags. Two-stage pipeline: SFT (3 epochs) → RL via RAFT (rejection sampling, 1 epoch).

	Standard ASR: `So I was thinking maybe we could try that new restaurant downtown.`

	Evoxtral: `[nervous] So... [stammers] I was thinking maybe we could... [clears throat] try that new restaurant downtown? [laughs nervously]`

	Two model variants:
	- [Evoxtral SFT](https://huggingface.co/YongkangZOU/evoxtral-lora) — Best transcription accuracy (lowest WER)
	- [Evoxtral RL](https://huggingface.co/YongkangZOU/evoxtral-rl) — Best expressive tag accuracy (highest Tag F1)

	\| Metric \| Base Voxtral \| Evoxtral SFT \| Evoxtral RL \| Best \|
	\|--------\|-------------\|-------------\|------------\|------\|
	\| WER ↓ \| 6.64% \| 4.47% \| 5.12% \| SFT \|
	\| CER ↓ \| 2.72% \| 1.23% \| 1.48% \| SFT \|
	\| Tag F1 ↑ \| 22.0% \| 67.2% \| 69.4% \| RL \|
	\| Tag Recall ↑ \| 22.0% \| 69.4% \| 72.7% \| RL \|
	\| Emphasis F1 ↑ \| 42.0% \| 84.0% \| 86.0% \| RL \|

	- [SFT Model](https://huggingface.co/YongkangZOU/evoxtral-lora) \| [RL Model](https://huggingface.co/YongkangZOU/evoxtral-rl)
	- [Live Demo (HF Space)](https://huggingface.co/spaces/YongkangZOU/evoxtral)
	- [API (Swagger UI)](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs)
	- [W&B Dashboard](https://wandb.ai/yongkang-zou-ai/evoxtral)
	- [Technical Report (PDF)](Evoxtral%20Technical%20Report.pdf) \| [LaTeX source](docs/technical_report.tex)

	### FER — Facial Emotion Recognition

	MobileViT-XXS model trained on 8 emotion classes, exported to ONNX for real-time browser inference.

	Classes: Anger, Contempt, Disgust, Fear, Happy, Neutral, Sad, Surprise

	### Voxtral Server — Speech-to-Text + Emotion

	Speech-to-text service with VAD sentence segmentation and per-segment emotion analysis, powered by [Voxtral Mini 4B](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602).

	## Architecture

	```
	Browser (port 3030) → Server layer (Node, :3000) → Model layer (Python, :8000)
	↑ Studio UI POST /api/speech-to-text POST /transcribe
	↑ Upload dialog POST /api/transcribe-diarize POST /transcribe-diarize
	GET /health GET /health
	```

	\| Layer \| Path \| Role \|
	\|-------\|------\|------\|
	\| Model \| `model/voxtral-server` \| Voxtral inference, VAD segmentation, emotion analysis \|
	\| Server \| `demo/server` \| API entrypoint; proxies to Model \|
	\| Frontend \| `demo` \| Next.js UI (upload, Studio editor, waveform, timeline) \|
	\| Evoxtral \| `training/scripts/` \| Training, eval, RL, serving for expressive transcription \|
	\| FER \| `models/` \| Facial emotion recognition ONNX model \|

	See [demo/README.md](demo/README.md) for full API and usage; [model/voxtral-server/README.md](model/voxtral-server/README.md) for the Model API.

	## Project Structure

	```
	├── api/ # Python FastAPI — local Voxtral inference + FER
	├── proxy/ # Node.js/Express — API gateway for frontend
	├── web/ # Next.js — Studio editor UI
	├── training/ # Fine-tuning code (SFT + RL), data prep, eval
	│ └── scripts/ # Modal scripts: train, RL (RAFT), eval, serve
	├── space/ # HuggingFace Space (Gradio demo)
	├── models/ # FER ONNX model (MobileViT-XXS)
	├── docs/ # Technical report, design docs, research refs
	├── data/ # Training data scripts (audio files gitignored)
	└── Dockerfile # Single-container HF Spaces build
	```

	## How to Run

	Requirements: Python 3.10+, Node.js 20+, ffmpeg; GPU recommended.

	### Model layer (port 8000)

	```bash
	cd model/voxtral-server
	python -m venv .venv && source .venv/bin/activate
	pip install -r requirements.txt
	uvicorn main:app --host 0.0.0.0 --port 8000 --reload
	```

	### Server layer (port 3000)

	```bash
	cd demo/server && npm install && npm run dev
	```

	### Frontend (port 3030)

	```bash
	cd demo && npm install && npm run dev
	```

	Open [http://localhost:3030](http://localhost:3030).

	### Evoxtral API (Modal)

	```bash
	modal deploy training/scripts/serve_modal.py
	```

	## Tech Stack

	- Models: Voxtral-Mini-3B + LoRA, Voxtral-Mini-4B, MobileViT-XXS
	- Training: PyTorch, PEFT, Weights & Biases
	- Inference: Modal (serverless GPU), HuggingFace ZeroGPU, ONNX Runtime
	- Backend: FastAPI, Node.js
	- Frontend: Next.js, Gradio

	## Links

	- [W&B Project](https://wandb.ai/yongkang-zou-ai/evoxtral) \| [W&B Eval Report](https://wandb.ai/yongkang-zou-ai/evoxtral/reports/Evoxtral-—-Evaluation-Results:-Base-vs-SFT-vs-RL--VmlldzoxNjA3MzI3Nw==)
	- [Evoxtral SFT Model](https://huggingface.co/YongkangZOU/evoxtral-lora) \| [Evoxtral RL Model](https://huggingface.co/YongkangZOU/evoxtral-rl)
	- [Evoxtral Demo](https://huggingface.co/spaces/YongkangZOU/evoxtral)
	- [Evoxtral API (Swagger)](https://yongkang-zou1999--evoxtral-api-evoxtralmodel-web.modal.run/docs)
	- [Technical Report (PDF)](Evoxtral%20Technical%20Report.pdf) \| [LaTeX](docs/technical_report.tex)