Spaces:
Running
Running
docs: update README with Studio features and API format
Browse files
README.md
CHANGED
|
@@ -12,6 +12,31 @@ pinned: false
|
|
| 12 |
|
| 13 |
Speech-to-text with VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER), powered by a fine-tuned [Voxtral Mini 3B](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) + [evoxtral-lora](https://huggingface.co/YongkangZOU/evoxtral-lora) running locally, and a MobileViT-XXS ONNX model for FER.
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
## Repository layout
|
| 16 |
|
| 17 |
```
|
|
@@ -31,11 +56,11 @@ ser/
|
|
| 31 |
|
| 32 |
```
|
| 33 |
Browser (:3030)
|
| 34 |
-
β Next.js UI (upload, Studio editor,
|
| 35 |
Node proxy (:3000)
|
| 36 |
β Express β streams multipart upload, manages session state
|
| 37 |
Python API (:8000)
|
| 38 |
-
ββ POST /transcribe-diarize β VAD + Voxtral STT + emotion tags
|
| 39 |
ββ POST /fer β per-frame FER via MobileViT-XXS ONNX
|
| 40 |
```
|
| 41 |
|
|
@@ -45,10 +70,60 @@ nginx on port 7860 (HF Spaces public port) routes:
|
|
| 45 |
|
| 46 |
| Directory | Port | Role |
|
| 47 |
|-----------|------|------|
|
| 48 |
-
| `api/` | 8000 | Voxtral local inference; VAD segmentation; per-segment emotion; FER |
|
| 49 |
| `proxy/` | 3000 | API entrypoint; proxies to `api/` |
|
| 50 |
| `web/` | 3030 | Next.js Studio UI |
|
| 51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
## How to run locally
|
| 53 |
|
| 54 |
**Requirements**: Python 3.11+, Node.js 22+, ffmpeg, a GPU with ~8 GB VRAM (or CPU fallback).
|
|
@@ -65,6 +140,8 @@ uvicorn main:app --host 0.0.0.0 --port 8000 --reload
|
|
| 65 |
|
| 66 |
On first start the Voxtral model (~6 GB) is downloaded from HuggingFace. Set `MODEL_ID` / `ADAPTER_ID` env vars to override.
|
| 67 |
|
|
|
|
|
|
|
| 68 |
### 2. Node proxy (port 3000)
|
| 69 |
|
| 70 |
```bash
|
|
@@ -91,7 +168,7 @@ curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@/path/to/au
|
|
| 91 |
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@/path/to/video.mov"
|
| 92 |
```
|
| 93 |
|
| 94 |
-
Upload a video to also get per-segment facial emotion
|
| 95 |
|
| 96 |
## Models
|
| 97 |
|
|
@@ -131,3 +208,11 @@ docker run -p 7860:7860 ethos-studio
|
|
| 131 |
HF Spaces auto-builds on every push to the `main` branch of the linked Space.
|
| 132 |
|
| 133 |
> **Note**: `models/emotion_model_web.onnx` is stored via Git LFS / Xet on HF Spaces. When pushing a new commit, use the `commit-tree` graft technique to reuse the existing LFS tree rather than re-uploading the binary.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
Speech-to-text with VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER), powered by a fine-tuned [Voxtral Mini 3B](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) + [evoxtral-lora](https://huggingface.co/YongkangZOU/evoxtral-lora) running locally, and a MobileViT-XXS ONNX model for FER.
|
| 14 |
|
| 15 |
+
## Studio features
|
| 16 |
+
|
| 17 |
+
### Transcript editor
|
| 18 |
+
- **Character-level text highlighting** β transcript text sweeps darkβgray in sync with playback position, character by character
|
| 19 |
+
- **Click-to-seek** β click any character in the transcript to jump the timeline to that exact moment; uses `caretRangeFromPoint` for precision
|
| 20 |
+
- **Inline `[bracket]` badges** β paralinguistic tags produced by Voxtral (e.g. `[laughs]`, `[sighs]`) render as pill badges at their exact inline position, not appended at the end; clicking a badge seeks to the moment just before it
|
| 21 |
+
- **Bidirectional timeline β transcript sync** β scrolling/clicking the timeline highlights the active segment in the transcript and auto-scrolls it into view; clicking a segment row seeks the timeline
|
| 22 |
+
- **Per-segment state** (`past` / `active` / `future`) with opacity transitions
|
| 23 |
+
|
| 24 |
+
### Live emotion panel (right sidebar)
|
| 25 |
+
- **Streaming speech emotion** β the Speech emotion badge updates sub-segment as playback passes each `[bracket]` tag; timing is estimated from the tag's character position proportional to segment duration
|
| 26 |
+
- **Streaming valence & arousal bars** β both bars transition to the bracket tag's valence/arousal values at the same moment, creating a continuous emotional arc within each segment
|
| 27 |
+
- **Per-second face emotion** (video only) β the Face badge updates every second from the `face_emotion_timeline` returned by the FER pipeline, more granular than the per-segment majority vote
|
| 28 |
+
- **Live indicator** β animated green dot appears during playback
|
| 29 |
+
|
| 30 |
+
### Timeline
|
| 31 |
+
- **Click-to-seek** on the track area
|
| 32 |
+
- **Active segment highlight** with ring indicator
|
| 33 |
+
- **Played-region overlay** β subtle tint left of the playhead
|
| 34 |
+
- **Dot + line playhead** design
|
| 35 |
+
|
| 36 |
+
### Video support
|
| 37 |
+
- Video files (`.mp4`, `.mkv`, `.avi`, `.mov`, `.m4v`, `.webm`) display inline in the right panel preview area
|
| 38 |
+
- FER runs on video frames and produces both per-segment majority-vote emotion and a per-second `face_emotion_timeline`
|
| 39 |
+
|
| 40 |
## Repository layout
|
| 41 |
|
| 42 |
```
|
|
|
|
| 56 |
|
| 57 |
```
|
| 58 |
Browser (:3030)
|
| 59 |
+
β Next.js UI (upload, Studio editor, timeline, live emotion panel)
|
| 60 |
Node proxy (:3000)
|
| 61 |
β Express β streams multipart upload, manages session state
|
| 62 |
Python API (:8000)
|
| 63 |
+
ββ POST /transcribe-diarize β VAD + Voxtral STT + emotion tags + FER timeline
|
| 64 |
ββ POST /fer β per-frame FER via MobileViT-XXS ONNX
|
| 65 |
```
|
| 66 |
|
|
|
|
| 70 |
|
| 71 |
| Directory | Port | Role |
|
| 72 |
|-----------|------|------|
|
| 73 |
+
| `api/` | 8000 | Voxtral local inference; VAD segmentation; per-segment emotion; FER timeline |
|
| 74 |
| `proxy/` | 3000 | API entrypoint; proxies to `api/` |
|
| 75 |
| `web/` | 3030 | Next.js Studio UI |
|
| 76 |
|
| 77 |
+
## API response format
|
| 78 |
+
|
| 79 |
+
`POST /api/transcribe-diarize` returns:
|
| 80 |
+
|
| 81 |
+
```json
|
| 82 |
+
{
|
| 83 |
+
"filename": "interview.mp4",
|
| 84 |
+
"duration": 42.5,
|
| 85 |
+
"text": "Full transcript...",
|
| 86 |
+
"segments": [
|
| 87 |
+
{
|
| 88 |
+
"id": 1,
|
| 89 |
+
"speaker": "SPEAKER_00",
|
| 90 |
+
"start": 0.0,
|
| 91 |
+
"end": 5.2,
|
| 92 |
+
"text": "Welcome to the show. [laughs]",
|
| 93 |
+
"emotion": "Happy",
|
| 94 |
+
"valence": 0.7,
|
| 95 |
+
"arousal": 0.6,
|
| 96 |
+
"face_emotion": "Happy"
|
| 97 |
+
}
|
| 98 |
+
],
|
| 99 |
+
"has_video": true,
|
| 100 |
+
"face_emotion_timeline": {
|
| 101 |
+
"0": "Neutral",
|
| 102 |
+
"1": "Happy",
|
| 103 |
+
"2": "Happy"
|
| 104 |
+
}
|
| 105 |
+
}
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
`face_emotion_timeline` maps each second (as a string key) to the majority FER label for that second. Only present for video inputs.
|
| 109 |
+
|
| 110 |
+
## Bracket tag emotions
|
| 111 |
+
|
| 112 |
+
Voxtral produces paralinguistic `[bracket]` tags in transcriptions. The frontend and API both recognise these tags and map them to `(emotion, valence, arousal)` triples:
|
| 113 |
+
|
| 114 |
+
| Tag | Emotion | Valence | Arousal |
|
| 115 |
+
|-----|---------|---------|---------|
|
| 116 |
+
| `[laughs]` / `[laughing]` | Happy | +0.70 | +0.60 |
|
| 117 |
+
| `[sighs]` / `[sighing]` | Sad | β0.30 | β0.30 |
|
| 118 |
+
| `[whispers]` / `[whispering]` | Calm | +0.10 | β0.50 |
|
| 119 |
+
| `[shouts]` / `[shouting]` | Angry | β0.50 | +0.80 |
|
| 120 |
+
| `[exclaims]` | Excited | +0.50 | +0.70 |
|
| 121 |
+
| `[gasps]` | Surprised | +0.20 | +0.70 |
|
| 122 |
+
| `[hesitates]` / `[stutters]` / `[stammers]` | Anxious | β0.20 | +0.35 |
|
| 123 |
+
| `[cries]` / `[crying]` | Sad | β0.70 | +0.40 |
|
| 124 |
+
| `[claps]` / `[applause]` | Happy | +0.60 | +0.50 |
|
| 125 |
+
| `[clears throat]` / `[pause]` | Neutral | 0.00 | Β±0.10 |
|
| 126 |
+
|
| 127 |
## How to run locally
|
| 128 |
|
| 129 |
**Requirements**: Python 3.11+, Node.js 22+, ffmpeg, a GPU with ~8 GB VRAM (or CPU fallback).
|
|
|
|
| 140 |
|
| 141 |
On first start the Voxtral model (~6 GB) is downloaded from HuggingFace. Set `MODEL_ID` / `ADAPTER_ID` env vars to override.
|
| 142 |
|
| 143 |
+
Inference is optimised with `merge_and_unload()` (removes PEFT per-forward overhead), `torch.set_num_threads(cpu_count)`, and `torch.inference_mode()`.
|
| 144 |
+
|
| 145 |
### 2. Node proxy (port 3000)
|
| 146 |
|
| 147 |
```bash
|
|
|
|
| 168 |
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@/path/to/video.mov"
|
| 169 |
```
|
| 170 |
|
| 171 |
+
Upload a video to also get per-segment facial emotion and the `face_emotion_timeline`.
|
| 172 |
|
| 173 |
## Models
|
| 174 |
|
|
|
|
| 208 |
HF Spaces auto-builds on every push to the `main` branch of the linked Space.
|
| 209 |
|
| 210 |
> **Note**: `models/emotion_model_web.onnx` is stored via Git LFS / Xet on HF Spaces. When pushing a new commit, use the `commit-tree` graft technique to reuse the existing LFS tree rather than re-uploading the binary.
|
| 211 |
+
>
|
| 212 |
+
> ```bash
|
| 213 |
+
> MODELS_SHA=$(git ls-tree space/main | grep $'\tmodels$' | awk '{print $3}')
|
| 214 |
+
> TREE_SHA=$((git ls-tree HEAD | grep -v $'\tmodels$'; echo "040000 tree $MODELS_SHA\tmodels") | git mktree)
|
| 215 |
+
> PARENT=$(git rev-parse space/main)
|
| 216 |
+
> COMMIT_SHA=$(git commit-tree "$TREE_SHA" -p "$PARENT" -m "your message")
|
| 217 |
+
> git push space "${COMMIT_SHA}:refs/heads/main"
|
| 218 |
+
> ```
|