Lior-0618 commited on
Commit
a01a078
Β·
1 Parent(s): fb8075d

docs: update README with Studio features and API format

Browse files
Files changed (1) hide show
  1. README.md +89 -4
README.md CHANGED
@@ -12,6 +12,31 @@ pinned: false
12
 
13
  Speech-to-text with VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER), powered by a fine-tuned [Voxtral Mini 3B](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) + [evoxtral-lora](https://huggingface.co/YongkangZOU/evoxtral-lora) running locally, and a MobileViT-XXS ONNX model for FER.
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ## Repository layout
16
 
17
  ```
@@ -31,11 +56,11 @@ ser/
31
 
32
  ```
33
  Browser (:3030)
34
- ↕ Next.js UI (upload, Studio editor, waveform, timeline, FER badges)
35
  Node proxy (:3000)
36
  ↕ Express β€” streams multipart upload, manages session state
37
  Python API (:8000)
38
- β”œβ”€ POST /transcribe-diarize β€” VAD + Voxtral STT + emotion tags
39
  └─ POST /fer β€” per-frame FER via MobileViT-XXS ONNX
40
  ```
41
 
@@ -45,10 +70,60 @@ nginx on port 7860 (HF Spaces public port) routes:
45
 
46
  | Directory | Port | Role |
47
  |-----------|------|------|
48
- | `api/` | 8000 | Voxtral local inference; VAD segmentation; per-segment emotion; FER |
49
  | `proxy/` | 3000 | API entrypoint; proxies to `api/` |
50
  | `web/` | 3030 | Next.js Studio UI |
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ## How to run locally
53
 
54
  **Requirements**: Python 3.11+, Node.js 22+, ffmpeg, a GPU with ~8 GB VRAM (or CPU fallback).
@@ -65,6 +140,8 @@ uvicorn main:app --host 0.0.0.0 --port 8000 --reload
65
 
66
  On first start the Voxtral model (~6 GB) is downloaded from HuggingFace. Set `MODEL_ID` / `ADAPTER_ID` env vars to override.
67
 
 
 
68
  ### 2. Node proxy (port 3000)
69
 
70
  ```bash
@@ -91,7 +168,7 @@ curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@/path/to/au
91
  curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@/path/to/video.mov"
92
  ```
93
 
94
- Upload a video to also get per-segment facial emotion (`face_emotion` field).
95
 
96
  ## Models
97
 
@@ -131,3 +208,11 @@ docker run -p 7860:7860 ethos-studio
131
  HF Spaces auto-builds on every push to the `main` branch of the linked Space.
132
 
133
  > **Note**: `models/emotion_model_web.onnx` is stored via Git LFS / Xet on HF Spaces. When pushing a new commit, use the `commit-tree` graft technique to reuse the existing LFS tree rather than re-uploading the binary.
 
 
 
 
 
 
 
 
 
12
 
13
  Speech-to-text with VAD sentence segmentation, per-segment emotion tagging, and facial emotion recognition (FER), powered by a fine-tuned [Voxtral Mini 3B](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) + [evoxtral-lora](https://huggingface.co/YongkangZOU/evoxtral-lora) running locally, and a MobileViT-XXS ONNX model for FER.
14
 
15
+ ## Studio features
16
+
17
+ ### Transcript editor
18
+ - **Character-level text highlighting** — transcript text sweeps dark→gray in sync with playback position, character by character
19
+ - **Click-to-seek** β€” click any character in the transcript to jump the timeline to that exact moment; uses `caretRangeFromPoint` for precision
20
+ - **Inline `[bracket]` badges** β€” paralinguistic tags produced by Voxtral (e.g. `[laughs]`, `[sighs]`) render as pill badges at their exact inline position, not appended at the end; clicking a badge seeks to the moment just before it
21
+ - **Bidirectional timeline ↔ transcript sync** β€” scrolling/clicking the timeline highlights the active segment in the transcript and auto-scrolls it into view; clicking a segment row seeks the timeline
22
+ - **Per-segment state** (`past` / `active` / `future`) with opacity transitions
23
+
24
+ ### Live emotion panel (right sidebar)
25
+ - **Streaming speech emotion** β€” the Speech emotion badge updates sub-segment as playback passes each `[bracket]` tag; timing is estimated from the tag's character position proportional to segment duration
26
+ - **Streaming valence & arousal bars** β€” both bars transition to the bracket tag's valence/arousal values at the same moment, creating a continuous emotional arc within each segment
27
+ - **Per-second face emotion** (video only) β€” the Face badge updates every second from the `face_emotion_timeline` returned by the FER pipeline, more granular than the per-segment majority vote
28
+ - **Live indicator** β€” animated green dot appears during playback
29
+
30
+ ### Timeline
31
+ - **Click-to-seek** on the track area
32
+ - **Active segment highlight** with ring indicator
33
+ - **Played-region overlay** β€” subtle tint left of the playhead
34
+ - **Dot + line playhead** design
35
+
36
+ ### Video support
37
+ - Video files (`.mp4`, `.mkv`, `.avi`, `.mov`, `.m4v`, `.webm`) display inline in the right panel preview area
38
+ - FER runs on video frames and produces both per-segment majority-vote emotion and a per-second `face_emotion_timeline`
39
+
40
  ## Repository layout
41
 
42
  ```
 
56
 
57
  ```
58
  Browser (:3030)
59
+ ↕ Next.js UI (upload, Studio editor, timeline, live emotion panel)
60
  Node proxy (:3000)
61
  ↕ Express β€” streams multipart upload, manages session state
62
  Python API (:8000)
63
+ β”œβ”€ POST /transcribe-diarize β€” VAD + Voxtral STT + emotion tags + FER timeline
64
  └─ POST /fer β€” per-frame FER via MobileViT-XXS ONNX
65
  ```
66
 
 
70
 
71
  | Directory | Port | Role |
72
  |-----------|------|------|
73
+ | `api/` | 8000 | Voxtral local inference; VAD segmentation; per-segment emotion; FER timeline |
74
  | `proxy/` | 3000 | API entrypoint; proxies to `api/` |
75
  | `web/` | 3030 | Next.js Studio UI |
76
 
77
+ ## API response format
78
+
79
+ `POST /api/transcribe-diarize` returns:
80
+
81
+ ```json
82
+ {
83
+ "filename": "interview.mp4",
84
+ "duration": 42.5,
85
+ "text": "Full transcript...",
86
+ "segments": [
87
+ {
88
+ "id": 1,
89
+ "speaker": "SPEAKER_00",
90
+ "start": 0.0,
91
+ "end": 5.2,
92
+ "text": "Welcome to the show. [laughs]",
93
+ "emotion": "Happy",
94
+ "valence": 0.7,
95
+ "arousal": 0.6,
96
+ "face_emotion": "Happy"
97
+ }
98
+ ],
99
+ "has_video": true,
100
+ "face_emotion_timeline": {
101
+ "0": "Neutral",
102
+ "1": "Happy",
103
+ "2": "Happy"
104
+ }
105
+ }
106
+ ```
107
+
108
+ `face_emotion_timeline` maps each second (as a string key) to the majority FER label for that second. Only present for video inputs.
109
+
110
+ ## Bracket tag emotions
111
+
112
+ Voxtral produces paralinguistic `[bracket]` tags in transcriptions. The frontend and API both recognise these tags and map them to `(emotion, valence, arousal)` triples:
113
+
114
+ | Tag | Emotion | Valence | Arousal |
115
+ |-----|---------|---------|---------|
116
+ | `[laughs]` / `[laughing]` | Happy | +0.70 | +0.60 |
117
+ | `[sighs]` / `[sighing]` | Sad | βˆ’0.30 | βˆ’0.30 |
118
+ | `[whispers]` / `[whispering]` | Calm | +0.10 | βˆ’0.50 |
119
+ | `[shouts]` / `[shouting]` | Angry | βˆ’0.50 | +0.80 |
120
+ | `[exclaims]` | Excited | +0.50 | +0.70 |
121
+ | `[gasps]` | Surprised | +0.20 | +0.70 |
122
+ | `[hesitates]` / `[stutters]` / `[stammers]` | Anxious | βˆ’0.20 | +0.35 |
123
+ | `[cries]` / `[crying]` | Sad | βˆ’0.70 | +0.40 |
124
+ | `[claps]` / `[applause]` | Happy | +0.60 | +0.50 |
125
+ | `[clears throat]` / `[pause]` | Neutral | 0.00 | Β±0.10 |
126
+
127
  ## How to run locally
128
 
129
  **Requirements**: Python 3.11+, Node.js 22+, ffmpeg, a GPU with ~8 GB VRAM (or CPU fallback).
 
140
 
141
  On first start the Voxtral model (~6 GB) is downloaded from HuggingFace. Set `MODEL_ID` / `ADAPTER_ID` env vars to override.
142
 
143
+ Inference is optimised with `merge_and_unload()` (removes PEFT per-forward overhead), `torch.set_num_threads(cpu_count)`, and `torch.inference_mode()`.
144
+
145
  ### 2. Node proxy (port 3000)
146
 
147
  ```bash
 
168
  curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@/path/to/video.mov"
169
  ```
170
 
171
+ Upload a video to also get per-segment facial emotion and the `face_emotion_timeline`.
172
 
173
  ## Models
174
 
 
208
  HF Spaces auto-builds on every push to the `main` branch of the linked Space.
209
 
210
  > **Note**: `models/emotion_model_web.onnx` is stored via Git LFS / Xet on HF Spaces. When pushing a new commit, use the `commit-tree` graft technique to reuse the existing LFS tree rather than re-uploading the binary.
211
+ >
212
+ > ```bash
213
+ > MODELS_SHA=$(git ls-tree space/main | grep $'\tmodels$' | awk '{print $3}')
214
+ > TREE_SHA=$((git ls-tree HEAD | grep -v $'\tmodels$'; echo "040000 tree $MODELS_SHA\tmodels") | git mktree)
215
+ > PARENT=$(git rev-parse space/main)
216
+ > COMMIT_SHA=$(git commit-tree "$TREE_SHA" -p "$PARENT" -m "your message")
217
+ > git push space "${COMMIT_SHA}:refs/heads/main"
218
+ > ```