File size: 4,213 Bytes
c4c4f17
e0fe7d5
c4c4f17
 
 
e0fe7d5
fdef69c
a479593
04fa9e6
fdef69c
c4c4f17
 
e0fe7d5
 
c4c4f17
 
 
e0fe7d5
 
 
fdef69c
e0fe7d5
c4c4f17
e0fe7d5
c4c4f17
e0fe7d5
 
c4c4f17
e0fe7d5
 
 
 
 
 
c4c4f17
e0fe7d5
c4c4f17
e0fe7d5
 
c4c4f17
e0fe7d5
fdef69c
e0fe7d5
 
fdef69c
e0fe7d5
 
c4c4f17
fdef69c
 
e0fe7d5
 
fdef69c
 
c4c4f17
 
fdef69c
 
 
 
 
 
 
c4c4f17
fdef69c
e0fe7d5
 
 
c4c4f17
e0fe7d5
c4c4f17
e0fe7d5
fdef69c
e0fe7d5
fdef69c
e0fe7d5
 
 
 
fdef69c
c4c4f17
e0fe7d5
fdef69c
e0fe7d5
 
 
fdef69c
e0fe7d5
c4c4f17
e0fe7d5
 
fdef69c
 
 
 
 
c4c4f17
fdef69c
 
 
 
c4c4f17
 
fdef69c
 
 
 
 
 
 
 
 
 
 
 
c4c4f17
 
 
 
fdef69c
 
 
 
c4c4f17
 
 
fdef69c
 
 
c4c4f17
fdef69c
 
 
 
 
 
 
 
 
 
 
c4c4f17
fdef69c
 
c4c4f17
 
fdef69c
 
 
 
 
 
 
 
 
c4c4f17
fdef69c
 
 
c4c4f17
fdef69c
c4c4f17
fdef69c
 
 
c4c4f17
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# Frontend (Next.js — port 3030)

Ethos Studio UI. Upload audio or video files, view transcription results with per-segment emotion badges and facial emotion (FER) badges, and explore the waveform timeline in the Studio editor.

---

## Architecture

```
Browser (port 3030)
  → Proxy layer (Node, port 3000)   POST /api/speech-to-text, POST /api/transcribe-diarize, GET /health
      → API layer (Python, port 8000)   POST /transcribe, POST /transcribe-diarize, GET /health
```

- **Frontend** (`web/`): Upload page + Studio editor. Calls the Proxy layer.
- **Proxy layer** (`proxy/`): Forwards browser requests to the API layer. See [proxy/README.md](../proxy/README.md) for API details.
- **API layer** (`api/`): Local Voxtral inference + VAD segmentation + emotion + FER. See [api/README.md](../api/README.md) for API details.

---

## Startup

### 1. API layer (Python, port 8000)

Requires **Python 3.11+** and **ffmpeg**.

```bash
cd api
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```

On first run the Voxtral model (~8 GB) is downloaded from HuggingFace.

### 2. Proxy layer (Node, port 3000)

```bash
cd proxy
npm install
npm run dev
```

### 3. Frontend (Next.js, port 3030)

```bash
cd web
npm install
npm run dev
```

Open [http://localhost:3030](http://localhost:3030).

- **Home page**: Click **Transcribe files**, drag-drop an audio or video file, then **Upload**. The file is sent to `/api/transcribe-diarize` and results open in the Studio.
- **Studio page** (`/studio`): Three-column layout — transcript segments (speaker + emotion badges + FER badges for video) on the left, waveform in the center, audio player on the right.

### 4. Quick check (API only)

```bash
curl -s http://localhost:3000/health
curl -X POST http://localhost:3000/api/speech-to-text -F "audio=@audio.m4a"
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@audio.m4a"
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@video.mov"
```

---

## API (Proxy layer)

Clients should call the **Proxy layer** only. The API layer is internal.

### POST /api/speech-to-text

Simple transcription without diarization.

| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio file (wav, mp3, flac, ogg, m4a, webm) |
| **Limits** | ≤ 100 MB; timeout 30 min |

**Response (200)**

```json
{
  "text": "transcribed text",
  "words": [],
  "languageCode": "en"
}
```

---

### POST /api/transcribe-diarize

Full pipeline: transcription + VAD sentence segmentation + per-segment emotion analysis. For video inputs, also returns `face_emotion` per segment.

| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio or video file (wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv) |
| **Limits** | ≤ 100 MB; timeout 60 min |

**Response (200)**

```json
{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.0,
      "end": 4.2,
      "text": "Hello, how are you?",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6,
      "face_emotion": "Happy"
    }
  ],
  "duration": 42.3,
  "text": "full transcript text",
  "filename": "recording.mov",
  "diarization_method": "vad",
  "has_video": true
}
```

`face_emotion` appears only on video uploads when FER is enabled. `diarization_method` is always `"vad"`.

---

### GET /health

```json
{
  "ok": true,
  "server": "ser-server",
  "model": {
    "status": "ok",
    "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
    "model_loaded": true,
    "ffmpeg": true,
    "fer_enabled": true,
    "device": "cpu",
    "max_upload_mb": 100
  }
}
```

---

## Environment variables

Create `web/.env.local`:

| Variable | Default | Description |
|----------|---------|-------------|
| `NEXT_PUBLIC_API_URL` | `http://localhost:3000` | Proxy layer URL used by the browser |

Create `proxy/.env` or export:

| Variable | Default | Description |
|----------|---------|-------------|
| `PORT` | `3000` | Proxy layer port |
| `MODEL_URL` | `http://127.0.0.1:8000` | API layer URL |