File size: 3,430 Bytes
c4c4f17
e0fe7d5
c4c4f17
e0fe7d5
 
c4c4f17
e0fe7d5
 
 
 
 
 
c4c4f17
e0fe7d5
 
 
 
 
 
c4c4f17
e0fe7d5
 
 
 
 
 
 
c4c4f17
e0fe7d5
 
 
 
fdef69c
 
e0fe7d5
fdef69c
e0fe7d5
 
 
 
 
c4c4f17
e0fe7d5
 
 
fdef69c
e0fe7d5
 
 
fdef69c
c4c4f17
 
fdef69c
 
 
 
 
c4c4f17
fdef69c
 
 
 
c4c4f17
fdef69c
 
 
 
 
 
 
 
 
 
 
 
 
c4c4f17
 
 
 
fdef69c
 
 
 
c4c4f17
 
 
fdef69c
 
 
c4c4f17
 
fdef69c
 
 
 
 
c4c4f17
 
e0fe7d5
 
 
 
 
fdef69c
e0fe7d5
fdef69c
e0fe7d5
 
 
 
 
 
 
c4c4f17
e0fe7d5
fdef69c
c4c4f17
 
e0fe7d5
 
 
 
 
c4c4f17
fdef69c
 
 
 
e0fe7d5
 
 
c4c4f17
 
 
 
 
 
fdef69c
e0fe7d5
 
fdef69c
 
 
c4c4f17
e0fe7d5
 
c4c4f17
fdef69c
c4c4f17
e0fe7d5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# Proxy Layer (Node/Express — port 3000)

API gateway. Accepts multipart file uploads from the browser, forwards them to the **API layer** (Python FastAPI on port 8000), and returns JSON responses.

- **Port**: `3000` (override with `PORT`)
- **API layer URL**: `http://127.0.0.1:8000` (override with `MODEL_URL`)

---

## Startup

```bash
cd proxy
npm install
npm run dev    # dev with --watch
# or
npm start
```

Requires **Node.js 22+**.

---

## API

### POST /api/speech-to-text

Simple transcription. Forwarded to API layer `POST /transcribe`. Timeout: **30 min** (CPU inference is slow).

| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio file (wav, mp3, flac, ogg, m4a, webm) |
| **Limits** | ≤ 100 MB |

**Response (200)**

```json
{
  "text": "transcribed text",
  "words": [],
  "languageCode": "en"
}
```

**Errors**

| Status | Body |
|--------|------|
| 400 | `{"error": "Upload an audio file (form field: audio)"}` |
| 502 | API layer error or unreachable |
| 504 | `{"error": "Request timeout (>30 min); try shorter audio"}` |

---

### POST /api/transcribe-diarize

Full pipeline: transcription + VAD sentence segmentation + emotion analysis. For video inputs, also returns `face_emotion` per segment. Forwarded to API layer `POST /transcribe-diarize`. Timeout: **60 min**.

| | |
|--|--|
| **Content-Type** | `multipart/form-data` |
| **Body** | `audio` — audio or video file (wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv) |
| **Limits** | ≤ 100 MB |

**Response (200)**

```json
{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.0,
      "end": 4.2,
      "text": "Hello, how are you?",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6,
      "face_emotion": "Happy"
    }
  ],
  "duration": 42.3,
  "text": "full transcript",
  "filename": "recording.mov",
  "diarization_method": "vad",
  "has_video": true
}
```

`face_emotion` is present only when a video file is uploaded and FER is enabled. `has_video` indicates whether facial emotion recognition ran.

**Errors**

| Status | Body |
|--------|------|
| 400 | `{"error": "Upload an audio file (form field: audio)"}` |
| 502 | API layer error or unreachable |
| 504 | `{"error": "Request timeout (>60 min); try shorter audio"}` |

---

### GET /health

Proxies `GET {MODEL_URL}/health` and wraps it.

**Response (200)**

```json
{
  "ok": true,
  "server": "ser-server",
  "model": {
    "status": "ok",
    "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
    "model_loaded": true,
    "ffmpeg": true,
    "fer_enabled": true,
    "device": "cpu",
    "max_upload_mb": 100
  }
}
```

**Response (502)** — when API layer is unreachable:

```json
{"ok": false, "error": "Cannot reach Model layer; start model/voxtral-server first", "url": "http://127.0.0.1:8000"}
```

---

### GET /api/debug-inference

Proxies `GET {MODEL_URL}/debug-inference` — smoke-tests the local Voxtral model with a short silence clip.

---

## Usage examples

```bash
# Health
curl -s http://localhost:3000/health

# Transcribe (audio)
curl -X POST http://localhost:3000/api/speech-to-text -F "audio=@./recording.m4a"

# Transcribe + segment + emotion (audio or video)
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@./recording.m4a"
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@./video.mov"
```