Spaces:
Running
Proxy Layer (Node/Express — port 3000)
API gateway. Accepts multipart file uploads from the browser, forwards them to the API layer (Python FastAPI on port 8000), and returns JSON responses.
- Port:
3000(override withPORT) - API layer URL:
http://127.0.0.1:8000(override withMODEL_URL)
Startup
cd proxy
npm install
npm run dev # dev with --watch
# or
npm start
Requires Node.js 22+.
API
POST /api/speech-to-text
Simple transcription. Forwarded to API layer POST /transcribe. Timeout: 30 min (CPU inference is slow).
| Content-Type | multipart/form-data |
| Body | audio — audio file (wav, mp3, flac, ogg, m4a, webm) |
| Limits | ≤ 100 MB |
Response (200)
{
"text": "transcribed text",
"words": [],
"languageCode": "en"
}
Errors
| Status | Body |
|---|---|
| 400 | {"error": "Upload an audio file (form field: audio)"} |
| 502 | API layer error or unreachable |
| 504 | {"error": "Request timeout (>30 min); try shorter audio"} |
POST /api/transcribe-diarize
Full pipeline: transcription + VAD sentence segmentation + emotion analysis. For video inputs, also returns face_emotion per segment. Forwarded to API layer POST /transcribe-diarize. Timeout: 60 min.
| Content-Type | multipart/form-data |
| Body | audio — audio or video file (wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv) |
| Limits | ≤ 100 MB |
Response (200)
{
"segments": [
{
"id": 1,
"speaker": "SPEAKER_00",
"start": 0.0,
"end": 4.2,
"text": "Hello, how are you?",
"emotion": "Happy",
"valence": 0.7,
"arousal": 0.6,
"face_emotion": "Happy"
}
],
"duration": 42.3,
"text": "full transcript",
"filename": "recording.mov",
"diarization_method": "vad",
"has_video": true
}
face_emotion is present only when a video file is uploaded and FER is enabled. has_video indicates whether facial emotion recognition ran.
Errors
| Status | Body |
|---|---|
| 400 | {"error": "Upload an audio file (form field: audio)"} |
| 502 | API layer error or unreachable |
| 504 | {"error": "Request timeout (>60 min); try shorter audio"} |
GET /health
Proxies GET {MODEL_URL}/health and wraps it.
Response (200)
{
"ok": true,
"server": "ser-server",
"model": {
"status": "ok",
"model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
"model_loaded": true,
"ffmpeg": true,
"fer_enabled": true,
"device": "cpu",
"max_upload_mb": 100
}
}
Response (502) — when API layer is unreachable:
{"ok": false, "error": "Cannot reach Model layer; start model/voxtral-server first", "url": "http://127.0.0.1:8000"}
GET /api/debug-inference
Proxies GET {MODEL_URL}/debug-inference — smoke-tests the local Voxtral model with a short silence clip.
Usage examples
# Health
curl -s http://localhost:3000/health
# Transcribe (audio)
curl -X POST http://localhost:3000/api/speech-to-text -F "audio=@./recording.m4a"
# Transcribe + segment + emotion (audio or video)
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@./recording.m4a"
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@./video.mov"