Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / proxy /README.md

Lior-0618

feat: replace frontend with redesigned Ethos Studio UI + update READMEs

c4c4f17 6 days ago

preview code

raw

history blame contribute delete

3.43 kB

Proxy Layer (Node/Express — port 3000)

API gateway. Accepts multipart file uploads from the browser, forwards them to the API layer (Python FastAPI on port 8000), and returns JSON responses.

Port: 3000 (override with PORT)
API layer URL: http://127.0.0.1:8000 (override with MODEL_URL)

Startup

cd proxy
npm install
npm run dev    # dev with --watch
# or
npm start

Requires Node.js 22+.

API

POST /api/speech-to-text

Simple transcription. Forwarded to API layer POST /transcribe. Timeout: 30 min (CPU inference is slow).


Content-Type	`multipart/form-data`
Body	`audio` — audio file (wav, mp3, flac, ogg, m4a, webm)
Limits	≤ 100 MB

Response (200)

{
  "text": "transcribed text",
  "words": [],
  "languageCode": "en"
}

Errors

Status	Body
400	`{"error": "Upload an audio file (form field: audio)"}`
502	API layer error or unreachable
504	`{"error": "Request timeout (>30 min); try shorter audio"}`

POST /api/transcribe-diarize

Full pipeline: transcription + VAD sentence segmentation + emotion analysis. For video inputs, also returns face_emotion per segment. Forwarded to API layer POST /transcribe-diarize. Timeout: 60 min.


Content-Type	`multipart/form-data`
Body	`audio` — audio or video file (wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv)
Limits	≤ 100 MB

Response (200)

{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.0,
      "end": 4.2,
      "text": "Hello, how are you?",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6,
      "face_emotion": "Happy"
    }
  ],
  "duration": 42.3,
  "text": "full transcript",
  "filename": "recording.mov",
  "diarization_method": "vad",
  "has_video": true
}

face_emotion is present only when a video file is uploaded and FER is enabled. has_video indicates whether facial emotion recognition ran.

Errors

Status	Body
400	`{"error": "Upload an audio file (form field: audio)"}`
502	API layer error or unreachable
504	`{"error": "Request timeout (>60 min); try shorter audio"}`

GET /health

Proxies GET {MODEL_URL}/health and wraps it.

Response (200)

{
  "ok": true,
  "server": "ser-server",
  "model": {
    "status": "ok",
    "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
    "model_loaded": true,
    "ffmpeg": true,
    "fer_enabled": true,
    "device": "cpu",
    "max_upload_mb": 100
  }
}

Response (502) — when API layer is unreachable:

{"ok": false, "error": "Cannot reach Model layer; start model/voxtral-server first", "url": "http://127.0.0.1:8000"}

GET /api/debug-inference

Proxies GET {MODEL_URL}/debug-inference — smoke-tests the local Voxtral model with a short silence clip.

Usage examples

# Health
curl -s http://localhost:3000/health

# Transcribe (audio)
curl -X POST http://localhost:3000/api/speech-to-text -F "audio=@./recording.m4a"

# Transcribe + segment + emotion (audio or video)
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@./recording.m4a"
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@./video.mov"