Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / proxy /README.md

Lior-0618

feat: replace frontend with redesigned Ethos Studio UI + update READMEs

c4c4f17 6 days ago

preview code

raw

history blame contribute delete

3.43 kB

	# Proxy Layer (Node/Express — port 3000)

	API gateway. Accepts multipart file uploads from the browser, forwards them to the API layer (Python FastAPI on port 8000), and returns JSON responses.

	- Port: `3000` (override with `PORT`)
	- API layer URL: `http://127.0.0.1:8000` (override with `MODEL_URL`)

	---

	## Startup

	```bash
	cd proxy
	npm install
	npm run dev # dev with --watch
	# or
	npm start
	```

	Requires Node.js 22+.

	---

	## API

	### POST /api/speech-to-text

	Simple transcription. Forwarded to API layer `POST /transcribe`. Timeout: 30 min (CPU inference is slow).

	\| \| \|
	\|--\|--\|
	\| Content-Type \| `multipart/form-data` \|
	\| Body \| `audio` — audio file (wav, mp3, flac, ogg, m4a, webm) \|
	\| Limits \| ≤ 100 MB \|

	Response (200)

	```json
	{
	"text": "transcribed text",
	"words": [],
	"languageCode": "en"
	}
	```

	Errors

	\| Status \| Body \|
	\|--------\|------\|
	\| 400 \| `{"error": "Upload an audio file (form field: audio)"}` \|
	\| 502 \| API layer error or unreachable \|
	\| 504 \| `{"error": "Request timeout (>30 min); try shorter audio"}` \|

	---

	### POST /api/transcribe-diarize

	Full pipeline: transcription + VAD sentence segmentation + emotion analysis. For video inputs, also returns `face_emotion` per segment. Forwarded to API layer `POST /transcribe-diarize`. Timeout: 60 min.

	\| \| \|
	\|--\|--\|
	\| Content-Type \| `multipart/form-data` \|
	\| Body \| `audio` — audio or video file (wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv) \|
	\| Limits \| ≤ 100 MB \|

	Response (200)

	```json
	{
	"segments": [
	{
	"id": 1,
	"speaker": "SPEAKER_00",
	"start": 0.0,
	"end": 4.2,
	"text": "Hello, how are you?",
	"emotion": "Happy",
	"valence": 0.7,
	"arousal": 0.6,
	"face_emotion": "Happy"
	}
	],
	"duration": 42.3,
	"text": "full transcript",
	"filename": "recording.mov",
	"diarization_method": "vad",
	"has_video": true
	}
	```

	`face_emotion` is present only when a video file is uploaded and FER is enabled. `has_video` indicates whether facial emotion recognition ran.

	Errors

	\| Status \| Body \|
	\|--------\|------\|
	\| 400 \| `{"error": "Upload an audio file (form field: audio)"}` \|
	\| 502 \| API layer error or unreachable \|
	\| 504 \| `{"error": "Request timeout (>60 min); try shorter audio"}` \|

	---

	### GET /health

	Proxies `GET {MODEL_URL}/health` and wraps it.

	Response (200)

	```json
	{
	"ok": true,
	"server": "ser-server",
	"model": {
	"status": "ok",
	"model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
	"model_loaded": true,
	"ffmpeg": true,
	"fer_enabled": true,
	"device": "cpu",
	"max_upload_mb": 100
	}
	}
	```

	Response (502) — when API layer is unreachable:

	```json
	{"ok": false, "error": "Cannot reach Model layer; start model/voxtral-server first", "url": "http://127.0.0.1:8000"}
	```

	---

	### GET /api/debug-inference

	Proxies `GET {MODEL_URL}/debug-inference` — smoke-tests the local Voxtral model with a short silence clip.

	---

	## Usage examples

	```bash
	# Health
	curl -s http://localhost:3000/health

	# Transcribe (audio)
	curl -X POST http://localhost:3000/api/speech-to-text -F "audio=@./recording.m4a"

	# Transcribe + segment + emotion (audio or video)
	curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@./recording.m4a"
	curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@./video.mov"
	```