Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / web /README.md

Lior-0618

feat: replace frontend with redesigned Ethos Studio UI + update READMEs

c4c4f17 6 days ago

preview code

raw

history blame contribute delete

4.21 kB

Frontend (Next.js — port 3030)

Ethos Studio UI. Upload audio or video files, view transcription results with per-segment emotion badges and facial emotion (FER) badges, and explore the waveform timeline in the Studio editor.

Architecture

Browser (port 3030)
  → Proxy layer (Node, port 3000)   POST /api/speech-to-text, POST /api/transcribe-diarize, GET /health
      → API layer (Python, port 8000)   POST /transcribe, POST /transcribe-diarize, GET /health

Frontend (web/): Upload page + Studio editor. Calls the Proxy layer.
Proxy layer (proxy/): Forwards browser requests to the API layer. See proxy/README.md for API details.
API layer (api/): Local Voxtral inference + VAD segmentation + emotion + FER. See api/README.md for API details.

Startup

1. API layer (Python, port 8000)

Requires Python 3.11+ and ffmpeg.

cd api
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

On first run the Voxtral model (~8 GB) is downloaded from HuggingFace.

2. Proxy layer (Node, port 3000)

cd proxy
npm install
npm run dev

3. Frontend (Next.js, port 3030)

cd web
npm install
npm run dev

Open http://localhost:3030.

Home page: Click Transcribe files, drag-drop an audio or video file, then Upload. The file is sent to /api/transcribe-diarize and results open in the Studio.
Studio page (/studio): Three-column layout — transcript segments (speaker + emotion badges + FER badges for video) on the left, waveform in the center, audio player on the right.

4. Quick check (API only)

curl -s http://localhost:3000/health
curl -X POST http://localhost:3000/api/speech-to-text -F "audio=@audio.m4a"
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@audio.m4a"
curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@video.mov"

API (Proxy layer)

Clients should call the Proxy layer only. The API layer is internal.

POST /api/speech-to-text

Simple transcription without diarization.


Content-Type	`multipart/form-data`
Body	`audio` — audio file (wav, mp3, flac, ogg, m4a, webm)
Limits	≤ 100 MB; timeout 30 min

Response (200)

{
  "text": "transcribed text",
  "words": [],
  "languageCode": "en"
}

POST /api/transcribe-diarize

Full pipeline: transcription + VAD sentence segmentation + per-segment emotion analysis. For video inputs, also returns face_emotion per segment.


Content-Type	`multipart/form-data`
Body	`audio` — audio or video file (wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv)
Limits	≤ 100 MB; timeout 60 min

Response (200)

{
  "segments": [
    {
      "id": 1,
      "speaker": "SPEAKER_00",
      "start": 0.0,
      "end": 4.2,
      "text": "Hello, how are you?",
      "emotion": "Happy",
      "valence": 0.7,
      "arousal": 0.6,
      "face_emotion": "Happy"
    }
  ],
  "duration": 42.3,
  "text": "full transcript text",
  "filename": "recording.mov",
  "diarization_method": "vad",
  "has_video": true
}

face_emotion appears only on video uploads when FER is enabled. diarization_method is always "vad".

GET /health

{
  "ok": true,
  "server": "ser-server",
  "model": {
    "status": "ok",
    "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)",
    "model_loaded": true,
    "ffmpeg": true,
    "fer_enabled": true,
    "device": "cpu",
    "max_upload_mb": 100
  }
}

Environment variables

Create web/.env.local:

Variable	Default	Description
`NEXT_PUBLIC_API_URL`	`http://localhost:3000`	Proxy layer URL used by the browser

Create proxy/.env or export:

Variable	Default	Description
`PORT`	`3000`	Proxy layer port
`MODEL_URL`	`http://127.0.0.1:8000`	API layer URL