# Frontend (Next.js — port 3030) Ethos Studio UI. Upload audio or video files, view transcription results with per-segment emotion badges and facial emotion (FER) badges, and explore the waveform timeline in the Studio editor. --- ## Architecture ``` Browser (port 3030) → Proxy layer (Node, port 3000) POST /api/speech-to-text, POST /api/transcribe-diarize, GET /health → API layer (Python, port 8000) POST /transcribe, POST /transcribe-diarize, GET /health ``` - **Frontend** (`web/`): Upload page + Studio editor. Calls the Proxy layer. - **Proxy layer** (`proxy/`): Forwards browser requests to the API layer. See [proxy/README.md](../proxy/README.md) for API details. - **API layer** (`api/`): Local Voxtral inference + VAD segmentation + emotion + FER. See [api/README.md](../api/README.md) for API details. --- ## Startup ### 1. API layer (Python, port 8000) Requires **Python 3.11+** and **ffmpeg**. ```bash cd api python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt uvicorn main:app --host 0.0.0.0 --port 8000 --reload ``` On first run the Voxtral model (~8 GB) is downloaded from HuggingFace. ### 2. Proxy layer (Node, port 3000) ```bash cd proxy npm install npm run dev ``` ### 3. Frontend (Next.js, port 3030) ```bash cd web npm install npm run dev ``` Open [http://localhost:3030](http://localhost:3030). - **Home page**: Click **Transcribe files**, drag-drop an audio or video file, then **Upload**. The file is sent to `/api/transcribe-diarize` and results open in the Studio. - **Studio page** (`/studio`): Three-column layout — transcript segments (speaker + emotion badges + FER badges for video) on the left, waveform in the center, audio player on the right. ### 4. Quick check (API only) ```bash curl -s http://localhost:3000/health curl -X POST http://localhost:3000/api/speech-to-text -F "audio=@audio.m4a" curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@audio.m4a" curl -X POST http://localhost:3000/api/transcribe-diarize -F "audio=@video.mov" ``` --- ## API (Proxy layer) Clients should call the **Proxy layer** only. The API layer is internal. ### POST /api/speech-to-text Simple transcription without diarization. | | | |--|--| | **Content-Type** | `multipart/form-data` | | **Body** | `audio` — audio file (wav, mp3, flac, ogg, m4a, webm) | | **Limits** | ≤ 100 MB; timeout 30 min | **Response (200)** ```json { "text": "transcribed text", "words": [], "languageCode": "en" } ``` --- ### POST /api/transcribe-diarize Full pipeline: transcription + VAD sentence segmentation + per-segment emotion analysis. For video inputs, also returns `face_emotion` per segment. | | | |--|--| | **Content-Type** | `multipart/form-data` | | **Body** | `audio` — audio or video file (wav, mp3, flac, ogg, m4a, webm, mp4, mov, mkv) | | **Limits** | ≤ 100 MB; timeout 60 min | **Response (200)** ```json { "segments": [ { "id": 1, "speaker": "SPEAKER_00", "start": 0.0, "end": 4.2, "text": "Hello, how are you?", "emotion": "Happy", "valence": 0.7, "arousal": 0.6, "face_emotion": "Happy" } ], "duration": 42.3, "text": "full transcript text", "filename": "recording.mov", "diarization_method": "vad", "has_video": true } ``` `face_emotion` appears only on video uploads when FER is enabled. `diarization_method` is always `"vad"`. --- ### GET /health ```json { "ok": true, "server": "ser-server", "model": { "status": "ok", "model": "mistralai/Voxtral-Mini-3B-2507 + YongkangZOU/evoxtral-lora (local)", "model_loaded": true, "ffmpeg": true, "fer_enabled": true, "device": "cpu", "max_upload_mb": 100 } } ``` --- ## Environment variables Create `web/.env.local`: | Variable | Default | Description | |----------|---------|-------------| | `NEXT_PUBLIC_API_URL` | `http://localhost:3000` | Proxy layer URL used by the browser | Create `proxy/.env` or export: | Variable | Default | Description | |----------|---------|-------------| | `PORT` | `3000` | Proxy layer port | | `MODEL_URL` | `http://127.0.0.1:8000` | API layer URL |