Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

aytoasty commited on Feb 28

Commit

515632d

1 Parent(s): 1fdd623

docs: add backend_spec.md

Browse files

Files changed (1) hide show

.context/backend_spec.md +95 -0

.context/backend_spec.md ADDED Viewed

	@@ -0,0 +1,95 @@

+## overview
+- **model:** Ethostral — fine-tuned Mistral Voxtral for joint ASR and emotion classification.
+- **framework:** Python (FastAPI) for the API layer.
+- **inference runtime:** Hugging Face `transformers` + `peft` for adapter-based fine-tuned inference.
+- **real-time transport:** WebSockets for streaming audio and transcription events.
+- **hosting:** Hugging Face Inference Endpoints for model serving.
+## api endpoints
+### `POST /transcribe`
+- **purpose:** accepts an uploaded audio/video file, runs the Ethostral pipeline, returns a structured transcript.
+- **input:** multipart form-data with fields: `file`, `language` (optional, default: auto-detect), `diarize` (bool), `emotion` (bool).
+- **output:** JSON object with diarized segments, each containing transcript text, speaker id, timestamps, and emotional metadata.
+### `WS /transcribe/stream`
+- **purpose:** accepts a live audio byte stream, emits partial transcription and emotion events in real time.
+- **message format (server → client):**
+  ```json
+  {
+    "segment_id": "uuid",
+    "speaker": "s0",
+    "text": "Hello, I'm here.",
+    "start_ms": 8100,
+    "end_ms": 9040,
+    "emotion": {
+      "label": "Calm",
+      "valence": 0.3,
+      "arousal": -0.1,
+      "dominance": 0.2
+    }
+  }
+  ```
+### `GET /sessions/{session_id}`
+- **purpose:** retrieves a previously processed session by ID.
+- **output:** full structured transcript with emotional metadata.
+### `DELETE /sessions/{session_id}`
+- **purpose:** deletes a stored session.
+## processing pipeline
+1. **ingest:** audio is received via REST upload or WebSocket stream.
+2. **preprocessing:** audio is resampled to 16 kHz mono. silence segments are stripped via VAD (voice activity detection).
+3. **diarization:** speaker diarization using `pyannote.audio` to split audio into per-speaker segments.
+4. **inference:** each segment is passed to the Ethostral endpoint for:
+   - automatic speech recognition (ASR).
+   - emotion classification (categorical + dimensional: valence / arousal / dominance).
+5. **post-processing:** results are merged, timestamps are aligned, and output is structured per-segment.
+6. **storage:** sessions are persisted with a generated UUID.
+7. **telemetry:** each pipeline run is traced via Weights & Biases Weave.
+## output schema
+```typescript
+type Segment = {
+  id: string
+  speaker: string           // "s0", "s1", ...
+  start_ms: number
+  end_ms: number
+  text: string
+  emotion: {
+    label: string           // "Happy", "Neutral", "Anxious", etc.
+    valence: number         // -1.0 to 1.0
+    arousal: number         // -1.0 to 1.0
+    dominance: number       // -1.0 to 1.0
+    confidence: number      // 0.0 to 1.0
+  }
+}
+type Session = {
+  id: string
+  filename: string
+  language: string
+  duration_ms: number
+  created_at: string        // ISO 8601
+  segments: Segment[]
+}
+```
+## dependencies
+- **`fastapi`** — async HTTP and WebSocket server.
+- **`pydantic`** — request/response schema validation.
+- **`pyannote.audio`** — speaker diarization.
+- **`transformers` + `peft`** — Ethostral model loading and adapter inference.
+- **`torchaudio`** — audio preprocessing and resampling.
+- **`wandb`** — Weights & Biases Weave integration for pipeline tracing.
+- **`huggingface_hub`** — programmatic access to model weights and datasets.
+## performance targets
+- **transcription latency (batch):** < 2× real-time (e.g., a 60s file processed in < 120s).
+- **streaming latency:** < 500ms from audio chunk to partial transcript event.
+- **emotion classification latency:** < 100ms per segment (excluding ASR).
+- **word error rate:** target < 10% on clean English audio.
+- **emotion F1 score:** target > 0.70 across the IEMOCAP benchmark.