Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / docs /context /backend_spec.md

Lior-0618

refactor: restructure repo into api/ proxy/ web/ training/ docs/

a265585 6 days ago

preview code

raw

history blame contribute delete

3.73 kB

overview

model: Ethostral — fine-tuned Mistral Voxtral for joint ASR and emotion classification.
framework: Python (FastAPI) for the API layer.
inference runtime: Hugging Face transformers + peft for adapter-based fine-tuned inference.
real-time transport: WebSockets for streaming audio and transcription events.
hosting: Hugging Face Inference Endpoints for model serving.

api endpoints

`POST /transcribe`

purpose: accepts an uploaded audio/video file, runs the Ethostral pipeline, returns a structured transcript.
input: multipart form-data with fields: file, language (optional, default: auto-detect), diarize (bool), emotion (bool).
output: JSON object with diarized segments, each containing transcript text, speaker id, timestamps, and emotional metadata.

`WS /transcribe/stream`

purpose: accepts a live audio byte stream, emits partial transcription and emotion events in real time.

message format (server → client):

{
  "segment_id": "uuid",
  "speaker": "s0",
  "text": "Hello, I'm here.",
  "start_ms": 8100,
  "end_ms": 9040,
  "emotion": {
    "label": "Calm",
    "valence": 0.3,
    "arousal": -0.1,
    "dominance": 0.2
  }
}

`GET /sessions/{session_id}`

purpose: retrieves a previously processed session by ID.
output: full structured transcript with emotional metadata.

`DELETE /sessions/{session_id}`

purpose: deletes a stored session.

processing pipeline

ingest: audio is received via REST upload or WebSocket stream.
preprocessing: audio is resampled to 16 kHz mono. silence segments are stripped via VAD (voice activity detection).
diarization: speaker diarization using pyannote.audio to split audio into per-speaker segments.
inference: each segment is passed to the Ethostral endpoint for:
- automatic speech recognition (ASR).
- emotion classification (categorical + dimensional: valence / arousal / dominance).
post-processing: results are merged, timestamps are aligned, and output is structured per-segment.
storage: sessions are persisted with a generated UUID.
telemetry: each pipeline run is traced via Weights & Biases Weave.

output schema

type Segment = {
  id: string
  speaker: string           // "s0", "s1", ...
  start_ms: number
  end_ms: number
  text: string
  emotion: {
    label: string           // "Happy", "Neutral", "Anxious", etc.
    valence: number         // -1.0 to 1.0
    arousal: number         // -1.0 to 1.0
    dominance: number       // -1.0 to 1.0
    confidence: number      // 0.0 to 1.0
  }
}

type Session = {
  id: string
  filename: string
  language: string
  duration_ms: number
  created_at: string        // ISO 8601
  segments: Segment[]
}

dependencies

fastapi — async HTTP and WebSocket server.
pydantic — request/response schema validation.
pyannote.audio — speaker diarization.
transformers + peft — Ethostral model loading and adapter inference.
torchaudio — audio preprocessing and resampling.
wandb — Weights & Biases Weave integration for pipeline tracing.
huggingface_hub — programmatic access to model weights and datasets.

performance targets

transcription latency (batch): < 2× real-time (e.g., a 60s file processed in < 120s).
streaming latency: < 500ms from audio chunk to partial transcript event.
emotion classification latency: < 100ms per segment (excluding ASR).
word error rate: target < 10% on clean English audio.
emotion F1 score: target > 0.70 across the IEMOCAP benchmark.