ethos / docs /context /backend_spec.md
Lior-0618's picture
refactor: restructure repo into api/ proxy/ web/ training/ docs/
a265585

overview

  • model: Ethostral β€” fine-tuned Mistral Voxtral for joint ASR and emotion classification.
  • framework: Python (FastAPI) for the API layer.
  • inference runtime: Hugging Face transformers + peft for adapter-based fine-tuned inference.
  • real-time transport: WebSockets for streaming audio and transcription events.
  • hosting: Hugging Face Inference Endpoints for model serving.

api endpoints

POST /transcribe

  • purpose: accepts an uploaded audio/video file, runs the Ethostral pipeline, returns a structured transcript.
  • input: multipart form-data with fields: file, language (optional, default: auto-detect), diarize (bool), emotion (bool).
  • output: JSON object with diarized segments, each containing transcript text, speaker id, timestamps, and emotional metadata.

WS /transcribe/stream

  • purpose: accepts a live audio byte stream, emits partial transcription and emotion events in real time.
  • message format (server β†’ client):
    {
      "segment_id": "uuid",
      "speaker": "s0",
      "text": "Hello, I'm here.",
      "start_ms": 8100,
      "end_ms": 9040,
      "emotion": {
        "label": "Calm",
        "valence": 0.3,
        "arousal": -0.1,
        "dominance": 0.2
      }
    }
    

GET /sessions/{session_id}

  • purpose: retrieves a previously processed session by ID.
  • output: full structured transcript with emotional metadata.

DELETE /sessions/{session_id}

  • purpose: deletes a stored session.

processing pipeline

  1. ingest: audio is received via REST upload or WebSocket stream.
  2. preprocessing: audio is resampled to 16 kHz mono. silence segments are stripped via VAD (voice activity detection).
  3. diarization: speaker diarization using pyannote.audio to split audio into per-speaker segments.
  4. inference: each segment is passed to the Ethostral endpoint for:
    • automatic speech recognition (ASR).
    • emotion classification (categorical + dimensional: valence / arousal / dominance).
  5. post-processing: results are merged, timestamps are aligned, and output is structured per-segment.
  6. storage: sessions are persisted with a generated UUID.
  7. telemetry: each pipeline run is traced via Weights & Biases Weave.

output schema

type Segment = {
  id: string
  speaker: string           // "s0", "s1", ...
  start_ms: number
  end_ms: number
  text: string
  emotion: {
    label: string           // "Happy", "Neutral", "Anxious", etc.
    valence: number         // -1.0 to 1.0
    arousal: number         // -1.0 to 1.0
    dominance: number       // -1.0 to 1.0
    confidence: number      // 0.0 to 1.0
  }
}

type Session = {
  id: string
  filename: string
  language: string
  duration_ms: number
  created_at: string        // ISO 8601
  segments: Segment[]
}

dependencies

  • fastapi β€” async HTTP and WebSocket server.
  • pydantic β€” request/response schema validation.
  • pyannote.audio β€” speaker diarization.
  • transformers + peft β€” Ethostral model loading and adapter inference.
  • torchaudio β€” audio preprocessing and resampling.
  • wandb β€” Weights & Biases Weave integration for pipeline tracing.
  • huggingface_hub β€” programmatic access to model weights and datasets.

performance targets

  • transcription latency (batch): < 2Γ— real-time (e.g., a 60s file processed in < 120s).
  • streaming latency: < 500ms from audio chunk to partial transcript event.
  • emotion classification latency: < 100ms per segment (excluding ASR).
  • word error rate: target < 10% on clean English audio.
  • emotion F1 score: target > 0.70 across the IEMOCAP benchmark.