Spaces:
Running
Running
overview
- model: Ethostral β fine-tuned Mistral Voxtral for joint ASR and emotion classification.
- framework: Python (FastAPI) for the API layer.
- inference runtime: Hugging Face
transformers+peftfor adapter-based fine-tuned inference. - real-time transport: WebSockets for streaming audio and transcription events.
- hosting: Hugging Face Inference Endpoints for model serving.
api endpoints
POST /transcribe
- purpose: accepts an uploaded audio/video file, runs the Ethostral pipeline, returns a structured transcript.
- input: multipart form-data with fields:
file,language(optional, default: auto-detect),diarize(bool),emotion(bool). - output: JSON object with diarized segments, each containing transcript text, speaker id, timestamps, and emotional metadata.
WS /transcribe/stream
- purpose: accepts a live audio byte stream, emits partial transcription and emotion events in real time.
- message format (server β client):
{ "segment_id": "uuid", "speaker": "s0", "text": "Hello, I'm here.", "start_ms": 8100, "end_ms": 9040, "emotion": { "label": "Calm", "valence": 0.3, "arousal": -0.1, "dominance": 0.2 } }
GET /sessions/{session_id}
- purpose: retrieves a previously processed session by ID.
- output: full structured transcript with emotional metadata.
DELETE /sessions/{session_id}
- purpose: deletes a stored session.
processing pipeline
- ingest: audio is received via REST upload or WebSocket stream.
- preprocessing: audio is resampled to 16 kHz mono. silence segments are stripped via VAD (voice activity detection).
- diarization: speaker diarization using
pyannote.audioto split audio into per-speaker segments. - inference: each segment is passed to the Ethostral endpoint for:
- automatic speech recognition (ASR).
- emotion classification (categorical + dimensional: valence / arousal / dominance).
- post-processing: results are merged, timestamps are aligned, and output is structured per-segment.
- storage: sessions are persisted with a generated UUID.
- telemetry: each pipeline run is traced via Weights & Biases Weave.
output schema
type Segment = {
id: string
speaker: string // "s0", "s1", ...
start_ms: number
end_ms: number
text: string
emotion: {
label: string // "Happy", "Neutral", "Anxious", etc.
valence: number // -1.0 to 1.0
arousal: number // -1.0 to 1.0
dominance: number // -1.0 to 1.0
confidence: number // 0.0 to 1.0
}
}
type Session = {
id: string
filename: string
language: string
duration_ms: number
created_at: string // ISO 8601
segments: Segment[]
}
dependencies
fastapiβ async HTTP and WebSocket server.pydanticβ request/response schema validation.pyannote.audioβ speaker diarization.transformers+peftβ Ethostral model loading and adapter inference.torchaudioβ audio preprocessing and resampling.wandbβ Weights & Biases Weave integration for pipeline tracing.huggingface_hubβ programmatic access to model weights and datasets.
performance targets
- transcription latency (batch): < 2Γ real-time (e.g., a 60s file processed in < 120s).
- streaming latency: < 500ms from audio chunk to partial transcript event.
- emotion classification latency: < 100ms per segment (excluding ASR).
- word error rate: target < 10% on clean English audio.
- emotion F1 score: target > 0.70 across the IEMOCAP benchmark.