Spaces:
Running
Running
File size: 3,729 Bytes
515632d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | ## overview
- **model:** Ethostral β fine-tuned Mistral Voxtral for joint ASR and emotion classification.
- **framework:** Python (FastAPI) for the API layer.
- **inference runtime:** Hugging Face `transformers` + `peft` for adapter-based fine-tuned inference.
- **real-time transport:** WebSockets for streaming audio and transcription events.
- **hosting:** Hugging Face Inference Endpoints for model serving.
## api endpoints
### `POST /transcribe`
- **purpose:** accepts an uploaded audio/video file, runs the Ethostral pipeline, returns a structured transcript.
- **input:** multipart form-data with fields: `file`, `language` (optional, default: auto-detect), `diarize` (bool), `emotion` (bool).
- **output:** JSON object with diarized segments, each containing transcript text, speaker id, timestamps, and emotional metadata.
### `WS /transcribe/stream`
- **purpose:** accepts a live audio byte stream, emits partial transcription and emotion events in real time.
- **message format (server β client):**
```json
{
"segment_id": "uuid",
"speaker": "s0",
"text": "Hello, I'm here.",
"start_ms": 8100,
"end_ms": 9040,
"emotion": {
"label": "Calm",
"valence": 0.3,
"arousal": -0.1,
"dominance": 0.2
}
}
```
### `GET /sessions/{session_id}`
- **purpose:** retrieves a previously processed session by ID.
- **output:** full structured transcript with emotional metadata.
### `DELETE /sessions/{session_id}`
- **purpose:** deletes a stored session.
## processing pipeline
1. **ingest:** audio is received via REST upload or WebSocket stream.
2. **preprocessing:** audio is resampled to 16 kHz mono. silence segments are stripped via VAD (voice activity detection).
3. **diarization:** speaker diarization using `pyannote.audio` to split audio into per-speaker segments.
4. **inference:** each segment is passed to the Ethostral endpoint for:
- automatic speech recognition (ASR).
- emotion classification (categorical + dimensional: valence / arousal / dominance).
5. **post-processing:** results are merged, timestamps are aligned, and output is structured per-segment.
6. **storage:** sessions are persisted with a generated UUID.
7. **telemetry:** each pipeline run is traced via Weights & Biases Weave.
## output schema
```typescript
type Segment = {
id: string
speaker: string // "s0", "s1", ...
start_ms: number
end_ms: number
text: string
emotion: {
label: string // "Happy", "Neutral", "Anxious", etc.
valence: number // -1.0 to 1.0
arousal: number // -1.0 to 1.0
dominance: number // -1.0 to 1.0
confidence: number // 0.0 to 1.0
}
}
type Session = {
id: string
filename: string
language: string
duration_ms: number
created_at: string // ISO 8601
segments: Segment[]
}
```
## dependencies
- **`fastapi`** β async HTTP and WebSocket server.
- **`pydantic`** β request/response schema validation.
- **`pyannote.audio`** β speaker diarization.
- **`transformers` + `peft`** β Ethostral model loading and adapter inference.
- **`torchaudio`** β audio preprocessing and resampling.
- **`wandb`** β Weights & Biases Weave integration for pipeline tracing.
- **`huggingface_hub`** β programmatic access to model weights and datasets.
## performance targets
- **transcription latency (batch):** < 2Γ real-time (e.g., a 60s file processed in < 120s).
- **streaming latency:** < 500ms from audio chunk to partial transcript event.
- **emotion classification latency:** < 100ms per segment (excluding ASR).
- **word error rate:** target < 10% on clean English audio.
- **emotion F1 score:** target > 0.70 across the IEMOCAP benchmark.
|