Spaces:
Running
Running
docs: add backend_spec.md
Browse files- .context/backend_spec.md +95 -0
.context/backend_spec.md
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## overview
|
| 2 |
+
- **model:** Ethostral β fine-tuned Mistral Voxtral for joint ASR and emotion classification.
|
| 3 |
+
- **framework:** Python (FastAPI) for the API layer.
|
| 4 |
+
- **inference runtime:** Hugging Face `transformers` + `peft` for adapter-based fine-tuned inference.
|
| 5 |
+
- **real-time transport:** WebSockets for streaming audio and transcription events.
|
| 6 |
+
- **hosting:** Hugging Face Inference Endpoints for model serving.
|
| 7 |
+
|
| 8 |
+
## api endpoints
|
| 9 |
+
|
| 10 |
+
### `POST /transcribe`
|
| 11 |
+
- **purpose:** accepts an uploaded audio/video file, runs the Ethostral pipeline, returns a structured transcript.
|
| 12 |
+
- **input:** multipart form-data with fields: `file`, `language` (optional, default: auto-detect), `diarize` (bool), `emotion` (bool).
|
| 13 |
+
- **output:** JSON object with diarized segments, each containing transcript text, speaker id, timestamps, and emotional metadata.
|
| 14 |
+
|
| 15 |
+
### `WS /transcribe/stream`
|
| 16 |
+
- **purpose:** accepts a live audio byte stream, emits partial transcription and emotion events in real time.
|
| 17 |
+
- **message format (server β client):**
|
| 18 |
+
```json
|
| 19 |
+
{
|
| 20 |
+
"segment_id": "uuid",
|
| 21 |
+
"speaker": "s0",
|
| 22 |
+
"text": "Hello, I'm here.",
|
| 23 |
+
"start_ms": 8100,
|
| 24 |
+
"end_ms": 9040,
|
| 25 |
+
"emotion": {
|
| 26 |
+
"label": "Calm",
|
| 27 |
+
"valence": 0.3,
|
| 28 |
+
"arousal": -0.1,
|
| 29 |
+
"dominance": 0.2
|
| 30 |
+
}
|
| 31 |
+
}
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
### `GET /sessions/{session_id}`
|
| 35 |
+
- **purpose:** retrieves a previously processed session by ID.
|
| 36 |
+
- **output:** full structured transcript with emotional metadata.
|
| 37 |
+
|
| 38 |
+
### `DELETE /sessions/{session_id}`
|
| 39 |
+
- **purpose:** deletes a stored session.
|
| 40 |
+
|
| 41 |
+
## processing pipeline
|
| 42 |
+
|
| 43 |
+
1. **ingest:** audio is received via REST upload or WebSocket stream.
|
| 44 |
+
2. **preprocessing:** audio is resampled to 16 kHz mono. silence segments are stripped via VAD (voice activity detection).
|
| 45 |
+
3. **diarization:** speaker diarization using `pyannote.audio` to split audio into per-speaker segments.
|
| 46 |
+
4. **inference:** each segment is passed to the Ethostral endpoint for:
|
| 47 |
+
- automatic speech recognition (ASR).
|
| 48 |
+
- emotion classification (categorical + dimensional: valence / arousal / dominance).
|
| 49 |
+
5. **post-processing:** results are merged, timestamps are aligned, and output is structured per-segment.
|
| 50 |
+
6. **storage:** sessions are persisted with a generated UUID.
|
| 51 |
+
7. **telemetry:** each pipeline run is traced via Weights & Biases Weave.
|
| 52 |
+
|
| 53 |
+
## output schema
|
| 54 |
+
|
| 55 |
+
```typescript
|
| 56 |
+
type Segment = {
|
| 57 |
+
id: string
|
| 58 |
+
speaker: string // "s0", "s1", ...
|
| 59 |
+
start_ms: number
|
| 60 |
+
end_ms: number
|
| 61 |
+
text: string
|
| 62 |
+
emotion: {
|
| 63 |
+
label: string // "Happy", "Neutral", "Anxious", etc.
|
| 64 |
+
valence: number // -1.0 to 1.0
|
| 65 |
+
arousal: number // -1.0 to 1.0
|
| 66 |
+
dominance: number // -1.0 to 1.0
|
| 67 |
+
confidence: number // 0.0 to 1.0
|
| 68 |
+
}
|
| 69 |
+
}
|
| 70 |
+
|
| 71 |
+
type Session = {
|
| 72 |
+
id: string
|
| 73 |
+
filename: string
|
| 74 |
+
language: string
|
| 75 |
+
duration_ms: number
|
| 76 |
+
created_at: string // ISO 8601
|
| 77 |
+
segments: Segment[]
|
| 78 |
+
}
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
## dependencies
|
| 82 |
+
- **`fastapi`** β async HTTP and WebSocket server.
|
| 83 |
+
- **`pydantic`** β request/response schema validation.
|
| 84 |
+
- **`pyannote.audio`** β speaker diarization.
|
| 85 |
+
- **`transformers` + `peft`** β Ethostral model loading and adapter inference.
|
| 86 |
+
- **`torchaudio`** β audio preprocessing and resampling.
|
| 87 |
+
- **`wandb`** β Weights & Biases Weave integration for pipeline tracing.
|
| 88 |
+
- **`huggingface_hub`** β programmatic access to model weights and datasets.
|
| 89 |
+
|
| 90 |
+
## performance targets
|
| 91 |
+
- **transcription latency (batch):** < 2Γ real-time (e.g., a 60s file processed in < 120s).
|
| 92 |
+
- **streaming latency:** < 500ms from audio chunk to partial transcript event.
|
| 93 |
+
- **emotion classification latency:** < 100ms per segment (excluding ASR).
|
| 94 |
+
- **word error rate:** target < 10% on clean English audio.
|
| 95 |
+
- **emotion F1 score:** target > 0.70 across the IEMOCAP benchmark.
|