aytoasty commited on
Commit
515632d
Β·
1 Parent(s): 1fdd623

docs: add backend_spec.md

Browse files
Files changed (1) hide show
  1. .context/backend_spec.md +95 -0
.context/backend_spec.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## overview
2
+ - **model:** Ethostral β€” fine-tuned Mistral Voxtral for joint ASR and emotion classification.
3
+ - **framework:** Python (FastAPI) for the API layer.
4
+ - **inference runtime:** Hugging Face `transformers` + `peft` for adapter-based fine-tuned inference.
5
+ - **real-time transport:** WebSockets for streaming audio and transcription events.
6
+ - **hosting:** Hugging Face Inference Endpoints for model serving.
7
+
8
+ ## api endpoints
9
+
10
+ ### `POST /transcribe`
11
+ - **purpose:** accepts an uploaded audio/video file, runs the Ethostral pipeline, returns a structured transcript.
12
+ - **input:** multipart form-data with fields: `file`, `language` (optional, default: auto-detect), `diarize` (bool), `emotion` (bool).
13
+ - **output:** JSON object with diarized segments, each containing transcript text, speaker id, timestamps, and emotional metadata.
14
+
15
+ ### `WS /transcribe/stream`
16
+ - **purpose:** accepts a live audio byte stream, emits partial transcription and emotion events in real time.
17
+ - **message format (server β†’ client):**
18
+ ```json
19
+ {
20
+ "segment_id": "uuid",
21
+ "speaker": "s0",
22
+ "text": "Hello, I'm here.",
23
+ "start_ms": 8100,
24
+ "end_ms": 9040,
25
+ "emotion": {
26
+ "label": "Calm",
27
+ "valence": 0.3,
28
+ "arousal": -0.1,
29
+ "dominance": 0.2
30
+ }
31
+ }
32
+ ```
33
+
34
+ ### `GET /sessions/{session_id}`
35
+ - **purpose:** retrieves a previously processed session by ID.
36
+ - **output:** full structured transcript with emotional metadata.
37
+
38
+ ### `DELETE /sessions/{session_id}`
39
+ - **purpose:** deletes a stored session.
40
+
41
+ ## processing pipeline
42
+
43
+ 1. **ingest:** audio is received via REST upload or WebSocket stream.
44
+ 2. **preprocessing:** audio is resampled to 16 kHz mono. silence segments are stripped via VAD (voice activity detection).
45
+ 3. **diarization:** speaker diarization using `pyannote.audio` to split audio into per-speaker segments.
46
+ 4. **inference:** each segment is passed to the Ethostral endpoint for:
47
+ - automatic speech recognition (ASR).
48
+ - emotion classification (categorical + dimensional: valence / arousal / dominance).
49
+ 5. **post-processing:** results are merged, timestamps are aligned, and output is structured per-segment.
50
+ 6. **storage:** sessions are persisted with a generated UUID.
51
+ 7. **telemetry:** each pipeline run is traced via Weights & Biases Weave.
52
+
53
+ ## output schema
54
+
55
+ ```typescript
56
+ type Segment = {
57
+ id: string
58
+ speaker: string // "s0", "s1", ...
59
+ start_ms: number
60
+ end_ms: number
61
+ text: string
62
+ emotion: {
63
+ label: string // "Happy", "Neutral", "Anxious", etc.
64
+ valence: number // -1.0 to 1.0
65
+ arousal: number // -1.0 to 1.0
66
+ dominance: number // -1.0 to 1.0
67
+ confidence: number // 0.0 to 1.0
68
+ }
69
+ }
70
+
71
+ type Session = {
72
+ id: string
73
+ filename: string
74
+ language: string
75
+ duration_ms: number
76
+ created_at: string // ISO 8601
77
+ segments: Segment[]
78
+ }
79
+ ```
80
+
81
+ ## dependencies
82
+ - **`fastapi`** β€” async HTTP and WebSocket server.
83
+ - **`pydantic`** β€” request/response schema validation.
84
+ - **`pyannote.audio`** β€” speaker diarization.
85
+ - **`transformers` + `peft`** β€” Ethostral model loading and adapter inference.
86
+ - **`torchaudio`** β€” audio preprocessing and resampling.
87
+ - **`wandb`** β€” Weights & Biases Weave integration for pipeline tracing.
88
+ - **`huggingface_hub`** β€” programmatic access to model weights and datasets.
89
+
90
+ ## performance targets
91
+ - **transcription latency (batch):** < 2Γ— real-time (e.g., a 60s file processed in < 120s).
92
+ - **streaming latency:** < 500ms from audio chunk to partial transcript event.
93
+ - **emotion classification latency:** < 100ms per segment (excluding ASR).
94
+ - **word error rate:** target < 10% on clean English audio.
95
+ - **emotion F1 score:** target > 0.70 across the IEMOCAP benchmark.