File size: 3,729 Bytes
515632d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
## overview
- **model:** Ethostral β€” fine-tuned Mistral Voxtral for joint ASR and emotion classification.
- **framework:** Python (FastAPI) for the API layer.
- **inference runtime:** Hugging Face `transformers` + `peft` for adapter-based fine-tuned inference.
- **real-time transport:** WebSockets for streaming audio and transcription events.
- **hosting:** Hugging Face Inference Endpoints for model serving.

## api endpoints

### `POST /transcribe`
- **purpose:** accepts an uploaded audio/video file, runs the Ethostral pipeline, returns a structured transcript.
- **input:** multipart form-data with fields: `file`, `language` (optional, default: auto-detect), `diarize` (bool), `emotion` (bool).
- **output:** JSON object with diarized segments, each containing transcript text, speaker id, timestamps, and emotional metadata.

### `WS /transcribe/stream`
- **purpose:** accepts a live audio byte stream, emits partial transcription and emotion events in real time.
- **message format (server β†’ client):**
  ```json
  {
    "segment_id": "uuid",
    "speaker": "s0",
    "text": "Hello, I'm here.",
    "start_ms": 8100,
    "end_ms": 9040,
    "emotion": {
      "label": "Calm",
      "valence": 0.3,
      "arousal": -0.1,
      "dominance": 0.2
    }
  }
  ```

### `GET /sessions/{session_id}`
- **purpose:** retrieves a previously processed session by ID.
- **output:** full structured transcript with emotional metadata.

### `DELETE /sessions/{session_id}`
- **purpose:** deletes a stored session.

## processing pipeline

1. **ingest:** audio is received via REST upload or WebSocket stream.
2. **preprocessing:** audio is resampled to 16 kHz mono. silence segments are stripped via VAD (voice activity detection).
3. **diarization:** speaker diarization using `pyannote.audio` to split audio into per-speaker segments.
4. **inference:** each segment is passed to the Ethostral endpoint for:
   - automatic speech recognition (ASR).
   - emotion classification (categorical + dimensional: valence / arousal / dominance).
5. **post-processing:** results are merged, timestamps are aligned, and output is structured per-segment.
6. **storage:** sessions are persisted with a generated UUID.
7. **telemetry:** each pipeline run is traced via Weights & Biases Weave.

## output schema

```typescript
type Segment = {
  id: string
  speaker: string           // "s0", "s1", ...
  start_ms: number
  end_ms: number
  text: string
  emotion: {
    label: string           // "Happy", "Neutral", "Anxious", etc.
    valence: number         // -1.0 to 1.0
    arousal: number         // -1.0 to 1.0
    dominance: number       // -1.0 to 1.0
    confidence: number      // 0.0 to 1.0
  }
}

type Session = {
  id: string
  filename: string
  language: string
  duration_ms: number
  created_at: string        // ISO 8601
  segments: Segment[]
}
```

## dependencies
- **`fastapi`** β€” async HTTP and WebSocket server.
- **`pydantic`** β€” request/response schema validation.
- **`pyannote.audio`** β€” speaker diarization.
- **`transformers` + `peft`** β€” Ethostral model loading and adapter inference.
- **`torchaudio`** β€” audio preprocessing and resampling.
- **`wandb`** β€” Weights & Biases Weave integration for pipeline tracing.
- **`huggingface_hub`** β€” programmatic access to model weights and datasets.

## performance targets
- **transcription latency (batch):** < 2Γ— real-time (e.g., a 60s file processed in < 120s).
- **streaming latency:** < 500ms from audio chunk to partial transcript event.
- **emotion classification latency:** < 100ms per segment (excluding ASR).
- **word error rate:** target < 10% on clean English audio.
- **emotion F1 score:** target > 0.70 across the IEMOCAP benchmark.