ethos

Sleeping

App Files Files Community

ethos / docs /context /backend_spec.md

Lior-0618

refactor: restructure repo into api/ proxy/ web/ training/ docs/

a265585 about 1 month ago

preview code

raw

history blame contribute delete

3.73 kB

	## overview
	- model: Ethostral — fine-tuned Mistral Voxtral for joint ASR and emotion classification.
	- framework: Python (FastAPI) for the API layer.
	- inference runtime: Hugging Face `transformers` + `peft` for adapter-based fine-tuned inference.
	- real-time transport: WebSockets for streaming audio and transcription events.
	- hosting: Hugging Face Inference Endpoints for model serving.

	## api endpoints

	### `POST /transcribe`
	- purpose: accepts an uploaded audio/video file, runs the Ethostral pipeline, returns a structured transcript.
	- input: multipart form-data with fields: `file`, `language` (optional, default: auto-detect), `diarize` (bool), `emotion` (bool).
	- output: JSON object with diarized segments, each containing transcript text, speaker id, timestamps, and emotional metadata.

	### `WS /transcribe/stream`
	- purpose: accepts a live audio byte stream, emits partial transcription and emotion events in real time.
	- message format (server → client):
	```json
	{
	"segment_id": "uuid",
	"speaker": "s0",
	"text": "Hello, I'm here.",
	"start_ms": 8100,
	"end_ms": 9040,
	"emotion": {
	"label": "Calm",
	"valence": 0.3,
	"arousal": -0.1,
	"dominance": 0.2
	}
	}
	```

	### `GET /sessions/{session_id}`
	- purpose: retrieves a previously processed session by ID.
	- output: full structured transcript with emotional metadata.

	### `DELETE /sessions/{session_id}`
	- purpose: deletes a stored session.

	## processing pipeline

	1. ingest: audio is received via REST upload or WebSocket stream.
	2. preprocessing: audio is resampled to 16 kHz mono. silence segments are stripped via VAD (voice activity detection).
	3. diarization: speaker diarization using `pyannote.audio` to split audio into per-speaker segments.
	4. inference: each segment is passed to the Ethostral endpoint for:
	- automatic speech recognition (ASR).
	- emotion classification (categorical + dimensional: valence / arousal / dominance).
	5. post-processing: results are merged, timestamps are aligned, and output is structured per-segment.
	6. storage: sessions are persisted with a generated UUID.
	7. telemetry: each pipeline run is traced via Weights & Biases Weave.

	## output schema

	```typescript
	type Segment = {
	id: string
	speaker: string // "s0", "s1", ...
	start_ms: number
	end_ms: number
	text: string
	emotion: {
	label: string // "Happy", "Neutral", "Anxious", etc.
	valence: number // -1.0 to 1.0
	arousal: number // -1.0 to 1.0
	dominance: number // -1.0 to 1.0
	confidence: number // 0.0 to 1.0
	}
	}

	type Session = {
	id: string
	filename: string
	language: string
	duration_ms: number
	created_at: string // ISO 8601
	segments: Segment[]
	}
	```

	## dependencies
	- `fastapi` — async HTTP and WebSocket server.
	- `pydantic` — request/response schema validation.
	- `pyannote.audio` — speaker diarization.
	- `transformers` + `peft` — Ethostral model loading and adapter inference.
	- `torchaudio` — audio preprocessing and resampling.
	- `wandb` — Weights & Biases Weave integration for pipeline tracing.
	- `huggingface_hub` — programmatic access to model weights and datasets.

	## performance targets
	- transcription latency (batch): < 2× real-time (e.g., a 60s file processed in < 120s).
	- streaming latency: < 500ms from audio chunk to partial transcript event.
	- emotion classification latency: < 100ms per segment (excluding ASR).
	- word error rate: target < 10% on clean English audio.
	- emotion F1 score: target > 0.70 across the IEMOCAP benchmark.