Spaces:

mistral-hackaton-2026
/

ethos

Running

ethos / docs /context /product_spec.md

refactor: restructure repo into api/ proxy/ web/ training/ docs/

a265585 8 days ago

1.81 kB

emotional speech recognition and synthesis - product specification

build a pipeline that ingests human speech, classifies emotion across multiple frameworks, and transcribes text.

source: raw speech material (16khz, mono) recorded via microphone.
process: audio is streamed to the fine-tuned mistral voxtral endpoint for processing.

content: transcript interleaved with tags from multiple emotional frameworks.
supported frameworks & reference datasets:
- russell’s circumplex model: valence and arousal coordinates (Ref: IEMOCAP, EMOVOME).
- pad emotion space: pleasure, arousal, and dominance dimensions (Ref: IEMOCAP).
- plutchik’s wheel: categorical tags (Ref: RAVDESS, EmoDB).
- prosodic analysis: pitch, jitter, and speech rate metadata (Ref: RAVDESS, Berlin Emo-db).
- stress & deception markers: specialized detection (Ref: SUSAS, MDPE, CRISIS).
example output: [arousal: high, valence: negative] [stress: probable] [pitch: 210hz] i can't figure this out!

schema:

{
  "text": "i can't figure this out.",
  "emotion": "frustrated",
  "confidence": 0.89
}

unrecognized emotion: default to "neutral" tag.
api timeouts: retry once, then fallback to standard text output.
audio format errors: reject request with 400 bad request and specify required formatting (16khz, mono).