Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

aytoasty commited on 8 days ago

Commit

8319599

1 Parent(s): ebdda02

update specs: simplify data flow, expand emotional frameworks with reference datasets, and initialize frontend spec

Browse files

Files changed (3) hide show

.context/frontend_spec.md +1 -0
.context/product_spec.md +46 -0
.context/technical_spec.md +8 -9

.context/frontend_spec.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ # front end specification

.context/product_spec.md ADDED Viewed

	@@ -0,0 +1,46 @@

+# emotional speech recognition and synthesis - product specification
+## objective
+build a pipeline that ingests human speech, classifies emotion across multiple frameworks, and transcribes text.
+## inputs and outputs
+### 1. input: audio injection
+- **source:** raw speech material (16khz, mono) recorded via microphone.
+- **process:** audio is streamed to the fine-tuned mistral voxtral endpoint for processing.
+### 2. output: emotional transcription
+- **content:** transcript interleaved with tags from multiple emotional frameworks.
+- **supported frameworks & reference datasets:**
+  - **russell’s circumplex model:** valence and arousal coordinates (Ref: IEMOCAP, EMOVOME).
+  - **pad emotion space:** pleasure, arousal, and dominance dimensions (Ref: IEMOCAP).
+  - **plutchik’s wheel:** categorical tags (Ref: RAVDESS, EmoDB).
+  - **prosodic analysis:** pitch, jitter, and speech rate metadata (Ref: RAVDESS, Berlin Emo-db).
+  - **stress & deception markers:** specialized detection (Ref: SUSAS, MDPE, CRISIS).
+- **example output:** `[arousal: high, valence: negative] [stress: probable] [pitch: 210hz] i can't figure this out!`
+## input/output payload definitions
+### 1. speech-to-text request
+- **endpoint:** `/v1/recognize`
+- **payload:** binary audio/wav
+- **constraints:** max duration 30 seconds.
+### 2. speech-to-text response
+- **format:** json
+- **schema:**
+  ```json
+  {
+    "text": "i can't figure this out.",
+    "emotion": "frustrated",
+    "confidence": 0.89
+  }
+  ```
+## error handling
+- **unrecognized emotion:** default to "neutral" tag.
+- **api timeouts:** retry once, then fallback to standard text output.
+- **audio format errors:** reject request with 400 bad request and specify required formatting (16khz, mono).
+## resources
+- [front end specification](./frontend_spec.md)

.context/technical_spec.md CHANGED Viewed

@@ -1,24 +1,23 @@
 ## core components
 - **base model:** mistral voxtral.
-- **synthesis engine:** elevenlabs v3 with emotional tag support.
 - **tracking and evaluation:** weights and biases.
 - **platform:** hugging face for model and adapter hosting.
 ## architecture
-1. **ingestion:** raw audio signals are processed for input to the fine-tuned voxtral model.
-2. **inference:** model performs simultaneous automatic speech recognition and emotion classification.
-3. **output format:** transcription output uses interleaved text and emotional metadata tags.
-4. **synthesis phase:** text and emotion tags are sent to elevenlabs v3 api to generate expressive high-fidelity audio.
 ## integration points
-- **elevenlabs v3 api:** handles the conversion of tagged text into emotional audio output.
-- **weights and biases weave:** used for tracing the end-to-end pipeline from recognition to synthesis.
 - **hugging face hub:** serves as the repository for fine-tuned weights and dataset storage.
 ## performance metrics
 - word error rate for transcription quality.
 - f1 score for emotion detection accuracy.
-- mean opinion score for synthesis naturalness and emotional alignment.
 ## benchmarking and evals
-@yongkang fill this up.

 ## core components
 - **base model:** mistral voxtral.
 - **tracking and evaluation:** weights and biases.
 - **platform:** hugging face for model and adapter hosting.
 ## architecture
+1. **process:** audio is streamed to the fine-tuned mistral voxtral endpoint for simultaneous automatic speech recognition and emotion classification.
+2. **output format:** transcription output uses interleaved text and emotional metadata tags.
 ## integration points
+- **weights and biases weave:** used for tracing the recognition pipeline.
 - **hugging face hub:** serves as the repository for fine-tuned weights and dataset storage.
 ## performance metrics
 - word error rate for transcription quality.
 - f1 score for emotion detection accuracy.
 ## benchmarking and evals
+- **IEMOCAP:** Evaluation of categorical and dimensional (Valence/Arousal/Dominance) accuracy.
+- **RAVDESS:** Benchmarking of prosodic feature mapping and speech rate accuracy.
+- **SUSAS:** Evaluation of stress detection reliability under varied acoustic conditions.
+- **MDPE:** Assessment of deception-related emotional leakage detection.
+- **Weights & Biases Weave:** Used for tracking eval traces and scoring pipeline performance.