aytoasty commited on
Commit
8319599
·
1 Parent(s): ebdda02

update specs: simplify data flow, expand emotional frameworks with reference datasets, and initialize frontend spec

Browse files
.context/frontend_spec.md ADDED
@@ -0,0 +1 @@
 
 
1
+ # front end specification
.context/product_spec.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # emotional speech recognition and synthesis - product specification
2
+
3
+ ## objective
4
+ build a pipeline that ingests human speech, classifies emotion across multiple frameworks, and transcribes text.
5
+
6
+ ## inputs and outputs
7
+
8
+ ### 1. input: audio injection
9
+ - **source:** raw speech material (16khz, mono) recorded via microphone.
10
+ - **process:** audio is streamed to the fine-tuned mistral voxtral endpoint for processing.
11
+
12
+ ### 2. output: emotional transcription
13
+ - **content:** transcript interleaved with tags from multiple emotional frameworks.
14
+ - **supported frameworks & reference datasets:**
15
+ - **russell’s circumplex model:** valence and arousal coordinates (Ref: IEMOCAP, EMOVOME).
16
+ - **pad emotion space:** pleasure, arousal, and dominance dimensions (Ref: IEMOCAP).
17
+ - **plutchik’s wheel:** categorical tags (Ref: RAVDESS, EmoDB).
18
+ - **prosodic analysis:** pitch, jitter, and speech rate metadata (Ref: RAVDESS, Berlin Emo-db).
19
+ - **stress & deception markers:** specialized detection (Ref: SUSAS, MDPE, CRISIS).
20
+ - **example output:** `[arousal: high, valence: negative] [stress: probable] [pitch: 210hz] i can't figure this out!`
21
+
22
+ ## input/output payload definitions
23
+
24
+ ### 1. speech-to-text request
25
+ - **endpoint:** `/v1/recognize`
26
+ - **payload:** binary audio/wav
27
+ - **constraints:** max duration 30 seconds.
28
+
29
+ ### 2. speech-to-text response
30
+ - **format:** json
31
+ - **schema:**
32
+ ```json
33
+ {
34
+ "text": "i can't figure this out.",
35
+ "emotion": "frustrated",
36
+ "confidence": 0.89
37
+ }
38
+ ```
39
+
40
+ ## error handling
41
+ - **unrecognized emotion:** default to "neutral" tag.
42
+ - **api timeouts:** retry once, then fallback to standard text output.
43
+ - **audio format errors:** reject request with 400 bad request and specify required formatting (16khz, mono).
44
+
45
+ ## resources
46
+ - [front end specification](./frontend_spec.md)
.context/technical_spec.md CHANGED
@@ -1,24 +1,23 @@
1
  ## core components
2
  - **base model:** mistral voxtral.
3
- - **synthesis engine:** elevenlabs v3 with emotional tag support.
4
  - **tracking and evaluation:** weights and biases.
5
  - **platform:** hugging face for model and adapter hosting.
6
 
7
  ## architecture
8
- 1. **ingestion:** raw audio signals are processed for input to the fine-tuned voxtral model.
9
- 2. **inference:** model performs simultaneous automatic speech recognition and emotion classification.
10
- 3. **output format:** transcription output uses interleaved text and emotional metadata tags.
11
- 4. **synthesis phase:** text and emotion tags are sent to elevenlabs v3 api to generate expressive high-fidelity audio.
12
 
13
  ## integration points
14
- - **elevenlabs v3 api:** handles the conversion of tagged text into emotional audio output.
15
- - **weights and biases weave:** used for tracing the end-to-end pipeline from recognition to synthesis.
16
  - **hugging face hub:** serves as the repository for fine-tuned weights and dataset storage.
17
 
18
  ## performance metrics
19
  - word error rate for transcription quality.
20
  - f1 score for emotion detection accuracy.
21
- - mean opinion score for synthesis naturalness and emotional alignment.
22
 
23
  ## benchmarking and evals
24
- @yongkang fill this up.
 
 
 
 
 
1
  ## core components
2
  - **base model:** mistral voxtral.
 
3
  - **tracking and evaluation:** weights and biases.
4
  - **platform:** hugging face for model and adapter hosting.
5
 
6
  ## architecture
7
+ 1. **process:** audio is streamed to the fine-tuned mistral voxtral endpoint for simultaneous automatic speech recognition and emotion classification.
8
+ 2. **output format:** transcription output uses interleaved text and emotional metadata tags.
 
 
9
 
10
  ## integration points
11
+ - **weights and biases weave:** used for tracing the recognition pipeline.
 
12
  - **hugging face hub:** serves as the repository for fine-tuned weights and dataset storage.
13
 
14
  ## performance metrics
15
  - word error rate for transcription quality.
16
  - f1 score for emotion detection accuracy.
 
17
 
18
  ## benchmarking and evals
19
+ - **IEMOCAP:** Evaluation of categorical and dimensional (Valence/Arousal/Dominance) accuracy.
20
+ - **RAVDESS:** Benchmarking of prosodic feature mapping and speech rate accuracy.
21
+ - **SUSAS:** Evaluation of stress detection reliability under varied acoustic conditions.
22
+ - **MDPE:** Assessment of deception-related emotional leakage detection.
23
+ - **Weights & Biases Weave:** Used for tracking eval traces and scoring pipeline performance.