Spaces:
Running
Running
update specs: simplify data flow, expand emotional frameworks with reference datasets, and initialize frontend spec
Browse files- .context/frontend_spec.md +1 -0
- .context/product_spec.md +46 -0
- .context/technical_spec.md +8 -9
.context/frontend_spec.md
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
# front end specification
|
.context/product_spec.md
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# emotional speech recognition and synthesis - product specification
|
| 2 |
+
|
| 3 |
+
## objective
|
| 4 |
+
build a pipeline that ingests human speech, classifies emotion across multiple frameworks, and transcribes text.
|
| 5 |
+
|
| 6 |
+
## inputs and outputs
|
| 7 |
+
|
| 8 |
+
### 1. input: audio injection
|
| 9 |
+
- **source:** raw speech material (16khz, mono) recorded via microphone.
|
| 10 |
+
- **process:** audio is streamed to the fine-tuned mistral voxtral endpoint for processing.
|
| 11 |
+
|
| 12 |
+
### 2. output: emotional transcription
|
| 13 |
+
- **content:** transcript interleaved with tags from multiple emotional frameworks.
|
| 14 |
+
- **supported frameworks & reference datasets:**
|
| 15 |
+
- **russell’s circumplex model:** valence and arousal coordinates (Ref: IEMOCAP, EMOVOME).
|
| 16 |
+
- **pad emotion space:** pleasure, arousal, and dominance dimensions (Ref: IEMOCAP).
|
| 17 |
+
- **plutchik’s wheel:** categorical tags (Ref: RAVDESS, EmoDB).
|
| 18 |
+
- **prosodic analysis:** pitch, jitter, and speech rate metadata (Ref: RAVDESS, Berlin Emo-db).
|
| 19 |
+
- **stress & deception markers:** specialized detection (Ref: SUSAS, MDPE, CRISIS).
|
| 20 |
+
- **example output:** `[arousal: high, valence: negative] [stress: probable] [pitch: 210hz] i can't figure this out!`
|
| 21 |
+
|
| 22 |
+
## input/output payload definitions
|
| 23 |
+
|
| 24 |
+
### 1. speech-to-text request
|
| 25 |
+
- **endpoint:** `/v1/recognize`
|
| 26 |
+
- **payload:** binary audio/wav
|
| 27 |
+
- **constraints:** max duration 30 seconds.
|
| 28 |
+
|
| 29 |
+
### 2. speech-to-text response
|
| 30 |
+
- **format:** json
|
| 31 |
+
- **schema:**
|
| 32 |
+
```json
|
| 33 |
+
{
|
| 34 |
+
"text": "i can't figure this out.",
|
| 35 |
+
"emotion": "frustrated",
|
| 36 |
+
"confidence": 0.89
|
| 37 |
+
}
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## error handling
|
| 41 |
+
- **unrecognized emotion:** default to "neutral" tag.
|
| 42 |
+
- **api timeouts:** retry once, then fallback to standard text output.
|
| 43 |
+
- **audio format errors:** reject request with 400 bad request and specify required formatting (16khz, mono).
|
| 44 |
+
|
| 45 |
+
## resources
|
| 46 |
+
- [front end specification](./frontend_spec.md)
|
.context/technical_spec.md
CHANGED
|
@@ -1,24 +1,23 @@
|
|
| 1 |
## core components
|
| 2 |
- **base model:** mistral voxtral.
|
| 3 |
-
- **synthesis engine:** elevenlabs v3 with emotional tag support.
|
| 4 |
- **tracking and evaluation:** weights and biases.
|
| 5 |
- **platform:** hugging face for model and adapter hosting.
|
| 6 |
|
| 7 |
## architecture
|
| 8 |
-
1. **
|
| 9 |
-
2. **
|
| 10 |
-
3. **output format:** transcription output uses interleaved text and emotional metadata tags.
|
| 11 |
-
4. **synthesis phase:** text and emotion tags are sent to elevenlabs v3 api to generate expressive high-fidelity audio.
|
| 12 |
|
| 13 |
## integration points
|
| 14 |
-
- **
|
| 15 |
-
- **weights and biases weave:** used for tracing the end-to-end pipeline from recognition to synthesis.
|
| 16 |
- **hugging face hub:** serves as the repository for fine-tuned weights and dataset storage.
|
| 17 |
|
| 18 |
## performance metrics
|
| 19 |
- word error rate for transcription quality.
|
| 20 |
- f1 score for emotion detection accuracy.
|
| 21 |
-
- mean opinion score for synthesis naturalness and emotional alignment.
|
| 22 |
|
| 23 |
## benchmarking and evals
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
## core components
|
| 2 |
- **base model:** mistral voxtral.
|
|
|
|
| 3 |
- **tracking and evaluation:** weights and biases.
|
| 4 |
- **platform:** hugging face for model and adapter hosting.
|
| 5 |
|
| 6 |
## architecture
|
| 7 |
+
1. **process:** audio is streamed to the fine-tuned mistral voxtral endpoint for simultaneous automatic speech recognition and emotion classification.
|
| 8 |
+
2. **output format:** transcription output uses interleaved text and emotional metadata tags.
|
|
|
|
|
|
|
| 9 |
|
| 10 |
## integration points
|
| 11 |
+
- **weights and biases weave:** used for tracing the recognition pipeline.
|
|
|
|
| 12 |
- **hugging face hub:** serves as the repository for fine-tuned weights and dataset storage.
|
| 13 |
|
| 14 |
## performance metrics
|
| 15 |
- word error rate for transcription quality.
|
| 16 |
- f1 score for emotion detection accuracy.
|
|
|
|
| 17 |
|
| 18 |
## benchmarking and evals
|
| 19 |
+
- **IEMOCAP:** Evaluation of categorical and dimensional (Valence/Arousal/Dominance) accuracy.
|
| 20 |
+
- **RAVDESS:** Benchmarking of prosodic feature mapping and speech rate accuracy.
|
| 21 |
+
- **SUSAS:** Evaluation of stress detection reliability under varied acoustic conditions.
|
| 22 |
+
- **MDPE:** Assessment of deception-related emotional leakage detection.
|
| 23 |
+
- **Weights & Biases Weave:** Used for tracking eval traces and scoring pipeline performance.
|