Spaces:

mistral-hackaton-2026
/

ethos

Running

App Files Files Community

ethos / docs /context /product_spec.md

Lior-0618

refactor: restructure repo into api/ proxy/ web/ training/ docs/

a265585 8 days ago

preview code

raw

history blame contribute delete

1.81 kB

	# emotional speech recognition and synthesis - product specification

	## objective
	build a pipeline that ingests human speech, classifies emotion across multiple frameworks, and transcribes text.

	## inputs and outputs

	### 1. input: audio injection
	- source: raw speech material (16khz, mono) recorded via microphone.
	- process: audio is streamed to the fine-tuned mistral voxtral endpoint for processing.

	### 2. output: emotional transcription
	- content: transcript interleaved with tags from multiple emotional frameworks.
	- supported frameworks & reference datasets:
	- russell’s circumplex model: valence and arousal coordinates (Ref: IEMOCAP, EMOVOME).
	- pad emotion space: pleasure, arousal, and dominance dimensions (Ref: IEMOCAP).
	- plutchik’s wheel: categorical tags (Ref: RAVDESS, EmoDB).
	- prosodic analysis: pitch, jitter, and speech rate metadata (Ref: RAVDESS, Berlin Emo-db).
	- stress & deception markers: specialized detection (Ref: SUSAS, MDPE, CRISIS).
	- example output: `[arousal: high, valence: negative] [stress: probable] [pitch: 210hz] i can't figure this out!`

	## input/output payload definitions

	### 1. speech-to-text request
	- endpoint: `/v1/recognize`
	- payload: binary audio/wav
	- constraints: max duration 30 seconds.

	### 2. speech-to-text response
	- format: json
	- schema:
	```json
	{
	"text": "i can't figure this out.",
	"emotion": "frustrated",
	"confidence": 0.89
	}
	```

	## error handling
	- unrecognized emotion: default to "neutral" tag.
	- api timeouts: retry once, then fallback to standard text output.
	- audio format errors: reject request with 400 bad request and specify required formatting (16khz, mono).

	## resources
	- [front end specification](./frontend_spec.md)

	# emotional speech recognition and synthesis - product specification

	## objective
	build a pipeline that ingests human speech, classifies emotion across multiple frameworks, and transcribes text.

	## inputs and outputs

	### 1. input: audio injection
	- source: raw speech material (16khz, mono) recorded via microphone.
	- process: audio is streamed to the fine-tuned mistral voxtral endpoint for processing.

	### 2. output: emotional transcription
	- content: transcript interleaved with tags from multiple emotional frameworks.
	- supported frameworks & reference datasets:
	- russell’s circumplex model: valence and arousal coordinates (Ref: IEMOCAP, EMOVOME).
	- pad emotion space: pleasure, arousal, and dominance dimensions (Ref: IEMOCAP).
	- plutchik’s wheel: categorical tags (Ref: RAVDESS, EmoDB).
	- prosodic analysis: pitch, jitter, and speech rate metadata (Ref: RAVDESS, Berlin Emo-db).
	- stress & deception markers: specialized detection (Ref: SUSAS, MDPE, CRISIS).
	- example output: `[arousal: high, valence: negative] [stress: probable] [pitch: 210hz] i can't figure this out!`

	## input/output payload definitions

	### 1. speech-to-text request
	- endpoint: `/v1/recognize`
	- payload: binary audio/wav
	- constraints: max duration 30 seconds.

	### 2. speech-to-text response
	- format: json
	- schema:
	```json
	{
	"text": "i can't figure this out.",
	"emotion": "frustrated",
	"confidence": 0.89
	}
	```

	## error handling
	- unrecognized emotion: default to "neutral" tag.
	- api timeouts: retry once, then fallback to standard text output.
	- audio format errors: reject request with 400 bad request and specify required formatting (16khz, mono).

	## resources
	- [front end specification](./frontend_spec.md)