File size: 5,053 Bytes
d072f3b
7e7994f
 
 
 
d072f3b
c4f0922
aec4fb0
d072f3b
 
 
 
7e7994f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
title: Delivery Analyzer
emoji: 🎀
colorFrom: indigo
colorTo: pink
sdk: gradio
python_version: "3.10"
sdk_version: "5.23.0"
app_file: app.py
pinned: false
---

# Space 2 β€” Delivery Analyzer

*Part of Prea Callahan's AI + Research Level 2 portfolio. See the full [research journal](./research-journal.md) for context.*

## What this Space does

You upload a short speech clip (10 seconds to about 4 minutes) and this Space returns two things:

1. A transcript produced by Whisper-small, and
2. Four prosodic features computed from Whisper's word-level timestamps:
   - **Speaking rate** (words per minute over the whole clip)
   - **Pause count** (number of silences longer than 400 ms)
   - **Pause-duration variance** (how uneven those pauses are)
   - **Speaking-rate variance across thirds** (whether the speaker changes pace between the opening, middle, and closing of the speech)

Those four numbers are the "delivery" half of the two-factor scoring pipeline in Space 3.

## The architecture, and why it looks this way

This is where my project made its biggest turn. Originally I planned to load an audio emotion-recognition model directly into this Space. That failed on free-tier CPU β€” see research-journal.md, Week 6, for the compute wall write-up. Around the same time my instructor shared Mistral's "Designing a speech-to-speech assistant" blog post, which describes their Voxtral pipeline: transcription with timestamps, then LLM reasoning over the transcript, cleanly separated. The architectural idea is the right one. The components they named aren't free.

So this Space is a **free-tools translation of the Mistral pattern**:

| Mistral pipeline     | My free-tools pipeline                                       |
|----------------------|--------------------------------------------------------------|
| Voxtral-small (STT)  | `openai/whisper-small` via Hugging Face Inference API        |
| Mistral LLM          | `HuggingFaceTB/SmolLM2-1.7B-Instruct` via Inference API (Space 3) |
| Mistral TTS          | Not needed for my project β€” I only need feedback, not a voice reply |

The Space itself holds no model weights. It boots in under five seconds on free-tier CPU and the heavy lifting happens on Hugging Face's servers. This is what "thin client over API" means in practice.

## Running this Space

The Space needs a Hugging Face access token to call the Inference API. Add one as a Space secret:

1. Go to **Settings β†’ Variables and secrets β†’ New secret**
2. Name: `HF_TOKEN`
3. Value: a read-level token from your [Hugging Face settings page](https://huggingface.co/settings/tokens)

The free Inference API tier is rate-limited but more than sufficient for demo use. A typical 90-second clip round-trips in about 8 seconds.

## Interpreting the features

The features are deliberately simple. I did not try to compute pitch or energy contours β€” both are noisy on phone recordings and neither is cleanly derivable from the Whisper API response. The four features I chose are all computable from word-level start/end timestamps alone, which makes the whole pipeline robust to bad mic conditions (and also makes it cheap).

Rough reference ranges from my own Week 8 data (n=5 β€” these are **not** generalizable, they are just reference values for a single student team's recordings):

| Feature                                   | Low        | Mid        | High        |
|-------------------------------------------|------------|------------|-------------|
| Speaking rate (wpm)                       | ~150       | ~170       | ~200        |
| Pauses > 400 ms                           | 2          | 5–8        | 12+         |
| Pause-duration variance                   | 0.05       | 0.20       | 0.40+       |
| Speaking-rate variance (across thirds)    | 3          | 10–20      | 25+         |

Higher pause-duration variance and higher speaking-rate variance both track with my intuitive judgment that a speech "landed." See research-journal.md, Week 10, for the correlation analysis.

## Known limitations

- **ASR bias.** Whisper has documented performance disparities across speaker groups ([Koenecke et al. 2020](https://www.pnas.org/doi/10.1073/pnas.1915768117); [Li et al. 2024](https://aclanthology.org/2024.naacl-long.246/)). Two of the five student sources in my Week 8 data are non-native English speakers. I did not correct for this and the prosodic features are only as reliable as the word-boundary timestamps Whisper returns.
- **Single-speaker assumption.** If there are two speakers on the clip, the word-level timestamps will span both and the features will be garbage. I did not add diarization.
- **Short clips.** Clips with fewer than 20 transcribed words get a warning instead of scores. Variance features are unstable below that threshold.

## Files

- `app.py` β€” Gradio interface and feature extraction.
- `requirements.txt` β€” Dependencies (just `gradio` and `requests`; no model weights).

## Course

Built for AI + Research Level 2, Youth Horizons Learning, Spring 2026.