profplate's picture
Update Gradio SDK version for Spaces compatibility
aec4fb0 verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: Delivery Analyzer
emoji: 🎀
colorFrom: indigo
colorTo: pink
sdk: gradio
python_version: '3.10'
sdk_version: 5.23.0
app_file: app.py
pinned: false

Space 2 β€” Delivery Analyzer

Part of Prea Callahan's AI + Research Level 2 portfolio. See the full research journal for context.

What this Space does

You upload a short speech clip (10 seconds to about 4 minutes) and this Space returns two things:

  1. A transcript produced by Whisper-small, and
  2. Four prosodic features computed from Whisper's word-level timestamps:
    • Speaking rate (words per minute over the whole clip)
    • Pause count (number of silences longer than 400 ms)
    • Pause-duration variance (how uneven those pauses are)
    • Speaking-rate variance across thirds (whether the speaker changes pace between the opening, middle, and closing of the speech)

Those four numbers are the "delivery" half of the two-factor scoring pipeline in Space 3.

The architecture, and why it looks this way

This is where my project made its biggest turn. Originally I planned to load an audio emotion-recognition model directly into this Space. That failed on free-tier CPU β€” see research-journal.md, Week 6, for the compute wall write-up. Around the same time my instructor shared Mistral's "Designing a speech-to-speech assistant" blog post, which describes their Voxtral pipeline: transcription with timestamps, then LLM reasoning over the transcript, cleanly separated. The architectural idea is the right one. The components they named aren't free.

So this Space is a free-tools translation of the Mistral pattern:

Mistral pipeline My free-tools pipeline
Voxtral-small (STT) openai/whisper-small via Hugging Face Inference API
Mistral LLM HuggingFaceTB/SmolLM2-1.7B-Instruct via Inference API (Space 3)
Mistral TTS Not needed for my project β€” I only need feedback, not a voice reply

The Space itself holds no model weights. It boots in under five seconds on free-tier CPU and the heavy lifting happens on Hugging Face's servers. This is what "thin client over API" means in practice.

Running this Space

The Space needs a Hugging Face access token to call the Inference API. Add one as a Space secret:

  1. Go to Settings β†’ Variables and secrets β†’ New secret
  2. Name: HF_TOKEN
  3. Value: a read-level token from your Hugging Face settings page

The free Inference API tier is rate-limited but more than sufficient for demo use. A typical 90-second clip round-trips in about 8 seconds.

Interpreting the features

The features are deliberately simple. I did not try to compute pitch or energy contours β€” both are noisy on phone recordings and neither is cleanly derivable from the Whisper API response. The four features I chose are all computable from word-level start/end timestamps alone, which makes the whole pipeline robust to bad mic conditions (and also makes it cheap).

Rough reference ranges from my own Week 8 data (n=5 β€” these are not generalizable, they are just reference values for a single student team's recordings):

Feature Low Mid High
Speaking rate (wpm) ~150 ~170 ~200
Pauses > 400 ms 2 5–8 12+
Pause-duration variance 0.05 0.20 0.40+
Speaking-rate variance (across thirds) 3 10–20 25+

Higher pause-duration variance and higher speaking-rate variance both track with my intuitive judgment that a speech "landed." See research-journal.md, Week 10, for the correlation analysis.

Known limitations

  • ASR bias. Whisper has documented performance disparities across speaker groups (Koenecke et al. 2020; Li et al. 2024). Two of the five student sources in my Week 8 data are non-native English speakers. I did not correct for this and the prosodic features are only as reliable as the word-boundary timestamps Whisper returns.
  • Single-speaker assumption. If there are two speakers on the clip, the word-level timestamps will span both and the features will be garbage. I did not add diarization.
  • Short clips. Clips with fewer than 20 transcribed words get a warning instead of scores. Variance features are unstable below that threshold.

Files

  • app.py β€” Gradio interface and feature extraction.
  • requirements.txt β€” Dependencies (just gradio and requests; no model weights).

Course

Built for AI + Research Level 2, Youth Horizons Learning, Spring 2026.