Voice_Agent / README.md
hudaakram's picture
Create README.md
4f7d1b7 verified

A newer version of the Gradio SDK is available: 6.3.0

Upgrade
metadata
title: Voice Agent  Speech  Intent  Tools
emoji: 🎤
colorFrom: purple
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
tags:
  - speech-recognition
  - whisper
  - intent-detection
  - ai-agent
  - gradio

🎤 Voice Agent

Speak or upload audio → transcript via Whisper → zero-shot intent → tool execution.
Live demo: Open the app ↗

UI


🔍 Abstract

Voice Agent turns short speech snippets into actions. It:

  1. transcribes audio with Whisper,
  2. infers the intent from the text (zero-shot),
  3. optionally executes a tool (e.g., “turn_on_lights”, “set_timer”).

This showcases an AI agent loop: Perceive → Understand → Act.


🧱 Pipeline

diagram

  1. Audio capture (Gradio mic/upload, 16 kHz)
  2. ASR: openai/whisper-small
  3. Intent detection: zero-shot text classification over a user-editable list of intents
    (e.g., turn_on_lights, start_music, set_timer, create_note, open_calendar)
  4. Tool layer (mock functions in this Space) → returns a JSON “execution log”.

🧪 Try it

  1. Click Record and say something like:
    • “turn the lights on please”
    • “open my calendar next Tuesday”
    • “set a timer for five minutes”
  2. Or upload a short .wav/.mp3.
  3. See Top-k intents, Chosen intent, and Action result.

🧩 Models & Libraries

  • ASR: openai/whisper-small
  • Zero-shot intent: transformers pipeline (facebook/bart-large-mnli by default)
  • UI: Gradio on Hugging Face Spaces

⚙️ Requirements

This Space uses requirements.txt:

transformers>=4.41.0
torch
torchaudio
gradio>=4.0.0
librosa
soundfile