---
title: Voice Agent – Speech → Intent → Tools
emoji: 🎤
colorFrom: purple
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
tags:
  - speech-recognition
  - whisper
  - intent-detection
  - ai-agent
  - gradio
---

# 🎤 Voice Agent
**Speak or upload audio → transcript via Whisper → zero-shot intent → tool execution.**  
Live demo: **[Open the app ↗](https://huggingface.co/spaces/hudaakram/Voice_Agent)**

![UI](assets/ui.png)

---

## 🔍 Abstract
Voice Agent turns short speech snippets into **actions**. It:
1) transcribes audio with Whisper,
2) infers the **intent** from the text (zero-shot),
3) optionally **executes a tool** (e.g., “turn_on_lights”, “set_timer”).

This showcases an **AI agent** loop: *Perceive → Understand → Act*.

---

## 🧱 Pipeline
![diagram](assets/diagram.png)

1. **Audio capture** (Gradio mic/upload, 16 kHz)
2. **ASR**: `openai/whisper-small`
3. **Intent detection**: zero-shot text classification over a user-editable list of intents  
   *(e.g., `turn_on_lights, start_music, set_timer, create_note, open_calendar`)*  
4. **Tool layer** (mock functions in this Space) → returns a JSON “execution log”.

---

## 🧪 Try it
1) Click **Record** and say something like:  
   - “turn the lights on please”  
   - “open my calendar next Tuesday”  
   - “set a timer for five minutes”  
2) Or upload a short `.wav/.mp3`.  
3) See **Top-k intents**, **Chosen intent**, and **Action result**.

---

## 🧩 Models & Libraries
- ASR: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small)  
- Zero-shot intent: `transformers` pipeline (`facebook/bart-large-mnli` by default)  
- UI: [Gradio](https://www.gradio.app/) on Hugging Face Spaces

---

## ⚙️ Requirements
This Space uses `requirements.txt`:

```txt
transformers>=4.41.0
torch
torchaudio
gradio>=4.0.0
librosa
soundfile