Voice_Agent / README.md
hudaakram's picture
Create README.md
4f7d1b7 verified
---
title: Voice Agent Speech Intent Tools
emoji: 🎤
colorFrom: purple
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
license: apache-2.0
tags:
- speech-recognition
- whisper
- intent-detection
- ai-agent
- gradio
---
# 🎤 Voice Agent
**Speak or upload audio → transcript via Whisper → zero-shot intent → tool execution.**
Live demo: **[Open the app ↗](https://huggingface.co/spaces/hudaakram/Voice_Agent)**
![UI](assets/ui.png)
---
## 🔍 Abstract
Voice Agent turns short speech snippets into **actions**. It:
1) transcribes audio with Whisper,
2) infers the **intent** from the text (zero-shot),
3) optionally **executes a tool** (e.g., “turn_on_lights”, “set_timer”).
This showcases an **AI agent** loop: *Perceive → Understand → Act*.
---
## 🧱 Pipeline
![diagram](assets/diagram.png)
1. **Audio capture** (Gradio mic/upload, 16 kHz)
2. **ASR**: `openai/whisper-small`
3. **Intent detection**: zero-shot text classification over a user-editable list of intents
*(e.g., `turn_on_lights, start_music, set_timer, create_note, open_calendar`)*
4. **Tool layer** (mock functions in this Space) → returns a JSON “execution log”.
---
## 🧪 Try it
1) Click **Record** and say something like:
- “turn the lights on please”
- “open my calendar next Tuesday”
- “set a timer for five minutes”
2) Or upload a short `.wav/.mp3`.
3) See **Top-k intents**, **Chosen intent**, and **Action result**.
---
## 🧩 Models & Libraries
- ASR: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small)
- Zero-shot intent: `transformers` pipeline (`facebook/bart-large-mnli` by default)
- UI: [Gradio](https://www.gradio.app/) on Hugging Face Spaces
---
## ⚙️ Requirements
This Space uses `requirements.txt`:
```txt
transformers>=4.41.0
torch
torchaudio
gradio>=4.0.0
librosa
soundfile