--- title: Voice Agent – Speech → Intent → Tools emoji: 🎤 colorFrom: purple colorTo: indigo sdk: gradio app_file: app.py pinned: false license: apache-2.0 tags: - speech-recognition - whisper - intent-detection - ai-agent - gradio --- # 🎤 Voice Agent **Speak or upload audio → transcript via Whisper → zero-shot intent → tool execution.** Live demo: **[Open the app ↗](https://huggingface.co/spaces/hudaakram/Voice_Agent)** ![UI](assets/ui.png) --- ## 🔍 Abstract Voice Agent turns short speech snippets into **actions**. It: 1) transcribes audio with Whisper, 2) infers the **intent** from the text (zero-shot), 3) optionally **executes a tool** (e.g., “turn_on_lights”, “set_timer”). This showcases an **AI agent** loop: *Perceive → Understand → Act*. --- ## 🧱 Pipeline ![diagram](assets/diagram.png) 1. **Audio capture** (Gradio mic/upload, 16 kHz) 2. **ASR**: `openai/whisper-small` 3. **Intent detection**: zero-shot text classification over a user-editable list of intents *(e.g., `turn_on_lights, start_music, set_timer, create_note, open_calendar`)* 4. **Tool layer** (mock functions in this Space) → returns a JSON “execution log”. --- ## 🧪 Try it 1) Click **Record** and say something like: - “turn the lights on please” - “open my calendar next Tuesday” - “set a timer for five minutes” 2) Or upload a short `.wav/.mp3`. 3) See **Top-k intents**, **Chosen intent**, and **Action result**. --- ## 🧩 Models & Libraries - ASR: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) - Zero-shot intent: `transformers` pipeline (`facebook/bart-large-mnli` by default) - UI: [Gradio](https://www.gradio.app/) on Hugging Face Spaces --- ## ⚙️ Requirements This Space uses `requirements.txt`: ```txt transformers>=4.41.0 torch torchaudio gradio>=4.0.0 librosa soundfile