Spaces:

hudaakram
/

Voice_Agent

Running

App Files Files Community

Voice_Agent / README.md

hudaakram

Create README.md

4f7d1b7 verified 4 months ago

preview code

raw

history blame contribute delete

1.91 kB

	---
	title: Voice Agent – Speech → Intent → Tools
	emoji: 🎤
	colorFrom: purple
	colorTo: indigo
	sdk: gradio
	app_file: app.py
	pinned: false
	license: apache-2.0
	tags:
	- speech-recognition
	- whisper
	- intent-detection
	- ai-agent
	- gradio
	---

	# 🎤 Voice Agent
	Speak or upload audio → transcript via Whisper → zero-shot intent → tool execution.
	Live demo: [Open the app ↗](https://huggingface.co/spaces/hudaakram/Voice_Agent)

	![UI](assets/ui.png)

	---

	## 🔍 Abstract
	Voice Agent turns short speech snippets into actions. It:
	1) transcribes audio with Whisper,
	2) infers the intent from the text (zero-shot),
	3) optionally executes a tool (e.g., “turn_on_lights”, “set_timer”).

	This showcases an AI agent loop: Perceive → Understand → Act.

	---

	## 🧱 Pipeline
	![diagram](assets/diagram.png)

	1. Audio capture (Gradio mic/upload, 16 kHz)
	2. ASR: `openai/whisper-small`
	3. Intent detection: zero-shot text classification over a user-editable list of intents
	(e.g., `turn_on_lights, start_music, set_timer, create_note, open_calendar`)
	4. Tool layer (mock functions in this Space) → returns a JSON “execution log”.

	---

	## 🧪 Try it
	1) Click Record and say something like:
	- “turn the lights on please”
	- “open my calendar next Tuesday”
	- “set a timer for five minutes”
	2) Or upload a short `.wav/.mp3`.
	3) See Top-k intents, Chosen intent, and Action result.

	---

	## 🧩 Models & Libraries
	- ASR: [`openai/whisper-small`](https://huggingface.co/openai/whisper-small)
	- Zero-shot intent: `transformers` pipeline (`facebook/bart-large-mnli` by default)
	- UI: [Gradio](https://www.gradio.app/) on Hugging Face Spaces

	---

	## ⚙️ Requirements
	This Space uses `requirements.txt`:

	```txt
	transformers>=4.41.0
	torch
	torchaudio
	gradio>=4.0.0
	librosa
	soundfile