Spaces:

hudaakram
/

Voice_OCR_Agent

Runtime error

App Files Files Community

hudaakram commited on Sep 12, 2025

Commit

ff3dea4

verified ·

1 Parent(s): 7d839a0

Update README.md

Browse files

Files changed (1) hide show

README.md +87 -3

README.md CHANGED Viewed

@@ -1,12 +1,96 @@
 ---
 title: Voice OCR Agent
 emoji: 👁
-colorFrom: green
-colorTo: gray
 sdk: gradio
 sdk_version: 5.45.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Voice OCR Agent
 emoji: 👁
+colorFrom: purple
+colorTo: indigo
 sdk: gradio
 sdk_version: 5.45.0
 app_file: app.py
 pinned: false
+license: mit
+tags:
+  - speech-recognition
+  - whisper
+  - zero-shot
+  - ocr
+  - summarization
+  - question-answering
+  - ai-agent
+  - gradio
 ---
+# 🎤🧾 Multimodal Voice & OCR Agent
+**Voice commands → intents → tools** and **images → text → summary → QA** — using only **pre-trained models** on Hugging Face.
+> **Live demo:** open the **App** tab above.
+> Works on CPU (tiny models) and GPU (faster, larger models).
+---
+## 🔎 Overview
+This Space demonstrates a simple **agent loop** across two modalities:
+- **Voice Agent tab:** microphone/upload → ASR → **zero-shot** intent detection → run a mapped tool → append to an execution log.
+- **OCR tab:** image/PDF page → OCR → summarization → optional **question answering** over the extracted text.
+---
+## 🧩 Models Used (pre-trained)
+- **ASR (speech→text):** `openai/whisper-tiny` *(use `openai/whisper-small` on GPU)*
+- **Zero-shot intent:** `facebook/bart-large-mnli` *(multilingual alt: `MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7`)*
+- **OCR:** `microsoft/trocr-small-printed` *(for handwriting: `microsoft/trocr-small-handwritten`)*
+- **Summarization:** `sshleifer/distilbart-cnn-12-6`
+- **Question Answering:** `deepset/roberta-base-squad2`
+All models are loaded with `transformers.pipeline` — no training required.
+---
+## 🧭 How It Works
+1. **Capture:** Gradio handles microphone/file input (audio or image).
+2. **Perceive:**
+   - Audio → Whisper ASR → transcript
+   - Image → TrOCR → extracted text
+3. **Understand:**
+   - Transcript → zero-shot classifier over user-editable intents
+   - OCR text → optional summarizer + QA
+4. **Act:** Chosen intent maps to a simple tool (e.g., `turn_on_lights`, `set_timer`) and logs the result.
+---
+## 🧪 Try It
+- **Voice:** say “turn on the lights / set a timer / pause the music”.
+- **OCR:** upload a screenshot/document → see extracted text + summary, then ask “What’s the due date?” etc.
+---
+## 🔧 Configuration — Swap Pre-Trained Models
+Change model IDs in `app.py` **or** set Space **Variables** (Settings → Variables) without code changes:
+| Component | Env var | Default | Common alternatives |
+|---|---|---|---|
+| ASR | `ASR_MODEL` | `openai/whisper-tiny` | `openai/whisper-small` (GPU) |
+| Zero-shot Intent | `ZSC_MODEL` | `facebook/bart-large-mnli` | `MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7` |
+| OCR | `OCR_MODEL` | `microsoft/trocr-small-printed` | `microsoft/trocr-small-handwritten` |
+| Summarizer | `SUM_MODEL` | `sshleifer/distilbart-cnn-12-6` | `facebook/bart-large-cnn` (GPU) |
+| QA | `QA_MODEL` | `deepset/roberta-base-squad2` | any SQuAD2-style model |
+**Example (env var):** set `ASR_MODEL=openai/whisper-small` to speed up on GPU.
+---
+## ⚙️ Requirements
+This Space installs from `requirements.txt`:
+```txt
+transformers>=4.41.0
+torch
+torchaudio
+gradio>=4.0.0
+librosa
+soundfile
+Pillow
+---
+System packages in apt.txt:
+ffmpeg