hudaakram commited on
Commit
ff3dea4
Β·
verified Β·
1 Parent(s): 7d839a0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -3
README.md CHANGED
@@ -1,12 +1,96 @@
1
  ---
2
  title: Voice OCR Agent
3
  emoji: πŸ‘
4
- colorFrom: green
5
- colorTo: gray
6
  sdk: gradio
7
  sdk_version: 5.45.0
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Voice OCR Agent
3
  emoji: πŸ‘
4
+ colorFrom: purple
5
+ colorTo: indigo
6
  sdk: gradio
7
  sdk_version: 5.45.0
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
11
+ tags:
12
+ - speech-recognition
13
+ - whisper
14
+ - zero-shot
15
+ - ocr
16
+ - summarization
17
+ - question-answering
18
+ - ai-agent
19
+ - gradio
20
  ---
21
 
22
+ # 🎀🧾 Multimodal Voice & OCR Agent
23
+ **Voice commands β†’ intents β†’ tools** and **images β†’ text β†’ summary β†’ QA** β€” using only **pre-trained models** on Hugging Face.
24
+
25
+ > **Live demo:** open the **App** tab above.
26
+ > Works on CPU (tiny models) and GPU (faster, larger models).
27
+
28
+ ---
29
+
30
+ ## πŸ”Ž Overview
31
+ This Space demonstrates a simple **agent loop** across two modalities:
32
+
33
+ - **Voice Agent tab:** microphone/upload β†’ ASR β†’ **zero-shot** intent detection β†’ run a mapped tool β†’ append to an execution log.
34
+ - **OCR tab:** image/PDF page β†’ OCR β†’ summarization β†’ optional **question answering** over the extracted text.
35
+
36
+ ---
37
+
38
+ ## 🧩 Models Used (pre-trained)
39
+ - **ASR (speech→text):** `openai/whisper-tiny` *(use `openai/whisper-small` on GPU)*
40
+ - **Zero-shot intent:** `facebook/bart-large-mnli` *(multilingual alt: `MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7`)*
41
+ - **OCR:** `microsoft/trocr-small-printed` *(for handwriting: `microsoft/trocr-small-handwritten`)*
42
+ - **Summarization:** `sshleifer/distilbart-cnn-12-6`
43
+ - **Question Answering:** `deepset/roberta-base-squad2`
44
+
45
+ All models are loaded with `transformers.pipeline` β€” no training required.
46
+
47
+ ---
48
+
49
+ ## 🧭 How It Works
50
+ 1. **Capture:** Gradio handles microphone/file input (audio or image).
51
+ 2. **Perceive:**
52
+ - Audio β†’ Whisper ASR β†’ transcript
53
+ - Image β†’ TrOCR β†’ extracted text
54
+ 3. **Understand:**
55
+ - Transcript β†’ zero-shot classifier over user-editable intents
56
+ - OCR text β†’ optional summarizer + QA
57
+ 4. **Act:** Chosen intent maps to a simple tool (e.g., `turn_on_lights`, `set_timer`) and logs the result.
58
+
59
+ ---
60
+
61
+ ## πŸ§ͺ Try It
62
+ - **Voice:** say β€œturn on the lights / set a timer / pause the music”.
63
+ - **OCR:** upload a screenshot/document β†’ see extracted text + summary, then ask β€œWhat’s the due date?” etc.
64
+
65
+ ---
66
+
67
+ ## πŸ”§ Configuration β€” Swap Pre-Trained Models
68
+ Change model IDs in `app.py` **or** set Space **Variables** (Settings β†’ Variables) without code changes:
69
+
70
+ | Component | Env var | Default | Common alternatives |
71
+ |---|---|---|---|
72
+ | ASR | `ASR_MODEL` | `openai/whisper-tiny` | `openai/whisper-small` (GPU) |
73
+ | Zero-shot Intent | `ZSC_MODEL` | `facebook/bart-large-mnli` | `MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7` |
74
+ | OCR | `OCR_MODEL` | `microsoft/trocr-small-printed` | `microsoft/trocr-small-handwritten` |
75
+ | Summarizer | `SUM_MODEL` | `sshleifer/distilbart-cnn-12-6` | `facebook/bart-large-cnn` (GPU) |
76
+ | QA | `QA_MODEL` | `deepset/roberta-base-squad2` | any SQuAD2-style model |
77
+
78
+ **Example (env var):** set `ASR_MODEL=openai/whisper-small` to speed up on GPU.
79
+
80
+ ---
81
+
82
+ ## βš™οΈ Requirements
83
+ This Space installs from `requirements.txt`:
84
+ ```txt
85
+ transformers>=4.41.0
86
+ torch
87
+ torchaudio
88
+ gradio>=4.0.0
89
+ librosa
90
+ soundfile
91
+ Pillow
92
+ ---
93
+
94
+ System packages in apt.txt:
95
+
96
+ ffmpeg