NanG01 commited on
Commit
bf88c57
·
1 Parent(s): 2402166

Clean up repo: remove archive, dev docs, nested submodule; fix license badge

Browse files
.gitignore CHANGED
@@ -51,6 +51,13 @@ models/piper/
51
  *.tflite
52
  *.onnx
53
 
 
 
 
 
 
 
 
54
  # Environment variables
55
  .env
56
 
 
51
  *.tflite
52
  *.onnx
53
 
54
+ # Runtime data at root
55
+ memory.json
56
+
57
+ # Local archive and extras
58
+ archive/
59
+ extras/
60
+
61
  # Environment variables
62
  .env
63
 
BEFORE_AFTER.md DELETED
@@ -1,422 +0,0 @@
1
- # 📊 VisionQ - Before & After Restructuring
2
-
3
- ## 🎯 TRANSFORMATION SUMMARY
4
-
5
- Your VisionQ project has been transformed from a **cluttered development project** to a **clean, production-ready application** with a **web interface**!
6
-
7
- ---
8
-
9
- ## 📂 FOLDER STRUCTURE COMPARISON
10
-
11
- ### **BEFORE** ❌
12
- ```
13
- VisionQ/
14
- ├── agents/ (new)
15
- ├── core/ (new)
16
- ├── data/
17
- ├── extras/
18
- ├── models/
19
- ├── VisionQ/
20
- ├── caption_agent.py (duplicate)
21
- ├── memory_agent.py (duplicate)
22
- ├── query_agent.py (duplicate)
23
- ├── vision_agent.py (duplicate)
24
- ├── voice_agent.py (duplicate)
25
- ├── main.py (old)
26
- ├── main_upgraded.py (old)
27
- ├── ask_question.py (old)
28
- ├── ask_question_upgraded.py (old)
29
- ├── test_upgrade.py (old)
30
- ├── install_upgrade.bat (old)
31
- ├── requirements.txt (old)
32
- ├── requirements_upgraded.txt (old)
33
- ├── README.md (old)
34
- ├── README_UPGRADED.md (duplicate)
35
- ├── ARCHITECTURE.md (old)
36
- ├── COMPARISON.md (old)
37
- ├── DEPLOYMENT_CHECKLIST.md (old)
38
- ├── INDEX.md (old)
39
- ├── QUICK_REFERENCE.md (old)
40
- ├── QUICKSTART.md (old)
41
- ├── SUMMARY.md (old)
42
- ├── UPGRADE_GUIDE.md (old)
43
- └── ... (many more files)
44
-
45
- ❌ 40+ files in root
46
- ❌ Duplicate files
47
- ❌ Confusing structure
48
- ❌ No web interface
49
- ❌ Scattered docs
50
- ```
51
-
52
- ### **AFTER** ✅
53
- ```
54
- VisionQ/
55
- ├── 📁 agents/ # AI agents (clean)
56
- ├── 📁 config/ # Settings (centralized)
57
- ├── 📁 ui/ # Web interface (NEW!)
58
- ├── 📁 core/ # Integration
59
- ├── 📁 data/ # Storage
60
- ├── 📁 models/ # AI models
61
- ├── 📁 docs/ # Documentation (organized)
62
- ├── 📁 .streamlit/ # UI config
63
- ├── 📁 archive/ # Old files (backup)
64
- ├── 📄 README.md # Main docs (clean)
65
- ├── 📄 requirements.txt # Dependencies (clean)
66
- ├── 📄 run.bat # Launcher (NEW!)
67
- ├── 📄 cleanup.bat # Cleanup script (NEW!)
68
- ├── 📄 .env.example # Environment template (NEW!)
69
- └── 📄 .gitignore # Git rules (updated)
70
-
71
- ✅ 15 files in root
72
- ✅ No duplicates
73
- ✅ Clear structure
74
- ✅ Web interface
75
- ✅ Organized docs
76
- ```
77
-
78
- ---
79
-
80
- ## 🆕 NEW FEATURES
81
-
82
- | Feature | Before | After |
83
- |---------|--------|-------|
84
- | **Web Interface** | ❌ None | ✅ Streamlit UI |
85
- | **One-Click Launch** | ❌ None | ✅ run.bat |
86
- | **Centralized Config** | ❌ Scattered | ✅ config/settings.py |
87
- | **Language Docs** | ❌ None | ✅ docs/LANGUAGES.md |
88
- | **API Keys Docs** | ❌ None | ✅ docs/API_KEYS.md |
89
- | **Structure Docs** | ❌ None | ✅ docs/STRUCTURE.md |
90
- | **Cleanup Script** | ❌ None | ✅ cleanup.bat |
91
- | **Environment Template** | ❌ None | ✅ .env.example |
92
-
93
- ---
94
-
95
- ## 🌐 WEB INTERFACE (NEW!)
96
-
97
- ### **Before**
98
- ```python
99
- # Command line only
100
- python main_upgraded.py
101
-
102
- # Voice commands only
103
- # No visual feedback
104
- # Hard to test
105
- ```
106
-
107
- ### **After**
108
- ```bash
109
- # Web interface
110
- run.bat
111
-
112
- # Opens browser at http://localhost:8501
113
- # Visual interface
114
- # Easy to test
115
- # Interactive
116
- ```
117
-
118
- ### **UI Features**
119
- - ✅ **4 Tabs:** Vision, Query, Memories, Help
120
- - ✅ **Live Camera:** See what AI sees
121
- - ✅ **Interactive Buttons:** Capture, Remember, Read Text
122
- - ✅ **Query Interface:** Ask questions visually
123
- - ✅ **Memory Browser:** View stored memories
124
- - ✅ **Settings Sidebar:** Configure languages
125
- - ✅ **Help Section:** Built-in documentation
126
-
127
- ---
128
-
129
- ## 🌍 LANGUAGE SUPPORT
130
-
131
- ### **Before**
132
- ```python
133
- # Hardcoded in code
134
- OCR_LANGUAGES = ['en']
135
-
136
- # No documentation
137
- # No easy way to change
138
- ```
139
-
140
- ### **After**
141
- ```python
142
- # Configurable in UI
143
- # Select from 90+ languages
144
- # Documented in docs/LANGUAGES.md
145
-
146
- # Easy to change:
147
- # 1. Open UI sidebar
148
- # 2. Select languages
149
- # 3. Done!
150
- ```
151
-
152
- **Supported:** 90+ languages including English, Spanish, French, German, Italian, Portuguese, Russian, Arabic, Hindi, Chinese, Japanese, Korean, and many more!
153
-
154
- ---
155
-
156
- ## 🔑 API KEYS CLARITY
157
-
158
- ### **Before**
159
- ```
160
- ❓ Unclear if API keys needed
161
- ❓ No documentation
162
- ❓ Confusing for users
163
- ```
164
-
165
- ### **After**
166
- ```
167
- ✅ Clear: NO API keys needed!
168
- ✅ Documented in docs/API_KEYS.md
169
- ✅ .env.example for optional token
170
- ✅ Works 100% offline
171
- ```
172
-
173
- ---
174
-
175
- ## 📚 DOCUMENTATION
176
-
177
- ### **Before**
178
- ```
179
- ❌ 11 documentation files in root
180
- ❌ Scattered information
181
- ❌ Redundant content
182
- ❌ Hard to find info
183
- ```
184
-
185
- ### **After**
186
- ```
187
- ✅ 1 main README.md
188
- ✅ 3 focused docs in docs/
189
- ✅ No redundancy
190
- ✅ Easy to navigate
191
- ```
192
-
193
- | Document | Purpose |
194
- |----------|---------|
195
- | `README.md` | Main documentation |
196
- | `docs/LANGUAGES.md` | Language support (90+) |
197
- | `docs/API_KEYS.md` | API keys info |
198
- | `docs/STRUCTURE.md` | Project structure |
199
- | `RESTRUCTURE_SUMMARY.md` | This summary |
200
-
201
- ---
202
-
203
- ## 🎯 USER EXPERIENCE
204
-
205
- ### **Before**
206
- ```
207
- 1. Install dependencies
208
- 2. Run python main_upgraded.py
209
- 3. Use voice commands only
210
- 4. No visual feedback
211
- 5. Hard to debug
212
- ```
213
-
214
- ### **After**
215
- ```
216
- 1. Run run.bat
217
- 2. Browser opens automatically
218
- 3. Click buttons in UI
219
- 4. See results instantly
220
- 5. Easy to use and test
221
- ```
222
-
223
- ---
224
-
225
- ## 👨‍💻 DEVELOPER EXPERIENCE
226
-
227
- ### **Before**
228
- ```
229
- ❌ Flat file structure
230
- ❌ Settings scattered in code
231
- ❌ Hard to find files
232
- ❌ Duplicate code
233
- ❌ No clear entry point
234
- ```
235
-
236
- ### **After**
237
- ```
238
- ✅ Organized folders
239
- ✅ Centralized config
240
- ✅ Easy to navigate
241
- ✅ No duplicates
242
- ✅ Clear entry points
243
- ```
244
-
245
- ---
246
-
247
- ## 🔧 CONFIGURATION
248
-
249
- ### **Before**
250
- ```python
251
- # Settings scattered across files
252
- # Hard to change
253
- # No central config
254
- ```
255
-
256
- ### **After**
257
- ```python
258
- # All in config/settings.py
259
- # Easy to customize
260
- # Well documented
261
- # Feature flags
262
-
263
- # Example:
264
- OCR_CONFIG = {
265
- "languages": ["en", "es", "fr"],
266
- "confidence_threshold": 0.3,
267
- }
268
- ```
269
-
270
- ---
271
-
272
- ## 📊 FILE COUNT
273
-
274
- | Category | Before | After | Change |
275
- |----------|--------|-------|--------|
276
- | **Root Files** | 40+ | 15 | -25 |
277
- | **Agent Files** | 12 (duplicates) | 7 (clean) | -5 |
278
- | **Doc Files** | 11 (scattered) | 4 (organized) | -7 |
279
- | **Config Files** | 0 | 1 | +1 |
280
- | **UI Files** | 0 | 1 | +1 |
281
- | **Total Clutter** | High | Low | ✅ |
282
-
283
- ---
284
-
285
- ## 🚀 LAUNCH PROCESS
286
-
287
- ### **Before**
288
- ```bash
289
- # Manual process
290
- 1. Activate venv
291
- 2. Install dependencies
292
- 3. Run python main_upgraded.py
293
- 4. Hope it works
294
- ```
295
-
296
- ### **After**
297
- ```bash
298
- # One command
299
- run.bat
300
-
301
- # Automatically:
302
- # - Creates venv if needed
303
- # - Installs dependencies
304
- # - Launches Streamlit
305
- # - Opens browser
306
- ```
307
-
308
- ---
309
-
310
- ## 🎨 VISUAL COMPARISON
311
-
312
- ### **Before: Command Line**
313
- ```
314
- $ python main_upgraded.py
315
- [VisionAgent] Initializing...
316
- [VisionAgent] YOLO backend loaded
317
- [VoiceAgent] Microphone detected
318
- Vision Q started. I am listening.
319
- [VOICE IN]: Listening (offline)...
320
- ```
321
-
322
- ### **After: Web Interface**
323
- ```
324
- ┌─────────────────────────────────────┐
325
- │ 👁️ VisionQ - Multimodal AI │
326
- ├─────────────────────────────────────┤
327
- │ 📷 Vision 🔍 Query 🧠 Memories │
328
- ├─────────────────────────────────────┤
329
- │ [📷 Capture] [💾 Remember] │
330
- │ [🔤 Read Text] │
331
- │ │
332
- │ 📸 Camera Feed │
333
- │ [Live video preview] │
334
- │ │
335
- │ 📝 Results │
336
- │ "a person holding a phone" │
337
- └─────────────────────────────────────┘
338
- ```
339
-
340
- ---
341
-
342
- ## ✅ BENEFITS SUMMARY
343
-
344
- ### **For Users**
345
- - ✅ Easy web interface
346
- - ✅ Visual feedback
347
- - ✅ One-click launch
348
- - ✅ 90+ languages
349
- - ✅ No API keys needed
350
-
351
- ### **For Developers**
352
- - ✅ Clean structure
353
- - ✅ Modular code
354
- - ✅ Centralized config
355
- - ✅ Easy to extend
356
- - ✅ Well documented
357
-
358
- ### **For Everyone**
359
- - ✅ Professional appearance
360
- - ✅ Production ready
361
- - ✅ Easy to deploy
362
- - ✅ Easy to maintain
363
- - ✅ Open source
364
-
365
- ---
366
-
367
- ## 🎯 WHAT TO DO NOW
368
-
369
- ### **1. Install & Run**
370
- ```bash
371
- pip install -r requirements.txt
372
- run.bat
373
- ```
374
-
375
- ### **2. Clean Up Old Files**
376
- ```bash
377
- cleanup.bat
378
- ```
379
-
380
- ### **3. Explore**
381
- - Open http://localhost:8501
382
- - Try the web interface
383
- - Test OCR in different languages
384
- - Query your memories
385
-
386
- ### **4. Customize**
387
- - Edit `config/settings.py`
388
- - Select languages in UI
389
- - Adjust settings
390
-
391
- ---
392
-
393
- ## 📈 IMPROVEMENT METRICS
394
-
395
- | Metric | Before | After | Improvement |
396
- |--------|--------|-------|-------------|
397
- | **Files in Root** | 40+ | 15 | 🟢 -62% |
398
- | **Duplicate Files** | 12 | 0 | 🟢 -100% |
399
- | **Setup Steps** | 5 | 1 | 🟢 -80% |
400
- | **User Interface** | CLI only | Web UI | 🟢 +100% |
401
- | **Documentation** | Scattered | Organized | 🟢 +100% |
402
- | **Ease of Use** | Hard | Easy | 🟢 +200% |
403
-
404
- ---
405
-
406
- ## 🎉 FINAL RESULT
407
-
408
- **VisionQ is now:**
409
- - ✅ **Clean** - Organized folder structure
410
- - ✅ **Modern** - Web interface with Streamlit
411
- - ✅ **Documented** - Clear, focused documentation
412
- - ✅ **Configurable** - Centralized settings
413
- - ✅ **Multi-lingual** - 90+ languages supported
414
- - ✅ **Offline** - No API keys needed
415
- - ✅ **Professional** - Production ready
416
- - ✅ **User-Friendly** - Easy to use and test
417
-
418
- ---
419
-
420
- **From cluttered development project to polished application! 🚀**
421
-
422
- **Run `run.bat` and see the transformation yourself!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
FINAL_FIX.md DELETED
@@ -1,177 +0,0 @@
1
- # FINAL FIX - Summary
2
-
3
- ## What Was Done
4
-
5
- ### 1. Fixed Embedding Normalization
6
- - Changed from `.norm()` to `torch.nn.functional.normalize()`
7
- - Updated both `encode_image()` and `encode_text()` methods
8
- - File: `agents/embedding_agent.py`
9
-
10
- ### 2. Added Error Handling
11
- - Wrapped embedding calls in try-except blocks
12
- - System continues even if embeddings fail
13
- - File: `agents/vision_agent.py`
14
-
15
- ### 3. Disabled Embeddings by Default
16
- - Set `embeddings_enabled: False` in config
17
- - Improves speed (2-3 seconds vs 5-7 seconds)
18
- - File: `config/settings.py`
19
-
20
- ### 4. Removed All Emojis
21
- - Cleaned up UI code
22
- - Professional appearance
23
- - File: `ui/app.py`
24
-
25
- ### 5. Added Cache Clearing
26
- - Created `fix_and_run.bat` script
27
- - Clears Python and Streamlit cache
28
- - Ensures fresh start
29
-
30
- ## How to Fix the Error
31
-
32
- ### Quick Fix (Do This Now)
33
-
34
- ```bash
35
- # Run this script
36
- fix_and_run.bat
37
- ```
38
-
39
- This will:
40
- 1. Clear all cache
41
- 2. Start fresh
42
- 3. Load updated code
43
-
44
- ### If Still Getting Errors
45
-
46
- 1. **Stop the application** (Ctrl+C in terminal)
47
-
48
- 2. **Clear all cache manually:**
49
- ```bash
50
- rd /s /q __pycache__
51
- rd /s /q agents\__pycache__
52
- rd /s /q config\__pycache__
53
- rd /s /q core\__pycache__
54
- rd /s /q ui\__pycache__
55
- ```
56
-
57
- 3. **Restart:**
58
- ```bash
59
- run.bat
60
- ```
61
-
62
- 4. **In browser, press `C` key** to clear Streamlit cache
63
-
64
- 5. **Click "Initialize System"** again
65
-
66
- ## Why This Happens
67
-
68
- The error occurs because:
69
- 1. Streamlit caches the old code
70
- 2. Python bytecode cache has old version
71
- 3. Old embedding_agent.py is being used
72
-
73
- The fix clears all caches and loads the new code.
74
-
75
- ## Verification
76
-
77
- After running `fix_and_run.bat`, you should see:
78
-
79
- ```
80
- [VisionAgent] Initializing...
81
- [VisionAgent] Embeddings disabled for faster performance
82
- [VisionAgent] YOLO backend loaded
83
- [VisionAgent] Vision system initialized
84
- ```
85
-
86
- The key line is: **"Embeddings disabled for faster performance"**
87
-
88
- This means:
89
- - Embeddings are properly disabled
90
- - No embedding errors will occur
91
- - System will be faster (2-3 seconds)
92
-
93
- ## What Each Button Does Now
94
-
95
- ### "Capture & Describe"
96
- - Captures frame
97
- - Generates caption (BLIP)
98
- - Extracts text (OCR)
99
- - NO embeddings (disabled)
100
- - Fast: ~2-3 seconds
101
-
102
- ### "Remember Scene"
103
- - Captures frame
104
- - Generates caption (BLIP)
105
- - Extracts text (OCR)
106
- - NO embeddings (disabled)
107
- - Stores in memory
108
- - Fast: ~2-3 seconds
109
-
110
- ### "Read Text"
111
- - Captures frame
112
- - Extracts text only (OCR)
113
- - Very fast: ~500ms
114
-
115
- ## Files Created
116
-
117
- 1. `fix_and_run.bat` - Quick fix script
118
- 2. `TROUBLESHOOTING.md` - Detailed troubleshooting guide
119
- 3. `docs/PERFORMANCE.md` - Performance optimization guide
120
- 4. `FIXES_APPLIED.md` - Summary of all fixes
121
-
122
- ## Current Configuration
123
-
124
- ```python
125
- # config/settings.py
126
- FEATURES = {
127
- "ocr_enabled": True, # Text extraction
128
- "embeddings_enabled": False, # Disabled for speed
129
- "object_detection_enabled": True, # YOLO detection
130
- }
131
- ```
132
-
133
- **Result:**
134
- - Speed: Fast (2-3 seconds)
135
- - Features: Caption + OCR + Objects
136
- - Stability: No embedding errors
137
-
138
- ## To Enable Embeddings (Optional)
139
-
140
- If you want visual similarity search:
141
-
142
- 1. Edit `config/settings.py`:
143
- ```python
144
- FEATURES = {
145
- "embeddings_enabled": True,
146
- }
147
- ```
148
-
149
- 2. Run `fix_and_run.bat`
150
-
151
- 3. Test carefully
152
-
153
- **Note:** This will make it slower (5-7 seconds) but enables visual search.
154
-
155
- ## Summary
156
-
157
- **The error is fixed in the code.**
158
-
159
- **To apply the fix:**
160
- 1. Run `fix_and_run.bat`
161
- 2. Click "Initialize System"
162
- 3. Test buttons
163
- 4. Should work now!
164
-
165
- **If still broken:**
166
- - See `TROUBLESHOOTING.md`
167
- - Or keep embeddings disabled (recommended)
168
-
169
- **Current status:**
170
- - Embeddings: Disabled (for speed and stability)
171
- - OCR: Enabled
172
- - Object Detection: Enabled
173
- - Caption: Enabled
174
- - Speed: Fast (2-3 seconds)
175
- - Emojis: Removed
176
-
177
- **Everything should work now!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
FIXES_APPLIED.md DELETED
@@ -1,124 +0,0 @@
1
- # Fixes Applied - Summary
2
-
3
- ## Issues Fixed
4
-
5
- ### 1. AttributeError with CLIP Embeddings
6
- **Error:** `'BaseModelOutputWithPooling' object has no attribute 'norm'`
7
-
8
- **Fix:** Changed normalization method in `agents/embedding_agent.py`:
9
- ```python
10
- # Before (broken)
11
- embedding = image_features / image_features.norm(dim=-1, keepdim=True)
12
-
13
- # After (fixed)
14
- embedding = torch.nn.functional.normalize(image_features, p=2, dim=-1)
15
- ```
16
-
17
- ### 2. Slow Performance
18
- **Issue:** System taking 5-7 seconds per capture
19
-
20
- **Fix:** Disabled embeddings by default in `config/settings.py`:
21
- ```python
22
- FEATURES = {
23
- "embeddings_enabled": False, # Now disabled for speed
24
- }
25
- ```
26
-
27
- **Result:** ~2-3 seconds per capture (much faster!)
28
-
29
- ### 3. Emojis in Code
30
- **Issue:** Emojis throughout UI code
31
-
32
- **Fix:** Removed all emojis from:
33
- - Button labels
34
- - Headers
35
- - Status messages
36
- - Tab names
37
- - Spinner messages
38
-
39
- **Result:** Clean, professional UI without emojis
40
-
41
- ## What Changed
42
-
43
- ### Files Modified
44
- 1. `agents/embedding_agent.py` - Fixed normalization
45
- 2. `config/settings.py` - Disabled embeddings by default
46
- 3. `agents/vision_agent.py` - Made embeddings optional
47
- 4. `ui/app.py` - Removed all emojis
48
-
49
- ### New Files Created
50
- 1. `docs/PERFORMANCE.md` - Performance optimization guide
51
-
52
- ## Current Configuration
53
-
54
- **Speed:** Fast (2-3 seconds per capture)
55
-
56
- **Enabled Features:**
57
- - BLIP Caption (always on)
58
- - YOLO Object Detection
59
- - EasyOCR Text Extraction
60
-
61
- **Disabled Features:**
62
- - CLIP Embeddings (for speed)
63
-
64
- ## How to Use
65
-
66
- ### 1. Run the Application
67
- ```bash
68
- run.bat
69
- ```
70
-
71
- ### 2. Test the Fix
72
- - Click "Initialize System"
73
- - Click "Capture & Describe"
74
- - Should work without errors now
75
- - Should be faster (2-3 seconds)
76
-
77
- ### 3. Enable Embeddings (Optional)
78
- If you want visual similarity search:
79
-
80
- Edit `config/settings.py`:
81
- ```python
82
- FEATURES = {
83
- "embeddings_enabled": True, # Enable for visual search
84
- }
85
- ```
86
-
87
- **Note:** This will make it slower (5-7 seconds) but enables visual similarity search.
88
-
89
- ## Performance Comparison
90
-
91
- | Configuration | Speed | Features |
92
- |---------------|-------|----------|
93
- | **Current (Fast)** | 2-3s | Caption + OCR + Objects |
94
- | Full (Slow) | 5-7s | Caption + OCR + Objects + Embeddings |
95
- | Minimal (Fastest) | 1s | Caption only |
96
-
97
- ## Troubleshooting
98
-
99
- ### Still getting errors?
100
- 1. Restart the application
101
- 2. Clear cache: Delete `data/` folder
102
- 3. Reinstall: `pip install --upgrade -r requirements.txt`
103
-
104
- ### Still slow?
105
- 1. Check `config/settings.py` - ensure `embeddings_enabled: False`
106
- 2. Reduce OCR languages to just English
107
- 3. Use smaller YOLO model (yolov8n.pt)
108
-
109
- See `docs/PERFORMANCE.md` for detailed optimization guide.
110
-
111
- ## Summary
112
-
113
- All issues fixed:
114
- - Error with embeddings: FIXED
115
- - Slow performance: FIXED (2-3x faster)
116
- - Emojis in code: REMOVED
117
-
118
- System is now:
119
- - Working correctly
120
- - Much faster
121
- - Professional appearance
122
- - Ready to use
123
-
124
- Run `run.bat` and test it out!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
FIX_TENSORFLOW.md DELETED
@@ -1,138 +0,0 @@
1
- # Fix: TensorFlow/Protobuf Conflict
2
-
3
- ## The Error
4
- ```
5
- RuntimeError: Failed to import transformers/BLIP.
6
- This usually happens when TensorFlow and protobuf are out of sync.
7
- ```
8
-
9
- ## Quick Fix (Recommended)
10
-
11
- Run this script:
12
- ```bash
13
- fix_tensorflow.bat
14
- ```
15
-
16
- This will:
17
- 1. Remove TensorFlow (not needed)
18
- 2. Install correct protobuf version
19
- 3. Reinstall transformers
20
- 4. Clear cache
21
-
22
- Then run:
23
- ```bash
24
- streamlit run ui\app.py
25
- ```
26
-
27
- ---
28
-
29
- ## Manual Fix (If script doesn't work)
30
-
31
- ### Step 1: Uninstall Conflicting Packages
32
- ```bash
33
- pip uninstall tensorflow tensorflow-cpu protobuf -y
34
- ```
35
-
36
- ### Step 2: Install Correct Protobuf
37
- ```bash
38
- pip install protobuf==3.20.3
39
- ```
40
-
41
- ### Step 3: Reinstall Transformers
42
- ```bash
43
- pip install --upgrade --force-reinstall transformers
44
- ```
45
-
46
- ### Step 4: Clear Cache
47
- ```bash
48
- rd /s /q __pycache__
49
- rd /s /q agents\__pycache__
50
- rd /s /q config\__pycache__
51
- rd /s /q core\__pycache__
52
- rd /s /q ui\__pycache__
53
- ```
54
-
55
- ### Step 5: Test
56
- ```bash
57
- streamlit run ui\app.py
58
- ```
59
-
60
- ---
61
-
62
- ## Why This Happens
63
-
64
- VisionQ doesn't need TensorFlow, but sometimes it gets installed as a dependency and conflicts with protobuf.
65
-
66
- **Solution:** Remove TensorFlow and use specific protobuf version.
67
-
68
- ---
69
-
70
- ## Nuclear Option (If nothing works)
71
-
72
- ### Delete and Recreate Virtual Environment
73
-
74
- ```bash
75
- # 1. Deactivate current venv
76
- deactivate
77
-
78
- # 2. Delete old venv
79
- rd /s /q .venv
80
- rd /s /q venv
81
-
82
- # 3. Create fresh venv
83
- python -m venv venv
84
-
85
- # 4. Activate
86
- venv\Scripts\activate
87
-
88
- # 5. Install dependencies
89
- pip install -r requirements.txt
90
-
91
- # 6. Run
92
- streamlit run ui\app.py
93
- ```
94
-
95
- ---
96
-
97
- ## Verify Fix
98
-
99
- After running the fix, you should be able to import without errors:
100
-
101
- ```bash
102
- python -c "from agents.caption_agent import CaptionAgent; print('Success!')"
103
- ```
104
-
105
- If you see "Success!", the fix worked!
106
-
107
- ---
108
-
109
- ## Prevention
110
-
111
- The `requirements.txt` has been updated to include:
112
- ```
113
- protobuf==3.20.3
114
- ```
115
-
116
- This prevents future conflicts.
117
-
118
- ---
119
-
120
- ## Summary
121
-
122
- **Quick Fix:**
123
- ```bash
124
- fix_tensorflow.bat
125
- streamlit run ui\app.py
126
- ```
127
-
128
- **If that doesn't work:**
129
- ```bash
130
- # Nuclear option
131
- rd /s /q venv
132
- python -m venv venv
133
- venv\Scripts\activate
134
- pip install -r requirements.txt
135
- streamlit run ui\app.py
136
- ```
137
-
138
- **One of these will definitely work!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
MASTER_TROUBLESHOOTING.md DELETED
@@ -1,232 +0,0 @@
1
- # VisionQ - Complete Troubleshooting Guide
2
-
3
- ## Current Issues & Fixes
4
-
5
- ### Issue 1: TensorFlow/Protobuf Conflict
6
-
7
- **Error:**
8
- ```
9
- RuntimeError: Failed to import transformers/BLIP
10
- ```
11
-
12
- **Fix:**
13
- ```bash
14
- fix_tensorflow.bat
15
- ```
16
-
17
- See `FIX_TENSORFLOW.md` for details.
18
-
19
- ---
20
-
21
- ### Issue 2: Embedding AttributeError
22
-
23
- **Error:**
24
- ```
25
- AttributeError: 'BaseModelOutputWithPooling' object has no attribute 'norm'
26
- ```
27
-
28
- **Fix:**
29
- ```bash
30
- fix_and_run.bat
31
- ```
32
-
33
- See `TROUBLESHOOTING.md` for details.
34
-
35
- ---
36
-
37
- ## All Fix Scripts
38
-
39
- | Script | Purpose | When to Use |
40
- |--------|---------|-------------|
41
- | `fix_tensorflow.bat` | Fix TensorFlow/protobuf conflict | Import errors with BLIP |
42
- | `fix_and_run.bat` | Clear cache and restart | Embedding errors, old code |
43
- | `run.bat` | Normal start | Regular use |
44
-
45
- ---
46
-
47
- ## Step-by-Step Fix Process
48
-
49
- ### Step 1: Fix TensorFlow Conflict
50
- ```bash
51
- fix_tensorflow.bat
52
- ```
53
-
54
- ### Step 2: Clear Cache
55
- ```bash
56
- fix_and_run.bat
57
- ```
58
-
59
- ### Step 3: Test
60
- Open http://localhost:8501 and test all buttons.
61
-
62
- ---
63
-
64
- ## If Nothing Works: Nuclear Option
65
-
66
- ### Complete Reset
67
-
68
- ```bash
69
- # 1. Stop everything
70
- # Press Ctrl+C in terminal
71
-
72
- # 2. Delete virtual environment
73
- rd /s /q .venv
74
- rd /s /q venv
75
-
76
- # 3. Delete cache
77
- rd /s /q __pycache__
78
- rd /s /q agents\__pycache__
79
- rd /s /q config\__pycache__
80
- rd /s /q core\__pycache__
81
- rd /s /q ui\__pycache__
82
-
83
- # 4. Create fresh venv
84
- python -m venv venv
85
-
86
- # 5. Activate
87
- venv\Scripts\activate
88
-
89
- # 6. Upgrade pip
90
- python -m pip install --upgrade pip
91
-
92
- # 7. Install dependencies
93
- pip install -r requirements.txt
94
-
95
- # 8. Run
96
- streamlit run ui\app.py
97
- ```
98
-
99
- This will give you a completely fresh start.
100
-
101
- ---
102
-
103
- ## Common Errors & Solutions
104
-
105
- ### Error: "python run.bat" gives SyntaxError
106
- **Problem:** Trying to run .bat file with Python
107
- **Solution:** Just run `run.bat` (without python)
108
-
109
- ### Error: Camera not working
110
- **Problem:** Camera in use or permissions
111
- **Solution:**
112
- - Close other apps using camera
113
- - Check camera permissions
114
- - Try different camera index in `config/settings.py`
115
-
116
- ### Error: Models loading slowly
117
- **Problem:** First run downloads models
118
- **Solution:** Wait for download to complete (~2GB)
119
-
120
- ### Error: Out of memory
121
- **Problem:** Too many models loaded
122
- **Solution:**
123
- - Close other applications
124
- - Disable embeddings in `config/settings.py`
125
- - Use smaller YOLO model
126
-
127
- ### Error: OCR not detecting text
128
- **Problem:** Poor lighting or text quality
129
- **Solution:**
130
- - Ensure good lighting
131
- - Text should be clear
132
- - Try different languages
133
-
134
- ---
135
-
136
- ## Performance Issues
137
-
138
- ### System is slow (5+ seconds per capture)
139
-
140
- **Check if embeddings are enabled:**
141
- ```python
142
- # config/settings.py
143
- FEATURES = {
144
- "embeddings_enabled": False, # Should be False for speed
145
- }
146
- ```
147
-
148
- **If True, change to False and restart.**
149
-
150
- ---
151
-
152
- ## Verification Commands
153
-
154
- ### Test imports:
155
- ```bash
156
- python -c "from agents.caption_agent import CaptionAgent; print('Caption: OK')"
157
- python -c "from agents.vision_agent import VisionAgent; print('Vision: OK')"
158
- python -c "from agents.memory_agent import MemoryAgent; print('Memory: OK')"
159
- ```
160
-
161
- ### Check protobuf version:
162
- ```bash
163
- pip show protobuf
164
- ```
165
- Should show: `Version: 3.20.3`
166
-
167
- ### Check if TensorFlow is installed:
168
- ```bash
169
- pip show tensorflow
170
- ```
171
- Should show: `WARNING: Package(s) not found: tensorflow` (Good!)
172
-
173
- ---
174
-
175
- ## Getting Help
176
-
177
- ### Check these files:
178
- 1. `FIX_TENSORFLOW.md` - TensorFlow/protobuf issues
179
- 2. `TROUBLESHOOTING.md` - Embedding errors
180
- 3. `docs/PERFORMANCE.md` - Speed optimization
181
- 4. `FINAL_FIX.md` - Summary of all fixes
182
-
183
- ### Still stuck?
184
-
185
- 1. Run nuclear option (complete reset)
186
- 2. Check Python version: `python --version` (should be 3.8+)
187
- 3. Check if in virtual environment: Look for `(.venv)` or `(venv)` in prompt
188
- 4. Try on different computer to isolate issue
189
-
190
- ---
191
-
192
- ## Quick Reference
193
-
194
- **Fix TensorFlow:**
195
- ```bash
196
- fix_tensorflow.bat
197
- ```
198
-
199
- **Fix Cache:**
200
- ```bash
201
- fix_and_run.bat
202
- ```
203
-
204
- **Normal Start:**
205
- ```bash
206
- run.bat
207
- ```
208
-
209
- **Direct Start:**
210
- ```bash
211
- streamlit run ui\app.py
212
- ```
213
-
214
- **Nuclear Reset:**
215
- ```bash
216
- rd /s /q venv
217
- python -m venv venv
218
- venv\Scripts\activate
219
- pip install -r requirements.txt
220
- streamlit run ui\app.py
221
- ```
222
-
223
- ---
224
-
225
- ## Summary
226
-
227
- Most issues are fixed by:
228
- 1. `fix_tensorflow.bat` - Fixes import errors
229
- 2. `fix_and_run.bat` - Fixes cache issues
230
- 3. Nuclear option - Fixes everything else
231
-
232
- **Try them in order until it works!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
MODELS_AND_SPEED.md DELETED
@@ -1,203 +0,0 @@
1
- # Models & Performance - Quick Reference
2
-
3
- ## Current Models
4
-
5
- | Component | Model | Size | Speed | Status |
6
- |-----------|-------|------|-------|--------|
7
- | **Caption** | BLIP-base | 990MB | 1.5s | Active (SLOWEST) |
8
- | **Object Detection** | YOLOv8s | 22MB | 0.5s | Active |
9
- | **OCR** | EasyOCR | 50MB | 0.5s | Active |
10
- | **Embeddings** | CLIP | 500MB | 2s | Disabled |
11
-
12
- **Total processing time: ~2.5 seconds per capture**
13
-
14
- ---
15
-
16
- ## Why Camera is Slow
17
-
18
- **The camera itself is fast!** The slowness comes from AI processing:
19
-
20
- ```
21
- Camera capture: 10ms (fast!)
22
- BLIP caption: 1500ms (slow!)
23
- EasyOCR: 500ms (medium)
24
- YOLO detection: 500ms (medium)
25
- ------------------------
26
- Total: 2510ms (2.5 seconds)
27
- ```
28
-
29
- **BLIP is the bottleneck!**
30
-
31
- ---
32
-
33
- ## Quick Speed Fixes
34
-
35
- ### Option 1: Disable OCR (500ms faster)
36
- ```python
37
- # config/settings.py
38
- FEATURES = {
39
- "ocr_enabled": False,
40
- }
41
- ```
42
- **New speed:** 2 seconds
43
-
44
- ### Option 2: Disable YOLO (500ms faster)
45
- ```python
46
- FEATURES = {
47
- "object_detection_enabled": False,
48
- }
49
- ```
50
- **New speed:** 2 seconds
51
-
52
- ### Option 3: Both (1000ms faster)
53
- ```python
54
- FEATURES = {
55
- "ocr_enabled": False,
56
- "object_detection_enabled": False,
57
- }
58
- ```
59
- **New speed:** 1.5 seconds (40% faster!)
60
-
61
- ---
62
-
63
- ## Apply Fast Mode
64
-
65
- ### Step 1: Edit config/settings.py
66
-
67
- Find the `FEATURES` section and change to:
68
- ```python
69
- FEATURES = {
70
- "ocr_enabled": False, # Disable for speed
71
- "object_detection_enabled": False, # Disable for speed
72
- "embeddings_enabled": False, # Keep disabled
73
- }
74
- ```
75
-
76
- ### Step 2: Restart
77
- ```bash
78
- fix_and_run.bat
79
- ```
80
-
81
- ### Step 3: Test
82
- Click "Capture & Describe" - should be ~1.5 seconds now!
83
-
84
- ---
85
-
86
- ## Model Details
87
-
88
- ### BLIP (Caption Model)
89
- - **Full name:** Salesforce/blip-image-captioning-base
90
- - **Purpose:** Generate scene descriptions
91
- - **Speed:** 1.5 seconds (CPU)
92
- - **Can't disable:** This is the core feature
93
- - **Alternative:** Use GIT model (3x faster)
94
-
95
- ### YOLOv8s (Object Detection)
96
- - **Full name:** YOLOv8 Small
97
- - **Purpose:** Detect objects (person, car, etc.)
98
- - **Speed:** 0.5 seconds
99
- - **Can disable:** Yes (set object_detection_enabled: False)
100
- - **Alternative:** Use YOLOv8n (nano) for 200ms faster
101
-
102
- ### EasyOCR (Text Reading)
103
- - **Purpose:** Read text from images
104
- - **Speed:** 0.5 seconds
105
- - **Can disable:** Yes (set ocr_enabled: False)
106
- - **Languages:** 90+ supported
107
-
108
- ### CLIP (Embeddings)
109
- - **Purpose:** Visual similarity search
110
- - **Speed:** 2 seconds
111
- - **Status:** Already disabled
112
- - **Keep disabled:** For best performance
113
-
114
- ---
115
-
116
- ## GPU Acceleration
117
-
118
- If you have NVIDIA GPU:
119
-
120
- ```python
121
- # config/settings.py
122
- PERFORMANCE_CONFIG = {
123
- "use_gpu": True,
124
- }
125
-
126
- OCR_CONFIG = {
127
- "gpu": True,
128
- }
129
- ```
130
-
131
- **Speed improvement:** 2-3x faster!
132
-
133
- **Requirements:**
134
- - NVIDIA GPU
135
- - CUDA installed
136
- - `pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118`
137
-
138
- ---
139
-
140
- ## Recommended Settings
141
-
142
- ### For Speed (Fastest)
143
- ```python
144
- FEATURES = {
145
- "ocr_enabled": False,
146
- "object_detection_enabled": False,
147
- "embeddings_enabled": False,
148
- }
149
- ```
150
- **Speed:** 1.5 seconds
151
- **Features:** Caption only
152
-
153
- ### For Balance (Recommended)
154
- ```python
155
- FEATURES = {
156
- "ocr_enabled": True,
157
- "object_detection_enabled": False,
158
- "embeddings_enabled": False,
159
- }
160
- ```
161
- **Speed:** 2 seconds
162
- **Features:** Caption + OCR
163
-
164
- ### For Full Features
165
- ```python
166
- FEATURES = {
167
- "ocr_enabled": True,
168
- "object_detection_enabled": True,
169
- "embeddings_enabled": False,
170
- }
171
- ```
172
- **Speed:** 2.5 seconds
173
- **Features:** Everything
174
-
175
- ---
176
-
177
- ## Summary
178
-
179
- **Models used:**
180
- - BLIP (caption) - 1.5s - Can't disable
181
- - YOLO (objects) - 0.5s - Can disable
182
- - EasyOCR (text) - 0.5s - Can disable
183
-
184
- **Why slow:**
185
- - BLIP takes 1.5 seconds
186
- - This is normal for AI image captioning
187
- - Camera itself is fast
188
-
189
- **Quick fix:**
190
- ```python
191
- # Disable OCR and YOLO
192
- FEATURES = {
193
- "ocr_enabled": False,
194
- "object_detection_enabled": False,
195
- }
196
- ```
197
- **Result:** 40% faster (1.5s instead of 2.5s)
198
-
199
- **Best fix:**
200
- - Use GPU (2-3x faster)
201
- - Or accept 1.5-2.5 second delay (normal for AI)
202
-
203
- **The camera is not slow - the AI models are doing heavy processing!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -14,7 +14,7 @@ license: apache-2.0
14
 
15
  [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
16
  [![Streamlit](https://img.shields.io/badge/streamlit-1.28+-red.svg)](https://streamlit.io/)
17
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
18
 
19
  ---
20
 
 
14
 
15
  [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
16
  [![Streamlit](https://img.shields.io/badge/streamlit-1.28+-red.svg)](https://streamlit.io/)
17
+ [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
18
 
19
  ---
20
 
RESTRUCTURE_SUMMARY.md DELETED
@@ -1,348 +0,0 @@
1
- # 🎉 VisionQ - Restructuring Complete!
2
-
3
- ## ✅ What Was Done
4
-
5
- Your VisionQ project has been **completely restructured** with:
6
-
7
- 1. ✅ **Clean folder structure**
8
- 2. ✅ **Streamlit web interface**
9
- 3. ✅ **Centralized configuration**
10
- 4. ✅ **Comprehensive documentation**
11
- 5. ✅ **90+ language support**
12
- 6. ✅ **No API keys needed**
13
-
14
- ---
15
-
16
- ## 📂 NEW Structure
17
-
18
- ```
19
- VisionQ/
20
- ├── agents/ # AI agents (7 files)
21
- ├── config/ # Settings (1 file)
22
- ├── ui/ # Streamlit app (1 file)
23
- ├── core/ # Integration (1 file)
24
- ├── data/ # Storage (auto-created)
25
- ├── models/ # AI models (auto-downloaded)
26
- ├── docs/ # Documentation (3 files)
27
- ├── .streamlit/ # UI config
28
- ├── README.md # Main docs
29
- ├── requirements.txt # Dependencies
30
- ├── run.bat # Launcher
31
- └── cleanup.bat # Cleanup script
32
- ```
33
-
34
- **Total:** 10 code files, 3 docs, clean structure!
35
-
36
- ---
37
-
38
- ## 🚀 How to Use
39
-
40
- ### **1. Install Dependencies**
41
- ```bash
42
- pip install -r requirements.txt
43
- ```
44
-
45
- ### **2. Launch Web Interface**
46
- ```bash
47
- # Windows
48
- run.bat
49
-
50
- # Linux/Mac
51
- streamlit run ui/app.py
52
- ```
53
-
54
- ### **3. Open Browser**
55
- Go to: `http://localhost:8501`
56
-
57
- ### **4. Start Using**
58
- - Click "Initialize System"
59
- - Capture scenes
60
- - Read text (OCR)
61
- - Query memories
62
-
63
- ---
64
-
65
- ## 🌍 Language Support
66
-
67
- **90+ languages supported!**
68
-
69
- Including:
70
- - 🇬🇧 English
71
- - 🇪🇸 Spanish
72
- - 🇫🇷 French
73
- - 🇩🇪 German
74
- - 🇮🇹 Italian
75
- - 🇵🇹 Portuguese
76
- - 🇷🇺 Russian
77
- - 🇨🇳 Chinese
78
- - 🇯🇵 Japanese
79
- - 🇰🇷 Korean
80
- - 🇸🇦 Arabic
81
- - 🇮🇳 Hindi
82
- - ...and 78 more!
83
-
84
- **Select languages in UI sidebar.**
85
-
86
- See `docs/LANGUAGES.md` for full list.
87
-
88
- ---
89
-
90
- ## 🔑 API Keys
91
-
92
- **Do you need API keys?**
93
-
94
- # **NO!** ❌
95
-
96
- VisionQ works **100% offline** without any API keys.
97
-
98
- All models run locally:
99
- - ✅ YOLO (object detection)
100
- - ✅ BLIP (captioning)
101
- - ✅ CLIP (embeddings)
102
- - ✅ EasyOCR (text extraction)
103
- - ✅ DistilBERT (NLP)
104
- - ✅ FAISS (vector search)
105
-
106
- **Optional:** Hugging Face token (for private models only)
107
-
108
- See `docs/API_KEYS.md` for details.
109
-
110
- ---
111
-
112
- ## 🎯 Key Features
113
-
114
- ### **Vision**
115
- - 👁️ Object detection (YOLO/SSD)
116
- - 📝 Image captioning (BLIP)
117
- - 🖼️ Visual embeddings (CLIP)
118
- - 🔤 Text extraction (OCR, 90+ languages)
119
-
120
- ### **Memory**
121
- - 🧠 Semantic storage (FAISS)
122
- - 💾 Persistent JSON
123
- - ⚡ Fast search (<10ms)
124
- - 📊 10,000+ capacity
125
-
126
- ### **Intelligence**
127
- - 🔍 Smart queries (DistilBERT)
128
- - ⏰ Time-based filtering
129
- - 🎯 Intent classification
130
- - 🔗 Multimodal fusion
131
-
132
- ### **Interface**
133
- - 🌐 Web UI (Streamlit)
134
- - 📱 Responsive design
135
- - 🎨 Clean interface
136
- - 🚀 One-click launch
137
-
138
- ---
139
-
140
- ## 📚 Documentation
141
-
142
- | File | Purpose |
143
- |------|---------|
144
- | `README.md` | Main documentation |
145
- | `docs/LANGUAGES.md` | Language support (90+) |
146
- | `docs/API_KEYS.md` | API keys info (none needed!) |
147
- | `docs/STRUCTURE.md` | Project structure |
148
-
149
- ---
150
-
151
- ## 🧹 Cleanup Old Files
152
-
153
- **Run cleanup script:**
154
- ```bash
155
- cleanup.bat
156
- ```
157
-
158
- This moves old files to `archive/` folder:
159
- - Old agent files
160
- - Old documentation
161
- - Old scripts
162
- - Old requirements
163
-
164
- **You can safely delete `archive/` if not needed.**
165
-
166
- ---
167
-
168
- ## 🔧 Configuration
169
-
170
- **Edit `config/settings.py` to customize:**
171
-
172
- ```python
173
- # OCR languages
174
- OCR_CONFIG = {
175
- "languages": ["en", "es", "fr"],
176
- }
177
-
178
- # Vision settings
179
- VISION_CONFIG = {
180
- "camera_index": 0,
181
- "confidence_threshold": 0.5,
182
- }
183
-
184
- # Memory settings
185
- MEMORY_CONFIG = {
186
- "max_memories": 10000,
187
- }
188
- ```
189
-
190
- ---
191
-
192
- ## 🎓 Quick Start Guide
193
-
194
- ### **Step 1: Install**
195
- ```bash
196
- pip install -r requirements.txt
197
- ```
198
-
199
- ### **Step 2: Run**
200
- ```bash
201
- run.bat # Windows
202
- # or
203
- streamlit run ui/app.py # Linux/Mac
204
- ```
205
-
206
- ### **Step 3: Initialize**
207
- - Open http://localhost:8501
208
- - Click "Initialize System"
209
- - Wait for models to load (~1 min first time)
210
-
211
- ### **Step 4: Use**
212
- - **Vision Tab:** Capture, remember, read text
213
- - **Query Tab:** Ask questions about memories
214
- - **Memories Tab:** Browse stored memories
215
- - **Help Tab:** Documentation and tips
216
-
217
- ---
218
-
219
- ## 📊 What Changed
220
-
221
- ### **Before**
222
- ```
223
- ❌ Flat file structure
224
- ❌ Redundant files
225
- ❌ No web interface
226
- ❌ Scattered documentation
227
- ❌ Complex to use
228
- ```
229
-
230
- ### **After**
231
- ```
232
- ✅ Clean folder structure
233
- ✅ No redundant files
234
- ✅ Streamlit web interface
235
- ✅ Organized documentation
236
- ✅ Easy to use
237
- ```
238
-
239
- ---
240
-
241
- ## 🎯 Benefits
242
-
243
- ### **For Users**
244
- - ✅ Easy web interface
245
- - ✅ One-click launch
246
- - ✅ Clear documentation
247
- - ✅ 90+ languages
248
-
249
- ### **For Developers**
250
- - ✅ Clean structure
251
- - ✅ Modular code
252
- - ✅ Centralized config
253
- - ✅ Easy to extend
254
-
255
- ### **For Everyone**
256
- - ✅ No API keys needed
257
- - ✅ 100% offline
258
- - ✅ Free forever
259
- - ✅ Open source
260
-
261
- ---
262
-
263
- ## 🐛 Troubleshooting
264
-
265
- ### **Models loading slowly?**
266
- - First run downloads ~2GB
267
- - Subsequent runs are fast
268
- - Models cached locally
269
-
270
- ### **Camera not working?**
271
- - Check permissions
272
- - Try different camera index
273
- - Ensure no other app using camera
274
-
275
- ### **OCR not detecting text?**
276
- - Ensure good lighting
277
- - Text should be clear
278
- - Try different languages
279
-
280
- ### **Out of memory?**
281
- - Close other applications
282
- - Reduce stored memories
283
- - Use CPU instead of GPU
284
-
285
- ---
286
-
287
- ## 📞 Support
288
-
289
- **Need help?**
290
-
291
- 1. Check `README.md`
292
- 2. Check `docs/` folder
293
- 3. Check UI "Help" tab
294
- 4. Open GitHub issue
295
-
296
- ---
297
-
298
- ## ✅ Next Steps
299
-
300
- ### **Immediate**
301
- 1. ✅ Run `pip install -r requirements.txt`
302
- 2. ✅ Run `run.bat` or `streamlit run ui/app.py`
303
- 3. ✅ Open http://localhost:8501
304
- 4. ✅ Click "Initialize System"
305
- 5. ✅ Start using!
306
-
307
- ### **Optional**
308
- 1. ⭐ Run `cleanup.bat` to organize old files
309
- 2. ⭐ Customize `config/settings.py`
310
- 3. ⭐ Add languages in UI sidebar
311
- 4. ⭐ Explore documentation
312
-
313
- ---
314
-
315
- ## 🎉 Summary
316
-
317
- **VisionQ is now:**
318
- - ✅ Clean & organized
319
- - ✅ Easy to use (web UI)
320
- - ✅ Well documented
321
- - ✅ Multi-language (90+)
322
- - ✅ No API keys needed
323
- - ✅ 100% offline
324
- - ✅ Production ready
325
-
326
- **Everything you need in one place!**
327
-
328
- ---
329
-
330
- ## 📋 Checklist
331
-
332
- - [ ] Install dependencies: `pip install -r requirements.txt`
333
- - [ ] Launch UI: `run.bat` or `streamlit run ui/app.py`
334
- - [ ] Open browser: http://localhost:8501
335
- - [ ] Initialize system
336
- - [ ] Test vision features
337
- - [ ] Test OCR (read text)
338
- - [ ] Test queries
339
- - [ ] Browse memories
340
- - [ ] Read documentation
341
- - [ ] Customize settings (optional)
342
- - [ ] Run cleanup (optional)
343
-
344
- ---
345
-
346
- **VisionQ - Restructured, refined, and ready to use! 🚀**
347
-
348
- **Open http://localhost:8501 and start exploring!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
TROUBLESHOOTING.md DELETED
@@ -1,191 +0,0 @@
1
- # Troubleshooting: AttributeError with Embeddings
2
-
3
- ## The Error
4
-
5
- ```
6
- AttributeError: 'BaseModelOutputWithPooling' object has no attribute 'norm'
7
- ```
8
-
9
- ## What Causes This
10
-
11
- This error occurs when:
12
- 1. Old cached version of embedding_agent.py is being used
13
- 2. Streamlit is caching the old agent code
14
- 3. Python bytecode cache (__pycache__) has old version
15
-
16
- ## Quick Fix (Recommended)
17
-
18
- ### Option 1: Use Fix Script
19
- ```bash
20
- fix_and_run.bat
21
- ```
22
-
23
- This will:
24
- - Clear all Python cache
25
- - Clear Streamlit cache
26
- - Restart the application
27
-
28
- ### Option 2: Manual Fix
29
- ```bash
30
- # 1. Stop the application (Ctrl+C)
31
-
32
- # 2. Clear Python cache
33
- rd /s /q __pycache__
34
- rd /s /q agents\__pycache__
35
- rd /s /q config\__pycache__
36
- rd /s /q core\__pycache__
37
- rd /s /q ui\__pycache__
38
-
39
- # 3. Clear Streamlit cache
40
- rd /s /q .streamlit\cache
41
-
42
- # 4. Restart
43
- run.bat
44
- ```
45
-
46
- ### Option 3: In Browser
47
- 1. Open http://localhost:8501
48
- 2. Press `C` key (clears cache)
49
- 3. Click "Initialize System" again
50
-
51
- ## Permanent Fix
52
-
53
- The code has been updated with error handling, so even if the error occurs, it will:
54
- 1. Print a warning message
55
- 2. Continue without embeddings
56
- 3. Still work for caption and OCR
57
-
58
- ## Verify Fix
59
-
60
- After running fix_and_run.bat, you should see:
61
- ```
62
- [VisionAgent] Embeddings disabled for faster performance
63
- ```
64
-
65
- This means embeddings are properly disabled and won't cause errors.
66
-
67
- ## If Still Getting Errors
68
-
69
- ### Step 1: Check Config
70
- Open `config/settings.py` and verify:
71
- ```python
72
- FEATURES = {
73
- "embeddings_enabled": False, # Should be False
74
- }
75
- ```
76
-
77
- ### Step 2: Delete All Cache
78
- ```bash
79
- # Delete everything
80
- rd /s /q __pycache__
81
- rd /s /q agents\__pycache__
82
- rd /s /q config\__pycache__
83
- rd /s /q core\__pycache__
84
- rd /s /q ui\__pycache__
85
- rd /s /q .streamlit
86
-
87
- # Recreate .streamlit
88
- mkdir .streamlit
89
- ```
90
-
91
- ### Step 3: Reinstall
92
- ```bash
93
- pip uninstall transformers torch -y
94
- pip install transformers torch
95
- ```
96
-
97
- ### Step 4: Fresh Start
98
- ```bash
99
- # Close all Python processes
100
- taskkill /F /IM python.exe
101
-
102
- # Wait 5 seconds
103
-
104
- # Start fresh
105
- run.bat
106
- ```
107
-
108
- ## Understanding the Fix
109
-
110
- ### What Was Changed
111
-
112
- **Before (Broken):**
113
- ```python
114
- embedding = image_features / image_features.norm(dim=-1, keepdim=True)
115
- ```
116
-
117
- **After (Fixed):**
118
- ```python
119
- embedding = torch.nn.functional.normalize(image_features, p=2, dim=-1)
120
- ```
121
-
122
- ### Why It Works
123
-
124
- The new method uses PyTorch's built-in normalize function which:
125
- - Works with all tensor types
126
- - Handles the BaseModelOutputWithPooling correctly
127
- - Is more robust
128
-
129
- ### Error Handling Added
130
-
131
- ```python
132
- if self.embedding_agent:
133
- try:
134
- embedding = self.embedding_agent.encode_image(frame)
135
- except Exception as e:
136
- print(f"[VisionAgent] Embedding failed: {e}")
137
- embedding = None
138
- ```
139
-
140
- Now even if embeddings fail, the system continues working.
141
-
142
- ## Prevention
143
-
144
- To avoid this in the future:
145
-
146
- 1. **Always clear cache after code changes:**
147
- ```bash
148
- fix_and_run.bat
149
- ```
150
-
151
- 2. **Use the fix script instead of run.bat when testing changes**
152
-
153
- 3. **Keep embeddings disabled unless you need visual search:**
154
- ```python
155
- FEATURES = {
156
- "embeddings_enabled": False, # Faster and more stable
157
- }
158
- ```
159
-
160
- ## Performance Note
161
-
162
- With embeddings disabled:
163
- - Speed: 2-3 seconds per capture
164
- - Features: Caption + OCR + Object Detection
165
- - Stability: No embedding errors
166
-
167
- With embeddings enabled:
168
- - Speed: 5-7 seconds per capture
169
- - Features: All features + Visual Search
170
- - Stability: May have errors if not properly cached
171
-
172
- ## Summary
173
-
174
- **Quick Fix:**
175
- 1. Run `fix_and_run.bat`
176
- 2. Click "Initialize System"
177
- 3. Test "Capture & Describe"
178
- 4. Should work now!
179
-
180
- **If still broken:**
181
- 1. Check `config/settings.py` - embeddings should be False
182
- 2. Delete all __pycache__ folders
183
- 3. Restart computer (clears all Python processes)
184
- 4. Run `fix_and_run.bat`
185
-
186
- **Prevention:**
187
- - Always use `fix_and_run.bat` after code changes
188
- - Keep embeddings disabled for stability
189
- - Clear cache regularly
190
-
191
- The system is now more robust and will handle errors gracefully!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
VisionQ DELETED
@@ -1 +0,0 @@
1
- Subproject commit 18f18d23a1f3ad386db32957c746239c80e78751
 
 
archive/old_agents/caption_agent.py DELETED
@@ -1,40 +0,0 @@
1
- import os
2
-
3
- # Avoid importing TensorFlow (fixes compatibility issues such as protobuf/DType conflicts)
4
- os.environ["TRANSFORMERS_NO_TF"] = "1"
5
- os.environ["HF_HUB_DISABLE_TF"] = "1"
6
-
7
- from PIL import Image
8
- import torch
9
-
10
- try:
11
- from transformers import BlipProcessor, BlipForConditionalGeneration
12
- except Exception as e:
13
- raise RuntimeError(
14
- "Failed to import transformers/BLIP. This usually happens when TensorFlow and protobuf "
15
- "are out of sync in the current Python environment.\n\n"
16
- "Fix: run in a clean virtual environment and install dependencies from requirements.txt."
17
- ) from e
18
-
19
- class CaptionAgent:
20
- def __init__(self):
21
- self.processor = BlipProcessor.from_pretrained(
22
- "Salesforce/blip-image-captioning-base"
23
- )
24
- self.model = BlipForConditionalGeneration.from_pretrained(
25
- "Salesforce/blip-image-captioning-base"
26
- )
27
-
28
- self.model.eval()
29
-
30
- def describe(self, frame_bgr):
31
- # OpenCV BGR → PIL RGB
32
- frame_rgb = frame_bgr[:, :, ::-1]
33
- image = Image.fromarray(frame_rgb)
34
-
35
- inputs = self.processor(image, return_tensors="pt")
36
- with torch.no_grad():
37
- out = self.model.generate(**inputs, max_length=30)
38
-
39
- caption = self.processor.decode(out[0], skip_special_tokens=True)
40
- return caption
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_agents/memory_agent.py DELETED
@@ -1,59 +0,0 @@
1
- import json
2
- from datetime import datetime
3
- import numpy as np
4
- from sentence_transformers import SentenceTransformer
5
-
6
- class MemoryAgent:
7
- def __init__(self, memory_file="memory.json"):
8
- self.memory_file = memory_file
9
- self.model = SentenceTransformer("all-MiniLM-L6-v2")
10
- self.memories = []
11
- self._load()
12
-
13
- def _load(self):
14
- try:
15
- with open(self.memory_file, "r") as f:
16
- self.memories = json.load(f)
17
- except:
18
- self.memories = []
19
-
20
- def _save(self):
21
- with open(self.memory_file, "w") as f:
22
- json.dump(self.memories, f, indent=2)
23
-
24
- @staticmethod
25
- def compute_importance(description):
26
- desc = description.lower()
27
- score = 1 # base importance
28
-
29
- if "person" in desc:
30
- score += 2
31
-
32
- if any(obj in desc for obj in ["phone", "bag", "book", "device"]):
33
- score += 1
34
-
35
- if any(act in desc for act in ["entered", "left", "holding", "walking"]):
36
- score += 2
37
-
38
- return score
39
-
40
-
41
- def add(self, description):
42
- embedding = self.model.encode(description).tolist()
43
- importance = MemoryAgent.compute_importance(description)
44
-
45
- memory = {
46
- "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
47
- "description": description,
48
- "embedding": embedding,
49
- "importance": importance
50
- }
51
-
52
- self.memories.append(memory)
53
- self._save()
54
-
55
- def recall_last(self):
56
- if not self.memories:
57
- return None
58
- return self.memories[-1]
59
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_agents/query_agent.py DELETED
@@ -1,127 +0,0 @@
1
- import numpy as np
2
- from datetime import datetime, timedelta
3
-
4
- class QueryAgent:
5
- def __init__(self, memory_agent):
6
- self.memory_agent = memory_agent
7
- self.model = memory_agent.model
8
-
9
- # Cosine similarity
10
- def cosine_similarity(self, a, b):
11
- return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
12
-
13
- # Time parsing
14
- @staticmethod
15
- def extract_time_window(question):
16
- now = datetime.now()
17
- q = question.lower()
18
-
19
- if "last hour" in q:
20
- return now - timedelta(hours=1)
21
-
22
- if "last 30 minutes" in q:
23
- return now - timedelta(minutes=30)
24
-
25
- if "recent" in q or "recently" in q:
26
- return now - timedelta(hours=2)
27
-
28
- if "today" in q:
29
- return now.replace(hour=0, minute=0, second=0)
30
-
31
- if "yesterday" in q:
32
- start = (now - timedelta(days=1)).replace(hour=0, minute=0, second=0)
33
- end = start + timedelta(days=1)
34
- return (start, end)
35
-
36
- if "this morning" in q:
37
- return (
38
- now.replace(hour=6, minute=0, second=0),
39
- now.replace(hour=12, minute=0, second=0),
40
- )
41
-
42
- if "this evening" in q:
43
- return (
44
- now.replace(hour=18, minute=0, second=0),
45
- now.replace(hour=22, minute=0, second=0),
46
- )
47
-
48
- if "last evening" in q:
49
- start = (now - timedelta(days=1)).replace(hour=18, minute=0, second=0)
50
- return (start, start.replace(hour=22))
51
-
52
- if "last night" in q:
53
- return (
54
- (now - timedelta(days=1)).replace(hour=22, minute=0, second=0),
55
- now.replace(hour=6, minute=0, second=0),
56
- )
57
-
58
- return None
59
-
60
- # MAIN QUERY METHOD
61
- def ask(self, question, threshold=0.45): #Change to 0.5 when scalable enough
62
- memories = self.memory_agent.recall_all()
63
- if not memories:
64
- return "I don't have any memories yet."
65
-
66
- # Time filtering
67
- time_filter = self.extract_time_window(question)
68
- filtered = []
69
-
70
- for m in memories:
71
- mem_time = datetime.strptime(
72
- m["timestamp"], "%Y-%m-%d %H:%M:%S"
73
- )
74
-
75
- if time_filter is None:
76
- filtered.append(m)
77
-
78
- elif isinstance(time_filter, tuple):
79
- start, end = time_filter
80
- if start <= mem_time < end:
81
- filtered.append(m)
82
-
83
- else:
84
- if mem_time >= time_filter:
85
- filtered.append(m)
86
-
87
- if not filtered:
88
- return "I don't recall anything from that time."
89
-
90
- # Semantic similarity
91
- query_embedding = self.model.encode(question)
92
- scored = []
93
-
94
- for m in filtered:
95
- # Handle missing embeddings (for backwards compatibility)
96
- if "embedding" not in m:
97
- m["embedding"] = self.model.encode(m["description"]).tolist()
98
-
99
- # Handle missing importance
100
- if "importance" not in m:
101
- m["importance"] = 1
102
-
103
- sim = self.cosine_similarity(
104
- query_embedding,
105
- np.array(m["embedding"])
106
- )
107
- if sim >= threshold:
108
- scored.append((sim, m))
109
-
110
- if not scored:
111
- return "I don't recall anything related to that."
112
-
113
- # Rank by similarity + importance
114
- scored.sort(
115
- key=lambda x: (x[0], x[1]["importance"]),
116
- reverse=True
117
- )
118
-
119
- # Build response
120
- responses = []
121
- for sim, m in scored:
122
- responses.append(
123
- f"At {m['timestamp']}, {m['description']} "
124
- f"(confidence {sim:.2f})"
125
- )
126
-
127
- return "\n".join(responses)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_agents/vision_agent.py DELETED
@@ -1,210 +0,0 @@
1
- import cv2
2
- import numpy as np
3
- import time
4
- import warnings
5
- warnings.filterwarnings("ignore")
6
-
7
- from caption_agent import CaptionAgent
8
- from memory_agent import MemoryAgent
9
-
10
-
11
- class VisionAgent:
12
- def __init__(self):
13
- # -------------------------------------------------
14
- # INIT AGENTS
15
- # -------------------------------------------------
16
- self.caption_agent = CaptionAgent()
17
- self.memory_agent = MemoryAgent()
18
-
19
- # -------------------------------------------------
20
- # CONFIG
21
- # -------------------------------------------------
22
- self.FRAME_INTERVAL = 0.3
23
- self.CONF_THRESHOLD = 0.5
24
-
25
- # -------------------------------------------------
26
- # LOAD YOLO (PRIMARY)
27
- # -------------------------------------------------
28
- self.VISION_BACKEND = "SSD"
29
- self.yolo_model = None
30
- self.interpreter = None
31
- self.LABELS = None
32
- self.input_details = None
33
- self.output_details = None
34
- self.INPUT_HEIGHT = None
35
- self.INPUT_WIDTH = None
36
- self.INPUT_TYPE = None
37
-
38
- try:
39
- from ultralytics import YOLO
40
- self.yolo_model = YOLO("yolov8s.pt")
41
- self.VISION_BACKEND = "YOLO"
42
- print("[Vision] YOLO backend loaded")
43
-
44
- except Exception as e:
45
- print("[Vision] YOLO failed, falling back to SSD:", e)
46
-
47
- Interpreter = None
48
- try:
49
- from ai_edge_litert import Interpreter
50
- except ImportError:
51
- try:
52
- import tensorflow as tf
53
- Interpreter = tf.lite.Interpreter
54
- except Exception as tf_err:
55
- print(
56
- "[Vision] SSD fallback unavailable (ai_edge_litert / tensorflow not installed):",
57
- tf_err,
58
- )
59
-
60
- if Interpreter is not None:
61
- with open("label_ssd.txt", "r") as f:
62
- self.LABELS = [line.strip() for line in f.readlines()]
63
-
64
- self.interpreter = Interpreter(
65
- model_path="ssd_mobilenet_v2_fpnlite_035_192_int8.tflite"
66
- )
67
- self.interpreter.allocate_tensors()
68
-
69
- self.input_details = self.interpreter.get_input_details()
70
- self.output_details = self.interpreter.get_output_details()
71
-
72
- self.INPUT_HEIGHT = self.input_details[0]["shape"][1]
73
- self.INPUT_WIDTH = self.input_details[0]["shape"][2]
74
- self.INPUT_TYPE = self.input_details[0]["dtype"]
75
-
76
- self.VISION_BACKEND = "SSD"
77
- print("[Vision] MobileNet-SSD backend loaded")
78
- else:
79
- # No valid vision backend available, only captioning will work.
80
- self.VISION_BACKEND = None
81
- print(
82
- "[Vision] No valid object-detection backend available; "
83
- "only captioning will work. Install ultralytics or tensorflow."
84
- )
85
-
86
- # -------------------------------------------------
87
- # CAMERA
88
- # -------------------------------------------------
89
- self.cap = cv2.VideoCapture(0)
90
- print("Vision system initialized.")
91
-
92
- def describe_scene(self):
93
- """Capture and describe current scene"""
94
- ret, frame = self.cap.read()
95
- if not ret:
96
- return None
97
- return self.caption_agent.describe(frame)
98
-
99
- def remember_scene(self):
100
- """Capture, describe, and remember current scene"""
101
- ret, frame = self.cap.read()
102
- if not ret:
103
- return None
104
- description = self.caption_agent.describe(frame)
105
- self.memory_agent.add(description)
106
- return description
107
-
108
- def cleanup(self):
109
- """Release resources"""
110
- self.cap.release()
111
- cv2.destroyAllWindows()
112
- print("Vision system stopped.")
113
-
114
- def run_continuous(self):
115
- """Run continuous vision loop (object detection + caption on change)"""
116
- previous_objects = set()
117
- last_time = 0
118
-
119
- print("Vision continuous mode started.")
120
- print("Press 'q' to quit.")
121
-
122
- while True:
123
- ret, frame = self.cap.read()
124
- if not ret:
125
- break
126
-
127
- current_time = time.time()
128
- if current_time - last_time < self.FRAME_INTERVAL:
129
- continue
130
- last_time = current_time
131
-
132
- current_objects = set()
133
-
134
- # -------------------------------------------------
135
- # OBJECT DETECTION (CONTINUOUS)
136
- # -------------------------------------------------
137
- if self.VISION_BACKEND == "YOLO":
138
- results = self.yolo_model(frame, conf=self.CONF_THRESHOLD, verbose=False)
139
-
140
- for r in results:
141
- for box in r.boxes:
142
- label = r.names[int(box.cls[0])]
143
- conf = float(box.conf[0])
144
- if conf >= self.CONF_THRESHOLD:
145
- current_objects.add(label)
146
-
147
- elif self.VISION_BACKEND == "SSD":
148
- rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
149
- resized = cv2.resize(rgb, (self.INPUT_WIDTH, self.INPUT_HEIGHT))
150
- input_data = np.expand_dims(resized, axis=0)
151
-
152
- if self.INPUT_TYPE == np.uint8:
153
- input_data = input_data.astype(np.uint8)
154
- else:
155
- input_data = input_data.astype(np.float32) / 255.0
156
-
157
- self.interpreter.set_tensor(self.input_details[0]["index"], input_data)
158
- self.interpreter.invoke()
159
-
160
- classes = self.interpreter.get_tensor(self.output_details[1]["index"]).flatten()
161
- scores = self.interpreter.get_tensor(self.output_details[2]["index"]).flatten()
162
-
163
- for i, score in enumerate(scores):
164
- if score >= self.CONF_THRESHOLD:
165
- class_id = int(classes[i])
166
- if class_id < len(self.LABELS):
167
- current_objects.add(self.LABELS[class_id])
168
-
169
- else:
170
- # No object-detection backend is available; just caption the scene every interval.
171
- description = self.caption_agent.describe(frame)
172
- print("[CAPTION]", description)
173
- self.memory_agent.add(description)
174
- previous_objects = set()
175
- continue
176
-
177
- self.interpreter.set_tensor(self.input_details[0]["index"], input_data)
178
- self.interpreter.invoke()
179
-
180
- classes = self.interpreter.get_tensor(self.output_details[1]["index"]).flatten()
181
- scores = self.interpreter.get_tensor(self.output_details[2]["index"]).flatten()
182
-
183
- for i, score in enumerate(scores):
184
- if score >= self.CONF_THRESHOLD:
185
- class_id = int(classes[i])
186
- if class_id < len(self.LABELS):
187
- current_objects.add(self.LABELS[class_id])
188
-
189
- print("Detected objects:", current_objects)
190
-
191
- # -------------------------------------------------
192
- # EVENT DETECTION + VLM CAPTION
193
- # -------------------------------------------------
194
- new_objects = current_objects - previous_objects
195
- removed_objects = previous_objects - current_objects
196
-
197
- if new_objects or removed_objects:
198
- description = self.caption_agent.describe(frame)
199
- print("[CAPTION]", description)
200
- self.memory_agent.add(description)
201
-
202
- previous_objects = current_objects.copy()
203
-
204
- # -------------------------------------------------
205
- # EXIT (NON-BLOCKING)
206
- # -------------------------------------------------
207
- if cv2.waitKey(1) & 0xFF == ord('q'):
208
- break
209
-
210
- self.cleanup()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_agents/voice_agent.py DELETED
@@ -1,127 +0,0 @@
1
- import json
2
- import queue
3
- import pyttsx3
4
- import sounddevice as sd
5
- from vosk import Model, KaldiRecognizer
6
-
7
-
8
- class VoiceAgent:
9
- def __init__(self, model_path="models/vosk"):
10
- # -------------------------
11
- # Text-to-Speech
12
- # -------------------------
13
- self.engine = pyttsx3.init()
14
-
15
- # -------------------------
16
- # Speech-to-Text (offline)
17
- # -------------------------
18
- self.sample_rate = 16000
19
-
20
- try:
21
- self.model = Model(model_path)
22
- except Exception as e:
23
- raise RuntimeError(f"Vosk model not found at {model_path}") from e
24
-
25
- self.recognizer = KaldiRecognizer(self.model, self.sample_rate)
26
-
27
- # Audio queue
28
- self.audio_queue = queue.Queue()
29
-
30
- # Check mic
31
- self._check_microphone()
32
-
33
- # -------------------------
34
- # Microphone check
35
- # -------------------------
36
- def _check_microphone(self):
37
- devices = sd.query_devices()
38
- input_devices = [d for d in devices if d["max_input_channels"] > 0]
39
-
40
- if not input_devices:
41
- raise RuntimeError("No microphone detected.")
42
-
43
- print("[VOICE INIT] Microphone detected:")
44
- for d in input_devices:
45
- print(" -", d["name"])
46
-
47
- self.speak("Microphone is ready.")
48
-
49
- # -------------------------
50
- # TTS
51
- # -------------------------
52
- def speak(self, text):
53
- print("[VOICE OUT]:", text)
54
- self.engine.say(text)
55
- self.engine.runAndWait()
56
-
57
- # -------------------------
58
- # Audio callback
59
- # -------------------------
60
- def _audio_callback(self, indata, frames, time, status):
61
- if status:
62
- print("[VOICE WARNING]", status)
63
- self.audio_queue.put(bytes(indata))
64
-
65
- # -------------------------
66
- # Listen (offline STT)
67
- # -------------------------
68
- def listen(self, timeout=5):
69
- print("[VOICE IN]: Listening (offline)...")
70
- self.speak("Listening")
71
-
72
- self.recognizer.Reset()
73
-
74
- with sd.RawInputStream(
75
- samplerate=self.sample_rate,
76
- blocksize=8000,
77
- dtype="int16",
78
- channels=1,
79
- callback=self._audio_callback
80
- ):
81
- for _ in range(int(timeout * self.sample_rate / 8000)):
82
- data = self.audio_queue.get()
83
-
84
- if self.recognizer.AcceptWaveform(data):
85
- break
86
-
87
- result = json.loads(self.recognizer.FinalResult())
88
- text = result.get("text", "").lower()
89
-
90
- if text:
91
- print("[VOICE IN]: Detected speech →", text)
92
- else:
93
- print("[VOICE IN]: No speech detected")
94
-
95
- return text
96
-
97
- # -------------------------
98
- # Intent parsing (SAFE ORDER)
99
- # -------------------------
100
- def parse_intent(self, text):
101
- if not text:
102
- return "UNKNOWN"
103
-
104
- # RECALL (most specific)
105
- if "what did i see" in text or "what have i seen" in text:
106
- return "RECALL_MEMORY"
107
-
108
- if "remember what i saw" in text:
109
- return "RECALL_MEMORY"
110
-
111
- # STORE
112
- if "remember this" in text or "save this" in text:
113
- return "REMEMBER_SCENE"
114
-
115
- # DESCRIBE
116
- if "describe" in text or "what is in front" in text:
117
- return "DESCRIBE_SCENE"
118
-
119
- # OCR (later)
120
- if "read" in text or "what does this say" in text:
121
- return "READ_TEXT"
122
-
123
- # EXIT
124
- if "exit" in text or "quit" in text or "stop" in text:
125
- return "EXIT"
126
-
127
- return "UNKNOWN"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_docs/ARCHITECTURE.md DELETED
@@ -1,445 +0,0 @@
1
- # 🏗️ VisionQ Architecture - Detailed Diagram
2
-
3
- ## 📐 SYSTEM OVERVIEW
4
-
5
- ```
6
- ┌─────────────────────────────────────────────────────────────────┐
7
- │ USER INTERACTION │
8
- │ (Voice Commands / Text Queries) │
9
- └────────────────────────────┬────────────────────────────────────┘
10
-
11
- ┌────────────┴────────────┐
12
- │ │
13
- ┌──────▼──────┐ ┌──────▼──────┐
14
- │ VOICE AGENT │ │TEXT QUERIES │
15
- │ (UPDATED) │ │ (UPDATED) │
16
- └──────┬──────┘ └──────┬──────┘
17
- │ │
18
- ┌───────────┴────────────┐ │
19
- │ │ │
20
- ┌───▼────┐ ┌─────▼──────┐ │
21
- │ STT │ │ TTS │ │
22
- │ (Vosk) │ │ (UPDATED) │ │
23
- │ KEPT │ └─────┬──────┘ │
24
- └───┬────┘ │ │
25
- │ ┌────────┴────────┐ │
26
- │ │ │ │
27
- │ ┌────▼────┐ ┌─────▼─▼──────┐
28
- │ │Voxtral │ │ pyttsx3 │
29
- │ │ (NEW) │ │ (FALLBACK) │
30
- │ │Primary │ │ KEPT │
31
- │ └─────────┘ └──────────────┘
32
-
33
- └──────────────────┬──────────────────────────────────┐
34
- │ │
35
- ┌──────▼──────┐ │
36
- │VISION AGENT │ │
37
- │ (UPDATED) │ │
38
- │ HUB │ │
39
- └──────┬──────┘ │
40
- │ │
41
- ┌──────────────┼──────────────┬─────────────┐ │
42
- │ │ │ │ │
43
- ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ ┌────▼───▼──┐
44
- │ YOLO/ │ │ BLIP │ │MobileCLIP│ │ EasyOCR │
45
- │ SSD │ │Caption │ │Embedding│ │ OCR │
46
- │ KEPT │ │ KEPT │ │ (NEW) │ │ (NEW) │
47
- └────┬────┘ └────┬────┘ └────┬────┘ └────┬──────┘
48
- │ │ │ │
49
- │ Objects │ Caption │ Embedding │ Text
50
- │ │ │ │
51
- └─────────────┴──────────────┴────────────┘
52
-
53
- ┌──────▼──────┐
54
- │ FUSION │
55
- │ LAYER │
56
- │ (NEW) │
57
- └──────┬──────┘
58
-
59
- Unified Multimodal Context
60
-
61
- ┌──────────────┴──────────────┐
62
- │ │
63
- ┌────▼────┐ ┌────▼────┐
64
- │ MEMORY │ │ QUERY │
65
- │ AGENT │◄─────────────────┤ AGENT │
66
- │(UPDATED)│ │(UPDATED)│
67
- └────┬────┘ └────┬────┘
68
- │ │
69
- ┌────┴────┬────────┐ ┌───┴────┐
70
- │ │ │ │ │
71
- ┌──▼──┐ ┌──▼───┐ ┌──▼───┐ ┌──▼────┐ ┌▼────────┐
72
- │JSON │ │FAISS │ │Text │ │DistilB│ │Hybrid │
73
- │Meta │ │Index │ │Embed │ │ ERT │ │Search │
74
- │KEPT │ │(NEW) │ │KEPT │ │(NEW) │ │(NEW) │
75
- └─────┘ └──────┘ └──────┘ └───────┘ └─────────┘
76
- ```
77
-
78
- ---
79
-
80
- ## 🔄 DATA FLOW DIAGRAM
81
-
82
- ### **1. SCENE DESCRIPTION FLOW**
83
-
84
- ```
85
- User: "Describe the scene"
86
-
87
-
88
- VoiceAgent.listen() → Vosk STT
89
-
90
-
91
- VoiceAgent.parse_intent() → "DESCRIBE_SCENE"
92
-
93
-
94
- VisionAgent.describe_scene()
95
-
96
- ├─► CaptionAgent.describe() → "a person holding a phone"
97
-
98
- ├─► OCRAgent.extract_text() → "Hello World"
99
-
100
- ├─► EmbeddingAgent.encode_image() → [512-dim vector]
101
-
102
- └─► FusionLayer.fuse()
103
-
104
-
105
- Combined Description:
106
- "a person holding a phone. Text visible: Hello World"
107
-
108
-
109
- VoiceAgent.speak() → Voxtral/pyttsx3
110
- ```
111
-
112
- ---
113
-
114
- ### **2. MEMORY STORAGE FLOW**
115
-
116
- ```
117
- User: "Remember this"
118
-
119
-
120
- VisionAgent.remember_scene()
121
-
122
- ├─► Capture frame
123
-
124
- ├─► Get caption (BLIP)
125
-
126
- ├─► Get OCR text (EasyOCR)
127
-
128
- ├─► Get embedding (MobileCLIP)
129
-
130
- └─► FusionLayer.fuse()
131
-
132
-
133
- Fused Context
134
-
135
-
136
- MemoryAgent.add(description, embedding)
137
-
138
- ├─► Generate text embedding (sentence-transformers)
139
-
140
- ├─► Compute importance score
141
-
142
- ├─► Save to JSON:
143
- │ {
144
- │ "id": 0,
145
- │ "timestamp": "2024-01-15 10:30:00",
146
- │ "description": "...",
147
- │ "text_embedding": [...],
148
- │ "image_embedding": [...],
149
- │ "importance": 5
150
- │ }
151
-
152
- └─► Add to FAISS index (image embedding)
153
-
154
-
155
- Memory Stored ✅
156
- ```
157
-
158
- ---
159
-
160
- ### **3. MEMORY QUERY FLOW**
161
-
162
- ```
163
- User: "What did I see this morning?"
164
-
165
-
166
- QueryAgent.ask(question)
167
-
168
- ├─► QueryAgent.classify_intent()
169
- │ │
170
- │ └─► DistilBERT → "temporal"
171
-
172
- ├─► QueryAgent.extract_time_window()
173
- │ │
174
- │ └─► (6:00 AM, 12:00 PM)
175
-
176
- ├─► Filter memories by time
177
-
178
- ├─► Text similarity search
179
- │ │
180
- │ └─► sentence-transformers cosine similarity
181
-
182
- ├─► Image similarity search (if query has image)
183
- │ │
184
- │ └─► FAISS.search() → Top-K results
185
-
186
- ├─► Hybrid ranking
187
- │ │
188
- │ └─► Sort by (similarity × importance)
189
-
190
- └─► Build response
191
-
192
-
193
- "At 10:30 AM, a person holding a phone.
194
- Text visible: Hello World (confidence 0.87)"
195
- ```
196
-
197
- ---
198
-
199
- ### **4. OCR READING FLOW**
200
-
201
- ```
202
- User: "Read the text"
203
-
204
-
205
- VisionAgent.read_text()
206
-
207
- ├─► Capture frame
208
-
209
- └─► OCRAgent.extract_text(frame)
210
-
211
- ├─► EasyOCR.readtext() → [(bbox, text, conf), ...]
212
-
213
- ├─► Filter by confidence (>0.3)
214
-
215
- ├─► Clean text (remove special chars)
216
-
217
- └─► Return: "Hello World"
218
-
219
-
220
- VoiceAgent.speak("I can see the following text: Hello World")
221
- ```
222
-
223
- ---
224
-
225
- ## 🧩 MODULE INTERACTIONS
226
-
227
- ### **Agent Dependencies**
228
-
229
- ```
230
- VoiceAgent
231
- ├─ Depends on: vosk, sounddevice, pyttsx3, piper-tts
232
- └─ Used by: main.py
233
-
234
- VisionAgent
235
- ├─ Depends on: CaptionAgent, EmbeddingAgent, OCRAgent,
236
- │ MemoryAgent, FusionLayer
237
- └─ Used by: main.py
238
-
239
- CaptionAgent
240
- ├─ Depends on: transformers (BLIP), torch
241
- └─ Used by: VisionAgent
242
-
243
- EmbeddingAgent
244
- ├─ Depends on: transformers (CLIP), torch
245
- └─ Used by: VisionAgent
246
-
247
- OCRAgent
248
- ├─ Depends on: easyocr
249
- └─ Used by: VisionAgent
250
-
251
- FusionLayer
252
- ├─ Depends on: None (pure Python)
253
- └─ Used by: VisionAgent
254
-
255
- MemoryAgent
256
- ├─ Depends on: sentence-transformers, faiss, json
257
- └─ Used by: VisionAgent, QueryAgent
258
-
259
- QueryAgent
260
- ├─ Depends on: MemoryAgent, transformers (DistilBERT)
261
- └─ Used by: ask_question.py, main.py (future)
262
- ```
263
-
264
- ---
265
-
266
- ## 🔀 FALLBACK MECHANISMS
267
-
268
- ### **1. TTS Fallback**
269
- ```
270
- Try: Voxtral/Piper
271
-
272
- ├─ Success → Use neural TTS
273
-
274
- └─ Failure → Fall back to pyttsx3
275
- ```
276
-
277
- ### **2. Intent Classification Fallback**
278
- ```
279
- Try: DistilBERT
280
-
281
- ├─ Success → Use NLP classification
282
-
283
- └─ Failure → Use keyword matching
284
- ```
285
-
286
- ### **3. Vision Backend Fallback**
287
- ```
288
- Try: YOLO
289
-
290
- ├─ Success → Use YOLO
291
-
292
- └─ Failure → Try SSD
293
-
294
- ├─ Success → Use SSD
295
-
296
- └─ Failure → Caption only
297
- ```
298
-
299
- ### **4. Vector Search Fallback**
300
- ```
301
- Try: FAISS
302
-
303
- ├─ Available → Fast vector search
304
-
305
- └─ Unavailable → Linear text search
306
- ```
307
-
308
- ---
309
-
310
- ## 📊 MEMORY ARCHITECTURE
311
-
312
- ### **Hybrid Storage System**
313
-
314
- ```
315
- ┌─────────────────────────────────────────┐
316
- │ MEMORY AGENT │
317
- ├─────────────────────────────────────────┤
318
- │ │
319
- │ ┌───────────────┐ ┌────────────────┐ │
320
- │ │ JSON FILE │ │ FAISS INDEX │ │
321
- │ │ (Metadata) │ │ (Vectors) │ │
322
- │ ├───────────────┤ ├────────────────┤ │
323
- │ │ • ID │ │ • Image embed │ │
324
- │ │ • Timestamp │ │ • Fast search │ │
325
- │ │ • Description │ │ • Cosine sim │ │
326
- │ │ • Text embed │ │ • Top-K │ │
327
- │ │ • Image embed │ │ │ │
328
- │ │ • Importance │ │ │ │
329
- │ └───────────────┘ └────────────────┘ │
330
- │ │ │ │
331
- │ └────────┬───────────┘ │
332
- │ │ │
333
- │ Linked by Memory ID │
334
- └─────────────────────────────────────────┘
335
- ```
336
-
337
- ### **Search Strategy**
338
-
339
- ```
340
- Query Input
341
-
342
- ├─► Has image? → FAISS image search
343
- │ │
344
- │ └─► Get top-K IDs
345
-
346
- └─► Has text? → Text embedding search
347
-
348
- └─► Get matching IDs
349
-
350
-
351
- Merge & Rank
352
-
353
-
354
- Return Results
355
- ```
356
-
357
- ---
358
-
359
- ## 🎯 COMPONENT STATUS
360
-
361
- | Component | Status | Notes |
362
- |-----------|--------|-------|
363
- | VoiceAgent | ✅ UPDATED | Added Voxtral + fallback |
364
- | VisionAgent | ✅ UPDATED | Integrated new agents |
365
- | CaptionAgent | ✅ KEPT | No changes needed |
366
- | EmbeddingAgent | 🆕 NEW | MobileCLIP integration |
367
- | OCRAgent | 🆕 NEW | EasyOCR integration |
368
- | FusionLayer | 🆕 NEW | Multimodal fusion |
369
- | MemoryAgent | ✅ UPDATED | Added FAISS |
370
- | QueryAgent | ✅ UPDATED | Added DistilBERT |
371
-
372
- ---
373
-
374
- ## 🔧 CONFIGURATION POINTS
375
-
376
- ### **Adjustable Parameters**
377
-
378
- ```python
379
- # VisionAgent
380
- FRAME_INTERVAL = 0.3 # Seconds between frames
381
- CONF_THRESHOLD = 0.5 # Object detection confidence
382
-
383
- # OCRAgent
384
- OCR_CONFIDENCE = 0.3 # Text detection threshold
385
- OCR_LANGUAGES = ['en'] # Supported languages
386
-
387
- # MemoryAgent
388
- EMBEDDING_DIM = 512 # CLIP embedding size
389
- FAISS_INDEX_TYPE = "FlatIP" # Inner product (cosine)
390
-
391
- # QueryAgent
392
- SIMILARITY_THRESHOLD = 0.45 # Text search threshold
393
- TOP_K_RESULTS = 5 # Max results to return
394
- ```
395
-
396
- ---
397
-
398
- ## 📈 SCALABILITY
399
-
400
- ### **Current Limits**
401
- - Memory: ~10,000 entries (JSON + FAISS)
402
- - Search: O(log n) with FAISS
403
- - Real-time: 3 FPS (with all agents)
404
-
405
- ### **Optimization Options**
406
- 1. Use FAISS IVF index for >100K memories
407
- 2. Batch process frames
408
- 3. GPU acceleration for embeddings
409
- 4. Async processing pipeline
410
-
411
- ---
412
-
413
- ## 🎓 KEY DESIGN DECISIONS
414
-
415
- ### **1. Why FAISS?**
416
- - Fast similarity search (10-100x faster than linear)
417
- - Scales to millions of vectors
418
- - CPU-friendly (no GPU required)
419
-
420
- ### **2. Why EasyOCR?**
421
- - Offline capability
422
- - Multi-language support
423
- - Good accuracy/speed tradeoff
424
-
425
- ### **3. Why DistilBERT?**
426
- - 40% smaller than BERT
427
- - 60% faster
428
- - 97% of BERT's accuracy
429
-
430
- ### **4. Why Hybrid Storage?**
431
- - JSON: Human-readable, easy debugging
432
- - FAISS: Fast vector search
433
- - Best of both worlds
434
-
435
- ---
436
-
437
- **This architecture provides:**
438
- - ✅ Modularity (easy to extend)
439
- - ✅ Robustness (multiple fallbacks)
440
- - ✅ Performance (FAISS acceleration)
441
- - ✅ Compatibility (backward compatible)
442
-
443
- ---
444
-
445
- For implementation details, see individual agent files in `agents/` directory.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_docs/COMPARISON.md DELETED
@@ -1,431 +0,0 @@
1
- # 📊 VisionQ - Before vs After Comparison
2
-
3
- ## 🎯 EXECUTIVE SUMMARY
4
-
5
- VisionQ has been upgraded from a **basic vision assistant** to a **comprehensive multimodal AI system** with 4 major new capabilities, 10x performance improvement, and 100% backward compatibility.
6
-
7
- ---
8
-
9
- ## 🆚 FEATURE COMPARISON
10
-
11
- | Feature | Before | After | Improvement |
12
- |---------|--------|-------|-------------|
13
- | **Text Reading** | ❌ None | ✅ EasyOCR | NEW |
14
- | **Memory Search** | Linear O(n) | FAISS O(log n) | 10-100x faster |
15
- | **Voice Quality** | Robotic (pyttsx3) | Natural (Voxtral) | Much better |
16
- | **Query Understanding** | Keywords | DistilBERT NLP | 27% more accurate |
17
- | **Scene Description** | Caption only | Caption+OCR+Objects | 4x richer |
18
- | **Memory Capacity** | ~1,000 entries | 10,000+ entries | 10x more |
19
- | **Search Accuracy** | ~75% relevant | ~90% relevant | 15% better |
20
- | **Response Time** | 100-500ms | <100ms | 5x faster |
21
-
22
- ---
23
-
24
- ## 🏗️ ARCHITECTURE COMPARISON
25
-
26
- ### **Before (Original)**
27
- ```
28
- Voice (Vosk + pyttsx3)
29
-
30
- Vision (YOLO/SSD + BLIP)
31
-
32
- Memory (JSON + text embeddings)
33
-
34
- Query (cosine similarity)
35
- ```
36
-
37
- **Components:** 4 agents
38
- **Storage:** JSON only
39
- **Search:** Linear text search
40
- **Modalities:** Vision only
41
-
42
- ---
43
-
44
- ### **After (Upgraded)**
45
- ```
46
- Voice (Vosk + Voxtral + pyttsx3)
47
-
48
- Vision Hub
49
- ├─ YOLO/SSD (objects)
50
- ├─ BLIP (captions)
51
- ├─ MobileCLIP (embeddings)
52
- └─ EasyOCR (text)
53
-
54
- Fusion Layer
55
-
56
- Memory (JSON + FAISS)
57
-
58
- Query (DistilBERT + hybrid search)
59
- ```
60
-
61
- **Components:** 7 agents + fusion layer
62
- **Storage:** JSON + FAISS hybrid
63
- **Search:** Vector similarity + text
64
- **Modalities:** Vision + Text + Embeddings
65
-
66
- ---
67
-
68
- ## 📈 PERFORMANCE METRICS
69
-
70
- | Metric | Before | After | Change |
71
- |--------|--------|-------|--------|
72
- | **Memory Search Time** | 100-500ms | <10ms | 🟢 10-50x faster |
73
- | **Query Response** | 200-1000ms | <100ms | 🟢 2-10x faster |
74
- | **Memory Capacity** | ~1,000 | 10,000+ | 🟢 10x more |
75
- | **Search Accuracy** | 75% | 90% | 🟢 +15% |
76
- | **Intent Accuracy** | 70% | 97% | 🟢 +27% |
77
- | **OCR Accuracy** | N/A | 85-95% | 🟢 NEW |
78
- | **Startup Time** | 5-10s | 8-15s | 🟡 Slightly slower |
79
- | **Memory Usage** | ~500MB | ~800MB | 🟡 +300MB |
80
-
81
- ---
82
-
83
- ## 🆕 NEW CAPABILITIES
84
-
85
- ### **1. OCR Text Extraction**
86
- **Before:** ❌ Could not read text
87
- **After:** ✅ Extracts and reads visible text
88
-
89
- **Example:**
90
- ```
91
- Before: "a sign on a wall"
92
- After: "a sign on a wall. Text visible: EXIT"
93
- ```
94
-
95
- ---
96
-
97
- ### **2. Visual Similarity Search**
98
- **Before:** ❌ Text-only search
99
- **After:** ✅ Image embedding search via FAISS
100
-
101
- **Example:**
102
- ```
103
- Before: Search by description only
104
- After: "Find scenes similar to this image" → Returns visually similar memories
105
- ```
106
-
107
- ---
108
-
109
- ### **3. Intent Classification**
110
- **Before:** ❌ Keyword matching (70% accuracy)
111
- **After:** ✅ DistilBERT NLP (97% accuracy)
112
-
113
- **Example:**
114
- ```
115
- Query: "What did I see this morning?"
116
- Before: Matches "see" keyword → Generic results
117
- After: Classifies as "temporal" → Time-filtered results
118
- ```
119
-
120
- ---
121
-
122
- ### **4. Neural TTS**
123
- **Before:** ❌ Robotic pyttsx3 voice
124
- **After:** ✅ Natural Voxtral/Piper voice
125
-
126
- **Example:**
127
- ```
128
- Before: "Scene. Remembered." (robotic)
129
- After: "Scene remembered." (natural)
130
- ```
131
-
132
- ---
133
-
134
- ### **5. Multimodal Fusion**
135
- **Before:** ❌ Caption only
136
- **After:** ✅ Caption + OCR + Objects + Embeddings
137
-
138
- **Example:**
139
- ```
140
- Before: "a person holding a phone"
141
- After: "a person holding a phone. Objects detected: person, phone. Text visible: Hello World"
142
- ```
143
-
144
- ---
145
-
146
- ## 🔧 TECHNICAL IMPROVEMENTS
147
-
148
- ### **Code Organization**
149
-
150
- | Aspect | Before | After |
151
- |--------|--------|-------|
152
- | **Structure** | Flat files | Modular (agents/ + core/) |
153
- | **Agents** | 4 agents | 7 agents + fusion layer |
154
- | **Lines of Code** | ~800 | ~1,500 (better organized) |
155
- | **Documentation** | Basic README | 6 comprehensive docs |
156
- | **Tests** | None | Automated test suite |
157
-
158
- ---
159
-
160
- ### **Dependencies**
161
-
162
- | Category | Before | After | Added |
163
- |----------|--------|-------|-------|
164
- | **Core** | 8 packages | 10 packages | +2 |
165
- | **Optional** | 1 (tensorflow) | 3 (faiss, easyocr, piper) | +2 |
166
- | **Total Size** | ~1.5GB | ~2GB | +500MB |
167
-
168
- **New Dependencies:**
169
- - ✅ faiss-cpu (vector search)
170
- - ✅ easyocr (text extraction)
171
- - ✅ piper-tts (neural voice)
172
-
173
- ---
174
-
175
- ### **Storage System**
176
-
177
- | Aspect | Before | After |
178
- |--------|--------|-------|
179
- | **Metadata** | JSON file | JSON file (kept) |
180
- | **Vectors** | In JSON | FAISS index (new) |
181
- | **Text Embeddings** | sentence-transformers | sentence-transformers (kept) |
182
- | **Image Embeddings** | ❌ None | ✅ MobileCLIP |
183
- | **Search Method** | Linear scan | FAISS similarity |
184
- | **Index Size** | N/A | ~4KB per 1000 entries |
185
-
186
- ---
187
-
188
- ## 🎯 USE CASE COMPARISON
189
-
190
- ### **Scenario 1: Scene Description**
191
-
192
- **Before:**
193
- ```
194
- User: "Describe the scene"
195
- System: "a person holding a phone"
196
- ```
197
-
198
- **After:**
199
- ```
200
- User: "Describe the scene"
201
- System: "a person holding a phone. Objects detected: person, phone. Text visible: Hello World"
202
- ```
203
-
204
- **Improvement:** 4x more information
205
-
206
- ---
207
-
208
- ### **Scenario 2: Memory Search**
209
-
210
- **Before:**
211
- ```
212
- User: "What did I see this morning?"
213
- System: [Searches all memories linearly]
214
- Time: 500ms for 1000 memories
215
- Results: 5 matches (75% relevant)
216
- ```
217
-
218
- **After:**
219
- ```
220
- User: "What did I see this morning?"
221
- System: [FAISS + time filter + intent classification]
222
- Time: 10ms for 10,000 memories
223
- Results: 5 matches (90% relevant)
224
- ```
225
-
226
- **Improvement:** 50x faster, 15% more accurate
227
-
228
- ---
229
-
230
- ### **Scenario 3: Text Reading**
231
-
232
- **Before:**
233
- ```
234
- User: "Read the text"
235
- System: "Reading text will be available soon."
236
- ```
237
-
238
- **After:**
239
- ```
240
- User: "Read the text"
241
- System: "I can see the following text: Hello World"
242
- ```
243
-
244
- **Improvement:** NEW capability
245
-
246
- ---
247
-
248
- ## 📊 CAPABILITY MATRIX
249
-
250
- | Capability | Before | After | Status |
251
- |------------|--------|-------|--------|
252
- | **Object Detection** | ✅ YOLO/SSD | ✅ YOLO/SSD | KEPT |
253
- | **Image Captioning** | ✅ BLIP | ✅ BLIP | KEPT |
254
- | **Text Extraction** | ❌ | ✅ EasyOCR | NEW |
255
- | **Image Embeddings** | ❌ | ✅ MobileCLIP | NEW |
256
- | **Text Embeddings** | ✅ MiniLM | ✅ MiniLM | KEPT |
257
- | **Vector Search** | ❌ | ✅ FAISS | NEW |
258
- | **Speech Recognition** | ✅ Vosk | ✅ Vosk | KEPT |
259
- | **Text-to-Speech** | ✅ pyttsx3 | ✅ Voxtral + pyttsx3 | ENHANCED |
260
- | **Intent Classification** | ❌ Keywords | ✅ DistilBERT | NEW |
261
- | **Time Filtering** | ✅ Basic | ✅ Enhanced | IMPROVED |
262
- | **Importance Scoring** | ✅ Basic | ✅ Enhanced | IMPROVED |
263
- | **Multimodal Fusion** | ❌ | ✅ FusionLayer | NEW |
264
-
265
- ---
266
-
267
- ## 🔄 BACKWARD COMPATIBILITY
268
-
269
- | Aspect | Compatible? | Notes |
270
- |--------|-------------|-------|
271
- | **Old memory.json** | ✅ YES | Automatically migrated |
272
- | **Voice commands** | ✅ YES | Same commands work |
273
- | **Memory format** | ✅ YES | New fields optional |
274
- | **API** | ✅ YES | Old methods still work |
275
- | **File structure** | ✅ YES | Old files preserved |
276
- | **Dependencies** | ✅ YES | Old deps still work |
277
-
278
- **Breaking Changes:** ❌ NONE
279
-
280
- ---
281
-
282
- ## 💰 COST-BENEFIT ANALYSIS
283
-
284
- ### **Costs**
285
- | Item | Cost |
286
- |------|------|
287
- | **Development Time** | ~8 hours |
288
- | **Additional Storage** | +500MB models |
289
- | **Memory Usage** | +300MB RAM |
290
- | **Startup Time** | +3-5 seconds |
291
- | **Complexity** | Medium increase |
292
-
293
- ### **Benefits**
294
- | Item | Benefit |
295
- |------|---------|
296
- | **New Features** | 4 major capabilities |
297
- | **Performance** | 10x faster search |
298
- | **Accuracy** | 15-27% improvement |
299
- | **Capacity** | 10x more memories |
300
- | **User Experience** | Significantly better |
301
- | **Maintainability** | Better code structure |
302
-
303
- **ROI:** 🟢 **VERY HIGH** - Major improvements with minimal cost
304
-
305
- ---
306
-
307
- ## 🎯 UPGRADE IMPACT
308
-
309
- ### **User Impact**
310
- - 🟢 **Positive:** Better features, faster, smarter
311
- - 🟡 **Neutral:** Slightly longer startup
312
- - 🔴 **Negative:** None
313
-
314
- ### **Developer Impact**
315
- - 🟢 **Positive:** Better code organization, more modular
316
- - 🟢 **Positive:** Comprehensive documentation
317
- - 🟡 **Neutral:** More files to maintain
318
- - 🔴 **Negative:** None
319
-
320
- ### **System Impact**
321
- - 🟢 **Positive:** 10x performance improvement
322
- - 🟢 **Positive:** 10x capacity increase
323
- - 🟡 **Neutral:** +300MB memory usage
324
- - 🔴 **Negative:** None
325
-
326
- ---
327
-
328
- ## 📈 SCALABILITY COMPARISON
329
-
330
- | Aspect | Before | After | Improvement |
331
- |--------|--------|-------|-------------|
332
- | **Max Memories** | ~1,000 | 10,000+ | 10x |
333
- | **Search Complexity** | O(n) | O(log n) | Logarithmic |
334
- | **Concurrent Queries** | 1 | Multiple | Thread-safe |
335
- | **Index Size** | N/A | ~4KB/1000 | Efficient |
336
- | **Memory Growth** | Linear | Sub-linear | Better |
337
-
338
- ---
339
-
340
- ## 🏆 SUCCESS METRICS
341
-
342
- ### **Technical Success**
343
- - ✅ 100% backward compatible
344
- - ✅ 0 breaking changes
345
- - ✅ 10x performance improvement
346
- - ✅ 4 new major features
347
- - ✅ 8 new modules created
348
-
349
- ### **Quality Success**
350
- - ✅ Comprehensive documentation
351
- - ✅ Automated tests
352
- - ✅ Error handling
353
- - ✅ Fallback mechanisms
354
- - ✅ Code organization
355
-
356
- ### **User Success** (To Measure)
357
- - ⏳ User satisfaction
358
- - ⏳ Feature adoption
359
- - ⏳ Error rate reduction
360
- - ⏳ Performance perception
361
- - ⏳ Feedback scores
362
-
363
- ---
364
-
365
- ## 🎓 LESSONS LEARNED
366
-
367
- ### **What Worked Well**
368
- - ✅ Modular architecture
369
- - ✅ Fallback mechanisms
370
- - ✅ Backward compatibility
371
- - ✅ Comprehensive docs
372
- - ✅ Hybrid storage (JSON + FAISS)
373
-
374
- ### **What Could Be Better**
375
- - 🟡 Startup time (slightly slower)
376
- - 🟡 Memory usage (increased)
377
- - 🟡 Dependency count (more packages)
378
-
379
- ### **Future Improvements**
380
- - 💡 Lazy loading for faster startup
381
- - 💡 Memory optimization
382
- - 💡 Optional feature flags
383
- - 💡 Web interface
384
- - 💡 Mobile app
385
-
386
- ---
387
-
388
- ## 📊 FINAL VERDICT
389
-
390
- ### **Overall Assessment**
391
-
392
- | Category | Rating | Notes |
393
- |----------|--------|-------|
394
- | **Features** | ⭐⭐⭐⭐⭐ | 4 major new capabilities |
395
- | **Performance** | ⭐⭐⭐⭐⭐ | 10x faster |
396
- | **Compatibility** | ⭐⭐⭐⭐⭐ | 100% backward compatible |
397
- | **Code Quality** | ⭐⭐⭐⭐⭐ | Well organized |
398
- | **Documentation** | ⭐⭐⭐⭐⭐ | Comprehensive |
399
- | **Testing** | ⭐⭐⭐⭐☆ | Good coverage |
400
- | **User Experience** | ⭐⭐⭐⭐⭐ | Significantly improved |
401
-
402
- **Overall:** ⭐⭐⭐⭐⭐ **EXCELLENT UPGRADE**
403
-
404
- ---
405
-
406
- ## ✅ RECOMMENDATION
407
-
408
- **Status:** ✅ **APPROVED FOR DEPLOYMENT**
409
-
410
- **Confidence:** 🟢 **HIGH**
411
-
412
- **Reasoning:**
413
- - All objectives achieved
414
- - No breaking changes
415
- - Significant improvements
416
- - Well documented
417
- - Production ready
418
-
419
- **Next Steps:**
420
- 1. ✅ Deploy to production
421
- 2. ⏳ Monitor performance
422
- 3. ⏳ Collect user feedback
423
- 4. ⏳ Plan next iteration
424
-
425
- ---
426
-
427
- **The upgrade is a resounding success! 🎉**
428
-
429
- VisionQ has evolved from a basic vision assistant to a **state-of-the-art multimodal AI system** while maintaining 100% backward compatibility.
430
-
431
- **Recommended Action:** PROCEED WITH DEPLOYMENT 🚀
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_docs/DEPLOYMENT_CHECKLIST.md DELETED
@@ -1,397 +0,0 @@
1
- # ✅ VisionQ Upgrade - Deployment Checklist
2
-
3
- ## 📋 PRE-DEPLOYMENT
4
-
5
- ### **Code Review**
6
- - [x] All agents implemented
7
- - [x] Fusion layer created
8
- - [x] Memory system upgraded
9
- - [x] Query system enhanced
10
- - [x] Voice system updated
11
- - [x] Backward compatibility verified
12
- - [x] Error handling added
13
- - [x] Fallback mechanisms in place
14
-
15
- ### **Documentation**
16
- - [x] README_UPGRADED.md created
17
- - [x] QUICKSTART.md created
18
- - [x] UPGRADE_GUIDE.md created
19
- - [x] ARCHITECTURE.md created
20
- - [x] SUMMARY.md created
21
- - [x] Code comments added
22
- - [x] Docstrings complete
23
-
24
- ### **Testing Scripts**
25
- - [x] test_upgrade.py created
26
- - [x] install_upgrade.bat created
27
- - [x] Test cases defined
28
-
29
- ---
30
-
31
- ## 🚀 DEPLOYMENT STEPS
32
-
33
- ### **Step 1: Backup** ⚠️
34
- ```bash
35
- # Backup existing system
36
- mkdir backup
37
- copy *.py backup\
38
- copy memory.json backup\
39
- ```
40
- - [ ] Old files backed up
41
- - [ ] Memory file backed up
42
- - [ ] Configuration saved
43
-
44
- ### **Step 2: Install Dependencies**
45
- ```bash
46
- pip install -r requirements_upgraded.txt
47
- ```
48
- - [ ] Core dependencies installed
49
- - [ ] FAISS installed
50
- - [ ] EasyOCR installed
51
- - [ ] Piper TTS installed (optional)
52
-
53
- ### **Step 3: Directory Setup**
54
- ```bash
55
- mkdir data
56
- move memory.json data\memory.json
57
- ```
58
- - [ ] data/ directory created
59
- - [ ] Memory file migrated
60
- - [ ] Permissions verified
61
-
62
- ### **Step 4: Run Tests**
63
- ```bash
64
- python test_upgrade.py
65
- ```
66
- - [ ] All imports successful
67
- - [ ] MemoryAgent tests pass
68
- - [ ] FusionLayer tests pass
69
- - [ ] QueryAgent tests pass
70
- - [ ] Backward compatibility verified
71
-
72
- ### **Step 5: Initial Run**
73
- ```bash
74
- python main_upgraded.py
75
- ```
76
- - [ ] System starts without errors
77
- - [ ] Camera initializes
78
- - [ ] Microphone detected
79
- - [ ] Voice output works
80
- - [ ] Can exit cleanly
81
-
82
- ---
83
-
84
- ## 🧪 FUNCTIONAL TESTING
85
-
86
- ### **Voice Commands**
87
- - [ ] "Describe the scene" works
88
- - [ ] "Remember this" stores memory
89
- - [ ] "What did I see" recalls memory
90
- - [ ] "Read the text" extracts text (if text visible)
91
- - [ ] "Exit" quits properly
92
-
93
- ### **Memory System**
94
- - [ ] Memories persist after restart
95
- - [ ] JSON file created in data/
96
- - [ ] FAISS index created (if available)
97
- - [ ] Can recall stored memories
98
- - [ ] Timestamps correct
99
-
100
- ### **Query System**
101
- ```bash
102
- python ask_question_upgraded.py
103
- ```
104
- - [ ] Time-based queries work
105
- - [ ] Text search returns results
106
- - [ ] Intent classification functional
107
- - [ ] Confidence scores displayed
108
-
109
- ### **OCR Functionality**
110
- - [ ] Text extraction works
111
- - [ ] Confidence filtering applied
112
- - [ ] Text cleaning functional
113
- - [ ] Integrated into descriptions
114
-
115
- ### **Fallback Mechanisms**
116
- - [ ] pyttsx3 works if Voxtral unavailable
117
- - [ ] Keyword matching if DistilBERT fails
118
- - [ ] Linear search if FAISS unavailable
119
- - [ ] System continues if OCR fails
120
-
121
- ---
122
-
123
- ## 🔍 PERFORMANCE TESTING
124
-
125
- ### **Speed Tests**
126
- - [ ] Memory search <100ms
127
- - [ ] OCR processing <500ms
128
- - [ ] Caption generation <200ms
129
- - [ ] Query response <100ms
130
-
131
- ### **Capacity Tests**
132
- - [ ] Can store 100+ memories
133
- - [ ] Search remains fast with many memories
134
- - [ ] FAISS index scales properly
135
- - [ ] No memory leaks
136
-
137
- ### **Accuracy Tests**
138
- - [ ] OCR accuracy >85% (on clear text)
139
- - [ ] Intent classification >90%
140
- - [ ] Memory retrieval relevance >85%
141
- - [ ] Caption quality maintained
142
-
143
- ---
144
-
145
- ## 📊 INTEGRATION TESTING
146
-
147
- ### **End-to-End Scenarios**
148
-
149
- **Scenario 1: Basic Usage**
150
- 1. [ ] Start system
151
- 2. [ ] Describe scene
152
- 3. [ ] Remember scene
153
- 4. [ ] Recall memory
154
- 5. [ ] Exit
155
-
156
- **Scenario 2: OCR Workflow**
157
- 1. [ ] Start system
158
- 2. [ ] Point at text
159
- 3. [ ] Say "Read the text"
160
- 4. [ ] Verify text extracted
161
- 5. [ ] Check memory includes text
162
-
163
- **Scenario 3: Query Workflow**
164
- 1. [ ] Store multiple memories
165
- 2. [ ] Run ask_question_upgraded.py
166
- 3. [ ] Try time-based query
167
- 4. [ ] Try object-based query
168
- 5. [ ] Verify results relevant
169
-
170
- **Scenario 4: Fallback Testing**
171
- 1. [ ] Uninstall FAISS temporarily
172
- 2. [ ] Verify system still works
173
- 3. [ ] Reinstall FAISS
174
- 4. [ ] Verify enhanced features return
175
-
176
- ---
177
-
178
- ## 🐛 ERROR HANDLING
179
-
180
- ### **Common Errors to Test**
181
- - [ ] Camera not available
182
- - [ ] Microphone not detected
183
- - [ ] Model download failure
184
- - [ ] Memory file corrupted
185
- - [ ] FAISS index corrupted
186
- - [ ] Out of disk space
187
- - [ ] Permission denied
188
-
189
- ### **Recovery Procedures**
190
- - [ ] System logs errors clearly
191
- - [ ] Fallbacks activate automatically
192
- - [ ] User gets helpful error messages
193
- - [ ] System doesn't crash
194
- - [ ] Can recover without restart
195
-
196
- ---
197
-
198
- ## 📚 DOCUMENTATION VERIFICATION
199
-
200
- ### **User Documentation**
201
- - [ ] QUICKSTART.md accurate
202
- - [ ] Installation steps work
203
- - [ ] Voice commands documented
204
- - [ ] Query examples work
205
- - [ ] Troubleshooting helpful
206
-
207
- ### **Developer Documentation**
208
- - [ ] ARCHITECTURE.md clear
209
- - [ ] Code comments accurate
210
- - [ ] API documented
211
- - [ ] Examples provided
212
- - [ ] Diagrams correct
213
-
214
- ### **Upgrade Documentation**
215
- - [ ] UPGRADE_GUIDE.md complete
216
- - [ ] Migration steps clear
217
- - [ ] Backward compatibility explained
218
- - [ ] New features documented
219
- - [ ] Performance metrics accurate
220
-
221
- ---
222
-
223
- ## 🔒 SECURITY & PRIVACY
224
-
225
- ### **Privacy Checks**
226
- - [ ] No data sent to cloud
227
- - [ ] All processing local
228
- - [ ] Memory stored locally
229
- - [ ] No telemetry
230
- - [ ] No external API calls
231
-
232
- ### **Security Checks**
233
- - [ ] No hardcoded credentials
234
- - [ ] File permissions correct
235
- - [ ] Input validation present
236
- - [ ] No SQL injection risks
237
- - [ ] Dependencies up to date
238
-
239
- ---
240
-
241
- ## 📦 PACKAGING
242
-
243
- ### **Files to Include**
244
- - [x] agents/ directory
245
- - [x] core/ directory
246
- - [x] main_upgraded.py
247
- - [x] ask_question_upgraded.py
248
- - [x] requirements_upgraded.txt
249
- - [x] install_upgrade.bat
250
- - [x] test_upgrade.py
251
- - [x] All documentation files
252
- - [x] LICENSE
253
- - [x] .gitignore
254
-
255
- ### **Files to Exclude**
256
- - [ ] __pycache__/
257
- - [ ] *.pyc
258
- - [ ] data/test_*
259
- - [ ] .venv/
260
- - [ ] models/ (too large, download separately)
261
-
262
- ---
263
-
264
- ## 🚀 PRODUCTION READINESS
265
-
266
- ### **Critical Requirements**
267
- - [ ] All tests pass
268
- - [ ] No critical bugs
269
- - [ ] Documentation complete
270
- - [ ] Backward compatible
271
- - [ ] Performance acceptable
272
-
273
- ### **Nice-to-Have**
274
- - [ ] Neural TTS working
275
- - [ ] FAISS available
276
- - [ ] EasyOCR installed
277
- - [ ] All optional features enabled
278
-
279
- ---
280
-
281
- ## 📈 POST-DEPLOYMENT
282
-
283
- ### **Monitoring**
284
- - [ ] Track memory usage
285
- - [ ] Monitor query performance
286
- - [ ] Log error rates
287
- - [ ] Collect user feedback
288
- - [ ] Measure accuracy
289
-
290
- ### **Maintenance**
291
- - [ ] Regular dependency updates
292
- - [ ] Model updates
293
- - [ ] Bug fixes
294
- - [ ] Feature requests
295
- - [ ] Documentation updates
296
-
297
- ---
298
-
299
- ## ✅ FINAL SIGN-OFF
300
-
301
- ### **Deployment Approval**
302
- - [ ] All critical tests passed
303
- - [ ] Documentation reviewed
304
- - [ ] Backup created
305
- - [ ] Rollback plan ready
306
- - [ ] Team notified
307
-
308
- ### **Go-Live Checklist**
309
- - [ ] System tested end-to-end
310
- - [ ] Users trained
311
- - [ ] Support ready
312
- - [ ] Monitoring active
313
- - [ ] Feedback mechanism in place
314
-
315
- ---
316
-
317
- ## 🎉 SUCCESS CRITERIA
318
-
319
- ### **Must Have**
320
- - ✅ System starts without errors
321
- - ✅ All voice commands work
322
- - ✅ Memory persists
323
- - ✅ Backward compatible
324
- - ✅ Documentation complete
325
-
326
- ### **Should Have**
327
- - ✅ OCR functional
328
- - ✅ FAISS search fast
329
- - ✅ Neural TTS working
330
- - ✅ Intent classification accurate
331
- - ✅ Performance targets met
332
-
333
- ### **Nice to Have**
334
- - ⭐ All optional features enabled
335
- - ⭐ Zero warnings
336
- - ⭐ Perfect test coverage
337
- - ⭐ User feedback positive
338
- - ⭐ Performance exceeds targets
339
-
340
- ---
341
-
342
- ## 📞 SUPPORT CONTACTS
343
-
344
- ### **Technical Issues**
345
- - Check: UPGRADE_GUIDE.md troubleshooting
346
- - Run: test_upgrade.py
347
- - Review: Error logs
348
-
349
- ### **Documentation**
350
- - QUICKSTART.md - Quick setup
351
- - UPGRADE_GUIDE.md - Complete guide
352
- - ARCHITECTURE.md - Technical details
353
-
354
- ---
355
-
356
- ## 🏁 DEPLOYMENT STATUS
357
-
358
- **Current Status:** ✅ READY FOR DEPLOYMENT
359
-
360
- **Confidence Level:** HIGH
361
- - All code implemented
362
- - Tests created
363
- - Documentation complete
364
- - Backward compatible
365
- - Fallbacks in place
366
-
367
- **Recommended Action:** PROCEED WITH DEPLOYMENT
368
-
369
- ---
370
-
371
- ## 📝 NOTES
372
-
373
- ### **Known Limitations**
374
- - Piper TTS requires separate model download
375
- - EasyOCR first run downloads models (~500MB)
376
- - FAISS CPU-only (GPU version available separately)
377
- - OCR accuracy depends on image quality
378
-
379
- ### **Future Improvements**
380
- - Web interface
381
- - Mobile app
382
- - Cloud sync (optional)
383
- - Multi-user support
384
- - Video recording
385
-
386
- ---
387
-
388
- **Deployment Checklist Complete! 🎊**
389
-
390
- **Next Steps:**
391
- 1. Review this checklist
392
- 2. Run through deployment steps
393
- 3. Execute test suite
394
- 4. Verify all features
395
- 5. Deploy to production
396
-
397
- **Good luck! 🚀**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_docs/INDEX.md DELETED
@@ -1,359 +0,0 @@
1
- # 📚 VisionQ Upgrade - Documentation Index
2
-
3
- ## 🎯 START HERE
4
-
5
- **New to VisionQ?** → [QUICKSTART.md](QUICKSTART.md)
6
- **Upgrading existing system?** → [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md)
7
- **Need quick reference?** → [QUICK_REFERENCE.md](QUICK_REFERENCE.md)
8
-
9
- ---
10
-
11
- ## 📖 DOCUMENTATION MAP
12
-
13
- ### **🚀 Getting Started**
14
-
15
- | Document | Purpose | Time | Audience |
16
- |----------|---------|------|----------|
17
- | [QUICKSTART.md](QUICKSTART.md) | 5-minute setup guide | 5 min | Everyone |
18
- | [QUICK_REFERENCE.md](QUICK_REFERENCE.md) | Command cheat sheet | 2 min | Everyone |
19
- | [README_UPGRADED.md](README_UPGRADED.md) | Project overview | 10 min | Everyone |
20
-
21
- ### **📋 Upgrade Information**
22
-
23
- | Document | Purpose | Time | Audience |
24
- |----------|---------|------|----------|
25
- | [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) | Complete upgrade docs | 30 min | Developers |
26
- | [SUMMARY.md](SUMMARY.md) | Executive summary | 10 min | Managers |
27
- | [COMPARISON.md](COMPARISON.md) | Before/After analysis | 15 min | Technical leads |
28
-
29
- ### **🏗️ Technical Documentation**
30
-
31
- | Document | Purpose | Time | Audience |
32
- |----------|---------|------|----------|
33
- | [ARCHITECTURE.md](ARCHITECTURE.md) | System architecture | 20 min | Developers |
34
- | [DEPLOYMENT_CHECKLIST.md](DEPLOYMENT_CHECKLIST.md) | Deploy procedures | 15 min | DevOps |
35
-
36
- ### **📝 Code Files**
37
-
38
- | File | Purpose | Type |
39
- |------|---------|------|
40
- | `main_upgraded.py` | Main entry point | Python |
41
- | `ask_question_upgraded.py` | Query interface | Python |
42
- | `test_upgrade.py` | Test suite | Python |
43
- | `install_upgrade.bat` | Installer script | Batch |
44
- | `requirements_upgraded.txt` | Dependencies | Text |
45
-
46
- ---
47
-
48
- ## 🗺️ NAVIGATION GUIDE
49
-
50
- ### **I want to...**
51
-
52
- **...get started quickly**
53
- → [QUICKSTART.md](QUICKSTART.md) → Run `install_upgrade.bat`
54
-
55
- **...understand what changed**
56
- → [COMPARISON.md](COMPARISON.md) → [SUMMARY.md](SUMMARY.md)
57
-
58
- **...learn the architecture**
59
- → [ARCHITECTURE.md](ARCHITECTURE.md) → Code in `agents/`
60
-
61
- **...deploy to production**
62
- → [DEPLOYMENT_CHECKLIST.md](DEPLOYMENT_CHECKLIST.md)
63
-
64
- **...troubleshoot issues**
65
- → [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) → Troubleshooting section
66
-
67
- **...see command reference**
68
- → [QUICK_REFERENCE.md](QUICK_REFERENCE.md)
69
-
70
- **...understand the upgrade**
71
- → [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) → [SUMMARY.md](SUMMARY.md)
72
-
73
- ---
74
-
75
- ## 📊 DOCUMENT RELATIONSHIPS
76
-
77
- ```
78
- START
79
-
80
- ├─ Quick Start? → QUICKSTART.md
81
- │ │
82
- │ └─ Need details? → UPGRADE_GUIDE.md
83
-
84
- ├─ Overview? → README_UPGRADED.md
85
- │ │
86
- │ └─ Technical? → ARCHITECTURE.md
87
-
88
- ├─ Comparison? → COMPARISON.md
89
- │ │
90
- │ └─ Summary? → SUMMARY.md
91
-
92
- └─ Deploy? → DEPLOYMENT_CHECKLIST.md
93
-
94
- └─ Reference? → QUICK_REFERENCE.md
95
- ```
96
-
97
- ---
98
-
99
- ## 🎯 BY ROLE
100
-
101
- ### **👤 End User**
102
- 1. [QUICKSTART.md](QUICKSTART.md) - Setup
103
- 2. [QUICK_REFERENCE.md](QUICK_REFERENCE.md) - Commands
104
- 3. [README_UPGRADED.md](README_UPGRADED.md) - Features
105
-
106
- ### **👨‍💻 Developer**
107
- 1. [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) - Complete guide
108
- 2. [ARCHITECTURE.md](ARCHITECTURE.md) - System design
109
- 3. Code in `agents/` and `core/` - Implementation
110
-
111
- ### **👔 Manager**
112
- 1. [SUMMARY.md](SUMMARY.md) - Executive summary
113
- 2. [COMPARISON.md](COMPARISON.md) - ROI analysis
114
- 3. [README_UPGRADED.md](README_UPGRADED.md) - Overview
115
-
116
- ### **🚀 DevOps**
117
- 1. [DEPLOYMENT_CHECKLIST.md](DEPLOYMENT_CHECKLIST.md) - Deploy
118
- 2. [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) - Installation
119
- 3. `test_upgrade.py` - Testing
120
-
121
- ---
122
-
123
- ## 📚 READING ORDER
124
-
125
- ### **Fast Track (30 minutes)**
126
- 1. [QUICKSTART.md](QUICKSTART.md) - 5 min
127
- 2. [QUICK_REFERENCE.md](QUICK_REFERENCE.md) - 2 min
128
- 3. [SUMMARY.md](SUMMARY.md) - 10 min
129
- 4. [COMPARISON.md](COMPARISON.md) - 15 min
130
-
131
- ### **Complete Track (2 hours)**
132
- 1. [README_UPGRADED.md](README_UPGRADED.md) - 10 min
133
- 2. [QUICKSTART.md](QUICKSTART.md) - 5 min
134
- 3. [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) - 30 min
135
- 4. [ARCHITECTURE.md](ARCHITECTURE.md) - 20 min
136
- 5. [COMPARISON.md](COMPARISON.md) - 15 min
137
- 6. [SUMMARY.md](SUMMARY.md) - 10 min
138
- 7. [DEPLOYMENT_CHECKLIST.md](DEPLOYMENT_CHECKLIST.md) - 15 min
139
- 8. Code exploration - 30 min
140
-
141
- ### **Technical Deep Dive (4 hours)**
142
- 1. All documents above
143
- 2. Code in `agents/` - 1 hour
144
- 3. Code in `core/` - 30 min
145
- 4. Test suite analysis - 30 min
146
- 5. Hands-on experimentation - 1 hour
147
-
148
- ---
149
-
150
- ## 🔍 BY TOPIC
151
-
152
- ### **Installation & Setup**
153
- - [QUICKSTART.md](QUICKSTART.md) - Quick setup
154
- - [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) - Detailed installation
155
- - `install_upgrade.bat` - Automated installer
156
- - `requirements_upgraded.txt` - Dependencies
157
-
158
- ### **Features & Capabilities**
159
- - [README_UPGRADED.md](README_UPGRADED.md) - Feature overview
160
- - [COMPARISON.md](COMPARISON.md) - Before/After features
161
- - [SUMMARY.md](SUMMARY.md) - Capability summary
162
-
163
- ### **Architecture & Design**
164
- - [ARCHITECTURE.md](ARCHITECTURE.md) - System architecture
165
- - [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) - Design decisions
166
- - Code in `agents/` - Implementation
167
-
168
- ### **Usage & Commands**
169
- - [QUICK_REFERENCE.md](QUICK_REFERENCE.md) - Command reference
170
- - [QUICKSTART.md](QUICKSTART.md) - Usage examples
171
- - [README_UPGRADED.md](README_UPGRADED.md) - Use cases
172
-
173
- ### **Testing & Deployment**
174
- - [DEPLOYMENT_CHECKLIST.md](DEPLOYMENT_CHECKLIST.md) - Deploy guide
175
- - `test_upgrade.py` - Test suite
176
- - [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) - Testing section
177
-
178
- ### **Troubleshooting**
179
- - [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) - Troubleshooting section
180
- - [QUICK_REFERENCE.md](QUICK_REFERENCE.md) - Quick fixes
181
- - [DEPLOYMENT_CHECKLIST.md](DEPLOYMENT_CHECKLIST.md) - Error handling
182
-
183
- ---
184
-
185
- ## 📦 FILE INVENTORY
186
-
187
- ### **Documentation (11 files)**
188
- - ✅ QUICKSTART.md
189
- - ✅ QUICK_REFERENCE.md
190
- - ✅ README_UPGRADED.md
191
- - ✅ UPGRADE_GUIDE.md
192
- - ✅ ARCHITECTURE.md
193
- - ✅ SUMMARY.md
194
- - ✅ COMPARISON.md
195
- - ✅ DEPLOYMENT_CHECKLIST.md
196
- - ✅ INDEX.md (this file)
197
- - ✅ README.md (original)
198
- - ✅ requirements_upgraded.txt
199
-
200
- ### **Code Files (12 files)**
201
- - ✅ agents/__init__.py
202
- - ✅ agents/voice_agent.py
203
- - ✅ agents/vision_agent.py
204
- - ✅ agents/caption_agent.py
205
- - ✅ agents/embedding_agent.py
206
- - ✅ agents/ocr_agent.py
207
- - ✅ agents/memory_agent.py
208
- - ✅ agents/query_agent.py
209
- - ✅ core/__init__.py
210
- - ✅ core/fusion_layer.py
211
- - ✅ main_upgraded.py
212
- - ✅ ask_question_upgraded.py
213
-
214
- ### **Utility Files (2 files)**
215
- - ✅ test_upgrade.py
216
- - ✅ install_upgrade.bat
217
-
218
- ### **Total: 25 new/updated files**
219
-
220
- ---
221
-
222
- ## 🎓 LEARNING PATHS
223
-
224
- ### **Path 1: Quick User (1 hour)**
225
- ```
226
- QUICKSTART.md
227
-
228
- Run install_upgrade.bat
229
-
230
- Run main_upgraded.py
231
-
232
- Try voice commands
233
-
234
- QUICK_REFERENCE.md (bookmark)
235
- ```
236
-
237
- ### **Path 2: Developer (4 hours)**
238
- ```
239
- README_UPGRADED.md
240
-
241
- UPGRADE_GUIDE.md
242
-
243
- ARCHITECTURE.md
244
-
245
- Explore agents/ code
246
-
247
- Run test_upgrade.py
248
-
249
- Modify and experiment
250
- ```
251
-
252
- ### **Path 3: Manager (30 minutes)**
253
- ```
254
- SUMMARY.md
255
-
256
- COMPARISON.md
257
-
258
- README_UPGRADED.md
259
-
260
- Make decision
261
- ```
262
-
263
- ---
264
-
265
- ## 🔗 EXTERNAL RESOURCES
266
-
267
- ### **Model Documentation**
268
- - [YOLO](https://github.com/ultralytics/ultralytics)
269
- - [BLIP](https://github.com/salesforce/BLIP)
270
- - [CLIP](https://github.com/openai/CLIP)
271
- - [EasyOCR](https://github.com/JaidedAI/EasyOCR)
272
- - [FAISS](https://github.com/facebookresearch/faiss)
273
- - [Vosk](https://alphacephei.com/vosk/)
274
- - [Piper TTS](https://github.com/rhasspy/piper)
275
-
276
- ### **Python Libraries**
277
- - [PyTorch](https://pytorch.org/)
278
- - [Transformers](https://huggingface.co/docs/transformers)
279
- - [OpenCV](https://opencv.org/)
280
- - [sentence-transformers](https://www.sbert.net/)
281
-
282
- ---
283
-
284
- ## 📞 SUPPORT MATRIX
285
-
286
- | Issue Type | Resource |
287
- |------------|----------|
288
- | **Installation** | QUICKSTART.md → UPGRADE_GUIDE.md |
289
- | **Usage** | QUICK_REFERENCE.md → README_UPGRADED.md |
290
- | **Errors** | UPGRADE_GUIDE.md (Troubleshooting) |
291
- | **Architecture** | ARCHITECTURE.md |
292
- | **Deployment** | DEPLOYMENT_CHECKLIST.md |
293
- | **Comparison** | COMPARISON.md |
294
- | **Testing** | test_upgrade.py |
295
-
296
- ---
297
-
298
- ## ✅ DOCUMENTATION CHECKLIST
299
-
300
- ### **For Users**
301
- - [x] Quick start guide
302
- - [x] Command reference
303
- - [x] Troubleshooting guide
304
- - [x] Use case examples
305
-
306
- ### **For Developers**
307
- - [x] Architecture documentation
308
- - [x] Code organization explained
309
- - [x] API documentation (docstrings)
310
- - [x] Test suite
311
-
312
- ### **For Managers**
313
- - [x] Executive summary
314
- - [x] ROI analysis
315
- - [x] Feature comparison
316
- - [x] Deployment guide
317
-
318
- ### **For DevOps**
319
- - [x] Installation scripts
320
- - [x] Deployment checklist
321
- - [x] Testing procedures
322
- - [x] Troubleshooting guide
323
-
324
- ---
325
-
326
- ## 🎯 QUICK LINKS
327
-
328
- **Most Important:**
329
- - 🚀 [QUICKSTART.md](QUICKSTART.md) - Start here!
330
- - 📋 [QUICK_REFERENCE.md](QUICK_REFERENCE.md) - Commands
331
- - 📚 [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) - Complete guide
332
-
333
- **For Understanding:**
334
- - 📊 [COMPARISON.md](COMPARISON.md) - What changed
335
- - 📝 [SUMMARY.md](SUMMARY.md) - Executive summary
336
- - 🏗️ [ARCHITECTURE.md](ARCHITECTURE.md) - How it works
337
-
338
- **For Action:**
339
- - ✅ [DEPLOYMENT_CHECKLIST.md](DEPLOYMENT_CHECKLIST.md) - Deploy
340
- - 🧪 `test_upgrade.py` - Test
341
- - 🔧 `install_upgrade.bat` - Install
342
-
343
- ---
344
-
345
- ## 🎉 YOU'RE ALL SET!
346
-
347
- **This index covers all documentation for the VisionQ upgrade.**
348
-
349
- **Start with:** [QUICKSTART.md](QUICKSTART.md)
350
-
351
- **Need help?** Check the appropriate document above.
352
-
353
- **Happy upgrading! 🚀**
354
-
355
- ---
356
-
357
- **Last Updated:** 2024
358
- **Version:** 2.0 (Upgraded)
359
- **Status:** ✅ Production Ready
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_docs/QUICKSTART.md DELETED
@@ -1,197 +0,0 @@
1
- # 🚀 VisionQ Upgrade - Quick Start Guide
2
-
3
- ## ⚡ 5-Minute Setup
4
-
5
- ### **Step 1: Install Dependencies**
6
- ```bash
7
- pip install -r requirements_upgraded.txt
8
- ```
9
-
10
- ### **Step 2: Create Data Directory**
11
- ```bash
12
- mkdir data
13
- move memory.json data\memory.json
14
- ```
15
-
16
- ### **Step 3: Run Upgraded System**
17
- ```bash
18
- python main_upgraded.py
19
- ```
20
-
21
- ---
22
-
23
- ## 🎯 What's New?
24
-
25
- ### **1. OCR Text Reading** ✨
26
- **Voice Command:** "Read the text"
27
- - Points camera at text
28
- - Extracts and speaks visible text
29
- - Stores text in memory
30
-
31
- ### **2. Enhanced Memory** 🧠
32
- - **FAISS vector search** - 10x faster retrieval
33
- - **Image embeddings** - Find visually similar scenes
34
- - **Hybrid search** - Text + image combined
35
-
36
- ### **3. Better Voice** 🗣️
37
- - **Neural TTS** (Voxtral/Piper) - Natural speech
38
- - **Auto-fallback** - Uses pyttsx3 if needed
39
- - **Same commands** - No learning curve
40
-
41
- ### **4. Smarter Queries** 🔍
42
- - **DistilBERT NLP** - Understands intent
43
- - **Time-aware** - "What did I see this morning?"
44
- - **Multi-modal** - Searches text, images, objects
45
-
46
- ---
47
-
48
- ## 📋 Voice Commands
49
-
50
- | Say This | System Does |
51
- |----------|-------------|
52
- | "Describe the scene" | Caption + OCR + objects |
53
- | "Remember this" | Store with embeddings |
54
- | "What did I see" | Recall last memory |
55
- | **"Read the text"** | **Extract visible text** ⭐ NEW |
56
- | "Exit" | Quit system |
57
-
58
- ---
59
-
60
- ## 🔧 Optional: Neural TTS Setup
61
-
62
- **Want better voice quality?**
63
-
64
- 1. Download Piper voice model:
65
- - https://github.com/rhasspy/piper/releases
66
- - Get: `en_US-lessac-medium.onnx`
67
-
68
- 2. Create directory:
69
- ```bash
70
- mkdir models\piper
71
- ```
72
-
73
- 3. Extract model to `models/piper/`
74
-
75
- 4. Restart VisionQ
76
-
77
- **Note:** System works fine without this - pyttsx3 is the fallback!
78
-
79
- ---
80
-
81
- ## 🧪 Test Your Upgrade
82
-
83
- ### **Test 1: OCR**
84
- ```bash
85
- python main_upgraded.py
86
- # Say: "Read the text"
87
- # Point camera at text
88
- # Should extract and speak text
89
- ```
90
-
91
- ### **Test 2: Enhanced Memory**
92
- ```bash
93
- python ask_question_upgraded.py
94
- # Type: "What did I see today?"
95
- # Should show memories with confidence scores
96
- ```
97
-
98
- ### **Test 3: Voice Quality**
99
- ```bash
100
- # Listen to TTS output
101
- # Should sound natural (if Piper installed)
102
- # Or robotic (if using pyttsx3 fallback)
103
- ```
104
-
105
- ---
106
-
107
- ## 🐛 Quick Fixes
108
-
109
- ### **"Module not found" error**
110
- ```bash
111
- pip install --upgrade -r requirements_upgraded.txt
112
- ```
113
-
114
- ### **"FAISS not available" warning**
115
- ```bash
116
- pip install faiss-cpu
117
- ```
118
-
119
- ### **"OCR not working"**
120
- ```bash
121
- pip install easyocr
122
- ```
123
-
124
- ### **Camera not opening**
125
- ```bash
126
- # Check camera permissions
127
- # Try different camera index in vision_agent.py:
128
- # self.cap = cv2.VideoCapture(1) # Try 1 instead of 0
129
- ```
130
-
131
- ---
132
-
133
- ## 📊 What Got Better?
134
-
135
- | Feature | Before | After | Improvement |
136
- |---------|--------|-------|-------------|
137
- | Text Reading | ❌ None | ✅ OCR | NEW |
138
- | Memory Search | Slow | Fast | 10x faster |
139
- | Voice Quality | Robotic | Natural | Much better |
140
- | Query Understanding | Keywords | NLP | Smarter |
141
- | Scene Understanding | Caption only | Caption+OCR+Objects | Richer |
142
-
143
- ---
144
-
145
- ## 🎓 Example Queries
146
-
147
- **Try these in `ask_question_upgraded.py`:**
148
-
149
- ```
150
- "What did I see this morning?"
151
- "Show me memories with text"
152
- "When did I see a person?"
153
- "What happened in the last hour?"
154
- "Find memories from yesterday"
155
- ```
156
-
157
- ---
158
-
159
- ## ✅ Success Checklist
160
-
161
- - [ ] System starts without errors
162
- - [ ] Voice recognition works
163
- - [ ] Camera captures video
164
- - [ ] "Describe scene" gives detailed output
165
- - [ ] "Remember this" stores memory
166
- - [ ] "Read text" extracts text (if text visible)
167
- - [ ] Query system returns results
168
- - [ ] Memory persists after restart
169
-
170
- ---
171
-
172
- ## 🚀 You're Ready!
173
-
174
- Your VisionQ is now upgraded with:
175
- - ✅ OCR text reading
176
- - ✅ Fast vector search (FAISS)
177
- - ✅ Neural TTS (optional)
178
- - ✅ Smart NLP queries
179
- - ✅ Enhanced memory
180
-
181
- **All existing features still work!**
182
-
183
- ---
184
-
185
- ## 📚 Full Documentation
186
-
187
- For detailed information, see:
188
- - `UPGRADE_GUIDE.md` - Complete upgrade documentation
189
- - `requirements_upgraded.txt` - All dependencies
190
- - `agents/` - New modular code
191
- - `core/` - Fusion layer
192
-
193
- ---
194
-
195
- **Need help?** Check `UPGRADE_GUIDE.md` troubleshooting section.
196
-
197
- **Happy upgrading! 🎉**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_docs/QUICK_REFERENCE.md DELETED
@@ -1,315 +0,0 @@
1
- # 🎯 VisionQ Upgrade - Quick Reference Card
2
-
3
- ## 📦 INSTALLATION (3 Steps)
4
-
5
- ```bash
6
- # 1. Install dependencies
7
- pip install -r requirements_upgraded.txt
8
-
9
- # 2. Create data directory
10
- mkdir data
11
-
12
- # 3. Run system
13
- python main_upgraded.py
14
- ```
15
-
16
- ---
17
-
18
- ## 🗣️ VOICE COMMANDS
19
-
20
- | Say This | System Does |
21
- |----------|-------------|
22
- | **"Describe the scene"** | Captures and describes (caption + OCR + objects) |
23
- | **"Remember this"** | Stores scene with embeddings in memory |
24
- | **"What did I see"** | Recalls last memory |
25
- | **"Read the text"** | Extracts visible text (OCR) 🆕 |
26
- | **"Exit"** | Quits system |
27
-
28
- ---
29
-
30
- ## 🔍 QUERY EXAMPLES
31
-
32
- ```bash
33
- python ask_question_upgraded.py
34
- ```
35
-
36
- **Try these:**
37
- - "What did I see this morning?"
38
- - "Show me memories with text"
39
- - "When did I see a person?"
40
- - "Find memories from yesterday"
41
- - "What happened in the last hour?"
42
-
43
- ---
44
-
45
- ## 📂 FILE STRUCTURE
46
-
47
- ```
48
- VisionQ/
49
- ├── agents/ # 🆕 Modular agents
50
- │ ├── voice_agent.py # Voice I/O
51
- │ ├── vision_agent.py # Vision hub
52
- │ ├── embedding_agent.py # 🆕 MobileCLIP
53
- │ ├── ocr_agent.py # 🆕 Text extraction
54
- │ ├── memory_agent.py # Storage (JSON + FAISS)
55
- │ └── query_agent.py # Smart retrieval
56
-
57
- ├── core/ # 🆕 Integration
58
- │ └── fusion_layer.py # 🆕 Multimodal fusion
59
-
60
- ├── data/ # 🆕 Storage
61
- │ ├── memory.json # Metadata
62
- │ └── memory.faiss # 🆕 Vector index
63
-
64
- ├── main_upgraded.py # 🆕 Main entry
65
- └── ask_question_upgraded.py # 🆕 Query tool
66
- ```
67
-
68
- ---
69
-
70
- ## 🆕 WHAT'S NEW?
71
-
72
- | Feature | Status |
73
- |---------|--------|
74
- | **OCR Text Reading** | ✅ NEW |
75
- | **FAISS Vector Search** | ✅ NEW (10x faster) |
76
- | **Neural TTS (Voxtral)** | ✅ NEW (natural voice) |
77
- | **Intent Classification** | ✅ NEW (DistilBERT) |
78
- | **Multimodal Fusion** | ✅ NEW (richer context) |
79
-
80
- ---
81
-
82
- ## 🔧 CONFIGURATION
83
-
84
- **Vision** (`agents/vision_agent.py`):
85
- ```python
86
- FRAME_INTERVAL = 0.3 # Seconds between frames
87
- CONF_THRESHOLD = 0.5 # Detection confidence
88
- ```
89
-
90
- **OCR** (`agents/ocr_agent.py`):
91
- ```python
92
- OCR_CONFIDENCE = 0.3 # Text threshold
93
- OCR_LANGUAGES = ['en'] # Languages
94
- ```
95
-
96
- **Query** (`agents/query_agent.py`):
97
- ```python
98
- SIMILARITY_THRESHOLD = 0.45 # Search threshold
99
- TOP_K_RESULTS = 5 # Max results
100
- ```
101
-
102
- ---
103
-
104
- ## 🧪 TESTING
105
-
106
- ```bash
107
- # Run test suite
108
- python test_upgrade.py
109
-
110
- # Expected: All tests pass ✅
111
- ```
112
-
113
- ---
114
-
115
- ## 🐛 TROUBLESHOOTING
116
-
117
- **"Module not found":**
118
- ```bash
119
- pip install --upgrade -r requirements_upgraded.txt
120
- ```
121
-
122
- **"FAISS not available":**
123
- ```bash
124
- pip install faiss-cpu
125
- ```
126
-
127
- **"OCR not working":**
128
- ```bash
129
- pip install easyocr
130
- ```
131
-
132
- **Camera not opening:**
133
- ```python
134
- # Edit agents/vision_agent.py line ~90
135
- self.cap = cv2.VideoCapture(1) # Try 1 instead of 0
136
- ```
137
-
138
- ---
139
-
140
- ## 📚 DOCUMENTATION
141
-
142
- | File | Purpose |
143
- |------|---------|
144
- | **QUICKSTART.md** | 5-minute setup |
145
- | **UPGRADE_GUIDE.md** | Complete guide |
146
- | **ARCHITECTURE.md** | System design |
147
- | **SUMMARY.md** | Executive summary |
148
- | **COMPARISON.md** | Before/After |
149
- | **DEPLOYMENT_CHECKLIST.md** | Deploy steps |
150
-
151
- ---
152
-
153
- ## 🎯 KEY IMPROVEMENTS
154
-
155
- | Metric | Before | After | Change |
156
- |--------|--------|-------|--------|
157
- | **Search Speed** | 100-500ms | <10ms | 🟢 10-50x |
158
- | **Memory Capacity** | ~1,000 | 10,000+ | 🟢 10x |
159
- | **Query Accuracy** | 75% | 90% | 🟢 +15% |
160
- | **Intent Accuracy** | 70% | 97% | 🟢 +27% |
161
-
162
- ---
163
-
164
- ## ✅ BACKWARD COMPATIBILITY
165
-
166
- - ✅ Old memory.json files work
167
- - ✅ Same voice commands
168
- - ✅ Old files preserved
169
- - ✅ Zero breaking changes
170
-
171
- ---
172
-
173
- ## 🚀 QUICK START CHECKLIST
174
-
175
- - [ ] Install: `pip install -r requirements_upgraded.txt`
176
- - [ ] Setup: `mkdir data`
177
- - [ ] Test: `python test_upgrade.py`
178
- - [ ] Run: `python main_upgraded.py`
179
- - [ ] Try: Voice commands
180
- - [ ] Query: `python ask_question_upgraded.py`
181
-
182
- ---
183
-
184
- ## 📊 ARCHITECTURE (Simplified)
185
-
186
- ```
187
- Voice → Vision Hub → Fusion → Memory → Query
188
- ├─ YOLO ├─ JSON
189
- ├─ BLIP └─ FAISS
190
- ├─ CLIP
191
- └─ OCR
192
- ```
193
-
194
- ---
195
-
196
- ## 🔄 DATA FLOW
197
-
198
- ```
199
- 1. User speaks → Vosk STT
200
- 2. Camera captures → Vision agents
201
- 3. Fusion combines → Unified context
202
- 4. Memory stores → JSON + FAISS
203
- 5. Query retrieves → Smart search
204
- 6. System speaks → Voxtral/pyttsx3
205
- ```
206
-
207
- ---
208
-
209
- ## 💡 TIPS
210
-
211
- **For Best Performance:**
212
- - Install FAISS: `pip install faiss-cpu`
213
- - Install EasyOCR: `pip install easyocr`
214
- - Use good lighting for OCR
215
- - Clear audio for voice commands
216
-
217
- **For Better Voice:**
218
- - Download Piper TTS model
219
- - Place in `models/piper/`
220
- - System auto-detects and uses
221
-
222
- **For Faster Startup:**
223
- - Models cached after first run
224
- - Subsequent starts faster
225
-
226
- ---
227
-
228
- ## 🎓 LEARNING PATH
229
-
230
- **Beginner:**
231
- 1. Read QUICKSTART.md
232
- 2. Run main_upgraded.py
233
- 3. Try voice commands
234
-
235
- **Intermediate:**
236
- 1. Read UPGRADE_GUIDE.md
237
- 2. Explore agents/ code
238
- 3. Customize parameters
239
-
240
- **Advanced:**
241
- 1. Read ARCHITECTURE.md
242
- 2. Modify agents
243
- 3. Add new features
244
-
245
- ---
246
-
247
- ## 📞 SUPPORT
248
-
249
- **Documentation:**
250
- - QUICKSTART.md - Quick setup
251
- - UPGRADE_GUIDE.md - Complete guide
252
- - ARCHITECTURE.md - Technical details
253
-
254
- **Testing:**
255
- - test_upgrade.py - Automated tests
256
- - DEPLOYMENT_CHECKLIST.md - Deploy guide
257
-
258
- **Comparison:**
259
- - COMPARISON.md - Before/After
260
- - SUMMARY.md - Executive summary
261
-
262
- ---
263
-
264
- ## 🏆 SUCCESS CRITERIA
265
-
266
- **System Working If:**
267
- - ✅ Starts without errors
268
- - ✅ Camera shows video
269
- - ✅ Voice commands work
270
- - ✅ Memory persists
271
- - ✅ Queries return results
272
-
273
- ---
274
-
275
- ## 🎉 YOU'RE READY!
276
-
277
- **Your VisionQ now has:**
278
- - 🧠 Smarter memory (FAISS)
279
- - 👁️ Better vision (CLIP + OCR)
280
- - 🗣️ Natural voice (Voxtral)
281
- - 🔍 Smart queries (DistilBERT)
282
-
283
- **All while keeping existing features! 🚀**
284
-
285
- ---
286
-
287
- ## 📋 COMMAND CHEAT SHEET
288
-
289
- ```bash
290
- # Install
291
- pip install -r requirements_upgraded.txt
292
-
293
- # Setup
294
- mkdir data
295
-
296
- # Test
297
- python test_upgrade.py
298
-
299
- # Run main system
300
- python main_upgraded.py
301
-
302
- # Run query tool
303
- python ask_question_upgraded.py
304
-
305
- # Install optional
306
- pip install faiss-cpu easyocr piper-tts
307
- ```
308
-
309
- ---
310
-
311
- **Keep this card handy for quick reference! 📌**
312
-
313
- **For detailed info, see full documentation files.**
314
-
315
- **Happy upgrading! 🎊**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_docs/README_UPGRADED.md DELETED
@@ -1,410 +0,0 @@
1
- # 🚀 VisionQ - Multimodal AI Assistant (UPGRADED)
2
-
3
- > **A voice-controlled AI vision assistant that can see, remember, read text, and recall visual memories through natural conversation.**
4
-
5
- [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
6
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
7
- [![Status: Production Ready](https://img.shields.io/badge/status-production%20ready-green.svg)]()
8
-
9
- ---
10
-
11
- ## 🎯 What is VisionQ?
12
-
13
- VisionQ is an **upgraded multimodal AI assistant** that combines:
14
- - 👁️ **Computer Vision** (YOLO/SSD object detection + BLIP captioning)
15
- - 🔤 **OCR** (EasyOCR text extraction)
16
- - 🧠 **Semantic Memory** (FAISS vector search + JSON storage)
17
- - 🗣️ **Voice Interaction** (Vosk STT + Voxtral/Piper TTS)
18
- - 🔍 **Intelligent Queries** (DistilBERT NLP)
19
-
20
- ---
21
-
22
- ## ✨ Key Features
23
-
24
- ### **Core Capabilities**
25
- - ✅ **Scene Description** - Multimodal understanding (vision + text)
26
- - ✅ **Memory Storage** - Persistent semantic memory with FAISS
27
- - ✅ **Memory Recall** - Fast similarity search (10x faster)
28
- - ✅ **Text Reading** - OCR extraction from images 🆕
29
- - ✅ **Voice Control** - Natural language commands
30
- - ✅ **Smart Queries** - Time-aware, intent-based search 🆕
31
-
32
- ### **Technical Highlights**
33
- - 🚀 **FAISS Vector Search** - Lightning-fast similarity matching
34
- - 🖼️ **MobileCLIP Embeddings** - Visual semantic understanding
35
- - 🔤 **EasyOCR Integration** - Offline text extraction
36
- - 🧠 **DistilBERT NLP** - Intent classification
37
- - 🗣️ **Neural TTS** - Natural voice output (Voxtral/Piper)
38
- - 🔗 **Multimodal Fusion** - Combined vision + text + embeddings
39
-
40
- ---
41
-
42
- ## 📦 Installation
43
-
44
- ### **Quick Install**
45
- ```bash
46
- # Clone repository
47
- git clone <your-repo-url>
48
- cd VisionQ
49
-
50
- # Run automated installer (Windows)
51
- install_upgrade.bat
52
-
53
- # Or manual install:
54
- pip install -r requirements_upgraded.txt
55
- mkdir data
56
- ```
57
-
58
- ### **Requirements**
59
- - Python 3.8+
60
- - Webcam
61
- - Microphone
62
- - ~2GB disk space (for models)
63
-
64
- ---
65
-
66
- ## 🚀 Quick Start
67
-
68
- ### **1. Run the System**
69
- ```bash
70
- python main_upgraded.py
71
- ```
72
-
73
- ### **2. Voice Commands**
74
- | Say This | System Does |
75
- |----------|-------------|
76
- | "Describe the scene" | Captures and describes what it sees |
77
- | "Remember this" | Stores current scene in memory |
78
- | "What did I see" | Recalls last memory |
79
- | "Read the text" | Extracts visible text (OCR) 🆕 |
80
- | "Exit" | Quits the system |
81
-
82
- ### **3. Query Memory**
83
- ```bash
84
- python ask_question_upgraded.py
85
- ```
86
-
87
- **Example Queries:**
88
- - "What did I see this morning?"
89
- - "Show me memories with text"
90
- - "When did I see a person?"
91
- - "Find memories from yesterday"
92
-
93
- ---
94
-
95
- ## 🏗️ Architecture
96
-
97
- ```
98
- Voice (Vosk + Voxtral/Piper)
99
-
100
- Vision Hub
101
- ├─ YOLO/SSD (objects)
102
- ├─ BLIP (captions)
103
- ├─ MobileCLIP (embeddings)
104
- └─ EasyOCR (text)
105
-
106
- Fusion Layer
107
-
108
- Memory (JSON + FAISS)
109
-
110
- Query (DistilBERT)
111
- ```
112
-
113
- **See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed diagrams.**
114
-
115
- ---
116
-
117
- ## 📂 Project Structure
118
-
119
- ```
120
- VisionQ/
121
- ├── agents/ # Modular AI agents
122
- │ ├── voice_agent.py # Voice I/O (STT + TTS)
123
- │ ├── vision_agent.py # Vision coordinator
124
- │ ├── caption_agent.py # BLIP captioning
125
- │ ├── embedding_agent.py # MobileCLIP embeddings
126
- │ ├── ocr_agent.py # Text extraction
127
- │ ├── memory_agent.py # Storage (JSON + FAISS)
128
- │ └── query_agent.py # Intelligent retrieval
129
-
130
- ├── core/ # Integration layer
131
- │ └── fusion_layer.py # Multimodal fusion
132
-
133
- ├── data/ # Persistent storage
134
- │ ├── memory.json # Metadata
135
- │ └── memory.faiss # Vector index
136
-
137
- ├── models/ # AI models
138
- │ ├── vosk/ # Speech recognition
139
- │ └── piper/ # Neural TTS (optional)
140
-
141
- ├── main_upgraded.py # Main entry point
142
- ├── ask_question_upgraded.py # Query interface
143
- └── requirements_upgraded.txt # Dependencies
144
- ```
145
-
146
- ---
147
-
148
- ## 🆕 What's New in This Upgrade?
149
-
150
- ### **New Features**
151
- 1. **OCR Text Reading** 🔤
152
- - Extract text from images
153
- - Confidence filtering
154
- - Multi-language support
155
-
156
- 2. **Visual Similarity Search** 🖼️
157
- - MobileCLIP embeddings
158
- - FAISS vector indexing
159
- - 10x faster retrieval
160
-
161
- 3. **Intent Classification** 🧠
162
- - DistilBERT NLP
163
- - Better query understanding
164
- - Context-aware responses
165
-
166
- 4. **Neural TTS** 🗣️
167
- - Voxtral/Piper integration
168
- - Natural voice output
169
- - Automatic fallback to pyttsx3
170
-
171
- 5. **Multimodal Fusion** 🔗
172
- - Combined vision + text + embeddings
173
- - Richer scene descriptions
174
- - Better memory context
175
-
176
- ### **Performance Improvements**
177
- - 🚀 10x faster memory search (FAISS)
178
- - 🎯 20% better query relevance
179
- - 📈 10x memory capacity (10,000+ entries)
180
- - ⚡ Sub-100ms query response time
181
-
182
- ---
183
-
184
- ## 🔧 Configuration
185
-
186
- ### **Adjustable Parameters**
187
-
188
- **Vision Settings** (`agents/vision_agent.py`):
189
- ```python
190
- FRAME_INTERVAL = 0.3 # Seconds between frames
191
- CONF_THRESHOLD = 0.5 # Object detection confidence
192
- ```
193
-
194
- **OCR Settings** (`agents/ocr_agent.py`):
195
- ```python
196
- OCR_CONFIDENCE = 0.3 # Text detection threshold
197
- OCR_LANGUAGES = ['en'] # Supported languages
198
- ```
199
-
200
- **Query Settings** (`agents/query_agent.py`):
201
- ```python
202
- SIMILARITY_THRESHOLD = 0.45 # Text search threshold
203
- TOP_K_RESULTS = 5 # Max results to return
204
- ```
205
-
206
- ---
207
-
208
- ## 🧪 Testing
209
-
210
- ### **Run Test Suite**
211
- ```bash
212
- python test_upgrade.py
213
- ```
214
-
215
- **Tests:**
216
- - ✅ Module imports
217
- - ✅ Dependency availability
218
- - ✅ MemoryAgent functionality
219
- - ✅ FusionLayer integration
220
- - ✅ QueryAgent NLP
221
- - ✅ Backward compatibility
222
-
223
- ---
224
-
225
- ## 📚 Documentation
226
-
227
- | Document | Description |
228
- |----------|-------------|
229
- | [QUICKSTART.md](QUICKSTART.md) | 5-minute setup guide |
230
- | [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) | Complete upgrade documentation |
231
- | [ARCHITECTURE.md](ARCHITECTURE.md) | System architecture details |
232
- | [SUMMARY.md](SUMMARY.md) | Executive summary |
233
-
234
- ---
235
-
236
- ## 🐛 Troubleshooting
237
-
238
- ### **Common Issues**
239
-
240
- **"Module not found" error:**
241
- ```bash
242
- pip install --upgrade -r requirements_upgraded.txt
243
- ```
244
-
245
- **"FAISS not available" warning:**
246
- ```bash
247
- pip install faiss-cpu
248
- ```
249
-
250
- **"OCR not working":**
251
- ```bash
252
- pip install easyocr
253
- ```
254
-
255
- **Camera not opening:**
256
- ```python
257
- # Try different camera index in vision_agent.py
258
- self.cap = cv2.VideoCapture(1) # Try 1 instead of 0
259
- ```
260
-
261
- **See [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md) for more troubleshooting.**
262
-
263
- ---
264
-
265
- ## 🎓 Use Cases
266
-
267
- ### **Personal Assistant**
268
- - "What did I see this morning?"
269
- - "Remember this document"
270
- - "Read the text on this sign"
271
-
272
- ### **Memory Aid**
273
- - "When did I last see my keys?"
274
- - "Show me memories with text"
275
- - "What was I doing yesterday?"
276
-
277
- ### **Accessibility**
278
- - Text-to-speech for visual content
279
- - Voice-controlled navigation
280
- - OCR for reading assistance
281
-
282
- ---
283
-
284
- ## 🔒 Privacy
285
-
286
- - ✅ **100% Offline** - All processing on-device
287
- - ✅ **No Cloud** - No data sent to external servers
288
- - ✅ **Local Storage** - Memories stored locally
289
- - ✅ **No Tracking** - No analytics or telemetry
290
-
291
- ---
292
-
293
- ## 🛠️ Tech Stack
294
-
295
- ### **Core Technologies**
296
- - **Python 3.8+** - Programming language
297
- - **PyTorch** - Deep learning framework
298
- - **OpenCV** - Computer vision
299
- - **FAISS** - Vector similarity search
300
-
301
- ### **AI Models**
302
- - **YOLO/SSD** - Object detection
303
- - **BLIP** - Image captioning
304
- - **CLIP** - Visual embeddings
305
- - **DistilBERT** - NLP
306
- - **EasyOCR** - Text extraction
307
- - **Vosk** - Speech recognition
308
- - **Piper** - Neural TTS
309
-
310
- ---
311
-
312
- ## 📈 Performance
313
-
314
- | Metric | Value |
315
- |--------|-------|
316
- | Memory Search | <10ms (FAISS) |
317
- | OCR Processing | 200-500ms |
318
- | Caption Generation | 100-200ms |
319
- | Embedding Generation | 50ms |
320
- | Query Response | <100ms |
321
- | Memory Capacity | 10,000+ entries |
322
-
323
- ---
324
-
325
- ## 🚀 Future Enhancements
326
-
327
- ### **Planned Features**
328
- - [ ] Web interface
329
- - [ ] Mobile app
330
- - [ ] Cloud sync (optional)
331
- - [ ] Multi-user support
332
- - [ ] Video recording
333
- - [ ] Real-time object tracking
334
- - [ ] Face recognition
335
- - [ ] Emotion detection
336
-
337
- ---
338
-
339
- ## 🤝 Contributing
340
-
341
- Contributions welcome! Please:
342
- 1. Fork the repository
343
- 2. Create a feature branch
344
- 3. Make your changes
345
- 4. Submit a pull request
346
-
347
- ---
348
-
349
- ## 📄 License
350
-
351
- This project is licensed under the MIT License - see [LICENSE](LICENSE) file for details.
352
-
353
- ---
354
-
355
- ## 🙏 Acknowledgments
356
-
357
- ### **Models & Libraries**
358
- - [Ultralytics YOLO](https://github.com/ultralytics/ultralytics)
359
- - [Salesforce BLIP](https://github.com/salesforce/BLIP)
360
- - [OpenAI CLIP](https://github.com/openai/CLIP)
361
- - [EasyOCR](https://github.com/JaidedAI/EasyOCR)
362
- - [FAISS](https://github.com/facebookresearch/faiss)
363
- - [Vosk](https://alphacephei.com/vosk/)
364
- - [Piper TTS](https://github.com/rhasspy/piper)
365
-
366
- ---
367
-
368
- ## 📞 Support
369
-
370
- - **Documentation:** See `docs/` folder
371
- - **Issues:** Open a GitHub issue
372
- - **Questions:** Check [UPGRADE_GUIDE.md](UPGRADE_GUIDE.md)
373
-
374
- ---
375
-
376
- ## 🎉 Status
377
-
378
- **✅ Production Ready**
379
- - All features implemented
380
- - Fully tested
381
- - Backward compatible
382
- - Well documented
383
-
384
- ---
385
-
386
- ## 📊 Comparison
387
-
388
- | Feature | Before | After |
389
- |---------|--------|-------|
390
- | Text Reading | ❌ | ✅ OCR |
391
- | Memory Search | Slow | 10x faster |
392
- | Voice Quality | Robotic | Natural |
393
- | Query Understanding | Keywords | NLP |
394
- | Scene Understanding | Caption only | Caption+OCR+Objects |
395
-
396
- ---
397
-
398
- **VisionQ - See, Remember, Recall. Now with OCR, FAISS, and Neural TTS! 🚀**
399
-
400
- ---
401
-
402
- ## 🏁 Getting Started
403
-
404
- 1. **Install:** `install_upgrade.bat` or `pip install -r requirements_upgraded.txt`
405
- 2. **Run:** `python main_upgraded.py`
406
- 3. **Test:** `python test_upgrade.py`
407
- 4. **Query:** `python ask_question_upgraded.py`
408
- 5. **Read:** [QUICKSTART.md](QUICKSTART.md)
409
-
410
- **Happy coding! 🎊**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_docs/SUMMARY.md DELETED
@@ -1,406 +0,0 @@
1
- # 🎯 VisionQ Upgrade - Executive Summary
2
-
3
- ## 📊 UPGRADE OVERVIEW
4
-
5
- **Project:** VisionQ Multimodal AI Assistant
6
- **Upgrade Date:** 2024
7
- **Status:** ✅ Complete - Ready for Testing
8
- **Backward Compatibility:** ✅ 100% - All existing features preserved
9
-
10
- ---
11
-
12
- ## 🚀 WHAT WAS UPGRADED
13
-
14
- ### **Core Enhancements**
15
-
16
- | Area | Before | After | Impact |
17
- |------|--------|-------|--------|
18
- | **Vision** | YOLO + BLIP | YOLO + BLIP + MobileCLIP + OCR | 4x richer understanding |
19
- | **Memory** | JSON + text embeddings | JSON + FAISS + image embeddings | 10x faster search |
20
- | **Voice** | Vosk + pyttsx3 | Vosk + Voxtral + pyttsx3 | Natural speech |
21
- | **Query** | Keyword matching | DistilBERT NLP | Smarter understanding |
22
- | **Text Reading** | ❌ None | ✅ EasyOCR | NEW capability |
23
-
24
- ---
25
-
26
- ## 🆕 NEW CAPABILITIES
27
-
28
- ### **1. OCR Text Extraction** 🔤
29
- - **What:** Extract and read visible text from camera
30
- - **How:** EasyOCR with confidence filtering
31
- - **Use Case:** Read signs, documents, labels
32
- - **Command:** "Read the text"
33
-
34
- ### **2. Visual Similarity Search** 🖼️
35
- - **What:** Find visually similar memories
36
- - **How:** MobileCLIP embeddings + FAISS indexing
37
- - **Use Case:** "Show me similar scenes"
38
- - **Speed:** 10-100x faster than before
39
-
40
- ### **3. Intent Classification** 🧠
41
- - **What:** Understand query meaning
42
- - **How:** DistilBERT zero-shot classification
43
- - **Use Case:** Better query interpretation
44
- - **Accuracy:** 97% (vs 70% keyword matching)
45
-
46
- ### **4. Neural Text-to-Speech** 🗣️
47
- - **What:** Natural-sounding voice output
48
- - **How:** Voxtral/Piper neural TTS
49
- - **Use Case:** Better user experience
50
- - **Fallback:** pyttsx3 (automatic)
51
-
52
- ### **5. Multimodal Fusion** 🔗
53
- - **What:** Combine caption + OCR + objects + embeddings
54
- - **How:** FusionLayer integration
55
- - **Use Case:** Richer scene descriptions
56
- - **Example:** "a person holding a phone. Text visible: Hello World"
57
-
58
- ---
59
-
60
- ## 📁 NEW FILE STRUCTURE
61
-
62
- ```
63
- VisionQ/
64
- ├── agents/ [NEW] Modular agent architecture
65
- │ ├── voice_agent.py [UPDATED] Voxtral + fallback
66
- │ ├── vision_agent.py [UPDATED] Multimodal hub
67
- │ ├── caption_agent.py [KEPT] BLIP captioning
68
- │ ├── embedding_agent.py [NEW] MobileCLIP
69
- │ ├── ocr_agent.py [NEW] EasyOCR
70
- │ ├── memory_agent.py [UPDATED] FAISS integration
71
- │ └── query_agent.py [UPDATED] DistilBERT
72
-
73
- ├── core/ [NEW] Integration layer
74
- │ └── fusion_layer.py [NEW] Multimodal fusion
75
-
76
- ├── data/ [NEW] Persistent storage
77
- │ ├── memory.json [EXISTING] Metadata
78
- │ └── memory.faiss [NEW] Vector index
79
-
80
- ├── main_upgraded.py [NEW] Upgraded entry point
81
- ├── ask_question_upgraded.py [NEW] Enhanced queries
82
- └── requirements_upgraded.txt [NEW] Dependencies
83
- ```
84
-
85
- ---
86
-
87
- ## 🔧 TECHNICAL IMPROVEMENTS
88
-
89
- ### **Performance**
90
- - **Memory Search:** O(n) → O(log n) with FAISS
91
- - **Query Speed:** 100ms → 10ms average
92
- - **Embedding Generation:** 50ms per image
93
- - **OCR Processing:** 200-500ms per frame
94
-
95
- ### **Accuracy**
96
- - **Intent Classification:** 70% → 97%
97
- - **Text Extraction:** N/A → 85-95% (depends on image quality)
98
- - **Memory Retrieval:** 75% → 90% relevance
99
-
100
- ### **Scalability**
101
- - **Memory Capacity:** 1,000 → 10,000+ entries
102
- - **Search Performance:** Linear → Logarithmic
103
- - **Concurrent Queries:** 1 → Multiple (FAISS thread-safe)
104
-
105
- ---
106
-
107
- ## 🎯 USE CASES
108
-
109
- ### **Before Upgrade**
110
- 1. ✅ Describe current scene
111
- 2. ✅ Remember scenes
112
- 3. ✅ Recall last memory
113
- 4. ❌ Read text
114
- 5. ❌ Find similar scenes
115
- 6. ❌ Smart queries
116
-
117
- ### **After Upgrade**
118
- 1. ✅ Describe scene (enhanced with OCR)
119
- 2. ✅ Remember scenes (with embeddings)
120
- 3. ✅ Recall memories (faster, smarter)
121
- 4. ✅ **Read text from images** 🆕
122
- 5. ✅ **Find visually similar memories** 🆕
123
- 6. ✅ **Natural language queries** 🆕
124
- 7. ✅ **Time-aware search** 🆕
125
- 8. ✅ **Hybrid text+image search** 🆕
126
-
127
- ---
128
-
129
- ## 📦 DEPENDENCIES ADDED
130
-
131
- ### **Required**
132
- ```
133
- faiss-cpu # Vector similarity search
134
- easyocr # Text extraction
135
- ```
136
-
137
- ### **Optional (Recommended)**
138
- ```
139
- piper-tts # Neural TTS
140
- ```
141
-
142
- ### **Kept from Original**
143
- ```
144
- torch # Deep learning
145
- transformers # BLIP, CLIP, DistilBERT
146
- sentence-transformers # Text embeddings
147
- opencv-python # Computer vision
148
- vosk # Speech recognition
149
- pyttsx3 # TTS fallback
150
- ultralytics # YOLO
151
- ```
152
-
153
- ---
154
-
155
- ## ✅ WHAT WAS PRESERVED
156
-
157
- ### **100% Backward Compatible**
158
-
159
- | Feature | Status | Notes |
160
- |---------|--------|-------|
161
- | Voice commands | ✅ KEPT | Same commands work |
162
- | YOLO/SSD detection | ✅ KEPT | No changes |
163
- | BLIP captioning | ✅ KEPT | Still primary |
164
- | JSON memory | ✅ KEPT | Same format |
165
- | Time filtering | ✅ KEPT | Enhanced |
166
- | Importance scoring | ✅ KEPT | Same algorithm |
167
- | Vosk STT | ✅ KEPT | No changes |
168
- | pyttsx3 TTS | ✅ KEPT | Now fallback |
169
-
170
- ### **Old Files Preserved**
171
- - All original `.py` files in root directory
172
- - Can run old system alongside new
173
- - No breaking changes
174
-
175
- ---
176
-
177
- ## 🚦 DEPLOYMENT STATUS
178
-
179
- ### **Ready for Production** ✅
180
- - [x] All modules implemented
181
- - [x] Fallback mechanisms in place
182
- - [x] Error handling added
183
- - [x] Documentation complete
184
- - [x] Backward compatibility verified
185
-
186
- ### **Testing Required** ⚠️
187
- - [ ] End-to-end voice commands
188
- - [ ] OCR on various text types
189
- - [ ] FAISS performance with 1000+ memories
190
- - [ ] Neural TTS quality
191
- - [ ] Memory persistence across restarts
192
-
193
- ### **Optional Enhancements** 💡
194
- - [ ] Web interface
195
- - [ ] Mobile app
196
- - [ ] Cloud sync
197
- - [ ] Multi-user support
198
- - [ ] Video recording
199
-
200
- ---
201
-
202
- ## 📈 EXPECTED BENEFITS
203
-
204
- ### **User Experience**
205
- - **Better Understanding:** OCR + embeddings = richer context
206
- - **Faster Responses:** FAISS = 10x faster search
207
- - **Natural Voice:** Voxtral = human-like speech
208
- - **Smarter Queries:** DistilBERT = better understanding
209
-
210
- ### **Developer Experience**
211
- - **Modular Code:** Easy to extend/modify
212
- - **Clear Architecture:** Well-documented
213
- - **Fallback Safety:** System never breaks
214
- - **Type Safety:** Clear interfaces
215
-
216
- ### **System Performance**
217
- - **Scalability:** Handles 10x more memories
218
- - **Speed:** 10x faster retrieval
219
- - **Accuracy:** 20% improvement in relevance
220
- - **Reliability:** Multiple fallback layers
221
-
222
- ---
223
-
224
- ## 🎓 LEARNING OUTCOMES
225
-
226
- ### **Technologies Integrated**
227
- 1. **CLIP** - Visual-language understanding
228
- 2. **FAISS** - Efficient vector search
229
- 3. **EasyOCR** - Text extraction
230
- 4. **DistilBERT** - Intent classification
231
- 5. **Piper TTS** - Neural speech synthesis
232
-
233
- ### **Design Patterns Applied**
234
- 1. **Modular Architecture** - Separate agents
235
- 2. **Fallback Pattern** - Graceful degradation
236
- 3. **Fusion Pattern** - Multimodal integration
237
- 4. **Hybrid Storage** - JSON + FAISS
238
- 5. **Dependency Injection** - Loose coupling
239
-
240
- ---
241
-
242
- ## 🔍 TESTING CHECKLIST
243
-
244
- ### **Critical Path** (Must Work)
245
- - [ ] System starts without errors
246
- - [ ] Voice recognition functional
247
- - [ ] Camera capture working
248
- - [ ] Basic commands work
249
- - [ ] Memory persists
250
-
251
- ### **New Features** (Should Work)
252
- - [ ] OCR extracts text
253
- - [ ] FAISS search faster
254
- - [ ] Neural TTS sounds natural
255
- - [ ] Intent classification accurate
256
- - [ ] Fusion layer combines data
257
-
258
- ### **Fallbacks** (Must Work)
259
- - [ ] pyttsx3 if Voxtral fails
260
- - [ ] Keyword matching if DistilBERT fails
261
- - [ ] Linear search if FAISS unavailable
262
- - [ ] System continues if OCR fails
263
-
264
- ---
265
-
266
- ## 📞 NEXT STEPS
267
-
268
- ### **Immediate (Day 1)**
269
- 1. Install dependencies: `pip install -r requirements_upgraded.txt`
270
- 2. Create data directory: `mkdir data`
271
- 3. Run system: `python main_upgraded.py`
272
- 4. Test voice commands
273
- 5. Verify memory storage
274
-
275
- ### **Short-term (Week 1)**
276
- 1. Test OCR on various text types
277
- 2. Build up memory database (100+ entries)
278
- 3. Benchmark FAISS performance
279
- 4. Fine-tune confidence thresholds
280
- 5. Collect user feedback
281
-
282
- ### **Long-term (Month 1)**
283
- 1. Optimize for mobile deployment
284
- 2. Add web interface
285
- 3. Implement cloud sync
286
- 4. Add multi-language support
287
- 5. Create demo videos
288
-
289
- ---
290
-
291
- ## 💰 COST-BENEFIT ANALYSIS
292
-
293
- ### **Development Cost**
294
- - **Time:** ~8 hours implementation
295
- - **Complexity:** Medium (modular design)
296
- - **Risk:** Low (backward compatible)
297
-
298
- ### **Benefits**
299
- - **Functionality:** +50% new capabilities
300
- - **Performance:** 10x faster search
301
- - **User Experience:** Significantly improved
302
- - **Maintainability:** Better code structure
303
- - **Scalability:** 10x capacity increase
304
-
305
- ### **ROI**
306
- - **High:** Major capability boost with minimal risk
307
- - **Immediate:** All features ready to use
308
- - **Long-term:** Foundation for future enhancements
309
-
310
- ---
311
-
312
- ## 🏆 SUCCESS METRICS
313
-
314
- ### **Technical Metrics**
315
- - ✅ 100% backward compatibility
316
- - ✅ 0 breaking changes
317
- - ✅ 10x search performance improvement
318
- - ✅ 4 new major features
319
- - ✅ 8 new modules created
320
-
321
- ### **User Metrics** (To Measure)
322
- - Query response time < 100ms
323
- - OCR accuracy > 85%
324
- - Intent classification > 90%
325
- - User satisfaction score
326
- - Feature adoption rate
327
-
328
- ---
329
-
330
- ## 📚 DOCUMENTATION
331
-
332
- ### **Created Documents**
333
- 1. ✅ `UPGRADE_GUIDE.md` - Complete upgrade documentation
334
- 2. ✅ `QUICKSTART.md` - 5-minute setup guide
335
- 3. ✅ `ARCHITECTURE.md` - Detailed system architecture
336
- 4. ✅ `SUMMARY.md` - This executive summary
337
- 5. ✅ `requirements_upgraded.txt` - Dependencies
338
- 6. ✅ Inline code comments - All modules documented
339
-
340
- ### **Code Documentation**
341
- - All agents have docstrings
342
- - Methods documented with parameters
343
- - Clear status markers (KEPT/UPDATED/NEW)
344
- - Architecture diagrams included
345
-
346
- ---
347
-
348
- ## 🎉 CONCLUSION
349
-
350
- ### **Upgrade Success** ✅
351
-
352
- VisionQ has been successfully upgraded from a basic vision assistant to a **comprehensive multimodal AI system** with:
353
-
354
- - 🧠 **Smarter memory** (FAISS vector search)
355
- - 👁️ **Better vision** (MobileCLIP + OCR)
356
- - 🗣️ **Natural voice** (Voxtral neural TTS)
357
- - 🔍 **Intelligent queries** (DistilBERT NLP)
358
- - 🔗 **Multimodal fusion** (Combined understanding)
359
-
360
- ### **Key Achievements**
361
- - ✅ All existing features preserved
362
- - ✅ 4 major new capabilities added
363
- - ✅ 10x performance improvement
364
- - ✅ Zero breaking changes
365
- - ✅ Production-ready code
366
-
367
- ### **Ready for Deployment**
368
- The system is now ready for:
369
- - ✅ Testing and validation
370
- - ✅ User feedback collection
371
- - ✅ Production deployment
372
- - ✅ Future enhancements
373
-
374
- ---
375
-
376
- **The upgrade is complete. VisionQ is now a state-of-the-art multimodal AI assistant! 🚀**
377
-
378
- ---
379
-
380
- ## 📋 QUICK REFERENCE
381
-
382
- **Start Upgraded System:**
383
- ```bash
384
- python main_upgraded.py
385
- ```
386
-
387
- **Test Query System:**
388
- ```bash
389
- python ask_question_upgraded.py
390
- ```
391
-
392
- **Install Dependencies:**
393
- ```bash
394
- pip install -r requirements_upgraded.txt
395
- ```
396
-
397
- **Documentation:**
398
- - Setup: `QUICKSTART.md`
399
- - Details: `UPGRADE_GUIDE.md`
400
- - Architecture: `ARCHITECTURE.md`
401
-
402
- ---
403
-
404
- **Questions?** See `UPGRADE_GUIDE.md` troubleshooting section.
405
-
406
- **Happy upgrading! 🎊**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_docs/UPGRADE_GUIDE.md DELETED
@@ -1,532 +0,0 @@
1
- # 🚀 VisionQ System Upgrade Documentation
2
-
3
- ## 📊 UPGRADE SUMMARY
4
-
5
- VisionQ has been upgraded from a basic vision assistant to a **multimodal AI system** with:
6
- - ✅ Enhanced vision understanding (MobileCLIP embeddings)
7
- - ✅ OCR text extraction (EasyOCR)
8
- - ✅ Fast vector search (FAISS)
9
- - ✅ Improved NLP (DistilBERT)
10
- - ✅ Neural TTS (Voxtral/Piper)
11
- - ✅ **ALL existing functionality preserved**
12
-
13
- ---
14
-
15
- ## 🏗️ ARCHITECTURE CHANGES
16
-
17
- ### **Before (Original System)**
18
- ```
19
- Voice (Vosk + pyttsx3)
20
-
21
- Vision (YOLO/SSD + BLIP)
22
-
23
- Memory (JSON + sentence-transformers)
24
-
25
- Query (cosine similarity)
26
- ```
27
-
28
- ### **After (Upgraded System)**
29
- ```
30
- Voice (Vosk + Voxtral/Piper + pyttsx3 fallback)
31
-
32
- Vision Hub
33
- ├─ YOLO/SSD (objects)
34
- ├─ BLIP (captions)
35
- ├─ MobileCLIP (embeddings)
36
- └─ EasyOCR (text)
37
-
38
- Fusion Layer (combines all modalities)
39
-
40
- Memory (JSON metadata + FAISS vectors)
41
-
42
- Query (DistilBERT + hybrid search)
43
- ```
44
-
45
- ---
46
-
47
- ## 📂 NEW FILE STRUCTURE
48
-
49
- ```
50
- VisionQ/
51
- ├── agents/ [NEW FOLDER]
52
- │ ├── __init__.py [NEW]
53
- │ ├── voice_agent.py [UPDATED]
54
- │ ├── vision_agent.py [UPDATED]
55
- │ ├── caption_agent.py [UNCHANGED]
56
- │ ├── embedding_agent.py [NEW]
57
- │ ├── ocr_agent.py [NEW]
58
- │ ├── memory_agent.py [UPDATED]
59
- │ └── query_agent.py [UPDATED]
60
-
61
- ├── core/ [NEW FOLDER]
62
- │ ├── __init__.py [NEW]
63
- │ └── fusion_layer.py [NEW]
64
-
65
- ├── data/ [NEW FOLDER]
66
- │ ├── memory.json [EXISTING]
67
- │ └── memory.faiss [NEW - auto-generated]
68
-
69
- ├── models/
70
- │ ├── vosk/ [EXISTING]
71
- │ └── piper/ [NEW - optional]
72
-
73
- ├── main_upgraded.py [NEW]
74
- ├── ask_question_upgraded.py [NEW]
75
- ├── requirements_upgraded.txt [NEW]
76
-
77
- └── [OLD FILES PRESERVED]
78
- ├── main.py
79
- ├── voice_agent.py
80
- ├── vision_agent.py
81
- ├── caption_agent.py
82
- ├── memory_agent.py
83
- ├── query_agent.py
84
- └── ask_question.py
85
- ```
86
-
87
- ---
88
-
89
- ## 🆕 NEW MODULES
90
-
91
- ### **1. EmbeddingAgent** (`agents/embedding_agent.py`)
92
- - **Purpose**: Generate visual embeddings using CLIP
93
- - **Input**: BGR image frame
94
- - **Output**: 512-dim embedding vector
95
- - **Use**: Semantic image search via FAISS
96
-
97
- ### **2. OCRAgent** (`agents/ocr_agent.py`)
98
- - **Purpose**: Extract text from images
99
- - **Technology**: EasyOCR (offline, lightweight)
100
- - **Features**:
101
- - Confidence filtering
102
- - Text cleaning
103
- - Multi-language support
104
-
105
- ### **3. FusionLayer** (`core/fusion_layer.py`)
106
- - **Purpose**: Combine multimodal inputs
107
- - **Inputs**: Caption + OCR + Objects + Embedding
108
- - **Output**: Unified context dictionary
109
- - **Key Method**: `fuse()` - creates structured multimodal data
110
-
111
- ---
112
-
113
- ## 🔄 UPDATED MODULES
114
-
115
- ### **1. VoiceAgent** (UPDATED)
116
- **Kept:**
117
- - Vosk STT
118
- - Intent parsing
119
- - Microphone detection
120
-
121
- **Added:**
122
- - Voxtral/Piper neural TTS (primary)
123
- - pyttsx3 fallback mechanism
124
- - Automatic TTS switching
125
-
126
- **New Methods:**
127
- - `_init_voxtral()` - Load neural TTS
128
- - `_speak_voxtral()` - Neural speech synthesis
129
- - Fallback logic in `speak()`
130
-
131
- ---
132
-
133
- ### **2. VisionAgent** (UPDATED)
134
- **Kept:**
135
- - YOLO/SSD object detection
136
- - BLIP captioning
137
- - Camera capture
138
- - Continuous monitoring
139
-
140
- **Added:**
141
- - EmbeddingAgent integration
142
- - OCRAgent integration
143
- - FusionLayer coordination
144
- - Multimodal memory storage
145
-
146
- **New Methods:**
147
- - `read_text()` - OCR extraction
148
- - Enhanced `describe_scene()` with OCR
149
- - Enhanced `remember_scene()` with embeddings
150
-
151
- ---
152
-
153
- ### **3. MemoryAgent** (UPDATED)
154
- **Kept:**
155
- - JSON metadata storage
156
- - sentence-transformers text embeddings
157
- - Importance scoring
158
- - Timestamp tracking
159
-
160
- **Added:**
161
- - FAISS vector indexing
162
- - Image embedding storage
163
- - Hybrid search (text + image)
164
- - Fast similarity search
165
-
166
- **New Methods:**
167
- - `_init_faiss_index()` - FAISS setup
168
- - `_save_faiss_index()` - Persist vectors
169
- - `search_by_image()` - Visual similarity search
170
- - Enhanced `add()` with image embeddings
171
-
172
- **Storage Format:**
173
- ```json
174
- {
175
- "id": 0,
176
- "timestamp": "2024-01-15 10:30:00",
177
- "description": "a person holding a phone. Text visible: Hello World",
178
- "text_embedding": [...],
179
- "image_embedding": [...],
180
- "importance": 5
181
- }
182
- ```
183
-
184
- ---
185
-
186
- ### **4. QueryAgent** (UPDATED)
187
- **Kept:**
188
- - Time-based filtering
189
- - Cosine similarity
190
- - Importance weighting
191
-
192
- **Added:**
193
- - DistilBERT intent classification
194
- - Hybrid search (text + image)
195
- - Multi-source ranking
196
-
197
- **New Methods:**
198
- - `classify_intent()` - NLP-based intent detection
199
- - `_fallback_intent()` - Keyword-based backup
200
- - Enhanced `ask()` with hybrid search
201
-
202
- **Intent Categories:**
203
- - `temporal` - Time-based queries
204
- - `object` - Object detection queries
205
- - `action` - Activity queries
206
- - `text` - OCR-related queries
207
- - `general` - Scene descriptions
208
-
209
- ---
210
-
211
- ## 🎯 NEW FEATURES
212
-
213
- ### **1. OCR Text Reading**
214
- ```python
215
- # Voice command: "Read the text"
216
- # System extracts and speaks visible text
217
- ```
218
-
219
- **Implementation:**
220
- - EasyOCR extracts text from camera frame
221
- - Confidence filtering (threshold: 0.3)
222
- - Text cleaning and normalization
223
- - Integrated into memory descriptions
224
-
225
- ---
226
-
227
- ### **2. Visual Similarity Search**
228
- ```python
229
- # Find visually similar memories
230
- results = memory_agent.search_by_image(query_embedding, k=5)
231
- ```
232
-
233
- **How it works:**
234
- 1. MobileCLIP generates image embedding
235
- 2. FAISS performs fast similarity search
236
- 3. Returns top-k matching memories
237
- 4. Combines with text search for hybrid ranking
238
-
239
- ---
240
-
241
- ### **3. Intent Classification**
242
- ```python
243
- # DistilBERT understands query intent
244
- intent = query_agent.classify_intent("What did I see this morning?")
245
- # Returns: "temporal"
246
- ```
247
-
248
- **Benefits:**
249
- - Better query understanding
250
- - Context-aware responses
251
- - Improved accuracy
252
-
253
- ---
254
-
255
- ### **4. Neural TTS**
256
- ```python
257
- # High-quality voice output
258
- voice.speak("Scene remembered")
259
- # Uses Voxtral/Piper if available
260
- # Falls back to pyttsx3 automatically
261
- ```
262
-
263
- ---
264
-
265
- ## 🔧 INSTALLATION GUIDE
266
-
267
- ### **Step 1: Backup Existing System**
268
- ```bash
269
- # Create backup of old files
270
- mkdir backup
271
- copy *.py backup\
272
- ```
273
-
274
- ### **Step 2: Install Dependencies**
275
- ```bash
276
- # Create virtual environment
277
- python -m venv venv
278
- venv\Scripts\activate
279
-
280
- # Install upgraded requirements
281
- pip install -r requirements_upgraded.txt
282
- ```
283
-
284
- ### **Step 3: Download Models**
285
-
286
- **Vosk (Required - Already have):**
287
- - Location: `models/vosk/`
288
- - ✅ Already installed
289
-
290
- **Piper Voice (Optional - for neural TTS):**
291
- ```bash
292
- # Download from: https://github.com/rhasspy/piper/releases
293
- # Example: en_US-lessac-medium.onnx
294
- # Extract to: models/piper/
295
- ```
296
-
297
- ### **Step 4: Migrate Memory Data**
298
- ```bash
299
- # Create data directory
300
- mkdir data
301
-
302
- # Move existing memory
303
- move memory.json data\memory.json
304
- ```
305
-
306
- ### **Step 5: Test Upgraded System**
307
- ```bash
308
- # Test voice + vision
309
- python main_upgraded.py
310
-
311
- # Test query system
312
- python ask_question_upgraded.py
313
- ```
314
-
315
- ---
316
-
317
- ## 🎮 USAGE GUIDE
318
-
319
- ### **Voice Commands (UPDATED)**
320
-
321
- | Command | Action | Status |
322
- |---------|--------|--------|
323
- | "Describe the scene" | Get multimodal description | ✅ ENHANCED |
324
- | "Remember this" | Store with embeddings | ✅ ENHANCED |
325
- | "What did I see" | Recall last memory | ✅ KEPT |
326
- | "Read the text" | OCR extraction | ✅ NEW |
327
- | "Exit" | Quit system | ✅ KEPT |
328
-
329
- ### **Query Examples (NEW)**
330
-
331
- **Time-based:**
332
- ```
333
- "What did I see this morning?"
334
- "Show me memories from yesterday"
335
- "What happened in the last hour?"
336
- ```
337
-
338
- **Object-based:**
339
- ```
340
- "When did I see a person?"
341
- "Find memories with a phone"
342
- ```
343
-
344
- **Text-based:**
345
- ```
346
- "What text did I see today?"
347
- "Find memories with visible text"
348
- ```
349
-
350
- ---
351
-
352
- ## 🔍 TESTING CHECKLIST
353
-
354
- ### **Basic Functionality (Must Work)**
355
- - [ ] Camera capture
356
- - [ ] Voice recognition (Vosk)
357
- - [ ] Voice output (pyttsx3 fallback)
358
- - [ ] BLIP captioning
359
- - [ ] YOLO/SSD detection
360
- - [ ] Memory storage (JSON)
361
- - [ ] Memory recall
362
-
363
- ### **New Features (Should Work if Dependencies Installed)**
364
- - [ ] OCR text extraction
365
- - [ ] MobileCLIP embeddings
366
- - [ ] FAISS vector search
367
- - [ ] DistilBERT intent classification
368
- - [ ] Voxtral/Piper TTS
369
- - [ ] Fusion layer integration
370
-
371
- ### **Fallback Mechanisms (Must Work)**
372
- - [ ] pyttsx3 if Voxtral fails
373
- - [ ] Keyword intent if DistilBERT fails
374
- - [ ] Text search if FAISS unavailable
375
- - [ ] System continues if OCR fails
376
-
377
- ---
378
-
379
- ## 🐛 TROUBLESHOOTING
380
-
381
- ### **Issue: FAISS not installing**
382
- ```bash
383
- # Try CPU version
384
- pip install faiss-cpu
385
-
386
- # Or GPU version (if CUDA available)
387
- pip install faiss-gpu
388
- ```
389
-
390
- ### **Issue: EasyOCR fails**
391
- ```bash
392
- # Install dependencies
393
- pip install easyocr torch torchvision
394
- ```
395
-
396
- ### **Issue: Piper TTS not working**
397
- ```bash
398
- # System will automatically fall back to pyttsx3
399
- # No action needed - this is expected behavior
400
- ```
401
-
402
- ### **Issue: Import errors**
403
- ```bash
404
- # Ensure you're in project root
405
- cd VisionQ
406
-
407
- # Run with Python module syntax
408
- python -m main_upgraded
409
- ```
410
-
411
- ---
412
-
413
- ## 📊 PERFORMANCE COMPARISON
414
-
415
- | Feature | Before | After |
416
- |---------|--------|-------|
417
- | Caption Quality | BLIP only | BLIP + OCR + Objects |
418
- | Memory Search | Text only | Text + Image (FAISS) |
419
- | Query Understanding | Keywords | DistilBERT NLP |
420
- | TTS Quality | Robotic | Natural (Voxtral) |
421
- | Search Speed | O(n) linear | O(log n) FAISS |
422
- | Text Reading | ❌ None | ✅ EasyOCR |
423
-
424
- ---
425
-
426
- ## 🚀 NEXT STEPS
427
-
428
- ### **Immediate:**
429
- 1. Test all voice commands
430
- 2. Verify OCR on text images
431
- 3. Check memory persistence
432
- 4. Test query system
433
-
434
- ### **Optional Enhancements:**
435
- 1. Add FastVLM for faster captioning
436
- 2. Implement image-to-image search UI
437
- 3. Add multi-language OCR
438
- 4. Create web interface
439
- 5. Add video recording
440
-
441
- ### **Production Readiness:**
442
- 1. Add error logging
443
- 2. Implement health checks
444
- 3. Add configuration file
445
- 4. Create Docker container
446
- 5. Add unit tests
447
-
448
- ---
449
-
450
- ## 📝 MIGRATION NOTES
451
-
452
- ### **Backward Compatibility:**
453
- - ✅ Old `memory.json` files work with new system
454
- - ✅ Existing voice commands unchanged
455
- - ✅ Old agents still available in root directory
456
- - ✅ Can run old and new systems side-by-side
457
-
458
- ### **Breaking Changes:**
459
- - ❌ None - fully backward compatible
460
-
461
- ### **Deprecation Warnings:**
462
- - Old files in root will be deprecated in future versions
463
- - Recommended to use `agents/` modules going forward
464
-
465
- ---
466
-
467
- ## 🎓 LEARNING RESOURCES
468
-
469
- **MobileCLIP:**
470
- - Paper: https://arxiv.org/abs/2311.17049
471
- - Use: Visual embeddings for similarity search
472
-
473
- **FAISS:**
474
- - Docs: https://github.com/facebookresearch/faiss
475
- - Use: Fast vector similarity search
476
-
477
- **EasyOCR:**
478
- - Docs: https://github.com/JaidedAI/EasyOCR
479
- - Use: Offline text extraction
480
-
481
- **DistilBERT:**
482
- - Paper: https://arxiv.org/abs/1910.01108
483
- - Use: Efficient NLP for intent classification
484
-
485
- **Piper TTS:**
486
- - Docs: https://github.com/rhasspy/piper
487
- - Use: Neural text-to-speech
488
-
489
- ---
490
-
491
- ## ✅ VERIFICATION
492
-
493
- Run this checklist to verify upgrade success:
494
-
495
- ```bash
496
- # 1. Check file structure
497
- dir agents
498
- dir core
499
- dir data
500
-
501
- # 2. Test imports
502
- python -c "from agents import VisionAgent; print('✅ Imports OK')"
503
-
504
- # 3. Test memory agent
505
- python -c "from agents import MemoryAgent; m = MemoryAgent(); print('✅ Memory OK')"
506
-
507
- # 4. Run upgraded system
508
- python main_upgraded.py
509
- ```
510
-
511
- ---
512
-
513
- ## 📞 SUPPORT
514
-
515
- If you encounter issues:
516
- 1. Check `TROUBLESHOOTING.md` section above
517
- 2. Verify all dependencies installed
518
- 3. Check Python version (3.8+)
519
- 4. Ensure camera/microphone permissions
520
- 5. Review error logs
521
-
522
- ---
523
-
524
- **Upgrade completed successfully! 🎉**
525
-
526
- Your VisionQ system now has:
527
- - 🧠 Smarter memory (FAISS)
528
- - 👁️ Better vision (MobileCLIP + OCR)
529
- - 🗣️ Natural voice (Voxtral)
530
- - 🔍 Intelligent queries (DistilBERT)
531
-
532
- **All while keeping your existing system intact!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_scripts/ask_question.py DELETED
@@ -1,19 +0,0 @@
1
- from memory_agent import MemoryAgent
2
- from query_agent import QueryAgent
3
-
4
- def main():
5
- memory_agent = MemoryAgent()
6
- query_agent = QueryAgent(memory_agent)
7
-
8
- print("🧠 Memory Query System (type 'exit' to quit)")
9
-
10
- while True:
11
- question = input("\nAsk a question: ").strip()
12
- if question.lower() == "exit":
13
- break
14
-
15
- answer = query_agent.ask(question)
16
- print("\n" + answer)
17
-
18
- if __name__ == "__main__":
19
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_scripts/ask_question_upgraded.py DELETED
@@ -1,41 +0,0 @@
1
- """
2
- Memory Query System - Interactive memory search
3
- UPDATED: Now includes intent classification and hybrid search
4
- """
5
-
6
- from agents.memory_agent import MemoryAgent
7
- from agents.query_agent import QueryAgent
8
-
9
-
10
- def main():
11
- print("=" * 60)
12
- print("🧠 VisionQ Memory Query System (UPGRADED)")
13
- print("=" * 60)
14
- print("\nFeatures:")
15
- print(" • Time-based queries (today, yesterday, last hour)")
16
- print(" • Semantic search with DistilBERT")
17
- print(" • FAISS-powered similarity search")
18
- print(" • OCR text search")
19
- print("\nType 'exit' to quit\n")
20
-
21
- memory_agent = MemoryAgent()
22
- query_agent = QueryAgent(memory_agent)
23
-
24
- while True:
25
- question = input("\n❓ Ask a question: ").strip()
26
-
27
- if question.lower() == "exit":
28
- print("Goodbye!")
29
- break
30
-
31
- if not question:
32
- continue
33
-
34
- # Query with enhanced capabilities
35
- answer = query_agent.ask(question)
36
- print(f"\n💡 Answer:\n{answer}\n")
37
- print("-" * 60)
38
-
39
-
40
- if __name__ == "__main__":
41
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_scripts/install_upgrade.bat DELETED
@@ -1,101 +0,0 @@
1
- @echo off
2
- REM ============================================
3
- REM VisionQ Upgrade - Automated Installation
4
- REM ============================================
5
-
6
- echo.
7
- echo ============================================
8
- echo VisionQ System Upgrade Installer
9
- echo ============================================
10
- echo.
11
-
12
- REM Check Python installation
13
- python --version >nul 2>&1
14
- if errorlevel 1 (
15
- echo [ERROR] Python not found. Please install Python 3.8+
16
- pause
17
- exit /b 1
18
- )
19
-
20
- echo [1/6] Python detected
21
- echo.
22
-
23
- REM Create data directory
24
- echo [2/6] Creating data directory...
25
- if not exist "data" mkdir data
26
- echo - data\ created
27
-
28
- REM Move existing memory file
29
- if exist "memory.json" (
30
- echo [3/6] Migrating existing memory...
31
- move /Y memory.json data\memory.json >nul
32
- echo - memory.json moved to data\
33
- ) else (
34
- echo [3/6] No existing memory found (fresh install)
35
- )
36
-
37
- REM Install dependencies
38
- echo.
39
- echo [4/6] Installing dependencies...
40
- echo This may take several minutes...
41
- echo.
42
- pip install -r requirements_upgraded.txt
43
- if errorlevel 1 (
44
- echo [ERROR] Dependency installation failed
45
- pause
46
- exit /b 1
47
- )
48
-
49
- echo.
50
- echo [5/6] Verifying installation...
51
-
52
- REM Test imports
53
- python -c "from agents import VisionAgent; print(' - Agents: OK')" 2>nul
54
- if errorlevel 1 (
55
- echo [ERROR] Agent import failed
56
- pause
57
- exit /b 1
58
- )
59
-
60
- python -c "from core import FusionLayer; print(' - Core: OK')" 2>nul
61
- if errorlevel 1 (
62
- echo [ERROR] Core import failed
63
- pause
64
- exit /b 1
65
- )
66
-
67
- python -c "import faiss; print(' - FAISS: OK')" 2>nul
68
- if errorlevel 1 (
69
- echo [WARNING] FAISS not available (optional)
70
- echo Install with: pip install faiss-cpu
71
- )
72
-
73
- python -c "import easyocr; print(' - EasyOCR: OK')" 2>nul
74
- if errorlevel 1 (
75
- echo [WARNING] EasyOCR not available (optional)
76
- echo Install with: pip install easyocr
77
- )
78
-
79
- echo.
80
- echo [6/6] Installation complete!
81
- echo.
82
- echo ============================================
83
- echo VisionQ Upgrade Installed Successfully!
84
- echo ============================================
85
- echo.
86
- echo Next steps:
87
- echo 1. Run: python main_upgraded.py
88
- echo 2. Test voice commands
89
- echo 3. Check QUICKSTART.md for usage guide
90
- echo.
91
- echo Optional enhancements:
92
- echo - Install FAISS: pip install faiss-cpu
93
- echo - Install EasyOCR: pip install easyocr
94
- echo - Download Piper TTS model (see QUICKSTART.md)
95
- echo.
96
- echo Documentation:
97
- echo - QUICKSTART.md - Quick start guide
98
- echo - UPGRADE_GUIDE.md - Complete documentation
99
- echo - ARCHITECTURE.md - System architecture
100
- echo.
101
- pause
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_scripts/main.py DELETED
@@ -1,66 +0,0 @@
1
- from voice_agent import VoiceAgent
2
- from vision_agent import VisionAgent
3
-
4
-
5
- def main():
6
- voice = VoiceAgent()
7
- vision = VisionAgent()
8
-
9
- voice.speak("Vision Q started. I am listening.")
10
-
11
-
12
- while True:
13
- spoken_text = voice.listen()
14
- if not spoken_text:
15
- continue
16
-
17
- intent = voice.parse_intent(spoken_text)
18
-
19
-
20
- print("[INTENT]:", intent)
21
-
22
- if intent == "DESCRIBE_SCENE":
23
- voice.speak("Describing the scene.")
24
- description = vision.describe_scene()
25
-
26
- if description:
27
- print("[DESCRIPTION]:", description)
28
- voice.speak(description)
29
- else:
30
- voice.speak("I could not capture the scene.")
31
-
32
- elif intent == "REMEMBER_SCENE":
33
- voice.speak("I will remember this scene.")
34
- description = vision.remember_scene()
35
-
36
- if description:
37
- print("[REMEMBERED]:", description)
38
- voice.speak("Scene remembered.")
39
- else:
40
- voice.speak("I could not remember the scene.")
41
-
42
- elif intent == "RECALL_MEMORY":
43
- memory = vision.memory_agent.recall_last()
44
- if memory:
45
- voice.speak(memory)
46
- else:
47
- voice.speak("I do not have any memories yet.")
48
-
49
-
50
- elif intent == "READ_TEXT":
51
- # OCR intentionally postponed
52
- voice.speak("Reading text will be available soon.")
53
-
54
- elif intent == "EXIT":
55
- voice.speak("Goodbye.")
56
- vision.cleanup()
57
- break
58
-
59
- else:
60
- voice.speak("I did not understand.")
61
-
62
- print("Vision Q stopped.")
63
-
64
-
65
- if __name__ == "__main__":
66
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_scripts/main_upgraded.py DELETED
@@ -1,85 +0,0 @@
1
- """
2
- VisionQ - Upgraded Multimodal AI Assistant
3
- UPDATED: Now includes OCR, embeddings, and enhanced memory
4
- """
5
-
6
- from agents.voice_agent import VoiceAgent
7
- from agents.vision_agent import VisionAgent
8
-
9
-
10
- def main():
11
- print("=" * 60)
12
- print("VisionQ - Multimodal AI Assistant (UPGRADED)")
13
- print("=" * 60)
14
-
15
- # Initialize agents
16
- voice = VoiceAgent()
17
- vision = VisionAgent()
18
-
19
- voice.speak("Vision Q started. I am listening.")
20
-
21
- while True:
22
- spoken_text = voice.listen()
23
- if not spoken_text:
24
- continue
25
-
26
- intent = voice.parse_intent(spoken_text)
27
- print(f"[INTENT]: {intent}")
28
-
29
- # ===== DESCRIBE SCENE (UPDATED) =====
30
- if intent == "DESCRIBE_SCENE":
31
- voice.speak("Describing the scene.")
32
- description = vision.describe_scene()
33
-
34
- if description:
35
- print(f"[DESCRIPTION]: {description}")
36
- voice.speak(description)
37
- else:
38
- voice.speak("I could not capture the scene.")
39
-
40
- # ===== REMEMBER SCENE (UPDATED) =====
41
- elif intent == "REMEMBER_SCENE":
42
- voice.speak("I will remember this scene.")
43
- description = vision.remember_scene()
44
-
45
- if description:
46
- print(f"[REMEMBERED]: {description}")
47
- voice.speak("Scene remembered.")
48
- else:
49
- voice.speak("I could not remember the scene.")
50
-
51
- # ===== RECALL MEMORY (KEPT) =====
52
- elif intent == "RECALL_MEMORY":
53
- memory = vision.memory_agent.recall_last()
54
- if memory:
55
- response = f"At {memory['timestamp']}, {memory['description']}"
56
- voice.speak(response)
57
- else:
58
- voice.speak("I do not have any memories yet.")
59
-
60
- # ===== READ TEXT (NEW - NOW FUNCTIONAL) =====
61
- elif intent == "READ_TEXT":
62
- voice.speak("Reading text from the scene.")
63
- text_result = vision.read_text()
64
-
65
- if text_result:
66
- print(f"[OCR]: {text_result}")
67
- voice.speak(text_result)
68
- else:
69
- voice.speak("I could not read any text.")
70
-
71
- # ===== EXIT (KEPT) =====
72
- elif intent == "EXIT":
73
- voice.speak("Goodbye.")
74
- vision.cleanup()
75
- break
76
-
77
- # ===== UNKNOWN (KEPT) =====
78
- else:
79
- voice.speak("I did not understand.")
80
-
81
- print("Vision Q stopped.")
82
-
83
-
84
- if __name__ == "__main__":
85
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/old_scripts/test_upgrade.py DELETED
@@ -1,274 +0,0 @@
1
- """
2
- VisionQ Upgrade - Automated Test Suite
3
- Tests all new and existing functionality
4
- """
5
-
6
- import sys
7
- import os
8
-
9
- def test_imports():
10
- """Test all module imports"""
11
- print("\n" + "="*60)
12
- print("TEST 1: Module Imports")
13
- print("="*60)
14
-
15
- tests = [
16
- ("agents.voice_agent", "VoiceAgent"),
17
- ("agents.vision_agent", "VisionAgent"),
18
- ("agents.caption_agent", "CaptionAgent"),
19
- ("agents.embedding_agent", "EmbeddingAgent"),
20
- ("agents.ocr_agent", "OCRAgent"),
21
- ("agents.memory_agent", "MemoryAgent"),
22
- ("agents.query_agent", "QueryAgent"),
23
- ("core.fusion_layer", "FusionLayer"),
24
- ]
25
-
26
- passed = 0
27
- failed = 0
28
-
29
- for module, cls in tests:
30
- try:
31
- exec(f"from {module} import {cls}")
32
- print(f" ✅ {module}.{cls}")
33
- passed += 1
34
- except Exception as e:
35
- print(f" ❌ {module}.{cls} - {e}")
36
- failed += 1
37
-
38
- print(f"\nResult: {passed} passed, {failed} failed")
39
- return failed == 0
40
-
41
-
42
- def test_dependencies():
43
- """Test optional dependencies"""
44
- print("\n" + "="*60)
45
- print("TEST 2: Optional Dependencies")
46
- print("="*60)
47
-
48
- deps = [
49
- ("faiss", "FAISS (vector search)"),
50
- ("easyocr", "EasyOCR (text extraction)"),
51
- ("piper", "Piper TTS (neural voice)"),
52
- ]
53
-
54
- for module, name in deps:
55
- try:
56
- __import__(module)
57
- print(f" ✅ {name}")
58
- except ImportError:
59
- print(f" ⚠️ {name} - Not installed (optional)")
60
-
61
- return True
62
-
63
-
64
- def test_memory_agent():
65
- """Test MemoryAgent functionality"""
66
- print("\n" + "="*60)
67
- print("TEST 3: MemoryAgent")
68
- print("="*60)
69
-
70
- try:
71
- from agents.memory_agent import MemoryAgent
72
- import numpy as np
73
-
74
- # Create test memory
75
- memory = MemoryAgent(
76
- memory_file="data/test_memory.json",
77
- faiss_index_file="data/test_memory.faiss"
78
- )
79
-
80
- # Test adding memory
81
- test_desc = "Test scene with a person"
82
- test_embedding = np.random.rand(512).astype('float32')
83
- memory.add(test_desc, image_embedding=test_embedding)
84
- print(" ✅ Add memory")
85
-
86
- # Test recall
87
- last = memory.recall_last()
88
- assert last is not None
89
- print(" ✅ Recall last")
90
-
91
- # Test text search
92
- results = memory.search_by_text("person", threshold=0.1)
93
- print(f" ✅ Text search ({len(results)} results)")
94
-
95
- # Test image search (if FAISS available)
96
- try:
97
- results = memory.search_by_image(test_embedding, k=1)
98
- print(f" ✅ Image search ({len(results)} results)")
99
- except:
100
- print(" ⚠️ Image search - FAISS not available")
101
-
102
- # Cleanup
103
- if os.path.exists("data/test_memory.json"):
104
- os.remove("data/test_memory.json")
105
- if os.path.exists("data/test_memory.faiss"):
106
- os.remove("data/test_memory.faiss")
107
-
108
- print("\n MemoryAgent: PASSED")
109
- return True
110
-
111
- except Exception as e:
112
- print(f"\n ❌ MemoryAgent: FAILED - {e}")
113
- return False
114
-
115
-
116
- def test_fusion_layer():
117
- """Test FusionLayer"""
118
- print("\n" + "="*60)
119
- print("TEST 4: FusionLayer")
120
- print("="*60)
121
-
122
- try:
123
- from core.fusion_layer import FusionLayer
124
- import numpy as np
125
-
126
- fusion = FusionLayer()
127
-
128
- # Test fusion
129
- context = fusion.fuse(
130
- caption="a person holding a phone",
131
- ocr_text="Hello World",
132
- objects=["person", "phone"],
133
- embedding=np.random.rand(512)
134
- )
135
-
136
- assert "caption" in context
137
- assert "ocr_text" in context
138
- assert "objects" in context
139
- assert "full_description" in context
140
- print(" ✅ Fuse multimodal data")
141
-
142
- # Test extraction
143
- desc, emb = fusion.extract_for_storage(context)
144
- assert desc is not None
145
- assert emb is not None
146
- print(" ✅ Extract for storage")
147
-
148
- print("\n FusionLayer: PASSED")
149
- return True
150
-
151
- except Exception as e:
152
- print(f"\n ❌ FusionLayer: FAILED - {e}")
153
- return False
154
-
155
-
156
- def test_query_agent():
157
- """Test QueryAgent"""
158
- print("\n" + "="*60)
159
- print("TEST 5: QueryAgent")
160
- print("="*60)
161
-
162
- try:
163
- from agents.memory_agent import MemoryAgent
164
- from agents.query_agent import QueryAgent
165
-
166
- memory = MemoryAgent(
167
- memory_file="data/test_memory.json",
168
- faiss_index_file="data/test_memory.faiss"
169
- )
170
- query = QueryAgent(memory)
171
-
172
- # Test intent classification
173
- intent = query.classify_intent("What did I see this morning?")
174
- print(f" ✅ Intent classification: {intent}")
175
-
176
- # Test time extraction
177
- time_window = query.extract_time_window("What did I see today?")
178
- print(f" ✅ Time extraction: {time_window is not None}")
179
-
180
- # Cleanup
181
- if os.path.exists("data/test_memory.json"):
182
- os.remove("data/test_memory.json")
183
- if os.path.exists("data/test_memory.faiss"):
184
- os.remove("data/test_memory.faiss")
185
-
186
- print("\n QueryAgent: PASSED")
187
- return True
188
-
189
- except Exception as e:
190
- print(f"\n ❌ QueryAgent: FAILED - {e}")
191
- return False
192
-
193
-
194
- def test_backward_compatibility():
195
- """Test backward compatibility"""
196
- print("\n" + "="*60)
197
- print("TEST 6: Backward Compatibility")
198
- print("="*60)
199
-
200
- try:
201
- from agents.memory_agent import MemoryAgent
202
-
203
- # Test old memory format
204
- memory = MemoryAgent(
205
- memory_file="data/test_memory.json",
206
- faiss_index_file="data/test_memory.faiss"
207
- )
208
-
209
- # Add memory without image embedding (old format)
210
- memory.add("Test scene without embedding", image_embedding=None)
211
- print(" ✅ Old format (no image embedding)")
212
-
213
- # Recall should work
214
- last = memory.recall_last()
215
- assert last is not None
216
- print(" ✅ Recall old format")
217
-
218
- # Cleanup
219
- if os.path.exists("data/test_memory.json"):
220
- os.remove("data/test_memory.json")
221
- if os.path.exists("data/test_memory.faiss"):
222
- os.remove("data/test_memory.faiss")
223
-
224
- print("\n Backward Compatibility: PASSED")
225
- return True
226
-
227
- except Exception as e:
228
- print(f"\n ❌ Backward Compatibility: FAILED - {e}")
229
- return False
230
-
231
-
232
- def main():
233
- """Run all tests"""
234
- print("\n" + "="*60)
235
- print("VisionQ Upgrade - Test Suite")
236
- print("="*60)
237
-
238
- # Ensure data directory exists
239
- os.makedirs("data", exist_ok=True)
240
-
241
- results = []
242
-
243
- # Run tests
244
- results.append(("Imports", test_imports()))
245
- results.append(("Dependencies", test_dependencies()))
246
- results.append(("MemoryAgent", test_memory_agent()))
247
- results.append(("FusionLayer", test_fusion_layer()))
248
- results.append(("QueryAgent", test_query_agent()))
249
- results.append(("Backward Compatibility", test_backward_compatibility()))
250
-
251
- # Summary
252
- print("\n" + "="*60)
253
- print("TEST SUMMARY")
254
- print("="*60)
255
-
256
- passed = sum(1 for _, result in results if result)
257
- total = len(results)
258
-
259
- for name, result in results:
260
- status = "✅ PASSED" if result else "❌ FAILED"
261
- print(f" {name}: {status}")
262
-
263
- print(f"\nTotal: {passed}/{total} tests passed")
264
-
265
- if passed == total:
266
- print("\n🎉 All tests passed! System is ready.")
267
- return 0
268
- else:
269
- print(f"\n⚠️ {total - passed} test(s) failed. Check errors above.")
270
- return 1
271
-
272
-
273
- if __name__ == "__main__":
274
- sys.exit(main())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
archive/pipcheck.txt DELETED
Binary file (22.9 kB)
 
archive/requirements_upgraded.txt DELETED
@@ -1,54 +0,0 @@
1
- # ============================================
2
- # VisionQ - UPGRADED Requirements
3
- # ============================================
4
-
5
- # Core ML/AI
6
- torch>=2.0.0
7
- transformers>=4.57.3
8
- sentence-transformers>=2.2.2
9
-
10
- # Vision
11
- opencv-python
12
- pillow
13
- ultralytics # YOLO (optional but recommended)
14
-
15
- # Voice
16
- pyttsx3 # TTS fallback (KEPT)
17
- sounddevice
18
- vosk
19
-
20
- # NEW: Neural TTS (Primary)
21
- piper-tts # Voxtral/Piper neural TTS
22
-
23
- # NEW: OCR
24
- easyocr # Lightweight OCR
25
-
26
- # NEW: Vector Search
27
- faiss-cpu # FAISS for similarity search
28
- # Use faiss-gpu if you have CUDA
29
-
30
- # NEW: NLP Enhancement
31
- # DistilBERT is included in transformers
32
-
33
- # Optional: TensorFlow for SSD fallback
34
- # tensorflow>=2.13.0
35
-
36
- # ============================================
37
- # Installation Notes:
38
- # ============================================
39
- # 1. Create virtual environment:
40
- # python -m venv venv
41
- # venv\Scripts\activate (Windows)
42
- # source venv/bin/activate (Linux/Mac)
43
- #
44
- # 2. Install dependencies:
45
- # pip install -r requirements.txt
46
- #
47
- # 3. Download Vosk model:
48
- # https://alphacephei.com/vosk/models
49
- # Extract to: models/vosk/
50
- #
51
- # 4. Download Piper voice (optional):
52
- # https://github.com/rhasspy/piper/releases
53
- # Extract to: models/piper/
54
- # ============================================
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cleanup.bat DELETED
@@ -1,65 +0,0 @@
1
- @echo off
2
- REM ============================================
3
- REM VisionQ - Project Cleanup Script
4
- REM Moves old/redundant files to archive
5
- REM ============================================
6
-
7
- echo.
8
- echo ============================================
9
- echo VisionQ Project Cleanup
10
- echo ============================================
11
- echo.
12
-
13
- REM Create archive directory
14
- if not exist "archive\" mkdir archive
15
- if not exist "archive\old_agents\" mkdir archive\old_agents
16
- if not exist "archive\old_docs\" mkdir archive\old_docs
17
- if not exist "archive\old_scripts\" mkdir archive\old_scripts
18
-
19
- echo [1/4] Moving old agent files...
20
- if exist "caption_agent.py" move /Y caption_agent.py archive\old_agents\
21
- if exist "memory_agent.py" move /Y memory_agent.py archive\old_agents\
22
- if exist "query_agent.py" move /Y query_agent.py archive\old_agents\
23
- if exist "vision_agent.py" move /Y vision_agent.py archive\old_agents\
24
- if exist "voice_agent.py" move /Y voice_agent.py archive\old_agents\
25
-
26
- echo [2/4] Moving old documentation...
27
- if exist "README_UPGRADED.md" move /Y README_UPGRADED.md archive\old_docs\
28
- if exist "ARCHITECTURE.md" move /Y ARCHITECTURE.md archive\old_docs\
29
- if exist "COMPARISON.md" move /Y COMPARISON.md archive\old_docs\
30
- if exist "DEPLOYMENT_CHECKLIST.md" move /Y DEPLOYMENT_CHECKLIST.md archive\old_docs\
31
- if exist "INDEX.md" move /Y INDEX.md archive\old_docs\
32
- if exist "QUICK_REFERENCE.md" move /Y QUICK_REFERENCE.md archive\old_docs\
33
- if exist "QUICKSTART.md" move /Y QUICKSTART.md archive\old_docs\
34
- if exist "SUMMARY.md" move /Y SUMMARY.md archive\old_docs\
35
- if exist "UPGRADE_GUIDE.md" move /Y UPGRADE_GUIDE.md archive\old_docs\
36
-
37
- echo [3/4] Moving old scripts...
38
- if exist "main.py" move /Y main.py archive\old_scripts\
39
- if exist "main_upgraded.py" move /Y main_upgraded.py archive\old_scripts\
40
- if exist "ask_question.py" move /Y ask_question.py archive\old_scripts\
41
- if exist "ask_question_upgraded.py" move /Y ask_question_upgraded.py archive\old_scripts\
42
- if exist "test_upgrade.py" move /Y test_upgrade.py archive\old_scripts\
43
- if exist "install_upgrade.bat" move /Y install_upgrade.bat archive\old_scripts\
44
-
45
- echo [4/4] Moving old requirements...
46
- if exist "requirements_upgraded.txt" move /Y requirements_upgraded.txt archive\
47
- if exist "pipcheck.txt" move /Y pipcheck.txt archive\
48
-
49
- echo.
50
- echo ============================================
51
- echo Cleanup Complete!
52
- echo ============================================
53
- echo.
54
- echo Old files moved to archive\ directory
55
- echo.
56
- echo Current structure:
57
- echo agents/ - AI agents
58
- echo config/ - Configuration
59
- echo ui/ - Streamlit interface
60
- echo data/ - Storage
61
- echo archive/ - Old files (backup)
62
- echo.
63
- echo You can safely delete archive\ if not needed
64
- echo.
65
- pause
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config/fast_mode.py DELETED
@@ -1,40 +0,0 @@
1
- # Fast Mode Configuration
2
- # Copy this to config/settings.py to make VisionQ faster
3
-
4
- # ============================================
5
- # FAST MODE - Optimized for Speed
6
- # ============================================
7
-
8
- FEATURES = {
9
- "ocr_enabled": False, # Disabled for speed
10
- "faiss_enabled": True,
11
- "neural_tts_enabled": True,
12
- "intent_classification_enabled": True,
13
- "object_detection_enabled": False, # Disabled for speed
14
- "continuous_mode_enabled": True,
15
- "embeddings_enabled": False, # Keep disabled
16
- }
17
-
18
- # Use nano YOLO if you enable object detection
19
- MODEL_CONFIG = {
20
- "yolo_model": "yolov8n.pt", # Nano model (faster)
21
- "caption_model": "Salesforce/blip-image-captioning-base",
22
- # ... rest of config
23
- }
24
-
25
- # ============================================
26
- # EXPECTED PERFORMANCE
27
- # ============================================
28
- # With this config:
29
- # - Capture & Describe: ~1.5 seconds
30
- # - Remember Scene: ~1.5 seconds
31
- # - Read Text: Disabled
32
- #
33
- # Speed improvement: ~40% faster!
34
- # ============================================
35
-
36
- # To apply:
37
- # 1. Copy FEATURES section above
38
- # 2. Paste into config/settings.py
39
- # 3. Run: fix_and_run.bat
40
- # 4. Test the speed!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/CAMERA_FEED.md DELETED
@@ -1,178 +0,0 @@
1
- # Camera Feed Options
2
-
3
- ## Two Versions Available
4
-
5
- ### 1. Standard Version (app.py)
6
- **File:** `ui/app.py`
7
- **Launch:** `run.bat` or `streamlit run ui/app.py`
8
-
9
- **Features:**
10
- - Static camera feed
11
- - Updates only when you click buttons
12
- - Lower CPU usage
13
- - Better for slower computers
14
-
15
- **Best for:**
16
- - Testing and development
17
- - Slower computers
18
- - Battery saving on laptops
19
-
20
- ---
21
-
22
- ### 2. Continuous Feed Version (app_continuous.py)
23
- **File:** `ui/app_continuous.py`
24
- **Launch:** `run_continuous.bat` or `streamlit run ui/app_continuous.py`
25
-
26
- **Features:**
27
- - Live continuous camera feed
28
- - Adjustable refresh rate (0.5-5 seconds)
29
- - Start/Stop camera button
30
- - Real-time preview
31
-
32
- **Best for:**
33
- - Live monitoring
34
- - Real-time demonstrations
35
- - Faster computers with good camera
36
-
37
- ---
38
-
39
- ## Comparison
40
-
41
- | Feature | Standard | Continuous |
42
- |---------|----------|------------|
43
- | **Camera Feed** | Static | Live |
44
- | **Updates** | On button click | Automatic |
45
- | **CPU Usage** | Low | Medium-High |
46
- | **Refresh Rate** | Manual | 0.5-5 seconds |
47
- | **Start/Stop** | No | Yes |
48
- | **Battery Impact** | Low | Higher |
49
-
50
- ---
51
-
52
- ## How to Use Continuous Feed
53
-
54
- ### Step 1: Launch
55
- ```bash
56
- run_continuous.bat
57
- ```
58
-
59
- ### Step 2: Initialize System
60
- Click "Initialize System" in the Vision tab
61
-
62
- ### Step 3: Start Camera
63
- Click "Start Camera" button
64
-
65
- ### Step 4: Adjust Settings
66
- - Use sidebar slider to change refresh rate
67
- - Lower rate = smoother but more CPU
68
- - Higher rate = less CPU but choppier
69
-
70
- ### Step 5: Use Features
71
- - Camera keeps running in background
72
- - Click "Capture & Describe" anytime
73
- - Click "Remember Scene" anytime
74
- - Click "Read Text" anytime
75
-
76
- ### Step 6: Stop Camera
77
- Click "Stop Camera" when done to save resources
78
-
79
- ---
80
-
81
- ## Performance Tips
82
-
83
- ### For Continuous Feed
84
-
85
- **Optimize refresh rate:**
86
- - Fast computer: 0.5-1 second
87
- - Medium computer: 1-2 seconds
88
- - Slow computer: 2-5 seconds
89
-
90
- **Save resources:**
91
- - Stop camera when not actively using
92
- - Close other applications
93
- - Use standard version if too slow
94
-
95
- **Battery saving:**
96
- - Use standard version on laptop
97
- - Or set refresh rate to 3-5 seconds
98
- - Stop camera between uses
99
-
100
- ---
101
-
102
- ## Troubleshooting
103
-
104
- ### Camera feed is choppy
105
- **Solution:** Increase refresh rate in sidebar (try 2-3 seconds)
106
-
107
- ### High CPU usage
108
- **Solution:**
109
- - Stop camera when not needed
110
- - Increase refresh rate
111
- - Use standard version instead
112
-
113
- ### Camera won't start
114
- **Solution:**
115
- - Check camera permissions
116
- - Close other apps using camera
117
- - Try standard version first
118
- - Restart application
119
-
120
- ### Feed freezes
121
- **Solution:**
122
- - Click "Stop Camera" then "Start Camera"
123
- - Refresh browser page
124
- - Restart application
125
-
126
- ---
127
-
128
- ## Which Version Should I Use?
129
-
130
- ### Use Standard Version (`run.bat`) if:
131
- - Testing features
132
- - Slower computer
133
- - On battery power
134
- - Don't need live feed
135
- - Just want to capture occasionally
136
-
137
- ### Use Continuous Version (`run_continuous.bat`) if:
138
- - Need live monitoring
139
- - Demonstrating to others
140
- - Fast computer with good camera
141
- - Plugged into power
142
- - Want real-time preview
143
-
144
- ---
145
-
146
- ## Switching Between Versions
147
-
148
- You can switch anytime:
149
-
150
- ```bash
151
- # Stop current version (Ctrl+C)
152
-
153
- # Start standard version
154
- run.bat
155
-
156
- # OR start continuous version
157
- run_continuous.bat
158
- ```
159
-
160
- Both use the same memory and settings!
161
-
162
- ---
163
-
164
- ## Summary
165
-
166
- **Standard Version:**
167
- - Launch: `run.bat`
168
- - Camera: Static (updates on click)
169
- - CPU: Low
170
- - Best for: General use
171
-
172
- **Continuous Version:**
173
- - Launch: `run_continuous.bat`
174
- - Camera: Live feed
175
- - CPU: Medium-High
176
- - Best for: Live monitoring
177
-
178
- **Try both and see which works better for you!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/PERFORMANCE.md DELETED
@@ -1,187 +0,0 @@
1
- # Performance Optimization Guide
2
-
3
- ## Speed Issues Fixed
4
-
5
- ### 1. Embedding Error Fixed
6
- The AttributeError with CLIP embeddings has been fixed by using `torch.nn.functional.normalize()` instead of `.norm()`.
7
-
8
- ### 2. Embeddings Disabled by Default
9
- Image embeddings (CLIP) are now **disabled by default** for faster performance.
10
-
11
- To enable embeddings, edit `config/settings.py`:
12
- ```python
13
- FEATURES = {
14
- "embeddings_enabled": True, # Enable for visual similarity search
15
- }
16
- ```
17
-
18
- ## Performance Settings
19
-
20
- ### Fast Mode (Default)
21
- ```python
22
- # config/settings.py
23
- FEATURES = {
24
- "embeddings_enabled": False, # Faster
25
- "ocr_enabled": True,
26
- "object_detection_enabled": True,
27
- }
28
- ```
29
-
30
- **Speed:** Fast (2-3 seconds per capture)
31
- **Features:** Caption + OCR + Object Detection
32
-
33
- ### Full Mode (Slower but more features)
34
- ```python
35
- FEATURES = {
36
- "embeddings_enabled": True, # Slower but enables visual search
37
- "ocr_enabled": True,
38
- "object_detection_enabled": True,
39
- }
40
- ```
41
-
42
- **Speed:** Slower (5-7 seconds per capture)
43
- **Features:** All features including visual similarity search
44
-
45
- ## Speed Comparison
46
-
47
- | Feature | Time | Can Disable? |
48
- |---------|------|--------------|
49
- | YOLO Detection | ~500ms | Yes (set object_detection_enabled=False) |
50
- | BLIP Caption | ~1000ms | No (core feature) |
51
- | CLIP Embeddings | ~2000ms | Yes (set embeddings_enabled=False) |
52
- | EasyOCR | ~500ms | Yes (set ocr_enabled=False) |
53
-
54
- ## Optimization Tips
55
-
56
- ### 1. Disable Unused Features
57
- Edit `config/settings.py`:
58
- ```python
59
- FEATURES = {
60
- "ocr_enabled": False, # If you don't need text reading
61
- "embeddings_enabled": False, # If you don't need visual search
62
- "object_detection_enabled": False, # If you don't need object detection
63
- }
64
- ```
65
-
66
- ### 2. Use Smaller Models
67
- ```python
68
- MODEL_CONFIG = {
69
- "yolo_model": "yolov8n.pt", # Nano model (faster)
70
- # Instead of "yolov8s.pt" (small model)
71
- }
72
- ```
73
-
74
- ### 3. Reduce OCR Languages
75
- ```python
76
- OCR_CONFIG = {
77
- "languages": ["en"], # Just English (faster)
78
- # Instead of ["en", "es", "fr", "de"]
79
- }
80
- ```
81
-
82
- ### 4. Lower Confidence Thresholds
83
- ```python
84
- VISION_CONFIG = {
85
- "confidence_threshold": 0.3, # Lower = faster but less accurate
86
- }
87
- ```
88
-
89
- ### 5. Use GPU (if available)
90
- ```python
91
- PERFORMANCE_CONFIG = {
92
- "use_gpu": True, # Much faster with GPU
93
- }
94
- ```
95
-
96
- ## Recommended Settings
97
-
98
- ### For Speed
99
- ```python
100
- # Fastest configuration
101
- FEATURES = {
102
- "ocr_enabled": False,
103
- "embeddings_enabled": False,
104
- "object_detection_enabled": False,
105
- }
106
- ```
107
- **Result:** ~1 second per capture (caption only)
108
-
109
- ### For Balance
110
- ```python
111
- # Balanced configuration (default)
112
- FEATURES = {
113
- "ocr_enabled": True,
114
- "embeddings_enabled": False,
115
- "object_detection_enabled": True,
116
- }
117
- ```
118
- **Result:** ~2-3 seconds per capture
119
-
120
- ### For Full Features
121
- ```python
122
- # All features enabled
123
- FEATURES = {
124
- "ocr_enabled": True,
125
- "embeddings_enabled": True,
126
- "object_detection_enabled": True,
127
- }
128
- ```
129
- **Result:** ~5-7 seconds per capture
130
-
131
- ## Troubleshooting Slow Performance
132
-
133
- ### Issue: First run is very slow
134
- **Solution:** This is normal. Models are being downloaded (~2GB). Subsequent runs will be much faster.
135
-
136
- ### Issue: Every capture takes 5+ seconds
137
- **Solution:** Disable embeddings in `config/settings.py`:
138
- ```python
139
- FEATURES = {
140
- "embeddings_enabled": False,
141
- }
142
- ```
143
-
144
- ### Issue: OCR is slow
145
- **Solution:**
146
- 1. Reduce languages to just what you need
147
- 2. Or disable OCR if not needed:
148
- ```python
149
- FEATURES = {
150
- "ocr_enabled": False,
151
- }
152
- ```
153
-
154
- ### Issue: Out of memory
155
- **Solution:**
156
- 1. Close other applications
157
- 2. Disable embeddings
158
- 3. Use smaller YOLO model (yolov8n.pt)
159
-
160
- ## Current Configuration
161
-
162
- The system is now configured for **balanced performance**:
163
- - Embeddings: DISABLED (faster)
164
- - OCR: ENABLED
165
- - Object Detection: ENABLED
166
- - Caption: ENABLED (always on)
167
-
168
- This gives you good features with reasonable speed (~2-3 seconds per capture).
169
-
170
- ## How to Change Settings
171
-
172
- 1. Open `config/settings.py`
173
- 2. Find the `FEATURES` section
174
- 3. Change `True`/`False` values
175
- 4. Restart the application
176
-
177
- Example:
178
- ```python
179
- # For maximum speed
180
- FEATURES = {
181
- "ocr_enabled": False,
182
- "embeddings_enabled": False,
183
- "object_detection_enabled": False,
184
- }
185
- ```
186
-
187
- Save the file and restart with `run.bat`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/PERFORMANCE_ANALYSIS.md DELETED
@@ -1,310 +0,0 @@
1
- # VisionQ Performance Analysis
2
-
3
- ## Current Models Being Used
4
-
5
- ### 1. YOLO (Object Detection)
6
- **Model:** YOLOv8s (Small)
7
- **File:** `yolov8s.pt`
8
- **Size:** ~22MB
9
- **Speed:** ~500ms per frame
10
- **Purpose:** Detect objects in scene
11
-
12
- ### 2. BLIP (Image Captioning)
13
- **Model:** Salesforce/blip-image-captioning-base
14
- **Size:** ~990MB
15
- **Speed:** ~1000-1500ms per frame
16
- **Purpose:** Generate scene descriptions
17
- **THIS IS THE SLOWEST PART!**
18
-
19
- ### 3. EasyOCR (Text Extraction)
20
- **Model:** EasyOCR English
21
- **Size:** ~50MB per language
22
- **Speed:** ~500ms per frame
23
- **Purpose:** Read text from images
24
-
25
- ### 4. CLIP (Embeddings) - DISABLED
26
- **Model:** openai/clip-vit-base-patch32
27
- **Status:** Disabled by default
28
- **Speed:** Would add ~2000ms if enabled
29
-
30
- ---
31
-
32
- ## Why Camera is Slow
33
-
34
- ### Current Processing Time Breakdown
35
-
36
- **When you click "Capture & Describe":**
37
- ```
38
- 1. Capture frame: ~10ms
39
- 2. BLIP caption: ~1500ms ← SLOWEST!
40
- 3. EasyOCR text: ~500ms
41
- 4. Fusion/processing: ~50ms
42
- --------------------------------
43
- Total: ~2060ms (2+ seconds)
44
- ```
45
-
46
- **The main bottleneck is BLIP (image captioning)!**
47
-
48
- ---
49
-
50
- ## Speed Optimization Options
51
-
52
- ### Option 1: Disable OCR (Fastest)
53
- **Speed gain:** ~500ms faster
54
- **Trade-off:** No text reading
55
-
56
- Edit `config/settings.py`:
57
- ```python
58
- FEATURES = {
59
- "ocr_enabled": False, # Disable OCR
60
- }
61
- ```
62
-
63
- **New speed:** ~1.5 seconds
64
-
65
- ---
66
-
67
- ### Option 2: Use Smaller YOLO Model
68
- **Speed gain:** ~200ms faster
69
- **Trade-off:** Slightly less accurate object detection
70
-
71
- Edit `config/settings.py`:
72
- ```python
73
- MODEL_CONFIG = {
74
- "yolo_model": "yolov8n.pt", # Nano model (faster)
75
- }
76
- ```
77
-
78
- Download nano model:
79
- ```bash
80
- # In Python
81
- from ultralytics import YOLO
82
- model = YOLO("yolov8n.pt")
83
- ```
84
-
85
- **New speed:** ~1.8 seconds
86
-
87
- ---
88
-
89
- ### Option 3: Disable Object Detection
90
- **Speed gain:** ~500ms faster
91
- **Trade-off:** No object detection
92
-
93
- Edit `config/settings.py`:
94
- ```python
95
- FEATURES = {
96
- "object_detection_enabled": False,
97
- }
98
- ```
99
-
100
- **New speed:** ~1.5 seconds
101
-
102
- ---
103
-
104
- ### Option 4: Use Faster Caption Model (RECOMMENDED)
105
- **Speed gain:** ~1000ms faster!
106
- **Trade-off:** Slightly different captions
107
-
108
- Replace BLIP with a faster model like GIT or BLIP-2 small.
109
-
110
- ---
111
-
112
- ### Option 5: Caption Only Mode (FASTEST)
113
- **Speed gain:** Maximum
114
- **Trade-off:** Only caption, no OCR or objects
115
-
116
- Edit `config/settings.py`:
117
- ```python
118
- FEATURES = {
119
- "ocr_enabled": False,
120
- "object_detection_enabled": False,
121
- }
122
- ```
123
-
124
- **New speed:** ~1.5 seconds (just BLIP)
125
-
126
- ---
127
-
128
- ## Recommended Configurations
129
-
130
- ### For Speed (Fastest)
131
- ```python
132
- # config/settings.py
133
- FEATURES = {
134
- "ocr_enabled": False, # Disable OCR
135
- "object_detection_enabled": False, # Disable YOLO
136
- "embeddings_enabled": False, # Already disabled
137
- }
138
-
139
- MODEL_CONFIG = {
140
- "yolo_model": "yolov8n.pt", # Use nano if keeping YOLO
141
- }
142
- ```
143
-
144
- **Result:** ~1.5 seconds per capture
145
-
146
- ---
147
-
148
- ### For Balance (Recommended)
149
- ```python
150
- # config/settings.py
151
- FEATURES = {
152
- "ocr_enabled": True, # Keep OCR
153
- "object_detection_enabled": False, # Disable YOLO (not critical)
154
- "embeddings_enabled": False, # Keep disabled
155
- }
156
- ```
157
-
158
- **Result:** ~2 seconds per capture
159
-
160
- ---
161
-
162
- ### For Full Features (Slowest)
163
- ```python
164
- # config/settings.py
165
- FEATURES = {
166
- "ocr_enabled": True,
167
- "object_detection_enabled": True,
168
- "embeddings_enabled": False, # Keep disabled!
169
- }
170
- ```
171
-
172
- **Result:** ~2.5 seconds per capture
173
-
174
- ---
175
-
176
- ## GPU Acceleration
177
-
178
- If you have an NVIDIA GPU:
179
-
180
- ```python
181
- # config/settings.py
182
- PERFORMANCE_CONFIG = {
183
- "use_gpu": True, # Enable GPU
184
- }
185
-
186
- OCR_CONFIG = {
187
- "gpu": True, # Enable GPU for OCR
188
- }
189
- ```
190
-
191
- **Speed improvement:** 2-3x faster!
192
-
193
- **Requirements:**
194
- - NVIDIA GPU
195
- - CUDA installed
196
- - PyTorch with CUDA support
197
-
198
- ---
199
-
200
- ## Camera Feed Speed
201
-
202
- The camera itself is fast (~10ms per frame).
203
-
204
- **The slowness comes from AI processing, not the camera!**
205
-
206
- ### For Continuous Feed:
207
- - Camera updates quickly
208
- - But processing (BLIP) takes 1-2 seconds
209
- - So you see lag between capture and results
210
-
211
- ### Solutions:
212
- 1. Use static feed (current `app.py`)
213
- 2. Disable heavy features (OCR, YOLO)
214
- 3. Use GPU acceleration
215
- 4. Accept the 1-2 second delay
216
-
217
- ---
218
-
219
- ## Model Comparison
220
-
221
- | Model | Size | Speed | Accuracy | Replaceable? |
222
- |-------|------|-------|----------|--------------|
223
- | **BLIP** | 990MB | Slow (1.5s) | High | Yes (use GIT) |
224
- | **YOLO** | 22MB | Medium (0.5s) | High | Yes (use nano) |
225
- | **EasyOCR** | 50MB | Medium (0.5s) | High | Hard to replace |
226
- | **CLIP** | 500MB | Slow (2s) | High | Disabled |
227
-
228
- ---
229
-
230
- ## Quick Fixes You Can Try Now
231
-
232
- ### 1. Disable OCR
233
- ```python
234
- # config/settings.py
235
- FEATURES = {
236
- "ocr_enabled": False,
237
- }
238
- ```
239
- **Restart app:** `fix_and_run.bat`
240
-
241
- ### 2. Disable YOLO
242
- ```python
243
- # config/settings.py
244
- FEATURES = {
245
- "object_detection_enabled": False,
246
- }
247
- ```
248
- **Restart app:** `fix_and_run.bat`
249
-
250
- ### 3. Both (Fastest)
251
- ```python
252
- # config/settings.py
253
- FEATURES = {
254
- "ocr_enabled": False,
255
- "object_detection_enabled": False,
256
- }
257
- ```
258
- **Restart app:** `fix_and_run.bat`
259
-
260
- **Result:** Only BLIP caption (~1.5 seconds)
261
-
262
- ---
263
-
264
- ## Alternative: Use Lighter Caption Model
265
-
266
- Create a new caption agent with a faster model:
267
-
268
- ```python
269
- # agents/caption_agent_fast.py
270
- from transformers import AutoProcessor, AutoModelForCausalLM
271
-
272
- class FastCaptionAgent:
273
- def __init__(self):
274
- # Use GIT (faster than BLIP)
275
- self.processor = AutoProcessor.from_pretrained("microsoft/git-base")
276
- self.model = AutoModelForCausalLM.from_pretrained("microsoft/git-base")
277
- self.model.eval()
278
-
279
- def describe(self, frame_bgr):
280
- # Same as BLIP but faster
281
- ...
282
- ```
283
-
284
- **Speed:** ~500ms (3x faster than BLIP!)
285
-
286
- ---
287
-
288
- ## Summary
289
-
290
- **Why slow:**
291
- - BLIP caption model takes 1.5 seconds
292
- - OCR adds 0.5 seconds
293
- - YOLO adds 0.5 seconds
294
- - Total: 2.5 seconds
295
-
296
- **Quick fix:**
297
- ```python
298
- # Disable OCR and YOLO
299
- FEATURES = {
300
- "ocr_enabled": False,
301
- "object_detection_enabled": False,
302
- }
303
- ```
304
- **New speed:** 1.5 seconds (just BLIP)
305
-
306
- **Best fix:**
307
- - Use GPU acceleration (2-3x faster)
308
- - Or replace BLIP with GIT model (3x faster)
309
-
310
- **The camera itself is fast - it's the AI models that are slow!**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
extras/labelmap_M.txt DELETED
@@ -1,91 +0,0 @@
1
- ???
2
- person
3
- bicycle
4
- car
5
- motorcycle
6
- airplane
7
- bus
8
- train
9
- truck
10
- boat
11
- traffic light
12
- fire hydrant
13
- ???
14
- stop sign
15
- parking meter
16
- bench
17
- bird
18
- cat
19
- dog
20
- horse
21
- sheep
22
- cow
23
- elephant
24
- bear
25
- zebra
26
- giraffe
27
- ???
28
- backpack
29
- umbrella
30
- ???
31
- ???
32
- handbag
33
- tie
34
- suitcase
35
- frisbee
36
- skis
37
- snowboard
38
- sports ball
39
- kite
40
- baseball bat
41
- baseball glove
42
- skateboard
43
- surfboard
44
- tennis racket
45
- bottle
46
- ???
47
- wine glass
48
- cup
49
- fork
50
- knife
51
- spoon
52
- bowl
53
- banana
54
- apple
55
- sandwich
56
- orange
57
- broccoli
58
- carrot
59
- hot dog
60
- pizza
61
- donut
62
- cake
63
- chair
64
- couch
65
- potted plant
66
- bed
67
- ???
68
- dining table
69
- ???
70
- ???
71
- toilet
72
- ???
73
- tv
74
- laptop
75
- mouse
76
- remote
77
- keyboard
78
- cell phone
79
- microwave
80
- oven
81
- toaster
82
- sink
83
- refrigerator
84
- ???
85
- book
86
- clock
87
- vase
88
- scissors
89
- teddy bear
90
- hair drier
91
- toothbrush
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fix_and_run.bat DELETED
@@ -1,40 +0,0 @@
1
- @echo off
2
- REM Quick Fix Script for VisionQ
3
- REM Clears cache and restarts
4
-
5
- echo.
6
- echo ============================================
7
- echo VisionQ - Quick Fix Script
8
- echo ============================================
9
- echo.
10
-
11
- echo [1/3] Clearing Python cache...
12
- if exist "__pycache__" rd /s /q __pycache__
13
- if exist "agents\__pycache__" rd /s /q agents\__pycache__
14
- if exist "config\__pycache__" rd /s /q config\__pycache__
15
- if exist "core\__pycache__" rd /s /q core\__pycache__
16
- if exist "ui\__pycache__" rd /s /q ui\__pycache__
17
- echo - Python cache cleared
18
-
19
- echo.
20
- echo [2/3] Clearing Streamlit cache...
21
- if exist ".streamlit\cache" rd /s /q .streamlit\cache
22
- echo - Streamlit cache cleared
23
-
24
- echo.
25
- echo [3/3] Restarting application...
26
- echo.
27
- echo ============================================
28
- echo Cache cleared! Starting VisionQ...
29
- echo ============================================
30
- echo.
31
-
32
- REM Activate venv if exists
33
- if exist "venv\Scripts\activate.bat" (
34
- call venv\Scripts\activate.bat
35
- )
36
-
37
- REM Run Streamlit
38
- streamlit run ui\app.py
39
-
40
- pause
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fix_tensorflow.bat DELETED
@@ -1,43 +0,0 @@
1
- @echo off
2
- REM Fix TensorFlow/Protobuf Conflict
3
-
4
- echo.
5
- echo ============================================
6
- echo Fixing TensorFlow/Protobuf Conflict
7
- echo ============================================
8
- echo.
9
-
10
- REM Activate venv
11
- if exist ".venv\Scripts\activate.bat" (
12
- call .venv\Scripts\activate.bat
13
- ) else if exist "venv\Scripts\activate.bat" (
14
- call venv\Scripts\activate.bat
15
- )
16
-
17
- echo [1/4] Uninstalling conflicting packages...
18
- pip uninstall tensorflow tensorflow-cpu protobuf -y
19
-
20
- echo.
21
- echo [2/4] Installing correct protobuf version...
22
- pip install protobuf==3.20.3
23
-
24
- echo.
25
- echo [3/4] Reinstalling transformers...
26
- pip install --upgrade --force-reinstall transformers
27
-
28
- echo.
29
- echo [4/4] Clearing cache...
30
- rd /s /q __pycache__ 2>nul
31
- rd /s /q agents\__pycache__ 2>nul
32
- rd /s /q config\__pycache__ 2>nul
33
- rd /s /q core\__pycache__ 2>nul
34
- rd /s /q ui\__pycache__ 2>nul
35
-
36
- echo.
37
- echo ============================================
38
- echo Fix Complete!
39
- echo ============================================
40
- echo.
41
- echo Now run: streamlit run ui\app.py
42
- echo.
43
- pause
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
memory.json DELETED
The diff for this file is too large to render. See raw diff
 
run_continuous.bat DELETED
@@ -1,30 +0,0 @@
1
- @echo off
2
- REM VisionQ - Continuous Camera Feed Version
3
-
4
- echo.
5
- echo ============================================
6
- echo VisionQ - Continuous Camera Feed
7
- echo ============================================
8
- echo.
9
-
10
- REM Activate venv
11
- if exist ".venv\Scripts\activate.bat" (
12
- call .venv\Scripts\activate.bat
13
- ) else if exist "venv\Scripts\activate.bat" (
14
- call venv\Scripts\activate.bat
15
- )
16
-
17
- echo [INFO] Launching VisionQ with continuous camera feed...
18
- echo [INFO] Opening browser at http://localhost:8501
19
- echo.
20
- echo Features:
21
- echo - Live camera feed
22
- echo - Adjustable refresh rate
23
- echo - Start/Stop camera control
24
- echo.
25
- echo Press Ctrl+C to stop the server
26
- echo.
27
-
28
- streamlit run ui\app_continuous.py
29
-
30
- pause
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ui/app_continuous.py DELETED
@@ -1,340 +0,0 @@
1
- """
2
- VisionQ - Enhanced Streamlit Interface with Continuous Camera Feed
3
- """
4
-
5
- import streamlit as st
6
- import cv2
7
- import numpy as np
8
- from PIL import Image
9
- import sys
10
- from pathlib import Path
11
- import time
12
-
13
- # Add project root to path
14
- PROJECT_ROOT = Path(__file__).parent.parent
15
- sys.path.insert(0, str(PROJECT_ROOT))
16
-
17
- from config.settings import UI_CONFIG, OCR_CONFIG, SUPPORTED_LANGUAGES
18
- from agents.vision_agent import VisionAgent
19
- from agents.memory_agent import MemoryAgent
20
- from agents.query_agent import QueryAgent
21
-
22
- # Page config
23
- st.set_page_config(
24
- page_title=UI_CONFIG["title"],
25
- page_icon="👁️",
26
- layout=UI_CONFIG["layout"],
27
- )
28
-
29
- # Custom CSS
30
- st.markdown("""
31
- <style>
32
- .main-header {
33
- font-size: 3rem;
34
- font-weight: bold;
35
- text-align: center;
36
- margin-bottom: 2rem;
37
- background: linear-gradient(90deg, #667eea 0%, #764ba2 100%);
38
- -webkit-background-clip: text;
39
- -webkit-text-fill-color: transparent;
40
- }
41
- .success-box {
42
- padding: 1rem;
43
- border-radius: 0.5rem;
44
- background-color: #d4edda;
45
- border: 1px solid #c3e6cb;
46
- color: #155724;
47
- }
48
- </style>
49
- """, unsafe_allow_html=True)
50
-
51
- # Initialize session state
52
- if "vision_agent" not in st.session_state:
53
- st.session_state.vision_agent = None
54
- if "memory_agent" not in st.session_state:
55
- st.session_state.memory_agent = None
56
- if "query_agent" not in st.session_state:
57
- st.session_state.query_agent = None
58
- if "last_description" not in st.session_state:
59
- st.session_state.last_description = None
60
- if "camera_running" not in st.session_state:
61
- st.session_state.camera_running = False
62
-
63
- @st.cache_resource
64
- def load_agents():
65
- """Load all agents (cached)"""
66
- with st.spinner("Loading AI models... This may take a minute on first run..."):
67
- try:
68
- vision = VisionAgent()
69
- memory = MemoryAgent()
70
- query = QueryAgent(memory)
71
- return vision, memory, query
72
- except Exception as e:
73
- st.error(f"Error loading agents: {e}")
74
- return None, None, None
75
-
76
- def capture_frame(vision_agent):
77
- """Capture frame from camera"""
78
- ret, frame = vision_agent.cap.read()
79
- if ret:
80
- return frame
81
- return None
82
-
83
- def main():
84
- # Header
85
- st.markdown('<h1 class="main-header">VisionQ - Multimodal AI Assistant</h1>', unsafe_allow_html=True)
86
-
87
- # Sidebar
88
- with st.sidebar:
89
- st.header("Settings")
90
-
91
- # Language selection
92
- st.subheader("OCR Language")
93
- selected_langs = st.multiselect(
94
- "Select languages for text extraction:",
95
- options=list(SUPPORTED_LANGUAGES.keys()),
96
- default=OCR_CONFIG["languages"],
97
- format_func=lambda x: f"{SUPPORTED_LANGUAGES[x]} ({x})"
98
- )
99
-
100
- if selected_langs:
101
- OCR_CONFIG["languages"] = selected_langs
102
-
103
- st.divider()
104
-
105
- # Camera settings
106
- st.subheader("Camera Settings")
107
- refresh_rate = st.slider("Refresh rate (seconds)", 0.5, 5.0, 1.0, 0.5)
108
-
109
- st.divider()
110
-
111
- # Info
112
- st.subheader("About")
113
- st.info("""
114
- **VisionQ** is a multimodal AI assistant that can:
115
- - See and describe scenes
116
- - Read text (OCR)
117
- - Remember and recall
118
- - Search memories
119
- """)
120
-
121
- st.divider()
122
-
123
- # Stats
124
- st.subheader("System Status")
125
- if st.session_state.memory_agent:
126
- memories = st.session_state.memory_agent.recall_all()
127
- st.metric("Memories Stored", len(memories))
128
- else:
129
- st.metric("Memories Stored", "Not loaded")
130
-
131
- # Main content
132
- tab1, tab2, tab3, tab4 = st.tabs(["Vision", "Query", "Memories", "Help"])
133
-
134
- # TAB 1: VISION
135
- with tab1:
136
- st.header("Vision System")
137
-
138
- # Load agents
139
- if st.session_state.vision_agent is None:
140
- if st.button("Initialize System", type="primary"):
141
- st.cache_resource.clear()
142
- vision, memory, query = load_agents()
143
- if vision:
144
- st.session_state.vision_agent = vision
145
- st.session_state.memory_agent = memory
146
- st.session_state.query_agent = query
147
- st.success("System initialized successfully!")
148
- st.rerun()
149
- else:
150
- col1, col2 = st.columns([2, 1])
151
-
152
- with col1:
153
- st.subheader("Live Camera Feed")
154
-
155
- # Camera controls
156
- col_a, col_b, col_c, col_d = st.columns(4)
157
-
158
- with col_a:
159
- if st.button("Capture & Describe"):
160
- with st.spinner("Analyzing scene..."):
161
- description = st.session_state.vision_agent.describe_scene()
162
- if description:
163
- st.session_state.last_description = description
164
- st.success("Scene analyzed!")
165
-
166
- with col_b:
167
- if st.button("Remember Scene"):
168
- with st.spinner("Storing memory..."):
169
- description = st.session_state.vision_agent.remember_scene()
170
- if description:
171
- st.session_state.last_description = description
172
- st.success("Scene remembered!")
173
-
174
- with col_c:
175
- if st.button("Read Text"):
176
- with st.spinner("Extracting text..."):
177
- text_result = st.session_state.vision_agent.read_text()
178
- if text_result:
179
- st.session_state.last_description = text_result
180
- st.success("Text extracted!")
181
-
182
- with col_d:
183
- if st.button("Stop Camera" if st.session_state.camera_running else "Start Camera"):
184
- st.session_state.camera_running = not st.session_state.camera_running
185
- st.rerun()
186
-
187
- # Camera feed placeholder
188
- camera_placeholder = st.empty()
189
-
190
- # Continuous camera feed
191
- if st.session_state.camera_running:
192
- frame = capture_frame(st.session_state.vision_agent)
193
- if frame is not None:
194
- frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
195
- camera_placeholder.image(frame_rgb, channels="RGB", use_container_width=True)
196
- time.sleep(refresh_rate)
197
- st.rerun()
198
- else:
199
- camera_placeholder.error("Could not capture frame from camera")
200
- else:
201
- # Show single frame when stopped
202
- frame = capture_frame(st.session_state.vision_agent)
203
- if frame is not None:
204
- frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
205
- camera_placeholder.image(frame_rgb, channels="RGB", use_container_width=True)
206
- else:
207
- camera_placeholder.info("Click 'Start Camera' to begin live feed")
208
-
209
- with col2:
210
- st.subheader("Results")
211
-
212
- if st.session_state.last_description:
213
- st.markdown(f'<div class="success-box">{st.session_state.last_description}</div>',
214
- unsafe_allow_html=True)
215
- else:
216
- st.info("Click a button to analyze the scene")
217
-
218
- # TAB 2: QUERY
219
- with tab2:
220
- st.header("Query Memories")
221
-
222
- if st.session_state.query_agent is None:
223
- st.warning("Please initialize the system first (Vision tab)")
224
- else:
225
- st.subheader("Ask a Question")
226
-
227
- query_text = st.text_input(
228
- "Enter your question:",
229
- placeholder="e.g., What did I see this morning?",
230
- key="query_input"
231
- )
232
-
233
- st.caption("**Examples:**")
234
- col1, col2, col3 = st.columns(3)
235
- with col1:
236
- if st.button("What did I see today?"):
237
- query_text = "What did I see today?"
238
- with col2:
239
- if st.button("When did I see a person?"):
240
- query_text = "When did I see a person?"
241
- with col3:
242
- if st.button("Show memories with text"):
243
- query_text = "Show memories with text"
244
-
245
- if st.button("Search", type="primary") and query_text:
246
- with st.spinner("Searching memories..."):
247
- result = st.session_state.query_agent.ask(query_text)
248
-
249
- st.subheader("Results")
250
- if "don't" in result.lower() or "no" in result.lower():
251
- st.info(result)
252
- else:
253
- st.success(result)
254
-
255
- # TAB 3: MEMORIES
256
- with tab3:
257
- st.header("Memory Browser")
258
-
259
- if st.session_state.memory_agent is None:
260
- st.warning("Please initialize the system first (Vision tab)")
261
- else:
262
- memories = st.session_state.memory_agent.recall_all()
263
-
264
- if not memories:
265
- st.info("No memories stored yet. Use the Vision tab to remember scenes!")
266
- else:
267
- st.success(f"Total memories: {len(memories)}")
268
-
269
- for i, mem in enumerate(reversed(memories[-10:])):
270
- with st.expander(f"Memory #{mem.get('id', i)} - {mem.get('timestamp', 'Unknown')}"):
271
- st.write(f"**Description:** {mem.get('description', 'N/A')}")
272
- st.write(f"**Importance:** {mem.get('importance', 1)}")
273
-
274
- has_text_emb = "text_embedding" in mem
275
- has_img_emb = "image_embedding" in mem
276
-
277
- col1, col2 = st.columns(2)
278
- with col1:
279
- st.caption(f"Text Embedding: {'Yes' if has_text_emb else 'No'}")
280
- with col2:
281
- st.caption(f"Image Embedding: {'Yes' if has_img_emb else 'No'}")
282
-
283
- st.divider()
284
- if st.button("Clear All Memories", type="secondary"):
285
- if st.button("Confirm Clear"):
286
- st.session_state.memory_agent.memories = []
287
- st.session_state.memory_agent._save()
288
- st.success("All memories cleared!")
289
- st.rerun()
290
-
291
- # TAB 4: HELP
292
- with tab4:
293
- st.header("Help & Documentation")
294
-
295
- st.subheader("Quick Start")
296
- st.markdown("""
297
- 1. **Initialize System**: Click "Initialize System" in the Vision tab
298
- 2. **Start Camera**: Click "Start Camera" for continuous feed
299
- 3. **Capture Scene**: Click "Capture & Describe" to analyze
300
- 4. **Remember**: Click "Remember Scene" to store in memory
301
- 5. **Read Text**: Click "Read Text" to extract visible text
302
- 6. **Query**: Go to Query tab and ask questions
303
- """)
304
-
305
- st.divider()
306
-
307
- st.subheader("Camera Controls")
308
- st.markdown("""
309
- - **Start Camera**: Begins continuous live feed
310
- - **Stop Camera**: Pauses live feed (saves resources)
311
- - **Refresh Rate**: Adjust in sidebar (0.5-5 seconds)
312
-
313
- **Tip:** Stop camera when not in use to save CPU/battery
314
- """)
315
-
316
- st.divider()
317
-
318
- st.subheader("Supported Languages")
319
- st.markdown(f"""
320
- VisionQ supports **{len(SUPPORTED_LANGUAGES)} languages** for OCR.
321
- Select languages in the sidebar settings.
322
- """)
323
-
324
- st.divider()
325
-
326
- st.subheader("Troubleshooting")
327
- st.markdown("""
328
- **Camera not working?**
329
- - Check camera permissions
330
- - Ensure no other app is using camera
331
- - Try clicking "Stop Camera" then "Start Camera"
332
-
333
- **System slow?**
334
- - Stop camera when not needed
335
- - Increase refresh rate in sidebar
336
- - Check `docs/PERFORMANCE.md`
337
- """)
338
-
339
- if __name__ == "__main__":
340
- main()