Claude commited on
Commit
4a741cb
·
unverified ·
1 Parent(s): 7840b0f

refactor: Replace video analyzer with blank Gradio 6 project

Browse files

- Remove all existing source code, tests, and documentation
- Add minimal pyproject.toml with Gradio 6 dependency
- Add blank app.py with simple Gradio interface
- Add README.md with HuggingFace Spaces YAML frontmatter
- Generate uv.lock for dependency management

This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .cursorrules +0 -71
  2. .env.example +0 -12
  3. .gitignore +22 -21
  4. DEPLOY_TO_HF_SPACES.md +0 -134
  5. PLAN.md +0 -437
  6. README.md +14 -192
  7. VOICE_COMMANDS_PLAN.md +0 -323
  8. app.py +11 -0
  9. data/audio/test_silence.wav +0 -0
  10. data/summaries/sample_real_estate_summary.md +0 -1
  11. data/transcripts/sample_real_estate.txt +0 -98
  12. hf_space/README.md +0 -39
  13. hf_space/app.py +0 -413
  14. hf_space/requirements.txt +0 -5
  15. pyproject.toml +9 -0
  16. pytest.ini +0 -9
  17. requirements.txt +0 -44
  18. src/__init__.py +0 -3
  19. src/__pycache__/__init__.cpython-312.pyc +0 -0
  20. src/__pycache__/config.cpython-312.pyc +0 -0
  21. src/__pycache__/main.cpython-312.pyc +0 -0
  22. src/analyzers/__init__.py +0 -26
  23. src/analyzers/__pycache__/__init__.cpython-312.pyc +0 -0
  24. src/analyzers/__pycache__/chunker.cpython-312.pyc +0 -0
  25. src/analyzers/__pycache__/huggingface.cpython-312.pyc +0 -0
  26. src/analyzers/__pycache__/summarizer.cpython-312.pyc +0 -0
  27. src/analyzers/chunker.py +0 -118
  28. src/analyzers/huggingface.py +0 -407
  29. src/analyzers/summarizer.py +0 -410
  30. src/config.py +0 -50
  31. src/downloaders/__init__.py +0 -6
  32. src/downloaders/files.py +0 -177
  33. src/downloaders/youtube.py +0 -264
  34. src/knowledge/__init__.py +0 -19
  35. src/knowledge/embeddings.py +0 -107
  36. src/knowledge/indexer.py +0 -151
  37. src/knowledge/vectorstore.py +0 -316
  38. src/main.py +0 -6
  39. src/mentor/__init__.py +0 -1
  40. src/processors/__init__.py +0 -18
  41. src/processors/__pycache__/__init__.cpython-312.pyc +0 -0
  42. src/processors/__pycache__/audio.cpython-312.pyc +0 -0
  43. src/processors/__pycache__/transcriber.cpython-312.pyc +0 -0
  44. src/processors/audio.py +0 -83
  45. src/processors/documents.py +0 -278
  46. src/processors/ocr.py +0 -133
  47. src/processors/transcriber.py +0 -243
  48. src/ui/__init__.py +0 -1
  49. src/ui/__pycache__/__init__.cpython-312.pyc +0 -0
  50. src/ui/__pycache__/cli.cpython-312.pyc +0 -0
.cursorrules DELETED
@@ -1,71 +0,0 @@
1
- # Cursor Engineering Ruleset
2
-
3
- ## 1. Context First
4
- Always request full context and constraints before proposing any decision.
5
- - Understand the problem completely before suggesting solutions
6
- - Ask clarifying questions when requirements are ambiguous
7
- - Consider existing codebase patterns and conventions
8
-
9
- ## 2. Tech Stack Principles
10
- Recommend tech stacks using:
11
- - Idiomatic, native patterns for the language/framework
12
- - Simple and maintainable components
13
- - Minimal unnecessary abstraction
14
- - Prefer standard library over external dependencies when reasonable
15
-
16
- ## 3. Scaffold Before Implementation
17
- Scaffold the project structure BEFORE implementation:
18
- - Clear domain boundaries
19
- - Clean folder organization
20
- - Conventional naming (language-specific conventions)
21
- - Consistent imports/exports
22
- - Document the structure in README
23
-
24
- ## 4. Test-Driven Development (TDD)
25
- Use TDD approach:
26
- - Tests define behavior before implementation
27
- - Define what failure looks like explicitly
28
- - No implementation until tests exist
29
- - Edge cases explicitly covered
30
- - Tests should be readable as documentation
31
-
32
- ## 5. Idempotent Functions
33
- All core functions must be idempotent:
34
- - Deterministic behavior (same input → same output)
35
- - Safe to re-run multiple times
36
- - No hidden state or side effects
37
- - Pure functions where possible
38
-
39
- ## 6. Simplicity First
40
- Optimize for simplicity:
41
- - Low cognitive load
42
- - Readable and clean code
43
- - Avoid cleverness and "magic"
44
- - Avoid premature optimization
45
- - YAGNI (You Aren't Gonna Need It)
46
- - DRY (Don't Repeat Yourself) but not at the cost of clarity
47
-
48
- ## 7. Idiomatic Code
49
- Use idiomatic language patterns at all times:
50
- - Follow language-specific style guides
51
- - Use conventional patterns for the ecosystem
52
- - Leverage language features appropriately
53
- - Write code that looks familiar to other developers
54
-
55
- ---
56
-
57
- ## Project-Specific Rules
58
-
59
- ### Video Analyzer Project
60
- - Use 100% free and open-source tools
61
- - Prefer local processing over cloud APIs
62
- - Keep user data private (process locally)
63
- - Support both CLI and future web UI
64
- - Modular architecture for easy extension
65
-
66
- ### Python Conventions
67
- - Type hints on all function signatures
68
- - Docstrings for public functions
69
- - Use pathlib for file paths
70
- - Rich for CLI output
71
- - Pydantic for configuration/validation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.env.example DELETED
@@ -1,12 +0,0 @@
1
- # Video Analyzer - Environment Variables
2
- # Copy this file to .env and fill in your values
3
-
4
- # Hugging Face API Key (optional - for faster API-based summarization)
5
- # Get your free key at: https://huggingface.co/settings/tokens
6
- HUGGINGFACE_API_KEY=your_token_here
7
-
8
- # Whisper Model Size (tiny, base, small, medium, large-v3)
9
- VIDEO_ANALYZER_WHISPER_MODEL=base
10
-
11
- # Default AI Backend (ollama, huggingface, huggingface-api)
12
- VIDEO_ANALYZER_AI_BACKEND=huggingface
 
 
 
 
 
 
 
 
 
 
 
 
 
.gitignore CHANGED
@@ -1,38 +1,39 @@
1
- # Environment variables (contains secrets!)
2
- .env
3
- .env.local
4
-
5
  # Python
6
  __pycache__/
7
  *.py[cod]
8
  *$py.class
9
  *.so
10
  .Python
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  .venv/
12
  venv/
13
  ENV/
14
 
15
- # Data directories (large files)
16
- data/downloads/
17
- data/audio/
18
- data/chromadb/
19
-
20
- # Keep transcripts and summaries (text files are small)
21
- # data/transcripts/
22
- # data/summaries/
23
-
24
- # Models cache
25
- models/
26
-
27
  # IDE
28
  .idea/
29
  .vscode/
30
  *.swp
31
  *.swo
32
 
33
- # OS
34
- .DS_Store
35
- Thumbs.db
36
 
37
- # Logs
38
- *.log
 
 
 
 
 
1
  # Python
2
  __pycache__/
3
  *.py[cod]
4
  *$py.class
5
  *.so
6
  .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual environments
24
  .venv/
25
  venv/
26
  ENV/
27
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  # IDE
29
  .idea/
30
  .vscode/
31
  *.swp
32
  *.swo
33
 
34
+ # Environment
35
+ .env
36
+ .env.local
37
 
38
+ # uv
39
+ .python-version
DEPLOY_TO_HF_SPACES.md DELETED
@@ -1,134 +0,0 @@
1
- # Deploy to HuggingFace Spaces
2
-
3
- This guide will help you deploy your Real Estate Mentor to HuggingFace Spaces for free.
4
-
5
- ## What You'll Get
6
-
7
- - 🌐 **Public URL** - Access from anywhere
8
- - 💾 **Persistent Storage** - Your data is saved
9
- - 🆓 **100% Free** - No cost on free tier
10
- - 🔒 **Private Option** - Can make it private
11
-
12
- ---
13
-
14
- ## Step 1: Create a New Space
15
-
16
- 1. Go to: https://huggingface.co/new-space
17
-
18
- 2. Fill in:
19
- ```
20
- Space name: real-estate-mentor
21
- License: MIT
22
- SDK: Gradio
23
- Hardware: CPU Basic (Free)
24
- Visibility: Public (or Private)
25
- ```
26
-
27
- 3. Click **"Create Space"**
28
-
29
- ---
30
-
31
- ## Step 2: Upload Files
32
-
33
- ### Option A: Upload via Web Interface
34
-
35
- 1. In your new Space, click **"Files"** tab
36
- 2. Click **"+ Add file"** → **"Upload files"**
37
- 3. Upload these files from the `hf_space/` folder:
38
- - `app.py`
39
- - `requirements.txt`
40
- - `README.md`
41
-
42
- ### Option B: Use Git (Recommended)
43
-
44
- ```bash
45
- # Clone your space
46
- git clone https://huggingface.co/spaces/YOUR_USERNAME/real-estate-mentor
47
- cd real-estate-mentor
48
-
49
- # Copy files from hf_space/
50
- cp /path/to/video_analyzer/hf_space/* .
51
-
52
- # Push to HuggingFace
53
- git add .
54
- git commit -m "Initial deployment"
55
- git push
56
- ```
57
-
58
- ---
59
-
60
- ## Step 3: Wait for Build
61
-
62
- 1. Go to your Space URL: `https://huggingface.co/spaces/YOUR_USERNAME/real-estate-mentor`
63
- 2. Watch the **"Building"** status
64
- 3. First build takes ~3-5 minutes (downloading models)
65
- 4. When ready, you'll see **"Running"** ✅
66
-
67
- ---
68
-
69
- ## Step 4: Enable Persistent Storage
70
-
71
- **Important:** To keep your data between restarts:
72
-
73
- 1. Go to Space **Settings**
74
- 2. Find **"Persistent Storage"**
75
- 3. Enable it (free tier: up to 50GB)
76
-
77
- This ensures your indexed content survives Space restarts.
78
-
79
- ---
80
-
81
- ## Step 5: Start Using It!
82
-
83
- 1. **Upload Tab** - Add your course transcripts
84
- 2. **Search Tab** - Find content semantically
85
- 3. **Ask Tab** - Chat with your AI mentor
86
- 4. **Status Tab** - See what's indexed
87
-
88
- ---
89
-
90
- ## Troubleshooting
91
-
92
- ### Space is "Sleeping"
93
-
94
- Free Spaces sleep after ~15 minutes of inactivity. Just visit the URL and it will wake up (takes ~30 seconds).
95
-
96
- ### Build Failed
97
-
98
- Check the **Logs** tab for errors. Common issues:
99
- - Missing dependencies → Check `requirements.txt`
100
- - Syntax errors → Check `app.py`
101
-
102
- ### Data Disappeared
103
-
104
- Make sure **Persistent Storage** is enabled in Settings.
105
-
106
- ---
107
-
108
- ## Upgrading (Optional)
109
-
110
- For faster performance, you can upgrade hardware:
111
-
112
- | Tier | Cost | Benefits |
113
- |------|------|----------|
114
- | CPU Basic | Free | Works fine, sleeps after 15 min |
115
- | CPU Upgrade | $0.03/hr | Faster, no sleep |
116
- | GPU | $0.60/hr | Much faster embeddings |
117
-
118
- ---
119
-
120
- ## Files Reference
121
-
122
- ```
123
- hf_space/
124
- ├── app.py # Main Gradio application
125
- ├── requirements.txt # Python dependencies
126
- └── README.md # Space description (shows on page)
127
- ```
128
-
129
- ---
130
-
131
- ## Need Help?
132
-
133
- - HuggingFace Docs: https://huggingface.co/docs/hub/spaces
134
- - Gradio Docs: https://gradio.app/docs/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
PLAN.md DELETED
@@ -1,437 +0,0 @@
1
- # Video Analyzer - Project Plan
2
-
3
- ## Overview
4
- A comprehensive tool to download videos from multiple sources, transcribe them to text, summarize content, and build a searchable knowledge base. The end goal is to create a **Virtual Real Estate Mentor** from course materials.
5
-
6
- **🆓 100% Free & Open Source - No API costs!**
7
-
8
- ---
9
-
10
- ## Tech Stack (All Free & Open Source)
11
-
12
- | Component | Technology | License | Notes |
13
- |-----------|------------|---------|-------|
14
- | **Language** | Python 3.11+ | PSF | Main language |
15
- | **Video Download** | yt-dlp | Unlicense | Supports 1000+ sites |
16
- | **Audio Processing** | ffmpeg | LGPL/GPL | Industry standard |
17
- | **Transcription** | Whisper.cpp / faster-whisper | MIT | Local, fast, accurate |
18
- | **Document Parsing** | PyMuPDF, python-docx | AGPL/MIT | PDF, Word support |
19
- | **OCR** | Tesseract | Apache 2.0 | Image text extraction |
20
- | **Vector DB** | ChromaDB | Apache 2.0 | Local vector storage |
21
- | **Embeddings** | sentence-transformers | Apache 2.0 | all-MiniLM-L6-v2 model |
22
- | **LLM** | Ollama + Llama3/Mistral/Phi | Various OSS | Local AI, no API costs |
23
- | **Web UI** | Gradio | Apache 2.0 | Simple, beautiful UI |
24
- | **CLI** | Typer | MIT | Command-line interface |
25
- | **Database** | SQLite | Public Domain | Metadata storage |
26
-
27
- ---
28
-
29
- ## Core Features
30
-
31
- ### 1. Multi-Source Video Downloader
32
- - **Supported Platforms:**
33
- - YouTube, Vimeo, Dailymotion
34
- - Udemy (with cookies/auth)
35
- - Teachable, Thinkific, Kajabi
36
- - Direct video URLs (MP4, WebM, etc.)
37
- - Google Drive, Dropbox links
38
- - **Technology:** `yt-dlp` (free, actively maintained)
39
- - **Features:**
40
- - Playlist/batch downloading
41
- - Quality selection
42
- - Resume interrupted downloads
43
- - Metadata extraction (title, description, chapters)
44
- - Cookie-based authentication for paid courses
45
-
46
- ### 2. Audio Extraction & Transcription
47
- - **Audio Extraction:** `ffmpeg` (free)
48
- - **Speech-to-Text:**
49
- - **faster-whisper** - CTranslate2 optimized, 4x faster than original
50
- - Models: tiny, base, small, medium, large-v3
51
- - Runs entirely local - no internet needed
52
- - **Features:**
53
- - Speaker diarization (with pyannote - free for research)
54
- - Word-level timestamps
55
- - Multiple language support (99 languages)
56
- - Auto language detection
57
-
58
- ### 3. Document Processing
59
- - **Supported Formats:**
60
- - PDF (PyMuPDF - fast, accurate)
61
- - Word documents (python-docx)
62
- - PowerPoint slides (python-pptx)
63
- - Images with text (Tesseract OCR)
64
- - Markdown, TXT, HTML
65
- - **All libraries are free and open source**
66
-
67
- ### 4. Local LLM for Summarization & Analysis
68
- - **Ollama** - Run LLMs locally with simple API
69
- - **Recommended Models (all free):**
70
- | Model | Size | Speed | Quality | Best For |
71
- |-------|------|-------|---------|----------|
72
- | Phi-3 | 3.8B | ⚡⚡⚡ | Good | Fast summaries |
73
- | Mistral | 7B | ⚡⚡ | Great | Balanced |
74
- | Llama3 | 8B | ⚡⚡ | Excellent | Best quality |
75
- | Llama3 | 70B | ⚡ | Outstanding | If you have GPU |
76
-
77
- - **Features:**
78
- - Quick summaries
79
- - Detailed study notes
80
- - Key concept extraction
81
- - Action items and strategies
82
- - Q&A over content
83
-
84
- ### 5. Knowledge Base & Vector Storage
85
- - **ChromaDB** - Local vector database (free)
86
- - **Embeddings:** sentence-transformers
87
- - Model: `all-MiniLM-L6-v2` (fast, 384 dimensions)
88
- - Alternative: `all-mpnet-base-v2` (better quality, slower)
89
- - **Features:**
90
- - Semantic search across all content
91
- - Source attribution with timestamps
92
- - Hybrid search (semantic + keyword)
93
- - No cloud, all local
94
-
95
- ### 6. Virtual Mentor Chat Interface
96
- - **RAG (Retrieval Augmented Generation):**
97
- - Query → Find relevant chunks → Generate response
98
- - All runs locally with Ollama
99
- - **Interfaces:**
100
- - CLI chat (terminal)
101
- - Web UI (Gradio - beautiful, easy)
102
- - **Features:**
103
- - Context-aware responses
104
- - Source citations
105
- - Conversation memory
106
- - Export chat history
107
-
108
- ---
109
-
110
- ## Architecture
111
-
112
- ```
113
- ┌─────────────────────────────────────────────────────────────────┐
114
- │ VIDEO ANALYZER (100% Local) │
115
- ├─────────────────────────────────────────────────────────────────┤
116
- │ │
117
- │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
118
- │ │ Ingestion │ │ Processing │ │ Knowledge Base │ │
119
- │ ├──────────────┤ ├──────────────┤ ├──────────────────────┤ │
120
- │ │ • yt-dlp │ │ • ffmpeg │ │ • ChromaDB │ │
121
- │ │ • Cookies │→ ��� • Whisper │→ │ • sentence-transform │ │
122
- │ │ • File input │ │ • Tesseract │ │ • SQLite metadata │ │
123
- │ └──────────────┘ └──────────────┘ └──────────────────────┘ │
124
- │ │
125
- │ ↓ │
126
- │ │
127
- │ ┌──────────────────────────────────────────────────────────┐ │
128
- │ │ Virtual Mentor (Ollama + RAG) │ │
129
- │ ├──────────────────────────────────────────────────────────┤ │
130
- │ │ • Llama3 / Mistral / Phi-3 (your choice) │ │
131
- │ │ • Context retrieval from ChromaDB │ │
132
- │ │ • Local inference - no API calls │ │
133
- │ │ • Gradio web interface │ │
134
- │ └──────────────────────────────────────────────────────────┘ │
135
- │ │
136
- └─────────────────────────────────────────────────────────────────┘
137
- ```
138
-
139
- ---
140
-
141
- ## System Requirements
142
-
143
- ### Minimum (CPU only)
144
- - **RAM:** 8GB (16GB recommended)
145
- - **Storage:** 20GB+ for models and data
146
- - **CPU:** Any modern x64 processor
147
- - **Whisper:** Use "small" or "base" model
148
- - **LLM:** Use Phi-3 (3.8B) model
149
-
150
- ### Recommended (with GPU)
151
- - **RAM:** 16GB+
152
- - **GPU:** NVIDIA with 8GB+ VRAM (RTX 3060+)
153
- - **Whisper:** Use "medium" or "large-v3" model
154
- - **LLM:** Use Llama3 8B or Mistral 7B
155
-
156
- ### Optimal (power user)
157
- - **GPU:** RTX 4090 or similar (24GB VRAM)
158
- - **LLM:** Llama3 70B for best quality
159
-
160
- ---
161
-
162
- ## Project Structure
163
-
164
- ```
165
- video_analyzer/
166
- ├── src/
167
- │ ├── __init__.py
168
- │ ├── main.py # Entry point
169
- │ ├── config.py # Configuration
170
- │ │
171
- │ ├── downloaders/ # Video/content downloaders
172
- │ │ ├── __init__.py
173
- │ │ ├── base.py # Base downloader class
174
- │ │ ├── ytdlp.py # yt-dlp wrapper
175
- │ │ └── files.py # Local file handling
176
- │ │
177
- │ ├── processors/ # Content processors
178
- │ │ ├── __init__.py
179
- │ │ ├── audio.py # Audio extraction (ffmpeg)
180
- │ │ ├── transcriber.py # Whisper transcription
181
- │ │ ├── documents.py # PDF, Word, PPT
182
- │ │ └── ocr.py # Tesseract OCR
183
- │ │
184
- │ ├── analyzers/ # AI analysis
185
- │ │ ├── __init__.py
186
- │ │ ├── summarizer.py # Ollama summarization
187
- │ │ ├── extractor.py # Key info extraction
188
- │ │ └── chunker.py # Text chunking
189
- │ │
190
- │ ├── knowledge/ # Knowledge base
191
- │ │ ├── __init__.py
192
- │ │ ├── vectorstore.py # ChromaDB
193
- │ │ ├── embeddings.py # sentence-transformers
194
- │ │ └── search.py # Semantic search
195
- │ │
196
- │ ├── mentor/ # Virtual mentor
197
- │ │ ├── __init__.py
198
- │ │ ├── rag.py # RAG pipeline
199
- │ │ ├── ollama_client.py # Ollama integration
200
- │ │ └── prompts.py # System prompts
201
- │ │
202
- │ └── ui/ # User interfaces
203
- │ ├── __init__.py
204
- │ ├── cli.py # Typer CLI
205
- │ └── web.py # Gradio web app
206
-
207
- ├── data/ # Data storage
208
- │ ├── downloads/ # Downloaded videos
209
- │ ├── audio/ # Extracted audio
210
- │ ├── transcripts/ # Transcriptions
211
- │ ├── summaries/ # Summaries
212
- │ └── chromadb/ # Vector database
213
-
214
- ├── models/ # Local model cache
215
- │ └── whisper/ # Whisper models
216
-
217
- ├── tests/
218
- ├── requirements.txt
219
- ├── install.sh # One-click setup script
220
- ├── .cursorrules
221
- └── README.md
222
- ```
223
-
224
- ---
225
-
226
- ## Dependencies (requirements.txt)
227
-
228
- ```
229
- # Core
230
- python-dotenv>=1.0.0
231
- typer[all]>=0.9.0
232
- rich>=13.0.0
233
-
234
- # Video/Audio
235
- yt-dlp>=2024.1.0
236
- ffmpeg-python>=0.2.0
237
-
238
- # Transcription
239
- faster-whisper>=1.0.0
240
- # or: openai-whisper>=20231117
241
-
242
- # Document Processing
243
- PyMuPDF>=1.23.0
244
- python-docx>=1.0.0
245
- python-pptx>=0.6.23
246
- pytesseract>=0.3.10
247
-
248
- # AI/ML
249
- sentence-transformers>=2.2.0
250
- chromadb>=0.4.0
251
- ollama>=0.1.0
252
-
253
- # Web UI
254
- gradio>=4.0.0
255
-
256
- # Utilities
257
- tqdm>=4.66.0
258
- pydantic>=2.0.0
259
- ```
260
-
261
- ---
262
-
263
- ## External Dependencies (System)
264
-
265
- ```bash
266
- # Ubuntu/Debian
267
- sudo apt install ffmpeg tesseract-ocr
268
-
269
- # macOS
270
- brew install ffmpeg tesseract
271
-
272
- # Windows
273
- # Download ffmpeg and tesseract installers
274
-
275
- # Ollama (all platforms)
276
- curl -fsSL https://ollama.com/install.sh | sh
277
- ollama pull llama3 # or mistral, phi3
278
- ```
279
-
280
- ---
281
-
282
- ## Development Phases
283
-
284
- ### Phase 1: Foundation (Week 1-2)
285
- - [ ] Project setup & dependencies
286
- - [ ] yt-dlp video downloader
287
- - [ ] ffmpeg audio extraction
288
- - [ ] faster-whisper transcription
289
- - [ ] Basic CLI with Typer
290
-
291
- ### Phase 2: Processing Pipeline (Week 3-4)
292
- - [ ] PDF/Word/PPT processing
293
- - [ ] OCR for images
294
- - [ ] Text chunking strategy
295
- - [ ] SQLite metadata storage
296
- - [ ] Batch processing
297
-
298
- ### Phase 3: Knowledge Base (Week 5-6)
299
- - [ ] sentence-transformers embeddings
300
- - [ ] ChromaDB integration
301
- - [ ] Semantic search
302
- - [ ] Hybrid search (semantic + keyword)
303
- - [ ] Source attribution
304
-
305
- ### Phase 4: Virtual Mentor (Week 7-8)
306
- - [ ] Ollama integration
307
- - [ ] RAG implementation
308
- - [ ] Real estate prompts
309
- - [ ] Conversation memory
310
- - [ ] CLI chat interface
311
-
312
- ### Phase 5: Polish & UI (Week 9-10)
313
- - [ ] Gradio web interface
314
- - [ ] Progress tracking
315
- - [ ] Export features
316
- - [ ] Error handling
317
- - [ ] Documentation
318
-
319
- ---
320
-
321
- ## Real Estate Mentor - Special Features
322
-
323
- ### Domain-Specific Prompts
324
- ```python
325
- REAL_ESTATE_SYSTEM_PROMPT = """
326
- You are a knowledgeable real estate mentor with expertise from
327
- the user's course materials. Help them with:
328
- - Deal analysis (cash flow, ROI, cap rates)
329
- - Negotiation strategies
330
- - Market analysis
331
- - Legal considerations
332
- - Financing options
333
-
334
- Always cite which video/document your advice comes from.
335
- """
336
- ```
337
-
338
- ### Deal Analysis Helper
339
- - Input property details
340
- - Get relevant strategies from course content
341
- - Calculate key metrics
342
- - Risk assessment based on learned material
343
-
344
- ### Study Features
345
- - Auto-generate flashcards
346
- - Create quizzes from content
347
- - Build glossary of terms
348
- - Track learning progress
349
-
350
- ---
351
-
352
- ## CLI Commands
353
-
354
- ```bash
355
- # Download video(s)
356
- video-analyzer download "https://youtube.com/watch?v=..."
357
- video-analyzer download --playlist "https://youtube.com/playlist?..."
358
- video-analyzer download --cookies cookies.txt "https://udemy.com/course/..."
359
-
360
- # Process content
361
- video-analyzer transcribe ./data/downloads/
362
- video-analyzer process ./documents/ # PDFs, Word, etc.
363
-
364
- # Build knowledge base
365
- video-analyzer index # Index all processed content
366
- video-analyzer search "what is cap rate"
367
-
368
- # Summarize
369
- video-analyzer summarize ./data/transcripts/video1.txt
370
- video-analyzer summarize --all # Summarize everything
371
-
372
- # Chat with mentor
373
- video-analyzer chat # CLI chat
374
- video-analyzer ui # Launch web UI
375
-
376
- # Utilities
377
- video-analyzer status # Show processing status
378
- video-analyzer export # Export all notes
379
- ```
380
-
381
- ---
382
-
383
- ## Web UI Preview
384
-
385
- ```
386
- ┌─────────────────────────────────────────────────────────────┐
387
- │ 🎓 Real Estate Mentor [⚙️] │
388
- ├─────────────────────────────────────────────────────────────┤
389
- │ │
390
- │ ┌─────────────────────────────────────────────────────┐ │
391
- │ │ 📚 Knowledge Base: 47 videos, 12 documents indexed │ │
392
- │ └─────────────────────────────────────────────────────┘ │
393
- │ │
394
- │ ┌─────────────────────────────────────────────────────┐ │
395
- │ │ You: How do I calculate cash-on-cash return? │ │
396
- │ │ │ │
397
- │ │ Mentor: Cash-on-cash return measures the annual │ │
398
- │ │ pre-tax cash flow relative to the total cash │ │
399
- │ │ invested. The formula is: │ │
400
- │ │ │ │
401
- │ │ CoC Return = (Annual Cash Flow / Total Cash) × 100 │ │
402
- │ │ │ │
403
- │ │ ���� Source: Module 3 - Investment Analysis (12:34) │ │
404
- │ └─────────────────────────────────────────────────────┘ │
405
- │ │
406
- │ [Type your question here... ] [Send] │
407
- │ │
408
- │ [📥 Add Content] [📊 Analyze Deal] [📝 Study Mode] │
409
- │ │
410
- └─────────────────────────────────────────────────────────────┘
411
- ```
412
-
413
- ---
414
-
415
- ## Cost Comparison
416
-
417
- | Approach | Monthly Cost | Our Approach |
418
- |----------|--------------|--------------|
419
- | OpenAI GPT-4 | $20-100+ | **$0** (Ollama) |
420
- | OpenAI Whisper API | $0.006/min | **$0** (local Whisper) |
421
- | Pinecone Vector DB | $70+ | **$0** (ChromaDB) |
422
- | Cloud transcription | $0.01-0.05/min | **$0** (local) |
423
- | **Total** | **$100+/month** | **$0** |
424
-
425
- **Only costs:** Electricity to run your computer 💡
426
-
427
- ---
428
-
429
- ## Next Steps
430
-
431
- 1. ✅ Plan complete - 100% free & open source
432
- 2. **Ready to start coding!**
433
-
434
- Shall I begin with Phase 1?
435
- - Set up project structure
436
- - Install dependencies
437
- - Build the video downloader
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,192 +1,14 @@
1
- # Video Analyzer 🎬
2
-
3
- **100% Free & Open Source** - No API costs, runs entirely on your machine.
4
-
5
- A powerful tool to download videos from multiple sources, transcribe to text, summarize content, and build a searchable knowledge base with an AI-powered virtual mentor.
6
-
7
- ## 🎯 Use Case
8
-
9
- Turn online courses (like real estate training) into a personal AI mentor that can:
10
- - Answer questions about course content
11
- - Help analyze deals using learned strategies
12
- - Provide quick access to key concepts and definitions
13
- - **All running locally - your data stays private!**
14
-
15
- ## 🆓 100% Free Stack
16
-
17
- | Component | Tool | Cost |
18
- |-----------|------|------|
19
- | Video Download | yt-dlp | Free |
20
- | Transcription | Whisper (local) | Free |
21
- | Document Processing | PyMuPDF, python-docx | Free |
22
- | OCR | Tesseract | Free |
23
- | Summarization | Ollama (Llama3/Mistral) | Free |
24
- | Vector Database | ChromaDB | Free |
25
- | Web UI | Gradio | Free |
26
-
27
- **Total monthly cost: $0** 💰
28
-
29
- ## 📋 Features
30
-
31
- ### Phase 1 ✅
32
- - **YouTube video downloading** with yt-dlp
33
- - **AI transcription** using local Whisper
34
- - **Audio extraction** with ffmpeg
35
-
36
- ### Phase 2 ✅
37
- - **Direct file/folder import** - drop files and process
38
- - **PDF processing** with PyMuPDF
39
- - **Word/PowerPoint processing**
40
- - **OCR for images** with Tesseract
41
- - **AI summarization** with Ollama (local LLM)
42
- - **Smart text chunking** for long documents
43
-
44
- ### Coming Soon
45
- - **Phase 3:** Vector database + semantic search
46
- - **Phase 4:** Virtual mentor RAG chat
47
- - **Phase 5:** Web UI with Gradio
48
-
49
- ## 💻 Requirements
50
-
51
- **Minimum:**
52
- - 8GB RAM (16GB recommended)
53
- - Any modern CPU
54
- - 20GB storage
55
-
56
- **Recommended (for faster processing):**
57
- - NVIDIA GPU with 8GB+ VRAM
58
- - 16GB+ RAM
59
-
60
- ## 🚀 Quick Start
61
-
62
- ### 1. Install Dependencies
63
-
64
- ```bash
65
- # Clone and setup
66
- git clone <repo>
67
- cd video_analyzer
68
-
69
- # Install Python dependencies
70
- pip install -r requirements.txt
71
- ```
72
-
73
- ### 2. Install Ollama (for AI summaries)
74
-
75
- ```bash
76
- # Install Ollama
77
- curl -fsSL https://ollama.com/install.sh | sh
78
-
79
- # Pull a model (choose one)
80
- ollama pull llama3 # Best quality (8B params)
81
- ollama pull mistral # Good balance
82
- ollama pull phi3 # Fastest (3.8B params)
83
-
84
- # Start Ollama server
85
- ollama serve
86
- ```
87
-
88
- ### 3. Process Your Content
89
-
90
- ```bash
91
- # Add local files (videos, PDFs, Word docs, etc.)
92
- ./video-analyzer add /path/to/your/course/files
93
-
94
- # Process everything (transcribe videos, extract docs)
95
- ./video-analyzer process-all
96
-
97
- # Generate AI summaries
98
- ./video-analyzer summarize --all --type study_notes
99
-
100
- # Check status
101
- ./video-analyzer status
102
- ```
103
-
104
- ## 📖 CLI Commands
105
-
106
- ### Content Management
107
- ```bash
108
- ./video-analyzer add PATH # Add files/folders
109
- ./video-analyzer status # Show statistics
110
- ./video-analyzer list-content # List processed content
111
- ```
112
-
113
- ### Processing
114
- ```bash
115
- ./video-analyzer transcribe PATH # Transcribe audio/video
116
- ./video-analyzer process-docs [PATH] # Process PDF/Word/PPT
117
- ./video-analyzer process-images [PATH] # OCR images
118
- ./video-analyzer process-all [PATH] # Process everything
119
- ```
120
-
121
- ### YouTube (requires cookies)
122
- ```bash
123
- ./video-analyzer download URL --cookies cookies.txt
124
- ./video-analyzer process URL --cookies cookies.txt
125
- ```
126
-
127
- ### AI Summarization
128
- ```bash
129
- ./video-analyzer summarize PATH # Summarize one file
130
- ./video-analyzer summarize --all # Summarize all transcripts
131
- ./video-analyzer summarize -t real_estate # Real estate focus
132
- ./video-analyzer summarize -t study_notes # Study notes format
133
- ```
134
-
135
- ### Summary Types
136
-
137
- | Type | Description |
138
- |------|-------------|
139
- | `quick` | 2-3 paragraph overview |
140
- | `detailed` | Comprehensive summary with key points |
141
- | `study_notes` | Formatted notes with concepts, definitions, action items |
142
- | `real_estate` | Specialized for real estate content with deal analysis |
143
-
144
- ## 🔧 Whisper Models
145
-
146
- | Model | Size | Speed | Quality | RAM |
147
- |-------|------|-------|---------|-----|
148
- | tiny | 39M | ⚡⚡⚡⚡ | Basic | 1GB |
149
- | base | 74M | ⚡⚡⚡ | Good | 1GB |
150
- | small | 244M | ⚡⚡ | Great | 2GB |
151
- | medium | 769M | ⚡ | Excellent | 5GB |
152
- | large-v3 | 1550M | 🐢 | Best | 10GB |
153
-
154
- ## 📁 Project Structure
155
-
156
- ```
157
- video_analyzer/
158
- ├── src/
159
- │ ├── downloaders/ # yt-dlp, file handling
160
- │ ├── processors/ # Whisper, documents, OCR
161
- │ ├── analyzers/ # Ollama summarization, chunking
162
- │ ├── knowledge/ # Vector DB (Phase 3)
163
- │ ├── mentor/ # RAG chat (Phase 4)
164
- │ └── ui/ # CLI, web interface
165
- ├── data/
166
- │ ├── downloads/ # Source files
167
- │ ├── audio/ # Extracted audio
168
- │ ├── transcripts/ # Text content
169
- │ └── summaries/ # AI summaries
170
- └── video-analyzer # CLI script
171
- ```
172
-
173
- ## 📁 Supported Formats
174
-
175
- | Type | Formats |
176
- |------|---------|
177
- | Video | .mp4, .mkv, .avi, .mov, .webm, .flv |
178
- | Audio | .mp3, .wav, .m4a, .flac, .aac, .ogg |
179
- | Document | .pdf, .docx, .pptx, .txt, .md |
180
- | Image (OCR) | .png, .jpg, .jpeg, .gif, .bmp |
181
-
182
- ## 🛠️ Development Status
183
-
184
- - [x] **Phase 1:** Video downloading + transcription
185
- - [x] **Phase 2:** Document processing + AI summarization
186
- - [ ] **Phase 3:** Knowledge base + vector search
187
- - [ ] **Phase 4:** Virtual mentor + RAG chat
188
- - [ ] **Phase 5:** Web UI + polish
189
-
190
- ## 📜 License
191
-
192
- MIT - Free for personal and commercial use
 
1
+ ---
2
+ title: Video Analyzer
3
+ emoji: "🎬"
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: "6.2.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # Video Analyzer
13
+
14
+ A Gradio application.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
VOICE_COMMANDS_PLAN.md DELETED
@@ -1,323 +0,0 @@
1
- # Voice Commands Plan - 100% Local & Private
2
-
3
- ## BLUF (Bottom Line Up Front)
4
-
5
- **Add voice control to video_analyzer using Whisper (STT) + Piper (TTS) - both run entirely on your machine. No audio leaves your computer. No voice fingerprinting. No cloud APIs.**
6
-
7
- ---
8
-
9
- ## ELI5 (Explain Like I'm 5)
10
-
11
- | What | How | Privacy |
12
- |------|-----|---------|
13
- | **You speak** | Microphone → Whisper (already in project!) | Audio never leaves your PC |
14
- | **App understands** | Whisper converts speech → text command | All processing is local |
15
- | **App responds** | Piper TTS converts text → speech | No voice profile created |
16
- | **Loop** | Wake word → listen → execute → respond | 100% offline capable |
17
-
18
- **Why this is private:**
19
- - Whisper runs locally - OpenAI never sees your voice
20
- - Piper TTS runs locally - no cloud synthesis
21
- - No internet required after initial setup
22
- - Your voice patterns stay on YOUR machine
23
-
24
- ---
25
-
26
- ## Tech Stack (All Free & Local)
27
-
28
- | Component | Technology | Why This One |
29
- |-----------|------------|--------------|
30
- | **Speech-to-Text** | Whisper (faster-whisper) | Already in project! Fast, accurate, local |
31
- | **Text-to-Speech** | Piper TTS | Fast, natural voices, 100% local, tiny models |
32
- | **Wake Word** | Porcupine (free tier) or OpenWakeWord | Local detection, low CPU |
33
- | **Audio Capture** | sounddevice + numpy | Cross-platform, real-time |
34
- | **Command Parser** | Simple pattern matching → Ollama for complex | Start simple, add AI later |
35
-
36
- ### Alternative TTS Options
37
-
38
- | TTS Engine | Quality | Speed | Size | Notes |
39
- |------------|---------|-------|------|-------|
40
- | **Piper** ⭐ | Great | ⚡⚡⚡ | 20-60MB | Best balance, recommended |
41
- | Coqui TTS | Excellent | ⚡⚡ | 200MB+ | More natural, heavier |
42
- | espeak-ng | Basic | ⚡⚡⚡⚡ | 5MB | Robotic but lightweight |
43
- | Bark | Amazing | ⚡ | 5GB+ | Too heavy for real-time |
44
-
45
- ---
46
-
47
- ## Architecture
48
-
49
- ```
50
- ┌─────────────────────────────────────────────────────────────────────┐
51
- │ VOICE COMMAND SYSTEM (100% Local) │
52
- ├─────────────────────────────────────────────────────────────────────┤
53
- │ │
54
- │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
55
- │ │ Microphone │ │ Wake Word │ │ Command Parser │ │
56
- │ │ (sounddev) │ ──▶ │ (Porcupine) │ ──▶ │ (pattern/Ollama)│ │
57
- │ └──────────────┘ └──────────────┘ └──────────────────┘ │
58
- │ │ │ │
59
- │ ▼ ▼ │
60
- │ ┌──────────────┐ ┌──────────────────┐ │
61
- │ │ Whisper │ │ Execute Command │ │
62
- │ │ (STT) │ │ (existing CLI) │ │
63
- │ │ LOCAL │ └──────────────────┘ │
64
- │ └──────────────┘ │ │
65
- │ ▼ │
66
- │ ┌──────────────────┐ │
67
- │ ┌──────────────┐ │ Piper TTS │ │
68
- │ │ Speaker │ ◀────────────────────────│ (local) │ │
69
- │ └──────────────┘ └──────────────────┘ │
70
- │ │
71
- │ 🔒 ALL PROCESSING ON LOCAL MACHINE - NOTHING SENT TO CLOUD 🔒 │
72
- └─────────────────────────────────────────────────────────────────────┘
73
- ```
74
-
75
- ---
76
-
77
- ## Voice Command Flow
78
-
79
- ```
80
- 1. IDLE STATE
81
- └─▶ Listening for wake word ("Hey Analyzer" / "Computer")
82
-
83
- 2. WAKE WORD DETECTED
84
- └─▶ Play acknowledgment sound
85
- └─▶ Start recording user speech
86
-
87
- 3. USER SPEAKS COMMAND
88
- └─▶ "Summarize my latest video"
89
- └─▶ Silence detection → stop recording
90
-
91
- 4. SPEECH-TO-TEXT (Whisper)
92
- └─▶ Audio → "summarize my latest video"
93
-
94
- 5. COMMAND PARSING
95
- └─▶ Match to CLI command: `./video-analyzer summarize --latest`
96
- └─▶ For complex queries → use Ollama to interpret
97
-
98
- 6. EXECUTE & RESPOND
99
- └─▶ Run command
100
- └─▶ Get result text
101
- └─▶ Piper TTS → Speak result
102
-
103
- 7. RETURN TO IDLE
104
- ```
105
-
106
- ---
107
-
108
- ## Supported Voice Commands (Examples)
109
-
110
- | Voice Command | Maps To | Category |
111
- |---------------|---------|----------|
112
- | "What's my status" | `./video-analyzer status` | Info |
113
- | "Summarize latest video" | `./video-analyzer summarize --latest` | Processing |
114
- | "Add files from downloads" | `./video-analyzer add ~/Downloads` | Content |
115
- | "Process all videos" | `./video-analyzer process-all` | Processing |
116
- | "Search for cap rate" | `./video-analyzer search "cap rate"` | Knowledge |
117
- | "Start chat mode" | `./video-analyzer chat` | Interactive |
118
- | "What did I learn about negotiation" | RAG query via Ollama | Q&A |
119
-
120
- ---
121
-
122
- ## Project Structure (New Files)
123
-
124
- ```
125
- src/
126
- ├── voice/ # NEW MODULE
127
- │ ├── __init__.py
128
- │ ├── listener.py # Microphone capture + wake word
129
- │ ├── stt.py # Whisper wrapper for real-time
130
- │ ├── tts.py # Piper TTS wrapper
131
- │ ├── commands.py # Command pattern matching
132
- │ └── assistant.py # Main voice assistant loop
133
-
134
- ├── processors/
135
- │ └── transcriber.py # Already exists - reuse for STT
136
- ```
137
-
138
- ---
139
-
140
- ## Implementation Phases
141
-
142
- ### Phase 1: Basic TTS (Speak Responses) — 2-3 hours
143
- - [ ] Install Piper TTS
144
- - [ ] Create `src/voice/tts.py`
145
- - [ ] Add `--speak` flag to CLI commands
146
- - [ ] Test: `./video-analyzer status --speak`
147
-
148
- ### Phase 2: Real-time STT (Hear Commands) — 3-4 hours
149
- - [ ] Install sounddevice for audio capture
150
- - [ ] Create `src/voice/stt.py` (wrap existing Whisper)
151
- - [ ] Implement silence detection (stop recording)
152
- - [ ] Test: record → transcribe → print
153
-
154
- ### Phase 3: Command Parsing — 2-3 hours
155
- - [ ] Create `src/voice/commands.py`
156
- - [ ] Pattern matching for simple commands
157
- - [ ] Ollama fallback for complex/natural queries
158
- - [ ] Map voice → CLI commands
159
-
160
- ### Phase 4: Wake Word Detection — 2-3 hours
161
- - [ ] Choose: Porcupine (easier) or OpenWakeWord (more private)
162
- - [ ] Create `src/voice/listener.py`
163
- - [ ] Continuous low-power listening
164
- - [ ] Wake → record → process cycle
165
-
166
- ### Phase 5: Voice Assistant Loop — 2-3 hours
167
- - [ ] Create `src/voice/assistant.py`
168
- - [ ] Full loop: wake → listen → parse → execute → speak
169
- - [ ] Add `./video-analyzer voice` command
170
- - [ ] Handle errors gracefully with voice feedback
171
-
172
- ### Phase 6: Polish — 2-3 hours
173
- - [ ] Acknowledgment sounds (beeps/chimes)
174
- - [ ] Voice feedback for long operations ("Processing, please wait...")
175
- - [ ] Configurable wake word
176
- - [ ] Voice selection for TTS
177
-
178
- ---
179
-
180
- ## Dependencies to Add
181
-
182
- ```txt
183
- # Voice Commands - requirements.txt additions
184
-
185
- # Audio capture
186
- sounddevice>=0.4.6
187
- numpy>=1.24.0
188
-
189
- # Text-to-Speech (local)
190
- piper-tts>=1.2.0
191
- # Alternative: TTS>=0.22.0 # Coqui TTS
192
-
193
- # Wake Word Detection (choose one)
194
- pvporcupine>=3.0.0 # Easier setup, free tier
195
- # openwakeword>=0.5.0 # Fully open source
196
-
197
- # Voice Activity Detection
198
- webrtcvad>=2.0.10
199
- ```
200
-
201
- ---
202
-
203
- ## System Dependencies
204
-
205
- ```bash
206
- # Ubuntu/Debian
207
- sudo apt install portaudio19-dev python3-pyaudio
208
-
209
- # For Piper TTS voices (download once)
210
- mkdir -p ~/.local/share/piper
211
- cd ~/.local/share/piper
212
- wget https://github.com/rhasspy/piper/releases/download/v1.2.0/voice-en_US-lessac-medium.onnx.json
213
- wget https://github.com/rhasspy/piper/releases/download/v1.2.0/voice-en_US-lessac-medium.onnx
214
- ```
215
-
216
- ---
217
-
218
- ## Privacy Guarantees
219
-
220
- ### What NEVER Leaves Your Machine
221
- - ❌ Raw audio recordings
222
- - ❌ Voice patterns/fingerprints
223
- - ❌ Transcribed text
224
- - ❌ Commands you speak
225
- - ❌ Any biometric data
226
-
227
- ### What Stays 100% Local
228
- - ✅ Whisper model runs locally
229
- - ✅ Piper TTS runs locally
230
- - ✅ Wake word detection runs locally
231
- - ✅ All audio processing is local
232
- - ✅ Works completely offline after setup
233
-
234
- ### Compared to Cloud Alternatives
235
-
236
- | Cloud Service | What They Collect | Our Approach |
237
- |---------------|-------------------|--------------|
238
- | Alexa/Siri | Voice recordings, patterns | Nothing - all local |
239
- | Google Assistant | Voice data, usage patterns | Nothing - all local |
240
- | OpenAI Whisper API | Audio sent to cloud | Local Whisper - never sent |
241
- | ElevenLabs | Voice for cloning | Local Piper - no upload |
242
-
243
- ---
244
-
245
- ## Configuration Options
246
-
247
- ```json
248
- // config/voice.json
249
- {
250
- "wake_word": "hey analyzer",
251
- "stt_model": "base", // tiny/base/small/medium
252
- "tts_voice": "en_US-lessac-medium",
253
- "tts_speed": 1.0,
254
- "silence_threshold": 0.5, // seconds of silence to stop
255
- "confirmation_sounds": true,
256
- "speak_responses": true,
257
- "max_listen_time": 30 // seconds
258
- }
259
- ```
260
-
261
- ---
262
-
263
- ## Example Usage
264
-
265
- ```bash
266
- # Start voice assistant mode
267
- ./video-analyzer voice
268
-
269
- # One-shot voice command
270
- ./video-analyzer voice --once
271
-
272
- # Status with spoken response
273
- ./video-analyzer status --speak
274
-
275
- # Process with voice feedback
276
- ./video-analyzer process-all --speak
277
- ```
278
-
279
- ### Voice Session Example
280
-
281
- ```
282
- [System]: Listening for "Hey Analyzer"...
283
- [You]: "Hey Analyzer"
284
- [System]: *beep* "Yes?"
285
- [You]: "What's my current status?"
286
- [System]: "You have 12 videos transcribed, 8 documents processed,
287
- and 47 items in your knowledge base. 3 videos are
288
- pending transcription."
289
- [System]: Listening for "Hey Analyzer"...
290
- [You]: "Hey Analyzer"
291
- [System]: *beep* "Yes?"
292
- [You]: "Summarize the latest video about negotiation"
293
- [System]: "Working on it... The latest video covers 5 key
294
- negotiation tactics: anchoring, the flinch,
295
- bracketing, nibbling, and the walk-away..."
296
- ```
297
-
298
- ---
299
-
300
- ## Why This Approach?
301
-
302
- | Requirement | Solution |
303
- |-------------|----------|
304
- | **No voice collection** | All STT via local Whisper |
305
- | **No fingerprinting** | No cloud = no profile building |
306
- | **Works offline** | Everything runs locally |
307
- | **Fast response** | Piper TTS is <100ms latency |
308
- | **Natural voices** | Piper neural voices sound great |
309
- | **Low resources** | Base Whisper + Piper = ~500MB RAM |
310
-
311
- ---
312
-
313
- ## Next Steps
314
-
315
- 1. **Start with Phase 1** - Get TTS working first (instant gratification)
316
- 2. **Then Phase 2** - Add STT (reuse existing Whisper code)
317
- 3. **Phases 3-5** - Build up the full assistant
318
- 4. **Phase 6** - Polish and customize
319
-
320
- Ready to start implementing? Just say the word! 🎤
321
-
322
-
323
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+
3
+ demo = gr.Interface(
4
+ fn=lambda x: x,
5
+ inputs=gr.Textbox(label="Input"),
6
+ outputs=gr.Textbox(label="Output"),
7
+ title="Video Analyzer",
8
+ )
9
+
10
+ if __name__ == "__main__":
11
+ demo.launch()
data/audio/test_silence.wav DELETED
Binary file (32.1 kB)
 
data/summaries/sample_real_estate_summary.md DELETED
@@ -1 +0,0 @@
1
- Real estate investing success comes from: understanding your numbers, doing thorough due diligence, Negotiating, and avoiding common pitfalls. In the next module, we'll dive into deeper into strategies and how to structure deals for maximum returns. Real Estate Investment Fundamentals - Course Transcript is available in English and Spanish. For more information, visit the Real Estate Investing Course Transcripts website or click here for the English version. For the Spanish version, go to the Real estate Investment Course Transcript website or visit the Dutch version.
 
 
data/transcripts/sample_real_estate.txt DELETED
@@ -1,98 +0,0 @@
1
- Real Estate Investment Fundamentals - Course Transcript
2
-
3
- Welcome to Module 1: Understanding Real Estate Investment Basics
4
-
5
- Today we're going to cover the fundamental concepts every real estate investor needs to know. Whether you're just starting out or looking to expand your portfolio, these principles will guide your decision-making.
6
-
7
- CASH FLOW ANALYSIS
8
-
9
- Cash flow is the lifeblood of any real estate investment. Simply put, it's the money left over after you've collected rent and paid all expenses. Here's the basic formula:
10
-
11
- Monthly Cash Flow = Gross Rent - Operating Expenses - Mortgage Payment
12
-
13
- Let's break this down with an example. Say you have a rental property that brings in $2,000 per month in rent. Your expenses include:
14
- - Property taxes: $200/month
15
- - Insurance: $100/month
16
- - Maintenance reserve: $150/month
17
- - Property management: $160/month (8% of rent)
18
- - Vacancy allowance: $100/month (5%)
19
-
20
- Total operating expenses: $710/month
21
- Mortgage payment: $900/month
22
-
23
- Cash flow = $2,000 - $710 - $900 = $390/month positive cash flow
24
-
25
- This is a healthy cash-flowing property!
26
-
27
- CAP RATE (CAPITALIZATION RATE)
28
-
29
- Cap rate helps you compare properties and determine value. It's calculated as:
30
-
31
- Cap Rate = Net Operating Income (NOI) / Property Value
32
-
33
- NOI is your annual income minus operating expenses (not including mortgage). Using our example:
34
- - Annual gross rent: $24,000
35
- - Annual operating expenses: $8,520
36
- - NOI: $15,480
37
-
38
- If the property is worth $200,000:
39
- Cap Rate = $15,480 / $200,000 = 7.74%
40
-
41
- Generally, higher cap rates mean higher returns but often come with more risk. Markets like New York might have 4% cap rates while smaller cities might offer 8-10%.
42
-
43
- CASH-ON-CASH RETURN
44
-
45
- This metric tells you how hard your actual invested cash is working:
46
-
47
- Cash-on-Cash Return = Annual Cash Flow / Total Cash Invested
48
-
49
- If you put $50,000 down on our example property:
50
- - Annual cash flow: $390 x 12 = $4,680
51
- - Cash-on-cash return: $4,680 / $50,000 = 9.36%
52
-
53
- That means you're earning 9.36% on your actual cash investment - much better than a savings account!
54
-
55
- THE 1% RULE
56
-
57
- A quick screening tool: the monthly rent should be at least 1% of the purchase price. For a $200,000 property, you'd want at least $2,000/month in rent.
58
-
59
- Our example property meets this rule: $2,000 / $200,000 = 1%
60
-
61
- NEGOTIATION STRATEGIES
62
-
63
- When making offers:
64
- 1. Always start below asking price - leave room to negotiate
65
- 2. Use inspection findings as leverage
66
- 3. Ask for seller concessions on closing costs
67
- 4. Be prepared to walk away - this is your strongest tool
68
- 5. Build rapport with the seller when possible
69
-
70
- DUE DILIGENCE CHECKLIST
71
-
72
- Before closing, verify:
73
- - Rent rolls and actual income
74
- - All operating expenses with documentation
75
- - Property condition (get professional inspection)
76
- - Comparable sales in the area
77
- - Zoning and any restrictions
78
- - Title search for liens or encumbrances
79
-
80
- COMMON MISTAKES TO AVOID
81
-
82
- 1. Overestimating rental income
83
- 2. Underestimating repairs and maintenance
84
- 3. Not accounting for vacancy
85
- 4. Skipping proper inspections
86
- 5. Emotional decision-making
87
- 6. Over-leveraging (too much debt)
88
-
89
- SUMMARY
90
-
91
- Real estate investing success comes from:
92
- - Understanding your numbers (cash flow, cap rate, CoC return)
93
- - Doing thorough due diligence
94
- - Negotiating effectively
95
- - Avoiding common pitfalls
96
- - Building for long-term wealth
97
-
98
- In the next module, we'll dive deeper into financing strategies and how to structure deals for maximum returns.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hf_space/README.md DELETED
@@ -1,39 +0,0 @@
1
- ---
2
- title: Real Estate Mentor
3
- emoji: 🏠
4
- colorFrom: blue
5
- colorTo: green
6
- sdk: gradio
7
- sdk_version: 4.44.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- ---
12
-
13
- # 🏠 Real Estate Mentor
14
-
15
- Your AI-powered course assistant for real estate investing education.
16
-
17
- ## Features
18
-
19
- - **🔍 Semantic Search** - Search your course content by meaning, not just keywords
20
- - **💬 Ask Questions** - Get answers based on your indexed materials
21
- - **📤 Easy Upload** - Add transcripts and notes with one click
22
- - **💾 Persistent Storage** - Your data is saved between sessions
23
-
24
- ## How to Use
25
-
26
- 1. **Upload Content** - Go to the Upload tab and add your course transcripts
27
- 2. **Search** - Use natural language to find relevant information
28
- 3. **Ask** - Chat with your AI mentor about the content
29
-
30
- ## Tech Stack
31
-
32
- - **Gradio** - Web interface
33
- - **ChromaDB** - Vector database for semantic search
34
- - **Sentence Transformers** - Text embeddings
35
- - **100% Free** - Runs entirely on HuggingFace Spaces
36
-
37
- ## Privacy
38
-
39
- Your uploaded content is stored in this Space's persistent storage. No data is sent to external services.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hf_space/app.py DELETED
@@ -1,413 +0,0 @@
1
- """
2
- Real Estate Mentor - HuggingFace Spaces App
3
-
4
- A semantic search and Q&A system for course content.
5
- Upload transcripts, search by meaning, and get answers.
6
- """
7
-
8
- import os
9
- import sys
10
- from pathlib import Path
11
-
12
- # Add src to path for imports
13
- sys.path.insert(0, str(Path(__file__).parent))
14
-
15
- import gradio as gr
16
-
17
- # Set up persistent storage paths for HF Spaces
18
- DATA_DIR = Path(os.getenv("PERSISTENT_DIR", "/data" if os.path.exists("/data") else "./data"))
19
- CHROMA_DIR = DATA_DIR / "chromadb"
20
- TRANSCRIPTS_DIR = DATA_DIR / "transcripts"
21
-
22
- # Ensure directories exist
23
- for d in [CHROMA_DIR, TRANSCRIPTS_DIR]:
24
- d.mkdir(parents=True, exist_ok=True)
25
-
26
- print(f"Data directory: {DATA_DIR}")
27
- print(f"ChromaDB directory: {CHROMA_DIR}")
28
-
29
-
30
- # ============== KNOWLEDGE BASE ==============
31
-
32
- class SimpleKnowledgeBase:
33
- """Simplified knowledge base for HF Spaces."""
34
-
35
- def __init__(self):
36
- self._client = None
37
- self._collection = None
38
- self._model = None
39
-
40
- def _init(self):
41
- if self._client is not None:
42
- return
43
-
44
- import chromadb
45
- from chromadb.config import Settings
46
- from sentence_transformers import SentenceTransformer
47
-
48
- # Initialize ChromaDB
49
- self._client = chromadb.PersistentClient(
50
- path=str(CHROMA_DIR),
51
- settings=Settings(anonymized_telemetry=False)
52
- )
53
- self._collection = self._client.get_or_create_collection(
54
- name="real_estate_mentor",
55
- metadata={"hnsw:space": "cosine"}
56
- )
57
-
58
- # Initialize embedding model
59
- self._model = SentenceTransformer("all-MiniLM-L6-v2")
60
-
61
- print(f"Knowledge base initialized: {self._collection.count()} documents")
62
-
63
- def add_text(self, text: str, source: str, chunk_size: int = 500):
64
- """Add text to the knowledge base in chunks."""
65
- self._init()
66
-
67
- # Simple chunking by sentences/paragraphs
68
- chunks = self._chunk_text(text, chunk_size)
69
-
70
- if not chunks:
71
- return 0
72
-
73
- # Generate embeddings
74
- embeddings = self._model.encode(chunks).tolist()
75
-
76
- # Generate IDs
77
- import hashlib
78
- ids = [
79
- hashlib.md5(f"{source}:{i}:{c[:50]}".encode()).hexdigest()
80
- for i, c in enumerate(chunks)
81
- ]
82
-
83
- # Add to collection
84
- self._collection.add(
85
- ids=ids,
86
- embeddings=embeddings,
87
- documents=chunks,
88
- metadatas=[{"source": source, "chunk_idx": i} for i in range(len(chunks))]
89
- )
90
-
91
- return len(chunks)
92
-
93
- def _chunk_text(self, text: str, chunk_size: int = 500) -> list[str]:
94
- """Split text into chunks."""
95
- if len(text) <= chunk_size:
96
- return [text] if text.strip() else []
97
-
98
- chunks = []
99
- paragraphs = text.split("\n\n")
100
- current_chunk = ""
101
-
102
- for para in paragraphs:
103
- if len(current_chunk) + len(para) <= chunk_size:
104
- current_chunk += para + "\n\n"
105
- else:
106
- if current_chunk.strip():
107
- chunks.append(current_chunk.strip())
108
- current_chunk = para + "\n\n"
109
-
110
- if current_chunk.strip():
111
- chunks.append(current_chunk.strip())
112
-
113
- return chunks
114
-
115
- def search(self, query: str, n_results: int = 5) -> list[dict]:
116
- """Search the knowledge base."""
117
- self._init()
118
-
119
- if self._collection.count() == 0:
120
- return []
121
-
122
- # Generate query embedding
123
- query_embedding = self._model.encode(query).tolist()
124
-
125
- # Search
126
- results = self._collection.query(
127
- query_embeddings=[query_embedding],
128
- n_results=min(n_results, self._collection.count()),
129
- include=["documents", "metadatas", "distances"]
130
- )
131
-
132
- # Format results
133
- output = []
134
- if results["documents"] and results["documents"][0]:
135
- for i, doc in enumerate(results["documents"][0]):
136
- meta = results["metadatas"][0][i] if results["metadatas"] else {}
137
- dist = results["distances"][0][i] if results["distances"] else 0
138
- output.append({
139
- "text": doc,
140
- "source": meta.get("source", "unknown"),
141
- "score": 1 - dist # Convert distance to similarity
142
- })
143
-
144
- return output
145
-
146
- def count(self) -> int:
147
- """Get document count."""
148
- self._init()
149
- return self._collection.count()
150
-
151
- def get_sources(self) -> list[str]:
152
- """Get all sources."""
153
- self._init()
154
- results = self._collection.get(include=["metadatas"])
155
- sources = set()
156
- if results["metadatas"]:
157
- for meta in results["metadatas"]:
158
- if "source" in meta:
159
- sources.add(meta["source"])
160
- return sorted(sources)
161
-
162
- def clear(self):
163
- """Clear the knowledge base."""
164
- self._init()
165
- self._client.delete_collection("real_estate_mentor")
166
- self._collection = self._client.create_collection(
167
- name="real_estate_mentor",
168
- metadata={"hnsw:space": "cosine"}
169
- )
170
-
171
-
172
- # Global instance
173
- kb = SimpleKnowledgeBase()
174
-
175
-
176
- # ============== UI FUNCTIONS ==============
177
-
178
- def search_knowledge(query: str, n_results: int = 5) -> str:
179
- """Search the knowledge base."""
180
- if not query.strip():
181
- return "⚠️ Please enter a search query."
182
-
183
- try:
184
- results = kb.search(query, n_results=int(n_results))
185
-
186
- if not results:
187
- return "📭 No results found. Upload some content first!"
188
-
189
- output = ["## 🔍 Search Results\n"]
190
- for i, r in enumerate(results, 1):
191
- source = Path(r["source"]).stem if r["source"] != "unknown" else "unknown"
192
- score = r["score"] * 100
193
- text = r["text"][:400] + "..." if len(r["text"]) > 400 else r["text"]
194
-
195
- output.append(f"### Result {i} — {score:.0f}% match")
196
- output.append(f"📄 **Source:** {source}\n")
197
- output.append(f"```\n{text}\n```\n")
198
-
199
- return "\n".join(output)
200
-
201
- except Exception as e:
202
- return f"❌ Error: {str(e)}"
203
-
204
-
205
- def upload_file(file, source_name: str) -> str:
206
- """Process uploaded file."""
207
- if file is None:
208
- return "⚠️ Please select a file."
209
-
210
- try:
211
- # Read content
212
- with open(file.name, "r", encoding="utf-8", errors="ignore") as f:
213
- content = f.read()
214
-
215
- if not content.strip():
216
- return "⚠️ File is empty."
217
-
218
- # Use custom name or filename
219
- name = source_name.strip() if source_name.strip() else Path(file.name).stem
220
-
221
- # Save locally
222
- save_path = TRANSCRIPTS_DIR / f"{name}.txt"
223
- save_path.write_text(content)
224
-
225
- # Index
226
- chunks = kb.add_text(content, source=str(save_path))
227
-
228
- return f"""✅ **Successfully indexed!**
229
-
230
- - **Source:** {name}
231
- - **Chunks created:** {chunks}
232
- - **Characters:** {len(content):,}
233
- """
234
- except Exception as e:
235
- return f"❌ Error: {str(e)}"
236
-
237
-
238
- def upload_text(text: str, source_name: str) -> str:
239
- """Process pasted text."""
240
- if not text.strip():
241
- return "⚠️ Please enter some text."
242
- if not source_name.strip():
243
- return "⚠️ Please provide a source name."
244
-
245
- try:
246
- # Save locally
247
- save_path = TRANSCRIPTS_DIR / f"{source_name.strip()}.txt"
248
- save_path.write_text(text)
249
-
250
- # Index
251
- chunks = kb.add_text(text, source=str(save_path))
252
-
253
- return f"""✅ **Successfully indexed!**
254
-
255
- - **Source:** {source_name}
256
- - **Chunks created:** {chunks}
257
- - **Characters:** {len(text):,}
258
- """
259
- except Exception as e:
260
- return f"❌ Error: {str(e)}"
261
-
262
-
263
- def get_status() -> str:
264
- """Get knowledge base status."""
265
- try:
266
- count = kb.count()
267
- sources = kb.get_sources()
268
-
269
- output = [f"## 📊 Knowledge Base Status\n"]
270
- output.append(f"**Total chunks:** {count}")
271
- output.append(f"**Sources:** {len(sources)}\n")
272
-
273
- if sources:
274
- output.append("### 📁 Indexed Sources:")
275
- for s in sources[:15]:
276
- name = Path(s).stem
277
- output.append(f"- {name}")
278
- if len(sources) > 15:
279
- output.append(f"- *...and {len(sources) - 15} more*")
280
- else:
281
- output.append("*No content indexed yet. Upload some files to get started!*")
282
-
283
- return "\n".join(output)
284
- except Exception as e:
285
- return f"❌ Error: {str(e)}"
286
-
287
-
288
- def clear_all() -> str:
289
- """Clear knowledge base."""
290
- try:
291
- kb.clear()
292
- return "✅ Knowledge base cleared!"
293
- except Exception as e:
294
- return f"❌ Error: {str(e)}"
295
-
296
-
297
- def chat_respond(message: str, history: list) -> tuple:
298
- """Respond to chat message using RAG."""
299
- if not message.strip():
300
- return "", history
301
-
302
- try:
303
- # Search for context
304
- results = kb.search(message, n_results=3)
305
-
306
- if not results:
307
- response = "I don't have any relevant information yet. Please upload some course content first! 📚"
308
- else:
309
- # Build response from context
310
- sources = set()
311
- context_parts = []
312
-
313
- for r in results:
314
- source = Path(r["source"]).stem if r["source"] != "unknown" else "unknown"
315
- sources.add(source)
316
- context_parts.append(r["text"])
317
-
318
- context = "\n\n---\n\n".join(context_parts)
319
-
320
- response = f"""Based on your course materials:
321
-
322
- {context[:1500]}{"..." if len(context) > 1500 else ""}
323
-
324
- ---
325
- 📚 *Sources: {", ".join(sources)}*"""
326
-
327
- history.append((message, response))
328
- return "", history
329
-
330
- except Exception as e:
331
- history.append((message, f"❌ Error: {str(e)}"))
332
- return "", history
333
-
334
-
335
- # ============== BUILD APP ==============
336
-
337
- with gr.Blocks(
338
- title="Real Estate Mentor",
339
- theme=gr.themes.Soft()
340
- ) as demo:
341
-
342
- gr.Markdown("""
343
- # 🏠 Real Estate Mentor
344
-
345
- Your AI-powered course assistant. Upload transcripts, search semantically, and ask questions.
346
-
347
- ---
348
- """)
349
-
350
- with gr.Tabs():
351
- # Search Tab
352
- with gr.TabItem("🔍 Search"):
353
- with gr.Row():
354
- with gr.Column(scale=4):
355
- search_input = gr.Textbox(
356
- label="Search Query",
357
- placeholder="e.g., How do I calculate cash-on-cash return?",
358
- lines=2
359
- )
360
- with gr.Column(scale=1):
361
- n_results_slider = gr.Slider(1, 10, value=5, step=1, label="Results")
362
- search_btn = gr.Button("🔍 Search", variant="primary")
363
- search_output = gr.Markdown()
364
-
365
- search_btn.click(search_knowledge, [search_input, n_results_slider], search_output)
366
- search_input.submit(search_knowledge, [search_input, n_results_slider], search_output)
367
-
368
- # Chat Tab
369
- with gr.TabItem("💬 Ask"):
370
- chatbot = gr.Chatbot(height=400, label="Chat")
371
- chat_input = gr.Textbox(label="Your Question", placeholder="Ask about your course content...")
372
- chat_btn = gr.Button("💬 Send", variant="primary")
373
-
374
- chat_btn.click(chat_respond, [chat_input, chatbot], [chat_input, chatbot])
375
- chat_input.submit(chat_respond, [chat_input, chatbot], [chat_input, chatbot])
376
-
377
- # Upload Tab
378
- with gr.TabItem("📤 Upload"):
379
- with gr.Row():
380
- with gr.Column():
381
- gr.Markdown("### Upload File")
382
- file_input = gr.File(label="Select .txt or .md file", file_types=[".txt", ".md"])
383
- file_name = gr.Textbox(label="Custom Name (optional)", placeholder="e.g., Module 1")
384
- file_btn = gr.Button("📤 Upload", variant="primary")
385
- file_output = gr.Markdown()
386
-
387
- file_btn.click(upload_file, [file_input, file_name], file_output)
388
-
389
- with gr.Column():
390
- gr.Markdown("### Paste Text")
391
- text_input = gr.Textbox(label="Text Content", lines=8, placeholder="Paste transcript here...")
392
- text_name = gr.Textbox(label="Source Name", placeholder="e.g., Video 1 Notes")
393
- text_btn = gr.Button("📥 Index", variant="primary")
394
- text_output = gr.Markdown()
395
-
396
- text_btn.click(upload_text, [text_input, text_name], text_output)
397
-
398
- # Status Tab
399
- with gr.TabItem("📊 Status"):
400
- status_output = gr.Markdown()
401
- with gr.Row():
402
- refresh_btn = gr.Button("🔄 Refresh")
403
- clear_btn = gr.Button("🗑️ Clear All", variant="stop")
404
-
405
- refresh_btn.click(get_status, outputs=status_output)
406
- clear_btn.click(clear_all, outputs=status_output)
407
- demo.load(get_status, outputs=status_output)
408
-
409
- gr.Markdown("---\n*Built with Gradio, ChromaDB & Sentence Transformers • 100% Free*")
410
-
411
-
412
- if __name__ == "__main__":
413
- demo.launch()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hf_space/requirements.txt DELETED
@@ -1,5 +0,0 @@
1
- # HuggingFace Spaces Requirements
2
- gradio>=4.0.0
3
- chromadb>=0.4.0
4
- sentence-transformers>=2.2.0
5
- torch>=2.0.0
 
 
 
 
 
 
pyproject.toml ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "video-analyzer"
3
+ version = "0.1.0"
4
+ description = "A Gradio application"
5
+ readme = "README.md"
6
+ requires-python = ">=3.11"
7
+ dependencies = [
8
+ "gradio>=6.0.0",
9
+ ]
pytest.ini DELETED
@@ -1,9 +0,0 @@
1
- [pytest]
2
- testpaths = tests
3
- python_files = test_*.py
4
- python_classes = Test*
5
- python_functions = test_*
6
- addopts = -v --tb=short
7
- filterwarnings =
8
- ignore::DeprecationWarning
9
- ignore::UserWarning
 
 
 
 
 
 
 
 
 
 
requirements.txt DELETED
@@ -1,44 +0,0 @@
1
- # Video Analyzer - Dependencies
2
- # 100% Free & Open Source
3
-
4
- # Core
5
- python-dotenv>=1.0.0
6
- typer[all]>=0.9.0
7
- rich>=13.0.0
8
- pydantic>=2.0.0
9
- pydantic-settings>=2.0.0
10
- tqdm>=4.66.0
11
-
12
- # Video/Audio
13
- yt-dlp>=2024.1.0
14
-
15
- # Transcription
16
- faster-whisper>=1.0.0
17
-
18
- # Document Processing
19
- PyMuPDF>=1.23.0
20
- python-docx>=1.0.0
21
- python-pptx>=0.6.23
22
-
23
- # OCR (optional - requires system tesseract)
24
- pytesseract>=0.3.10
25
- Pillow>=10.0.0
26
-
27
- # AI/ML - Multiple options
28
- transformers>=4.36.0 # Hugging Face models
29
- torch>=2.0.0 # PyTorch backend
30
- # ollama>=0.1.0 # Optional: Ollama client
31
-
32
- # Testing
33
- pytest>=7.4.0
34
- pytest-cov>=4.1.0
35
-
36
- # Phase 3: Knowledge Base
37
- sentence-transformers>=2.2.0
38
- chromadb>=0.4.0
39
-
40
- # Phase 5: Web UI
41
- gradio>=4.0.0
42
-
43
- # Web UI (Phase 5)
44
- # gradio>=4.0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/__init__.py DELETED
@@ -1,3 +0,0 @@
1
- """Video Analyzer - Download, transcribe, and learn from video content."""
2
-
3
- __version__ = "0.1.0"
 
 
 
 
src/__pycache__/__init__.cpython-312.pyc DELETED
Binary file (235 Bytes)
 
src/__pycache__/config.cpython-312.pyc DELETED
Binary file (2.08 kB)
 
src/__pycache__/main.cpython-312.pyc DELETED
Binary file (278 Bytes)
 
src/analyzers/__init__.py DELETED
@@ -1,26 +0,0 @@
1
- """AI analyzers for summarization and extraction."""
2
-
3
- from .chunker import chunk_text, chunk_for_summarization, TextChunk
4
- from .summarizer import Summarizer, summarize_file, Summary, OllamaClient
5
- from .huggingface import (
6
- HuggingFaceLocal,
7
- HuggingFaceAPI,
8
- HuggingFaceTextGen,
9
- summarize_with_huggingface,
10
- list_recommended_models
11
- )
12
-
13
- __all__ = [
14
- "chunk_text",
15
- "chunk_for_summarization",
16
- "TextChunk",
17
- "Summarizer",
18
- "summarize_file",
19
- "Summary",
20
- "OllamaClient",
21
- "HuggingFaceLocal",
22
- "HuggingFaceAPI",
23
- "HuggingFaceTextGen",
24
- "summarize_with_huggingface",
25
- "list_recommended_models"
26
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/analyzers/__pycache__/__init__.cpython-312.pyc DELETED
Binary file (685 Bytes)
 
src/analyzers/__pycache__/chunker.cpython-312.pyc DELETED
Binary file (3.85 kB)
 
src/analyzers/__pycache__/huggingface.cpython-312.pyc DELETED
Binary file (14.9 kB)
 
src/analyzers/__pycache__/summarizer.cpython-312.pyc DELETED
Binary file (14.3 kB)
 
src/analyzers/chunker.py DELETED
@@ -1,118 +0,0 @@
1
- """Text chunking for processing long documents with LLMs."""
2
-
3
- from dataclasses import dataclass
4
- from typing import Optional
5
-
6
-
7
- @dataclass
8
- class TextChunk:
9
- """A chunk of text with metadata."""
10
-
11
- text: str
12
- index: int
13
- start_char: int
14
- end_char: int
15
-
16
- @property
17
- def word_count(self) -> int:
18
- return len(self.text.split())
19
-
20
-
21
- def chunk_text(
22
- text: str,
23
- chunk_size: int = 4000,
24
- chunk_overlap: int = 200,
25
- separator: str = "\n\n"
26
- ) -> list[TextChunk]:
27
- """Split text into overlapping chunks.
28
-
29
- Args:
30
- text: Text to split
31
- chunk_size: Maximum characters per chunk
32
- chunk_overlap: Characters to overlap between chunks
33
- separator: Preferred split point (paragraphs, sentences, etc.)
34
-
35
- Returns:
36
- List of TextChunk objects
37
- """
38
- if len(text) <= chunk_size:
39
- return [TextChunk(text=text, index=0, start_char=0, end_char=len(text))]
40
-
41
- chunks = []
42
- start = 0
43
- index = 0
44
-
45
- while start < len(text):
46
- # Find end of chunk
47
- end = start + chunk_size
48
-
49
- if end >= len(text):
50
- # Last chunk
51
- chunk_text = text[start:]
52
- chunks.append(TextChunk(
53
- text=chunk_text,
54
- index=index,
55
- start_char=start,
56
- end_char=len(text)
57
- ))
58
- break
59
-
60
- # Try to find a good break point
61
- # Look for separator near the end of the chunk
62
- search_start = max(start + chunk_size - 500, start)
63
- search_end = min(start + chunk_size + 200, len(text))
64
- search_text = text[search_start:search_end]
65
-
66
- # Find last separator in search range
67
- sep_pos = search_text.rfind(separator)
68
- if sep_pos != -1:
69
- end = search_start + sep_pos + len(separator)
70
- else:
71
- # Fall back to sentence end
72
- for punct in [". ", "! ", "? ", "\n"]:
73
- punct_pos = search_text.rfind(punct)
74
- if punct_pos != -1:
75
- end = search_start + punct_pos + len(punct)
76
- break
77
-
78
- # Create chunk
79
- chunk_text = text[start:end].strip()
80
- if chunk_text:
81
- chunks.append(TextChunk(
82
- text=chunk_text,
83
- index=index,
84
- start_char=start,
85
- end_char=end
86
- ))
87
- index += 1
88
-
89
- # Move start with overlap
90
- start = end - chunk_overlap
91
-
92
- return chunks
93
-
94
-
95
- def chunk_for_summarization(
96
- text: str,
97
- max_tokens: int = 3000,
98
- chars_per_token: float = 4.0
99
- ) -> list[TextChunk]:
100
- """Chunk text optimized for LLM summarization.
101
-
102
- Args:
103
- text: Text to chunk
104
- max_tokens: Maximum tokens per chunk (for LLM context)
105
- chars_per_token: Approximate characters per token
106
-
107
- Returns:
108
- List of TextChunk objects
109
- """
110
- chunk_size = int(max_tokens * chars_per_token)
111
- overlap = int(chunk_size * 0.05) # 5% overlap for context
112
-
113
- return chunk_text(text, chunk_size=chunk_size, chunk_overlap=overlap)
114
-
115
-
116
- def estimate_tokens(text: str, chars_per_token: float = 4.0) -> int:
117
- """Estimate number of tokens in text."""
118
- return int(len(text) / chars_per_token)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/analyzers/huggingface.py DELETED
@@ -1,407 +0,0 @@
1
- """AI summarization using Hugging Face (local models or API)."""
2
-
3
- from dataclasses import dataclass
4
- from pathlib import Path
5
- from typing import Optional
6
- import os
7
-
8
- from rich.console import Console
9
- from rich.progress import Progress, SpinnerColumn, TextColumn
10
-
11
- from src.config import settings
12
- from src.analyzers.chunker import chunk_for_summarization, estimate_tokens
13
-
14
- console = Console()
15
-
16
-
17
- # Recommended models for different tasks
18
- RECOMMENDED_MODELS = {
19
- "summarization": {
20
- "small": "facebook/bart-large-cnn", # Fast, good for news-style
21
- "medium": "google/flan-t5-base", # Balanced
22
- "large": "google/flan-t5-large", # Better quality
23
- "best": "facebook/bart-large-xsum", # Abstractive summaries
24
- },
25
- "text_generation": {
26
- "small": "microsoft/phi-2", # 2.7B, very fast
27
- "medium": "mistralai/Mistral-7B-Instruct-v0.2", # 7B, good quality
28
- "large": "meta-llama/Llama-2-7b-chat-hf", # Requires access
29
- }
30
- }
31
-
32
-
33
- class HuggingFaceLocal:
34
- """Run Hugging Face models locally."""
35
-
36
- def __init__(
37
- self,
38
- model_name: str = "facebook/bart-large-cnn",
39
- device: str = "auto"
40
- ):
41
- self.model_name = model_name
42
- self.device = device
43
- self._model = None
44
- self._tokenizer = None
45
-
46
- def _load_model(self):
47
- """Lazy load the model."""
48
- if self._model is None:
49
- console.print(f"[bold green]Loading model:[/] {self.model_name}")
50
- console.print("[dim]This may take a few minutes on first run...[/]")
51
-
52
- try:
53
- from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
54
- import torch
55
- except ImportError:
56
- raise ImportError(
57
- "Transformers not installed. Run:\n"
58
- " pip install transformers torch"
59
- )
60
-
61
- # Determine device
62
- if self.device == "auto":
63
- device = 0 if torch.cuda.is_available() else -1
64
- else:
65
- device = 0 if self.device == "cuda" else -1
66
-
67
- # Load tokenizer and model
68
- self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
69
-
70
- # Use pipeline for easier inference
71
- self._pipeline = pipeline(
72
- "summarization",
73
- model=self.model_name,
74
- tokenizer=self._tokenizer,
75
- device=device
76
- )
77
-
78
- device_name = "GPU" if device >= 0 else "CPU"
79
- console.print(f"[green]✓[/] Model loaded on {device_name}")
80
-
81
- def summarize(
82
- self,
83
- text: str,
84
- max_length: int = 500,
85
- min_length: int = 100
86
- ) -> str:
87
- """Summarize text using local model.
88
-
89
- Args:
90
- text: Text to summarize
91
- max_length: Maximum summary length in tokens
92
- min_length: Minimum summary length in tokens
93
-
94
- Returns:
95
- Summary text
96
- """
97
- self._load_model()
98
-
99
- # Handle long texts by chunking
100
- tokens = estimate_tokens(text)
101
-
102
- if tokens > 1000: # BART/T5 have ~1024 token limit
103
- return self._summarize_chunks(text, max_length, min_length)
104
-
105
- result = self._pipeline(
106
- text,
107
- max_length=max_length,
108
- min_length=min_length,
109
- do_sample=False
110
- )
111
-
112
- return result[0]["summary_text"]
113
-
114
- def _summarize_chunks(
115
- self,
116
- text: str,
117
- max_length: int,
118
- min_length: int
119
- ) -> str:
120
- """Summarize long text in chunks."""
121
- chunks = chunk_for_summarization(text, max_tokens=800)
122
- console.print(f"[bold blue]Processing {len(chunks)} chunks...[/]")
123
-
124
- chunk_summaries = []
125
-
126
- with Progress(
127
- SpinnerColumn(),
128
- TextColumn("[progress.description]{task.description}"),
129
- console=console
130
- ) as progress:
131
- task = progress.add_task("Summarizing...", total=len(chunks))
132
-
133
- for i, chunk in enumerate(chunks):
134
- progress.update(task, description=f"Chunk {i+1}/{len(chunks)}")
135
-
136
- result = self._pipeline(
137
- chunk.text,
138
- max_length=max_length // len(chunks) + 50,
139
- min_length=min_length // len(chunks),
140
- do_sample=False
141
- )
142
- chunk_summaries.append(result[0]["summary_text"])
143
- progress.advance(task)
144
-
145
- # Combine summaries
146
- combined = " ".join(chunk_summaries)
147
-
148
- # If combined is still long, summarize again
149
- if len(combined) > 2000:
150
- console.print("[bold blue]Creating final summary...[/]")
151
- result = self._pipeline(
152
- combined,
153
- max_length=max_length,
154
- min_length=min_length,
155
- do_sample=False
156
- )
157
- return result[0]["summary_text"]
158
-
159
- return combined
160
-
161
-
162
- class HuggingFaceAPI:
163
- """Use Hugging Face Inference API (free tier available)."""
164
-
165
- def __init__(
166
- self,
167
- model_name: str = "facebook/bart-large-cnn",
168
- api_key: Optional[str] = None
169
- ):
170
- self.model_name = model_name
171
- # Check multiple sources for API key
172
- self.api_key = (
173
- api_key or
174
- os.getenv("HUGGINGFACE_API_KEY") or
175
- os.getenv("VIDEO_ANALYZER_HUGGINGFACE_API_KEY")
176
- )
177
-
178
- # Also try loading from settings
179
- if not self.api_key:
180
- try:
181
- from src.config import settings
182
- self.api_key = settings.huggingface_api_key
183
- except:
184
- pass
185
-
186
- self.api_url = f"https://router.huggingface.co/hf-inference/models/{model_name}"
187
-
188
- def _check_api_key(self):
189
- if not self.api_key:
190
- raise ValueError(
191
- "Hugging Face API key not found.\n"
192
- "Set it via:\n"
193
- " export HUGGINGFACE_API_KEY=your_key\n"
194
- "Or get a free key at: https://huggingface.co/settings/tokens"
195
- )
196
-
197
- def summarize(
198
- self,
199
- text: str,
200
- max_length: int = 500,
201
- min_length: int = 100
202
- ) -> str:
203
- """Summarize text using Hugging Face API.
204
-
205
- Args:
206
- text: Text to summarize
207
- max_length: Maximum summary length
208
- min_length: Minimum summary length
209
-
210
- Returns:
211
- Summary text
212
- """
213
- self._check_api_key()
214
-
215
- try:
216
- import requests
217
- except ImportError:
218
- raise ImportError("requests not installed. Run: pip install requests")
219
-
220
- headers = {"Authorization": f"Bearer {self.api_key}"}
221
-
222
- # Handle long texts
223
- tokens = estimate_tokens(text)
224
- if tokens > 1000:
225
- return self._summarize_chunks_api(text, max_length, min_length, headers)
226
-
227
- payload = {
228
- "inputs": text,
229
- "parameters": {
230
- "max_length": max_length,
231
- "min_length": min_length,
232
- "do_sample": False
233
- }
234
- }
235
-
236
- with console.status("[bold green]Calling Hugging Face API..."):
237
- response = requests.post(self.api_url, headers=headers, json=payload)
238
-
239
- if response.status_code != 200:
240
- error = response.json().get("error", response.text)
241
- raise Exception(f"API error: {error}")
242
-
243
- result = response.json()
244
-
245
- if isinstance(result, list) and len(result) > 0:
246
- return result[0].get("summary_text", str(result))
247
-
248
- return str(result)
249
-
250
- def _summarize_chunks_api(
251
- self,
252
- text: str,
253
- max_length: int,
254
- min_length: int,
255
- headers: dict
256
- ) -> str:
257
- """Summarize chunks via API."""
258
- import requests
259
-
260
- chunks = chunk_for_summarization(text, max_tokens=800)
261
- console.print(f"[bold blue]Processing {len(chunks)} chunks via API...[/]")
262
-
263
- chunk_summaries = []
264
-
265
- for i, chunk in enumerate(chunks):
266
- console.print(f"[dim]Chunk {i+1}/{len(chunks)}...[/]")
267
-
268
- payload = {
269
- "inputs": chunk.text,
270
- "parameters": {
271
- "max_length": max_length // len(chunks) + 50,
272
- "min_length": min(30, min_length // len(chunks)),
273
- "do_sample": False
274
- }
275
- }
276
-
277
- response = requests.post(self.api_url, headers=headers, json=payload)
278
-
279
- if response.status_code == 200:
280
- result = response.json()
281
- if isinstance(result, list) and len(result) > 0:
282
- chunk_summaries.append(result[0].get("summary_text", ""))
283
-
284
- return " ".join(chunk_summaries)
285
-
286
-
287
- class HuggingFaceTextGen:
288
- """Use Hugging Face for text generation (like Ollama alternative)."""
289
-
290
- def __init__(
291
- self,
292
- model_name: str = "microsoft/phi-2",
293
- device: str = "auto"
294
- ):
295
- self.model_name = model_name
296
- self.device = device
297
- self._pipeline = None
298
-
299
- def _load_model(self):
300
- """Lazy load the model."""
301
- if self._pipeline is None:
302
- console.print(f"[bold green]Loading model:[/] {self.model_name}")
303
- console.print("[dim]This may download several GB on first run...[/]")
304
-
305
- try:
306
- from transformers import pipeline
307
- import torch
308
- except ImportError:
309
- raise ImportError(
310
- "Transformers not installed. Run:\n"
311
- " pip install transformers torch accelerate"
312
- )
313
-
314
- # Determine device
315
- if self.device == "auto":
316
- device = "cuda" if torch.cuda.is_available() else "cpu"
317
- else:
318
- device = self.device
319
-
320
- self._pipeline = pipeline(
321
- "text-generation",
322
- model=self.model_name,
323
- device_map="auto" if device == "cuda" else None,
324
- torch_dtype="auto"
325
- )
326
-
327
- console.print(f"[green]✓[/] Model loaded on {device}")
328
-
329
- def generate(
330
- self,
331
- prompt: str,
332
- max_new_tokens: int = 500,
333
- temperature: float = 0.7
334
- ) -> str:
335
- """Generate text from prompt.
336
-
337
- Args:
338
- prompt: Input prompt
339
- max_new_tokens: Maximum tokens to generate
340
- temperature: Creativity (0-1)
341
-
342
- Returns:
343
- Generated text
344
- """
345
- self._load_model()
346
-
347
- result = self._pipeline(
348
- prompt,
349
- max_new_tokens=max_new_tokens,
350
- temperature=temperature,
351
- do_sample=temperature > 0,
352
- pad_token_id=self._pipeline.tokenizer.eos_token_id
353
- )
354
-
355
- generated = result[0]["generated_text"]
356
-
357
- # Remove the prompt from output
358
- if generated.startswith(prompt):
359
- generated = generated[len(prompt):].strip()
360
-
361
- return generated
362
-
363
-
364
- def summarize_with_huggingface(
365
- text: str,
366
- model: str = "facebook/bart-large-cnn",
367
- use_api: bool = False,
368
- api_key: Optional[str] = None,
369
- max_length: int = 500
370
- ) -> str:
371
- """Convenience function to summarize with Hugging Face.
372
-
373
- Args:
374
- text: Text to summarize
375
- model: Model name
376
- use_api: If True, use API instead of local
377
- api_key: API key (if using API)
378
- max_length: Maximum summary length
379
-
380
- Returns:
381
- Summary text
382
- """
383
- if use_api:
384
- client = HuggingFaceAPI(model, api_key)
385
- else:
386
- client = HuggingFaceLocal(model)
387
-
388
- return client.summarize(text, max_length=max_length)
389
-
390
-
391
- def list_recommended_models():
392
- """Display recommended Hugging Face models."""
393
- from rich.table import Table
394
-
395
- table = Table(title="Recommended Hugging Face Models")
396
- table.add_column("Task", style="cyan")
397
- table.add_column("Size", style="white")
398
- table.add_column("Model", style="green")
399
- table.add_column("Notes", style="dim")
400
-
401
- table.add_row("Summarization", "Small", "facebook/bart-large-cnn", "Fast, news-style")
402
- table.add_row("Summarization", "Medium", "google/flan-t5-base", "Balanced")
403
- table.add_row("Summarization", "Large", "google/flan-t5-large", "Better quality")
404
- table.add_row("Text Gen", "Small", "microsoft/phi-2", "2.7B, very fast")
405
- table.add_row("Text Gen", "Medium", "mistralai/Mistral-7B-Instruct-v0.2", "7B, good")
406
-
407
- console.print(table)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/analyzers/summarizer.py DELETED
@@ -1,410 +0,0 @@
1
- """AI-powered summarization using Ollama (local, free)."""
2
-
3
- import json
4
- import subprocess
5
- from dataclasses import dataclass
6
- from pathlib import Path
7
- from typing import Optional
8
-
9
- from rich.console import Console
10
- from rich.progress import Progress, SpinnerColumn, TextColumn
11
-
12
- from src.config import settings
13
- from src.analyzers.chunker import chunk_for_summarization, estimate_tokens
14
-
15
- console = Console()
16
-
17
-
18
- @dataclass
19
- class Summary:
20
- """A generated summary."""
21
-
22
- text: str
23
- source_path: Optional[Path]
24
- model: str
25
- summary_type: str # quick, detailed, study_notes
26
- original_length: int
27
- summary_length: int
28
-
29
- @property
30
- def compression_ratio(self) -> float:
31
- """How much the text was compressed."""
32
- if self.original_length == 0:
33
- return 0
34
- return self.summary_length / self.original_length
35
-
36
- def save(self, output_path: Optional[Path] = None) -> Path:
37
- """Save summary to file."""
38
- if output_path is None:
39
- stem = self.source_path.stem if self.source_path else "summary"
40
- output_path = settings.summaries_dir / f"{stem}_{self.summary_type}.md"
41
-
42
- output_path.parent.mkdir(parents=True, exist_ok=True)
43
- output_path.write_text(self.text)
44
- return output_path
45
-
46
-
47
- # Prompts for different summary types
48
- PROMPTS = {
49
- "quick": """Summarize the following text in 2-3 paragraphs. Focus on the main points and key takeaways.
50
-
51
- TEXT:
52
- {text}
53
-
54
- SUMMARY:""",
55
-
56
- "detailed": """Create a detailed summary of the following text. Include:
57
- - Main topics covered
58
- - Key points and concepts
59
- - Important details and examples
60
- - Actionable insights
61
-
62
- TEXT:
63
- {text}
64
-
65
- DETAILED SUMMARY:""",
66
-
67
- "study_notes": """Create comprehensive study notes from the following text. Format as:
68
-
69
- ## Key Concepts
70
- - List main concepts with brief explanations
71
-
72
- ## Important Points
73
- - Bullet points of critical information
74
-
75
- ## Definitions
76
- - Any important terms defined
77
-
78
- ## Action Items
79
- - Practical steps or strategies mentioned
80
-
81
- ## Summary
82
- - Brief overall summary
83
-
84
- TEXT:
85
- {text}
86
-
87
- STUDY NOTES:""",
88
-
89
- "real_estate": """You are a real estate expert. Analyze the following content and extract:
90
-
91
- ## Key Real Estate Concepts
92
- - Investment strategies mentioned
93
- - Market analysis techniques
94
- - Deal evaluation methods
95
-
96
- ## Financial Metrics
97
- - ROI, Cap Rate, Cash-on-Cash calculations if mentioned
98
- - Financing strategies
99
-
100
- ## Negotiation & Strategy
101
- - Negotiation tactics
102
- - Deal structuring advice
103
-
104
- ## Action Items
105
- - Practical steps to take
106
-
107
- ## Critical Warnings
108
- - Risks or pitfalls mentioned
109
-
110
- TEXT:
111
- {text}
112
-
113
- REAL ESTATE ANALYSIS:"""
114
- }
115
-
116
-
117
- class OllamaClient:
118
- """Client for interacting with Ollama."""
119
-
120
- def __init__(self, model: str = "llama3"):
121
- self.model = model
122
- self._verified = False
123
-
124
- def is_available(self) -> bool:
125
- """Check if Ollama is running."""
126
- try:
127
- result = subprocess.run(
128
- ["ollama", "list"],
129
- capture_output=True,
130
- text=True,
131
- timeout=5
132
- )
133
- return result.returncode == 0
134
- except (subprocess.TimeoutExpired, FileNotFoundError):
135
- return False
136
-
137
- def list_models(self) -> list[str]:
138
- """List available Ollama models."""
139
- try:
140
- result = subprocess.run(
141
- ["ollama", "list"],
142
- capture_output=True,
143
- text=True,
144
- timeout=10
145
- )
146
- if result.returncode != 0:
147
- return []
148
-
149
- models = []
150
- for line in result.stdout.strip().split("\n")[1:]: # Skip header
151
- if line.strip():
152
- model_name = line.split()[0]
153
- models.append(model_name)
154
- return models
155
- except Exception:
156
- return []
157
-
158
- def pull_model(self, model: Optional[str] = None) -> bool:
159
- """Pull/download a model."""
160
- model = model or self.model
161
- console.print(f"[bold green]Pulling model:[/] {model}")
162
-
163
- try:
164
- result = subprocess.run(
165
- ["ollama", "pull", model],
166
- capture_output=False,
167
- timeout=600 # 10 minutes
168
- )
169
- return result.returncode == 0
170
- except Exception as e:
171
- console.print(f"[red]Error pulling model:[/] {e}")
172
- return False
173
-
174
- def generate(
175
- self,
176
- prompt: str,
177
- system: Optional[str] = None,
178
- temperature: float = 0.7,
179
- max_tokens: int = 2000
180
- ) -> str:
181
- """Generate text using Ollama.
182
-
183
- Args:
184
- prompt: The prompt to send
185
- system: Optional system message
186
- temperature: Creativity (0-1)
187
- max_tokens: Maximum response length
188
-
189
- Returns:
190
- Generated text
191
- """
192
- # Build the request
193
- request = {
194
- "model": self.model,
195
- "prompt": prompt,
196
- "stream": False,
197
- "options": {
198
- "temperature": temperature,
199
- "num_predict": max_tokens
200
- }
201
- }
202
-
203
- if system:
204
- request["system"] = system
205
-
206
- try:
207
- # Use ollama CLI with run command
208
- full_prompt = prompt
209
- if system:
210
- full_prompt = f"System: {system}\n\n{prompt}"
211
-
212
- result = subprocess.run(
213
- ["ollama", "run", self.model],
214
- input=full_prompt,
215
- capture_output=True,
216
- text=True,
217
- timeout=300 # 5 minutes
218
- )
219
-
220
- if result.returncode != 0:
221
- raise Exception(f"Ollama error: {result.stderr}")
222
-
223
- return result.stdout.strip()
224
-
225
- except subprocess.TimeoutExpired:
226
- raise Exception("Ollama request timed out")
227
- except FileNotFoundError:
228
- raise Exception(
229
- "Ollama not found. Install it:\n"
230
- " curl -fsSL https://ollama.com/install.sh | sh\n"
231
- " ollama pull llama3"
232
- )
233
-
234
-
235
- class Summarizer:
236
- """Summarize text using Ollama."""
237
-
238
- def __init__(self, model: str = "llama3"):
239
- self.client = OllamaClient(model)
240
- self.model = model
241
-
242
- def summarize(
243
- self,
244
- text: str,
245
- summary_type: str = "detailed",
246
- source_path: Optional[Path] = None
247
- ) -> Summary:
248
- """Summarize text using Ollama.
249
-
250
- Args:
251
- text: Text to summarize
252
- summary_type: Type of summary (quick, detailed, study_notes, real_estate)
253
- source_path: Optional path to source file
254
-
255
- Returns:
256
- Summary object
257
- """
258
- # Check Ollama availability
259
- if not self.client.is_available():
260
- raise Exception(
261
- "Ollama is not running. Start it with:\n"
262
- " ollama serve\n"
263
- "Or install: curl -fsSL https://ollama.com/install.sh | sh"
264
- )
265
-
266
- # Check if model is available
267
- models = self.client.list_models()
268
- if self.model not in models and f"{self.model}:latest" not in models:
269
- console.print(f"[yellow]Model {self.model} not found. Pulling...[/]")
270
- self.client.pull_model()
271
-
272
- # Get prompt template
273
- prompt_template = PROMPTS.get(summary_type, PROMPTS["detailed"])
274
-
275
- # Check if text needs chunking
276
- tokens = estimate_tokens(text)
277
-
278
- if tokens > 3000:
279
- # Process in chunks
280
- console.print(f"[bold blue]Text is long ({tokens} tokens). Processing in chunks...[/]")
281
- return self._summarize_chunks(text, summary_type, source_path, prompt_template)
282
- else:
283
- # Process directly
284
- return self._summarize_single(text, summary_type, source_path, prompt_template)
285
-
286
- def _summarize_single(
287
- self,
288
- text: str,
289
- summary_type: str,
290
- source_path: Optional[Path],
291
- prompt_template: str
292
- ) -> Summary:
293
- """Summarize a single chunk of text."""
294
- prompt = prompt_template.format(text=text)
295
-
296
- with Progress(
297
- SpinnerColumn(),
298
- TextColumn("[progress.description]{task.description}"),
299
- console=console
300
- ) as progress:
301
- progress.add_task(f"Generating {summary_type} summary...", total=None)
302
-
303
- response = self.client.generate(prompt)
304
-
305
- console.print(f"[green]✓[/] Summary generated")
306
-
307
- return Summary(
308
- text=response,
309
- source_path=source_path,
310
- model=self.model,
311
- summary_type=summary_type,
312
- original_length=len(text),
313
- summary_length=len(response)
314
- )
315
-
316
- def _summarize_chunks(
317
- self,
318
- text: str,
319
- summary_type: str,
320
- source_path: Optional[Path],
321
- prompt_template: str
322
- ) -> Summary:
323
- """Summarize text in chunks, then combine."""
324
- chunks = chunk_for_summarization(text)
325
- console.print(f"[bold blue]Split into {len(chunks)} chunks[/]")
326
-
327
- chunk_summaries = []
328
-
329
- with Progress(
330
- SpinnerColumn(),
331
- TextColumn("[progress.description]{task.description}"),
332
- console=console
333
- ) as progress:
334
- task = progress.add_task("Processing chunks...", total=len(chunks))
335
-
336
- for i, chunk in enumerate(chunks):
337
- progress.update(task, description=f"Processing chunk {i+1}/{len(chunks)}...")
338
-
339
- prompt = prompt_template.format(text=chunk.text)
340
- response = self.client.generate(prompt)
341
- chunk_summaries.append(response)
342
-
343
- progress.advance(task)
344
-
345
- # Combine chunk summaries
346
- if len(chunk_summaries) > 1:
347
- console.print("[bold blue]Combining chunk summaries...[/]")
348
-
349
- combined_text = "\n\n---\n\n".join(chunk_summaries)
350
-
351
- combine_prompt = f"""Combine these summaries into one coherent {summary_type} summary.
352
- Remove redundancy and organize the information clearly.
353
-
354
- SUMMARIES:
355
- {combined_text}
356
-
357
- COMBINED SUMMARY:"""
358
-
359
- final_response = self.client.generate(combine_prompt)
360
- else:
361
- final_response = chunk_summaries[0]
362
-
363
- console.print(f"[green]✓[/] Summary generated")
364
-
365
- return Summary(
366
- text=final_response,
367
- source_path=source_path,
368
- model=self.model,
369
- summary_type=summary_type,
370
- original_length=len(text),
371
- summary_length=len(final_response)
372
- )
373
-
374
-
375
- def summarize_file(
376
- path: Path,
377
- summary_type: str = "detailed",
378
- model: str = "llama3",
379
- output_dir: Optional[Path] = None
380
- ) -> Summary:
381
- """Summarize a transcript or document file.
382
-
383
- Args:
384
- path: Path to text file
385
- summary_type: Type of summary
386
- model: Ollama model to use
387
- output_dir: Output directory for summary
388
-
389
- Returns:
390
- Summary object
391
- """
392
- path = Path(path)
393
- output_dir = output_dir or settings.summaries_dir
394
-
395
- if not path.exists():
396
- raise FileNotFoundError(f"File not found: {path}")
397
-
398
- console.print(f"[bold green]Summarizing:[/] {path.name}")
399
-
400
- text = path.read_text(encoding="utf-8", errors="ignore")
401
-
402
- summarizer = Summarizer(model=model)
403
- summary = summarizer.summarize(text, summary_type=summary_type, source_path=path)
404
-
405
- # Save summary
406
- output_path = output_dir / f"{path.stem}_{summary_type}.md"
407
- summary.save(output_path)
408
- console.print(f"[green]✓[/] Saved: {output_path}")
409
-
410
- return summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/config.py DELETED
@@ -1,50 +0,0 @@
1
- """Configuration management for Video Analyzer."""
2
-
3
- from pathlib import Path
4
- from typing import Optional
5
- from pydantic_settings import BaseSettings, SettingsConfigDict
6
-
7
-
8
- class Settings(BaseSettings):
9
- """Application settings."""
10
-
11
- # Paths
12
- base_dir: Path = Path(__file__).parent.parent
13
- data_dir: Path = base_dir / "data"
14
- downloads_dir: Path = data_dir / "downloads"
15
- audio_dir: Path = data_dir / "audio"
16
- transcripts_dir: Path = data_dir / "transcripts"
17
- summaries_dir: Path = data_dir / "summaries"
18
-
19
- # yt-dlp settings
20
- video_format: str = "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best"
21
- audio_format: str = "bestaudio[ext=m4a]/bestaudio/best"
22
-
23
- # Whisper settings
24
- whisper_model: str = "base" # tiny, base, small, medium, large-v3
25
- whisper_device: str = "auto" # auto, cpu, cuda
26
-
27
- # AI settings
28
- ai_backend: str = "huggingface" # ollama, huggingface, huggingface-api
29
- huggingface_api_key: Optional[str] = None
30
- ollama_model: str = "llama3"
31
- huggingface_model: str = "facebook/bart-large-cnn"
32
-
33
- # Processing
34
- max_concurrent_downloads: int = 3
35
-
36
- model_config = SettingsConfigDict(
37
- env_prefix="VIDEO_ANALYZER_",
38
- env_file=".env",
39
- env_file_encoding="utf-8",
40
- extra="ignore" # Ignore extra env vars
41
- )
42
-
43
-
44
- # Global settings instance
45
- settings = Settings()
46
-
47
- # Ensure directories exist
48
- for dir_path in [settings.downloads_dir, settings.audio_dir,
49
- settings.transcripts_dir, settings.summaries_dir]:
50
- dir_path.mkdir(parents=True, exist_ok=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/downloaders/__init__.py DELETED
@@ -1,6 +0,0 @@
1
- """Video downloaders and file handling."""
2
-
3
- from .youtube import YouTubeDownloader
4
- from .files import scan_files, import_files, FileInfo, get_file_type
5
-
6
- __all__ = ["YouTubeDownloader", "scan_files", "import_files", "FileInfo", "get_file_type"]
 
 
 
 
 
 
 
src/downloaders/files.py DELETED
@@ -1,177 +0,0 @@
1
- """Direct file and folder processing support."""
2
-
3
- import shutil
4
- from dataclasses import dataclass
5
- from pathlib import Path
6
- from typing import Optional
7
-
8
- from rich.console import Console
9
- from rich.table import Table
10
-
11
- from src.config import settings
12
-
13
- console = Console()
14
-
15
-
16
- @dataclass
17
- class FileInfo:
18
- """Information about a local file."""
19
-
20
- path: Path
21
- name: str
22
- size: int
23
- file_type: str # video, audio, document, image
24
- extension: str
25
-
26
- @property
27
- def size_formatted(self) -> str:
28
- """Return human-readable file size."""
29
- if self.size >= 1024 * 1024 * 1024:
30
- return f"{self.size / (1024**3):.1f} GB"
31
- elif self.size >= 1024 * 1024:
32
- return f"{self.size / (1024**2):.1f} MB"
33
- elif self.size >= 1024:
34
- return f"{self.size / 1024:.1f} KB"
35
- return f"{self.size} B"
36
-
37
-
38
- # File type mappings
39
- VIDEO_EXTENSIONS = {".mp4", ".mkv", ".avi", ".mov", ".webm", ".flv", ".wmv", ".m4v"}
40
- AUDIO_EXTENSIONS = {".mp3", ".wav", ".m4a", ".flac", ".aac", ".ogg", ".wma"}
41
- DOCUMENT_EXTENSIONS = {".pdf", ".docx", ".doc", ".pptx", ".ppt", ".txt", ".md", ".rtf"}
42
- IMAGE_EXTENSIONS = {".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff", ".webp"}
43
-
44
- ALL_SUPPORTED = VIDEO_EXTENSIONS | AUDIO_EXTENSIONS | DOCUMENT_EXTENSIONS | IMAGE_EXTENSIONS
45
-
46
-
47
- def get_file_type(path: Path) -> str:
48
- """Determine file type from extension."""
49
- ext = path.suffix.lower()
50
- if ext in VIDEO_EXTENSIONS:
51
- return "video"
52
- elif ext in AUDIO_EXTENSIONS:
53
- return "audio"
54
- elif ext in DOCUMENT_EXTENSIONS:
55
- return "document"
56
- elif ext in IMAGE_EXTENSIONS:
57
- return "image"
58
- return "unknown"
59
-
60
-
61
- def scan_files(
62
- path: Path,
63
- recursive: bool = True,
64
- file_types: Optional[list[str]] = None
65
- ) -> list[FileInfo]:
66
- """Scan a file or directory for supported files.
67
-
68
- Args:
69
- path: File or directory path
70
- recursive: If True, scan subdirectories
71
- file_types: Filter by type - ['video', 'audio', 'document', 'image']
72
-
73
- Returns:
74
- List of FileInfo objects
75
- """
76
- path = Path(path)
77
- files = []
78
-
79
- if path.is_file():
80
- # Single file
81
- if path.suffix.lower() in ALL_SUPPORTED:
82
- file_type = get_file_type(path)
83
- if file_types is None or file_type in file_types:
84
- files.append(FileInfo(
85
- path=path,
86
- name=path.name,
87
- size=path.stat().st_size,
88
- file_type=file_type,
89
- extension=path.suffix.lower()
90
- ))
91
- elif path.is_dir():
92
- # Directory
93
- pattern = "**/*" if recursive else "*"
94
- for file_path in path.glob(pattern):
95
- if file_path.is_file() and file_path.suffix.lower() in ALL_SUPPORTED:
96
- file_type = get_file_type(file_path)
97
- if file_types is None or file_type in file_types:
98
- files.append(FileInfo(
99
- path=file_path,
100
- name=file_path.name,
101
- size=file_path.stat().st_size,
102
- file_type=file_type,
103
- extension=file_path.suffix.lower()
104
- ))
105
-
106
- # Sort by name
107
- files.sort(key=lambda f: f.name.lower())
108
- return files
109
-
110
-
111
- def import_files(
112
- source: Path,
113
- dest_dir: Optional[Path] = None,
114
- copy: bool = True,
115
- recursive: bool = True
116
- ) -> list[FileInfo]:
117
- """Import files from a source location to the data directory.
118
-
119
- Args:
120
- source: Source file or directory
121
- dest_dir: Destination directory (default: data/downloads)
122
- copy: If True, copy files. If False, move files.
123
- recursive: If True, scan subdirectories
124
-
125
- Returns:
126
- List of imported FileInfo objects
127
- """
128
- source = Path(source)
129
- dest_dir = dest_dir or settings.downloads_dir
130
- dest_dir.mkdir(parents=True, exist_ok=True)
131
-
132
- files = scan_files(source, recursive=recursive)
133
- imported = []
134
-
135
- for file_info in files:
136
- dest_path = dest_dir / file_info.name
137
-
138
- # Handle duplicates
139
- if dest_path.exists():
140
- stem = dest_path.stem
141
- suffix = dest_path.suffix
142
- counter = 1
143
- while dest_path.exists():
144
- dest_path = dest_dir / f"{stem}_{counter}{suffix}"
145
- counter += 1
146
-
147
- # Copy or move
148
- if copy:
149
- shutil.copy2(file_info.path, dest_path)
150
- console.print(f"[green]✓[/] Copied: {file_info.name}")
151
- else:
152
- shutil.move(file_info.path, dest_path)
153
- console.print(f"[green]✓[/] Moved: {file_info.name}")
154
-
155
- imported.append(FileInfo(
156
- path=dest_path,
157
- name=dest_path.name,
158
- size=file_info.size,
159
- file_type=file_info.file_type,
160
- extension=file_info.extension
161
- ))
162
-
163
- return imported
164
-
165
-
166
- def list_supported_formats():
167
- """Display all supported file formats."""
168
- table = Table(title="Supported File Formats")
169
- table.add_column("Type", style="cyan")
170
- table.add_column("Extensions", style="white")
171
-
172
- table.add_row("Video", ", ".join(sorted(VIDEO_EXTENSIONS)))
173
- table.add_row("Audio", ", ".join(sorted(AUDIO_EXTENSIONS)))
174
- table.add_row("Document", ", ".join(sorted(DOCUMENT_EXTENSIONS)))
175
- table.add_row("Image (OCR)", ", ".join(sorted(IMAGE_EXTENSIONS)))
176
-
177
- console.print(table)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/downloaders/youtube.py DELETED
@@ -1,264 +0,0 @@
1
- """YouTube video downloader using yt-dlp."""
2
-
3
- import json
4
- import subprocess
5
- from dataclasses import dataclass, field
6
- from pathlib import Path
7
- from typing import Optional
8
-
9
- from rich.console import Console
10
- from rich.progress import Progress, SpinnerColumn, TextColumn
11
-
12
- from src.config import settings
13
-
14
- console = Console()
15
-
16
- # Add local bin paths
17
- import os
18
- os.environ["PATH"] = os.environ.get("PATH", "") + ":/home/ubuntu/.local/bin:/home/ubuntu/.deno/bin"
19
-
20
-
21
- @dataclass
22
- class VideoInfo:
23
- """Information about a downloaded video."""
24
-
25
- id: str
26
- title: str
27
- description: str
28
- duration: int # seconds
29
- uploader: str
30
- upload_date: str
31
- url: str
32
- filepath: Optional[Path] = None
33
- audio_filepath: Optional[Path] = None
34
- subtitles: Optional[str] = None
35
- chapters: list = field(default_factory=list)
36
-
37
- @property
38
- def duration_formatted(self) -> str:
39
- """Return duration in HH:MM:SS format."""
40
- hours, remainder = divmod(self.duration, 3600)
41
- minutes, seconds = divmod(remainder, 60)
42
- if hours:
43
- return f"{hours:02d}:{minutes:02d}:{seconds:02d}"
44
- return f"{minutes:02d}:{seconds:02d}"
45
-
46
-
47
- class YouTubeDownloader:
48
- """Download videos from YouTube using yt-dlp."""
49
-
50
- def __init__(self, output_dir: Optional[Path] = None, cookies_file: Optional[Path] = None):
51
- self.output_dir = output_dir or settings.downloads_dir
52
- self.output_dir.mkdir(parents=True, exist_ok=True)
53
- self.cookies_file = cookies_file # Path to cookies.txt for authenticated downloads
54
-
55
- def get_info(self, url: str) -> VideoInfo:
56
- """Get video information without downloading."""
57
- cmd = [
58
- "yt-dlp",
59
- "--dump-json",
60
- "--no-download",
61
- ]
62
-
63
- # Add cookies if provided
64
- if self.cookies_file and Path(self.cookies_file).exists():
65
- cmd.extend(["--cookies", str(self.cookies_file)])
66
-
67
- cmd.append(url)
68
-
69
- result = subprocess.run(cmd, capture_output=True, text=True)
70
- if result.returncode != 0:
71
- error_msg = result.stderr
72
- if "Sign in to confirm you're not a bot" in error_msg:
73
- raise Exception(
74
- "YouTube requires authentication. Please provide a cookies file:\n"
75
- "1. Install a browser extension to export cookies (e.g., 'Get cookies.txt LOCALLY')\n"
76
- "2. Export cookies from youtube.com\n"
77
- "3. Use --cookies path/to/cookies.txt"
78
- )
79
- raise Exception(f"Failed to get video info: {error_msg}")
80
-
81
- data = json.loads(result.stdout)
82
-
83
- return VideoInfo(
84
- id=data.get("id", ""),
85
- title=data.get("title", "Unknown"),
86
- description=data.get("description", ""),
87
- duration=data.get("duration", 0),
88
- uploader=data.get("uploader", "Unknown"),
89
- upload_date=data.get("upload_date", ""),
90
- url=url,
91
- chapters=data.get("chapters", [])
92
- )
93
-
94
- def download_video(
95
- self,
96
- url: str,
97
- audio_only: bool = False,
98
- get_subtitles: bool = True,
99
- quality: str = "best"
100
- ) -> VideoInfo:
101
- """Download a video from YouTube.
102
-
103
- Args:
104
- url: YouTube video URL
105
- audio_only: If True, download only audio (faster for transcription)
106
- get_subtitles: If True, download auto-generated subtitles if available
107
- quality: Video quality - 'best', '1080p', '720p', '480p', 'audio'
108
-
109
- Returns:
110
- VideoInfo with filepath set
111
- """
112
- # Get video info first
113
- with console.status("[bold green]Getting video info..."):
114
- info = self.get_info(url)
115
-
116
- console.print(f"[bold blue]Title:[/] {info.title}")
117
- console.print(f"[bold blue]Duration:[/] {info.duration_formatted}")
118
- console.print(f"[bold blue]Uploader:[/] {info.uploader}")
119
-
120
- # Build output template
121
- output_template = str(self.output_dir / "%(id)s.%(ext)s")
122
-
123
- # Build yt-dlp command
124
- cmd = [
125
- "yt-dlp",
126
- "--output", output_template,
127
- "--no-playlist", # Download single video only
128
- "--newline", # Progress on new lines
129
- ]
130
-
131
- # Add cookies if provided
132
- if self.cookies_file and Path(self.cookies_file).exists():
133
- cmd.extend(["--cookies", str(self.cookies_file)])
134
-
135
- # Format selection
136
- if audio_only:
137
- cmd.extend([
138
- "-x", # Extract audio
139
- "--audio-format", "mp3",
140
- "--audio-quality", "0", # Best quality
141
- ])
142
- else:
143
- if quality == "best":
144
- cmd.extend(["-f", "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best"])
145
- elif quality == "1080p":
146
- cmd.extend(["-f", "bestvideo[height<=1080][ext=mp4]+bestaudio[ext=m4a]/best[height<=1080]"])
147
- elif quality == "720p":
148
- cmd.extend(["-f", "bestvideo[height<=720][ext=mp4]+bestaudio[ext=m4a]/best[height<=720]"])
149
- elif quality == "480p":
150
- cmd.extend(["-f", "bestvideo[height<=480][ext=mp4]+bestaudio[ext=m4a]/best[height<=480]"])
151
-
152
- # Subtitles
153
- if get_subtitles:
154
- cmd.extend([
155
- "--write-auto-sub",
156
- "--sub-lang", "en",
157
- "--sub-format", "srt/vtt/best",
158
- "--convert-subs", "srt",
159
- ])
160
-
161
- cmd.append(url)
162
-
163
- # Download
164
- console.print(f"\n[bold green]Downloading...[/]")
165
-
166
- result = subprocess.run(cmd, capture_output=True, text=True)
167
-
168
- if result.returncode != 0:
169
- console.print(f"[red]Error:[/] {result.stderr}")
170
- raise Exception(f"Download failed: {result.stderr}")
171
-
172
- # Find the downloaded file
173
- if audio_only:
174
- filepath = self.output_dir / f"{info.id}.mp3"
175
- info.audio_filepath = filepath
176
- else:
177
- # Try common extensions
178
- for ext in ["mp4", "webm", "mkv"]:
179
- filepath = self.output_dir / f"{info.id}.{ext}"
180
- if filepath.exists():
181
- break
182
- info.filepath = filepath
183
-
184
- # Check for subtitles
185
- subtitle_path = self.output_dir / f"{info.id}.en.srt"
186
- if subtitle_path.exists():
187
- info.subtitles = subtitle_path.read_text()
188
- console.print(f"[green]✓[/] Subtitles downloaded")
189
-
190
- console.print(f"[green]✓[/] Downloaded to: {filepath}")
191
-
192
- return info
193
-
194
- def download_playlist(
195
- self,
196
- url: str,
197
- audio_only: bool = False,
198
- max_videos: Optional[int] = None
199
- ) -> list[VideoInfo]:
200
- """Download all videos from a playlist.
201
-
202
- Args:
203
- url: YouTube playlist URL
204
- audio_only: If True, download only audio
205
- max_videos: Maximum number of videos to download
206
-
207
- Returns:
208
- List of VideoInfo objects
209
- """
210
- # Get playlist info
211
- cmd = [
212
- "yt-dlp",
213
- "--dump-json",
214
- "--flat-playlist",
215
- url
216
- ]
217
-
218
- result = subprocess.run(cmd, capture_output=True, text=True)
219
- if result.returncode != 0:
220
- raise Exception(f"Failed to get playlist info: {result.stderr}")
221
-
222
- # Parse each video entry
223
- videos = []
224
- for line in result.stdout.strip().split("\n"):
225
- if line:
226
- data = json.loads(line)
227
- video_url = f"https://www.youtube.com/watch?v={data['id']}"
228
- videos.append(video_url)
229
-
230
- if max_videos:
231
- videos = videos[:max_videos]
232
-
233
- console.print(f"[bold blue]Found {len(videos)} videos in playlist[/]")
234
-
235
- # Download each video
236
- downloaded = []
237
- for i, video_url in enumerate(videos, 1):
238
- console.print(f"\n[bold]Downloading {i}/{len(videos)}[/]")
239
- try:
240
- info = self.download_video(video_url, audio_only=audio_only)
241
- downloaded.append(info)
242
- except Exception as e:
243
- console.print(f"[red]Failed to download:[/] {e}")
244
-
245
- return downloaded
246
-
247
-
248
- def download_youtube(
249
- url: str,
250
- audio_only: bool = False,
251
- output_dir: Optional[Path] = None
252
- ) -> VideoInfo:
253
- """Convenience function to download a YouTube video.
254
-
255
- Args:
256
- url: YouTube video URL
257
- audio_only: If True, download only audio
258
- output_dir: Output directory (default: data/downloads)
259
-
260
- Returns:
261
- VideoInfo with download details
262
- """
263
- downloader = YouTubeDownloader(output_dir)
264
- return downloader.download_video(url, audio_only=audio_only)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/knowledge/__init__.py DELETED
@@ -1,19 +0,0 @@
1
- """Knowledge base with vector storage."""
2
-
3
- from .embeddings import EmbeddingModel, embed_text, embed_texts
4
- from .vectorstore import KnowledgeBase, SearchResult, get_knowledge_base, search
5
- from .indexer import index_text, index_file, index_directory, reindex_all
6
-
7
- __all__ = [
8
- "EmbeddingModel",
9
- "embed_text",
10
- "embed_texts",
11
- "KnowledgeBase",
12
- "SearchResult",
13
- "get_knowledge_base",
14
- "search",
15
- "index_text",
16
- "index_file",
17
- "index_directory",
18
- "reindex_all",
19
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/knowledge/embeddings.py DELETED
@@ -1,107 +0,0 @@
1
- """Embedding generation using sentence-transformers (local, free)."""
2
-
3
- from typing import Optional
4
- import numpy as np
5
-
6
- from rich.console import Console
7
-
8
- console = Console()
9
-
10
-
11
- class EmbeddingModel:
12
- """Generate embeddings using sentence-transformers."""
13
-
14
- # Recommended models (all free, run locally)
15
- MODELS = {
16
- "fast": "all-MiniLM-L6-v2", # 384 dims, very fast
17
- "balanced": "all-mpnet-base-v2", # 768 dims, good quality
18
- "multilingual": "paraphrase-multilingual-MiniLM-L12-v2", # 384 dims
19
- }
20
-
21
- def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
22
- """Initialize embedding model.
23
-
24
- Args:
25
- model_name: Model name or key from MODELS dict
26
- """
27
- # Allow shorthand names
28
- if model_name in self.MODELS:
29
- model_name = self.MODELS[model_name]
30
-
31
- self.model_name = model_name
32
- self._model = None
33
-
34
- def _load_model(self):
35
- """Lazy load the model."""
36
- if self._model is None:
37
- console.print(f"[bold green]Loading embedding model:[/] {self.model_name}")
38
-
39
- try:
40
- from sentence_transformers import SentenceTransformer
41
- except ImportError:
42
- raise ImportError(
43
- "sentence-transformers not installed. Run:\n"
44
- " pip install sentence-transformers"
45
- )
46
-
47
- self._model = SentenceTransformer(self.model_name)
48
- console.print(f"[green]✓[/] Model loaded (dim={self._model.get_sentence_embedding_dimension()})")
49
-
50
- @property
51
- def dimension(self) -> int:
52
- """Get embedding dimension."""
53
- self._load_model()
54
- return self._model.get_sentence_embedding_dimension()
55
-
56
- def embed(self, text: str) -> list[float]:
57
- """Generate embedding for a single text.
58
-
59
- Args:
60
- text: Text to embed
61
-
62
- Returns:
63
- List of floats (embedding vector)
64
- """
65
- self._load_model()
66
- embedding = self._model.encode(text, convert_to_numpy=True)
67
- return embedding.tolist()
68
-
69
- def embed_batch(self, texts: list[str], show_progress: bool = True) -> list[list[float]]:
70
- """Generate embeddings for multiple texts.
71
-
72
- Args:
73
- texts: List of texts to embed
74
- show_progress: Show progress bar
75
-
76
- Returns:
77
- List of embedding vectors
78
- """
79
- self._load_model()
80
- embeddings = self._model.encode(
81
- texts,
82
- convert_to_numpy=True,
83
- show_progress_bar=show_progress
84
- )
85
- return embeddings.tolist()
86
-
87
-
88
- # Global instance for convenience
89
- _default_model: Optional[EmbeddingModel] = None
90
-
91
-
92
- def get_embedding_model(model_name: str = "all-MiniLM-L6-v2") -> EmbeddingModel:
93
- """Get or create the default embedding model."""
94
- global _default_model
95
- if _default_model is None or _default_model.model_name != model_name:
96
- _default_model = EmbeddingModel(model_name)
97
- return _default_model
98
-
99
-
100
- def embed_text(text: str) -> list[float]:
101
- """Convenience function to embed a single text."""
102
- return get_embedding_model().embed(text)
103
-
104
-
105
- def embed_texts(texts: list[str]) -> list[list[float]]:
106
- """Convenience function to embed multiple texts."""
107
- return get_embedding_model().embed_batch(texts)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/knowledge/indexer.py DELETED
@@ -1,151 +0,0 @@
1
- """Index content into the knowledge base."""
2
-
3
- from pathlib import Path
4
- from typing import Optional
5
-
6
- from rich.console import Console
7
- from rich.progress import Progress, SpinnerColumn, TextColumn
8
-
9
- from src.config import settings
10
- from src.analyzers.chunker import chunk_for_summarization
11
- from src.knowledge.vectorstore import KnowledgeBase, get_knowledge_base
12
-
13
- console = Console()
14
-
15
-
16
- def index_text(
17
- text: str,
18
- source: str,
19
- kb: Optional[KnowledgeBase] = None,
20
- chunk_size: int = 1000
21
- ) -> int:
22
- """Index a text into the knowledge base.
23
-
24
- Args:
25
- text: Text content to index
26
- source: Source identifier
27
- kb: Knowledge base (uses default if None)
28
- chunk_size: Characters per chunk
29
-
30
- Returns:
31
- Number of chunks indexed
32
- """
33
- kb = kb or get_knowledge_base()
34
-
35
- # Chunk the text
36
- chunks = chunk_for_summarization(text, max_tokens=chunk_size // 4)
37
-
38
- if not chunks:
39
- return 0
40
-
41
- # Extract just the text from chunks
42
- texts = [c.text for c in chunks]
43
- metadatas = [{"start_char": c.start_char, "end_char": c.end_char} for c in chunks]
44
-
45
- # Add to knowledge base
46
- kb.add_texts(texts, source=source, metadatas=metadatas)
47
-
48
- return len(chunks)
49
-
50
-
51
- def index_file(
52
- path: Path,
53
- kb: Optional[KnowledgeBase] = None
54
- ) -> int:
55
- """Index a file into the knowledge base.
56
-
57
- Args:
58
- path: Path to text file
59
- kb: Knowledge base (uses default if None)
60
-
61
- Returns:
62
- Number of chunks indexed
63
- """
64
- path = Path(path)
65
-
66
- if not path.exists():
67
- console.print(f"[red]File not found:[/] {path}")
68
- return 0
69
-
70
- text = path.read_text(encoding="utf-8", errors="ignore")
71
-
72
- if not text.strip():
73
- console.print(f"[yellow]Empty file:[/] {path.name}")
74
- return 0
75
-
76
- return index_text(text, source=str(path), kb=kb)
77
-
78
-
79
- def index_directory(
80
- path: Optional[Path] = None,
81
- kb: Optional[KnowledgeBase] = None,
82
- extensions: list[str] = [".txt", ".md"]
83
- ) -> dict:
84
- """Index all text files in a directory.
85
-
86
- Args:
87
- path: Directory path (defaults to transcripts_dir)
88
- kb: Knowledge base
89
- extensions: File extensions to index
90
-
91
- Returns:
92
- Dict with stats {files: int, chunks: int}
93
- """
94
- path = path or settings.transcripts_dir
95
- path = Path(path)
96
- kb = kb or get_knowledge_base()
97
-
98
- # Find all text files
99
- files = []
100
- for ext in extensions:
101
- files.extend(path.glob(f"*{ext}"))
102
-
103
- if not files:
104
- console.print(f"[yellow]No files found in {path}[/]")
105
- return {"files": 0, "chunks": 0}
106
-
107
- console.print(f"[bold blue]Indexing {len(files)} files...[/]")
108
-
109
- total_chunks = 0
110
- indexed_files = 0
111
-
112
- for file_path in files:
113
- try:
114
- chunks = index_file(file_path, kb=kb)
115
- if chunks > 0:
116
- indexed_files += 1
117
- total_chunks += chunks
118
- except Exception as e:
119
- console.print(f"[red]Error indexing {file_path.name}:[/] {e}")
120
-
121
- console.print(f"[green]✓[/] Indexed {indexed_files} files, {total_chunks} chunks")
122
-
123
- return {"files": indexed_files, "chunks": total_chunks}
124
-
125
-
126
- def reindex_all(kb: Optional[KnowledgeBase] = None) -> dict:
127
- """Clear and reindex everything.
128
-
129
- Args:
130
- kb: Knowledge base
131
-
132
- Returns:
133
- Dict with stats
134
- """
135
- kb = kb or get_knowledge_base()
136
-
137
- console.print("[bold yellow]Clearing existing index...[/]")
138
- kb.clear()
139
-
140
- # Index transcripts
141
- console.print("\n[bold blue]Indexing transcripts...[/]")
142
- transcript_stats = index_directory(settings.transcripts_dir, kb=kb)
143
-
144
- # Index summaries
145
- console.print("\n[bold blue]Indexing summaries...[/]")
146
- summary_stats = index_directory(settings.summaries_dir, kb=kb, extensions=[".md", ".txt"])
147
-
148
- return {
149
- "files": transcript_stats["files"] + summary_stats["files"],
150
- "chunks": transcript_stats["chunks"] + summary_stats["chunks"]
151
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/knowledge/vectorstore.py DELETED
@@ -1,316 +0,0 @@
1
- """Vector store using ChromaDB (local, free, persistent)."""
2
-
3
- import hashlib
4
- import json
5
- from dataclasses import dataclass
6
- from pathlib import Path
7
- from typing import Optional
8
-
9
- from rich.console import Console
10
-
11
- from src.config import settings
12
-
13
- console = Console()
14
-
15
-
16
- @dataclass
17
- class SearchResult:
18
- """A search result from the knowledge base."""
19
-
20
- text: str
21
- source: str
22
- score: float # Similarity score (higher = more similar)
23
- metadata: dict
24
-
25
- @property
26
- def source_name(self) -> str:
27
- """Get just the filename from source path."""
28
- return Path(self.source).stem if self.source else "unknown"
29
-
30
-
31
- class KnowledgeBase:
32
- """Vector store for semantic search using ChromaDB."""
33
-
34
- def __init__(
35
- self,
36
- persist_dir: Optional[Path] = None,
37
- collection_name: str = "video_analyzer"
38
- ):
39
- """Initialize knowledge base.
40
-
41
- Args:
42
- persist_dir: Directory for persistent storage
43
- collection_name: Name of the ChromaDB collection
44
- """
45
- self.persist_dir = persist_dir or (settings.data_dir / "chromadb")
46
- self.collection_name = collection_name
47
- self._client = None
48
- self._collection = None
49
- self._embedding_model = None
50
-
51
- def _init_db(self):
52
- """Initialize ChromaDB client and collection."""
53
- if self._client is None:
54
- try:
55
- import chromadb
56
- from chromadb.config import Settings as ChromaSettings
57
- except ImportError:
58
- raise ImportError(
59
- "ChromaDB not installed. Run:\n"
60
- " pip install chromadb"
61
- )
62
-
63
- # Create persistent client
64
- self.persist_dir.mkdir(parents=True, exist_ok=True)
65
-
66
- self._client = chromadb.PersistentClient(
67
- path=str(self.persist_dir),
68
- settings=ChromaSettings(anonymized_telemetry=False)
69
- )
70
-
71
- # Get or create collection
72
- self._collection = self._client.get_or_create_collection(
73
- name=self.collection_name,
74
- metadata={"description": "Video Analyzer Knowledge Base"}
75
- )
76
-
77
- console.print(f"[green]✓[/] Knowledge base loaded: {self._collection.count()} documents")
78
-
79
- def _get_embedding_model(self):
80
- """Get the embedding model."""
81
- if self._embedding_model is None:
82
- from src.knowledge.embeddings import EmbeddingModel
83
- self._embedding_model = EmbeddingModel()
84
- return self._embedding_model
85
-
86
- def _generate_id(self, text: str, source: str) -> str:
87
- """Generate a unique ID for a document."""
88
- content = f"{source}:{text[:100]}"
89
- return hashlib.md5(content.encode()).hexdigest()
90
-
91
- def add_text(
92
- self,
93
- text: str,
94
- source: str,
95
- metadata: Optional[dict] = None
96
- ) -> str:
97
- """Add a single text to the knowledge base.
98
-
99
- Args:
100
- text: Text content
101
- source: Source file path or identifier
102
- metadata: Additional metadata
103
-
104
- Returns:
105
- Document ID
106
- """
107
- self._init_db()
108
-
109
- # Generate embedding
110
- model = self._get_embedding_model()
111
- embedding = model.embed(text)
112
-
113
- # Generate ID
114
- doc_id = self._generate_id(text, source)
115
-
116
- # Prepare metadata
117
- meta = metadata or {}
118
- meta["source"] = source
119
- meta["text_length"] = len(text)
120
-
121
- # Add to collection
122
- self._collection.add(
123
- ids=[doc_id],
124
- embeddings=[embedding],
125
- documents=[text],
126
- metadatas=[meta]
127
- )
128
-
129
- return doc_id
130
-
131
- def add_texts(
132
- self,
133
- texts: list[str],
134
- source: str,
135
- metadatas: Optional[list[dict]] = None,
136
- show_progress: bool = True
137
- ) -> list[str]:
138
- """Add multiple texts to the knowledge base.
139
-
140
- Args:
141
- texts: List of text content
142
- source: Source file path
143
- metadatas: List of metadata dicts
144
- show_progress: Show progress bar
145
-
146
- Returns:
147
- List of document IDs
148
- """
149
- self._init_db()
150
-
151
- if not texts:
152
- return []
153
-
154
- console.print(f"[bold blue]Indexing {len(texts)} chunks from {Path(source).name}[/]")
155
-
156
- # Generate embeddings in batch
157
- model = self._get_embedding_model()
158
- embeddings = model.embed_batch(texts, show_progress=show_progress)
159
-
160
- # Generate IDs and prepare metadata
161
- ids = []
162
- metas = []
163
- for i, text in enumerate(texts):
164
- doc_id = self._generate_id(text, f"{source}:{i}")
165
- ids.append(doc_id)
166
-
167
- meta = metadatas[i] if metadatas else {}
168
- meta["source"] = source
169
- meta["chunk_index"] = i
170
- meta["text_length"] = len(text)
171
- metas.append(meta)
172
-
173
- # Add to collection
174
- self._collection.add(
175
- ids=ids,
176
- embeddings=embeddings,
177
- documents=texts,
178
- metadatas=metas
179
- )
180
-
181
- console.print(f"[green]✓[/] Added {len(texts)} chunks to knowledge base")
182
-
183
- return ids
184
-
185
- def search(
186
- self,
187
- query: str,
188
- n_results: int = 5,
189
- filter_source: Optional[str] = None
190
- ) -> list[SearchResult]:
191
- """Search the knowledge base semantically.
192
-
193
- Args:
194
- query: Search query
195
- n_results: Number of results to return
196
- filter_source: Filter by source file
197
-
198
- Returns:
199
- List of SearchResult objects
200
- """
201
- self._init_db()
202
-
203
- # Generate query embedding
204
- model = self._get_embedding_model()
205
- query_embedding = model.embed(query)
206
-
207
- # Build filter
208
- where_filter = None
209
- if filter_source:
210
- where_filter = {"source": filter_source}
211
-
212
- # Search
213
- results = self._collection.query(
214
- query_embeddings=[query_embedding],
215
- n_results=n_results,
216
- where=where_filter,
217
- include=["documents", "metadatas", "distances"]
218
- )
219
-
220
- # Convert to SearchResult objects
221
- search_results = []
222
- if results["documents"] and results["documents"][0]:
223
- for i, doc in enumerate(results["documents"][0]):
224
- # Convert distance to similarity score (1 - distance for cosine)
225
- distance = results["distances"][0][i] if results["distances"] else 0
226
- score = 1 - distance # Higher = more similar
227
-
228
- metadata = results["metadatas"][0][i] if results["metadatas"] else {}
229
- source = metadata.pop("source", "unknown")
230
-
231
- search_results.append(SearchResult(
232
- text=doc,
233
- source=source,
234
- score=score,
235
- metadata=metadata
236
- ))
237
-
238
- return search_results
239
-
240
- def count(self) -> int:
241
- """Get total number of documents in the knowledge base."""
242
- self._init_db()
243
- return self._collection.count()
244
-
245
- def get_sources(self) -> list[str]:
246
- """Get list of all sources in the knowledge base."""
247
- self._init_db()
248
-
249
- # Get all metadata
250
- results = self._collection.get(include=["metadatas"])
251
-
252
- sources = set()
253
- if results["metadatas"]:
254
- for meta in results["metadatas"]:
255
- if "source" in meta:
256
- sources.add(meta["source"])
257
-
258
- return sorted(sources)
259
-
260
- def delete_source(self, source: str) -> int:
261
- """Delete all documents from a specific source.
262
-
263
- Args:
264
- source: Source path to delete
265
-
266
- Returns:
267
- Number of documents deleted
268
- """
269
- self._init_db()
270
-
271
- # Get IDs for this source
272
- results = self._collection.get(
273
- where={"source": source},
274
- include=["metadatas"]
275
- )
276
-
277
- if not results["ids"]:
278
- return 0
279
-
280
- # Delete
281
- count = len(results["ids"])
282
- self._collection.delete(ids=results["ids"])
283
-
284
- console.print(f"[green]✓[/] Deleted {count} chunks from {source}")
285
-
286
- return count
287
-
288
- def clear(self):
289
- """Clear all documents from the knowledge base."""
290
- self._init_db()
291
-
292
- # Delete and recreate collection
293
- self._client.delete_collection(self.collection_name)
294
- self._collection = self._client.create_collection(
295
- name=self.collection_name,
296
- metadata={"description": "Video Analyzer Knowledge Base"}
297
- )
298
-
299
- console.print("[green]✓[/] Knowledge base cleared")
300
-
301
-
302
- # Convenience functions
303
- _default_kb: Optional[KnowledgeBase] = None
304
-
305
-
306
- def get_knowledge_base() -> KnowledgeBase:
307
- """Get the default knowledge base instance."""
308
- global _default_kb
309
- if _default_kb is None:
310
- _default_kb = KnowledgeBase()
311
- return _default_kb
312
-
313
-
314
- def search(query: str, n_results: int = 5) -> list[SearchResult]:
315
- """Search the knowledge base."""
316
- return get_knowledge_base().search(query, n_results)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/main.py DELETED
@@ -1,6 +0,0 @@
1
- """Main entry point for Video Analyzer."""
2
-
3
- from src.ui.cli import app
4
-
5
- if __name__ == "__main__":
6
- app()
 
 
 
 
 
 
 
src/mentor/__init__.py DELETED
@@ -1 +0,0 @@
1
- """Virtual mentor with RAG (Phase 4+)."""
 
 
src/processors/__init__.py DELETED
@@ -1,18 +0,0 @@
1
- """Content processors for audio, documents, and transcription."""
2
-
3
- from .audio import extract_audio
4
- from .transcriber import transcribe_audio, WhisperTranscriber
5
- from .documents import extract_document, process_documents, DocumentContent
6
- from .ocr import extract_text_from_image, process_images, OCRResult
7
-
8
- __all__ = [
9
- "extract_audio",
10
- "transcribe_audio",
11
- "WhisperTranscriber",
12
- "extract_document",
13
- "process_documents",
14
- "DocumentContent",
15
- "extract_text_from_image",
16
- "process_images",
17
- "OCRResult"
18
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/processors/__pycache__/__init__.cpython-312.pyc DELETED
Binary file (618 Bytes)
 
src/processors/__pycache__/audio.cpython-312.pyc DELETED
Binary file (3.13 kB)
 
src/processors/__pycache__/transcriber.cpython-312.pyc DELETED
Binary file (10.6 kB)
 
src/processors/audio.py DELETED
@@ -1,83 +0,0 @@
1
- """Audio extraction from video files using ffmpeg."""
2
-
3
- import subprocess
4
- from pathlib import Path
5
- from typing import Optional
6
-
7
- from rich.console import Console
8
-
9
- from src.config import settings
10
-
11
- console = Console()
12
-
13
-
14
- def extract_audio(
15
- video_path: Path,
16
- output_path: Optional[Path] = None,
17
- audio_format: str = "mp3",
18
- sample_rate: int = 16000, # Whisper prefers 16kHz
19
- ) -> Path:
20
- """Extract audio from a video file using ffmpeg.
21
-
22
- Args:
23
- video_path: Path to the video file
24
- output_path: Output path for audio file (default: data/audio/<video_name>.mp3)
25
- audio_format: Output audio format (mp3, wav, m4a)
26
- sample_rate: Audio sample rate in Hz (16000 recommended for Whisper)
27
-
28
- Returns:
29
- Path to the extracted audio file
30
- """
31
- video_path = Path(video_path)
32
-
33
- if not video_path.exists():
34
- raise FileNotFoundError(f"Video not found: {video_path}")
35
-
36
- # Default output path
37
- if output_path is None:
38
- output_path = settings.audio_dir / f"{video_path.stem}.{audio_format}"
39
- else:
40
- output_path = Path(output_path)
41
-
42
- output_path.parent.mkdir(parents=True, exist_ok=True)
43
-
44
- console.print(f"[bold green]Extracting audio from:[/] {video_path.name}")
45
-
46
- # Build ffmpeg command
47
- cmd = [
48
- "ffmpeg",
49
- "-i", str(video_path),
50
- "-vn", # No video
51
- "-acodec", "libmp3lame" if audio_format == "mp3" else "pcm_s16le",
52
- "-ar", str(sample_rate), # Sample rate
53
- "-ac", "1", # Mono (better for speech recognition)
54
- "-y", # Overwrite output
55
- str(output_path)
56
- ]
57
-
58
- result = subprocess.run(cmd, capture_output=True, text=True)
59
-
60
- if result.returncode != 0:
61
- raise Exception(f"Audio extraction failed: {result.stderr}")
62
-
63
- console.print(f"[green]✓[/] Audio extracted to: {output_path}")
64
-
65
- return output_path
66
-
67
-
68
- def get_audio_duration(audio_path: Path) -> float:
69
- """Get the duration of an audio file in seconds."""
70
- cmd = [
71
- "ffprobe",
72
- "-v", "error",
73
- "-show_entries", "format=duration",
74
- "-of", "default=noprint_wrappers=1:nokey=1",
75
- str(audio_path)
76
- ]
77
-
78
- result = subprocess.run(cmd, capture_output=True, text=True)
79
-
80
- if result.returncode != 0:
81
- raise Exception(f"Failed to get audio duration: {result.stderr}")
82
-
83
- return float(result.stdout.strip())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/processors/documents.py DELETED
@@ -1,278 +0,0 @@
1
- """Document processing for PDF, Word, PowerPoint, and text files."""
2
-
3
- from dataclasses import dataclass
4
- from pathlib import Path
5
- from typing import Optional
6
-
7
- from rich.console import Console
8
-
9
- from src.config import settings
10
-
11
- console = Console()
12
-
13
-
14
- @dataclass
15
- class DocumentContent:
16
- """Extracted content from a document."""
17
-
18
- text: str
19
- title: str
20
- pages: int
21
- source_path: Path
22
- doc_type: str # pdf, docx, pptx, txt, md
23
- metadata: dict
24
-
25
- def save(self, output_path: Optional[Path] = None) -> Path:
26
- """Save extracted text to file."""
27
- if output_path is None:
28
- output_path = settings.transcripts_dir / f"{self.source_path.stem}.txt"
29
-
30
- output_path.parent.mkdir(parents=True, exist_ok=True)
31
- output_path.write_text(self.text)
32
- return output_path
33
-
34
-
35
- def extract_pdf(path: Path) -> DocumentContent:
36
- """Extract text from a PDF file.
37
-
38
- Args:
39
- path: Path to PDF file
40
-
41
- Returns:
42
- DocumentContent with extracted text
43
- """
44
- try:
45
- import fitz # PyMuPDF
46
- except ImportError:
47
- raise ImportError("PyMuPDF not installed. Run: pip install PyMuPDF")
48
-
49
- path = Path(path)
50
- console.print(f"[bold green]Extracting PDF:[/] {path.name}")
51
-
52
- doc = fitz.open(path)
53
-
54
- text_parts = []
55
- for page_num, page in enumerate(doc, 1):
56
- text = page.get_text()
57
- if text.strip():
58
- text_parts.append(f"--- Page {page_num} ---\n{text}")
59
-
60
- full_text = "\n\n".join(text_parts)
61
-
62
- # Extract metadata
63
- metadata = {
64
- "author": doc.metadata.get("author", ""),
65
- "title": doc.metadata.get("title", ""),
66
- "subject": doc.metadata.get("subject", ""),
67
- "creator": doc.metadata.get("creator", ""),
68
- }
69
-
70
- title = metadata.get("title") or path.stem
71
-
72
- doc.close()
73
-
74
- console.print(f"[green]✓[/] Extracted {len(doc)} pages, {len(full_text)} characters")
75
-
76
- return DocumentContent(
77
- text=full_text,
78
- title=title,
79
- pages=len(doc),
80
- source_path=path,
81
- doc_type="pdf",
82
- metadata=metadata
83
- )
84
-
85
-
86
- def extract_docx(path: Path) -> DocumentContent:
87
- """Extract text from a Word document.
88
-
89
- Args:
90
- path: Path to .docx file
91
-
92
- Returns:
93
- DocumentContent with extracted text
94
- """
95
- try:
96
- from docx import Document
97
- except ImportError:
98
- raise ImportError("python-docx not installed. Run: pip install python-docx")
99
-
100
- path = Path(path)
101
- console.print(f"[bold green]Extracting Word doc:[/] {path.name}")
102
-
103
- doc = Document(path)
104
-
105
- text_parts = []
106
- for para in doc.paragraphs:
107
- if para.text.strip():
108
- text_parts.append(para.text)
109
-
110
- # Also extract from tables
111
- for table in doc.tables:
112
- for row in table.rows:
113
- row_text = " | ".join(cell.text.strip() for cell in row.cells if cell.text.strip())
114
- if row_text:
115
- text_parts.append(row_text)
116
-
117
- full_text = "\n\n".join(text_parts)
118
-
119
- # Extract metadata
120
- metadata = {
121
- "author": doc.core_properties.author or "",
122
- "title": doc.core_properties.title or "",
123
- "subject": doc.core_properties.subject or "",
124
- }
125
-
126
- title = metadata.get("title") or path.stem
127
-
128
- console.print(f"[green]✓[/] Extracted {len(text_parts)} paragraphs, {len(full_text)} characters")
129
-
130
- return DocumentContent(
131
- text=full_text,
132
- title=title,
133
- pages=1, # Word docs don't have fixed pages
134
- source_path=path,
135
- doc_type="docx",
136
- metadata=metadata
137
- )
138
-
139
-
140
- def extract_pptx(path: Path) -> DocumentContent:
141
- """Extract text from a PowerPoint presentation.
142
-
143
- Args:
144
- path: Path to .pptx file
145
-
146
- Returns:
147
- DocumentContent with extracted text
148
- """
149
- try:
150
- from pptx import Presentation
151
- except ImportError:
152
- raise ImportError("python-pptx not installed. Run: pip install python-pptx")
153
-
154
- path = Path(path)
155
- console.print(f"[bold green]Extracting PowerPoint:[/] {path.name}")
156
-
157
- prs = Presentation(path)
158
-
159
- text_parts = []
160
- for slide_num, slide in enumerate(prs.slides, 1):
161
- slide_text = [f"--- Slide {slide_num} ---"]
162
-
163
- for shape in slide.shapes:
164
- if hasattr(shape, "text") and shape.text.strip():
165
- slide_text.append(shape.text)
166
-
167
- if len(slide_text) > 1: # Has content beyond header
168
- text_parts.append("\n".join(slide_text))
169
-
170
- full_text = "\n\n".join(text_parts)
171
-
172
- console.print(f"[green]✓[/] Extracted {len(prs.slides)} slides, {len(full_text)} characters")
173
-
174
- return DocumentContent(
175
- text=full_text,
176
- title=path.stem,
177
- pages=len(prs.slides),
178
- source_path=path,
179
- doc_type="pptx",
180
- metadata={}
181
- )
182
-
183
-
184
- def extract_text_file(path: Path) -> DocumentContent:
185
- """Extract text from plain text or markdown files.
186
-
187
- Args:
188
- path: Path to .txt or .md file
189
-
190
- Returns:
191
- DocumentContent with text
192
- """
193
- path = Path(path)
194
- console.print(f"[bold green]Reading text file:[/] {path.name}")
195
-
196
- text = path.read_text(encoding="utf-8", errors="ignore")
197
-
198
- console.print(f"[green]✓[/] Read {len(text)} characters")
199
-
200
- return DocumentContent(
201
- text=text,
202
- title=path.stem,
203
- pages=1,
204
- source_path=path,
205
- doc_type=path.suffix.lstrip("."),
206
- metadata={}
207
- )
208
-
209
-
210
- def extract_document(path: Path) -> DocumentContent:
211
- """Extract text from any supported document type.
212
-
213
- Args:
214
- path: Path to document
215
-
216
- Returns:
217
- DocumentContent with extracted text
218
- """
219
- path = Path(path)
220
- ext = path.suffix.lower()
221
-
222
- if ext == ".pdf":
223
- return extract_pdf(path)
224
- elif ext in {".docx", ".doc"}:
225
- return extract_docx(path)
226
- elif ext in {".pptx", ".ppt"}:
227
- return extract_pptx(path)
228
- elif ext in {".txt", ".md", ".rtf"}:
229
- return extract_text_file(path)
230
- else:
231
- raise ValueError(f"Unsupported document type: {ext}")
232
-
233
-
234
- def process_documents(
235
- path: Path,
236
- output_dir: Optional[Path] = None,
237
- recursive: bool = True
238
- ) -> list[DocumentContent]:
239
- """Process all documents in a file or directory.
240
-
241
- Args:
242
- path: File or directory path
243
- output_dir: Output directory for extracted text
244
- recursive: If True, scan subdirectories
245
-
246
- Returns:
247
- List of DocumentContent objects
248
- """
249
- from src.downloaders.files import scan_files
250
-
251
- path = Path(path)
252
- output_dir = output_dir or settings.transcripts_dir
253
-
254
- # Get document files
255
- files = scan_files(path, recursive=recursive, file_types=["document"])
256
-
257
- if not files:
258
- console.print("[yellow]No documents found[/]")
259
- return []
260
-
261
- console.print(f"[bold blue]Found {len(files)} documents to process[/]")
262
-
263
- results = []
264
- for file_info in files:
265
- try:
266
- content = extract_document(file_info.path)
267
-
268
- # Save extracted text
269
- output_path = output_dir / f"{file_info.path.stem}.txt"
270
- content.save(output_path)
271
- console.print(f"[green]✓[/] Saved: {output_path.name}")
272
-
273
- results.append(content)
274
-
275
- except Exception as e:
276
- console.print(f"[red]Error processing {file_info.name}:[/] {e}")
277
-
278
- return results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/processors/ocr.py DELETED
@@ -1,133 +0,0 @@
1
- """OCR (Optical Character Recognition) for images using Tesseract."""
2
-
3
- from dataclasses import dataclass
4
- from pathlib import Path
5
- from typing import Optional
6
-
7
- from rich.console import Console
8
-
9
- from src.config import settings
10
-
11
- console = Console()
12
-
13
-
14
- @dataclass
15
- class OCRResult:
16
- """Result of OCR processing."""
17
-
18
- text: str
19
- source_path: Path
20
- confidence: float # 0-100
21
-
22
- def save(self, output_path: Optional[Path] = None) -> Path:
23
- """Save extracted text to file."""
24
- if output_path is None:
25
- output_path = settings.transcripts_dir / f"{self.source_path.stem}_ocr.txt"
26
-
27
- output_path.parent.mkdir(parents=True, exist_ok=True)
28
- output_path.write_text(self.text)
29
- return output_path
30
-
31
-
32
- def extract_text_from_image(path: Path, language: str = "eng") -> OCRResult:
33
- """Extract text from an image using Tesseract OCR.
34
-
35
- Args:
36
- path: Path to image file
37
- language: Tesseract language code (eng, spa, fra, deu, etc.)
38
-
39
- Returns:
40
- OCRResult with extracted text
41
- """
42
- try:
43
- import pytesseract
44
- from PIL import Image
45
- except ImportError:
46
- raise ImportError(
47
- "OCR dependencies not installed. Run:\n"
48
- " pip install pytesseract Pillow\n"
49
- " sudo apt install tesseract-ocr # Linux\n"
50
- " brew install tesseract # macOS"
51
- )
52
-
53
- path = Path(path)
54
- console.print(f"[bold green]OCR processing:[/] {path.name}")
55
-
56
- # Open and process image
57
- image = Image.open(path)
58
-
59
- # Get OCR data with confidence scores
60
- data = pytesseract.image_to_data(image, lang=language, output_type=pytesseract.Output.DICT)
61
-
62
- # Extract text and calculate average confidence
63
- words = []
64
- confidences = []
65
-
66
- for i, word in enumerate(data["text"]):
67
- if word.strip():
68
- words.append(word)
69
- conf = data["conf"][i]
70
- if conf > 0: # -1 means no confidence data
71
- confidences.append(conf)
72
-
73
- text = " ".join(words)
74
- avg_confidence = sum(confidences) / len(confidences) if confidences else 0
75
-
76
- console.print(f"[green]✓[/] Extracted {len(words)} words, confidence: {avg_confidence:.1f}%")
77
-
78
- return OCRResult(
79
- text=text,
80
- source_path=path,
81
- confidence=avg_confidence
82
- )
83
-
84
-
85
- def process_images(
86
- path: Path,
87
- output_dir: Optional[Path] = None,
88
- language: str = "eng",
89
- recursive: bool = True
90
- ) -> list[OCRResult]:
91
- """Process all images in a file or directory with OCR.
92
-
93
- Args:
94
- path: File or directory path
95
- output_dir: Output directory for extracted text
96
- language: Tesseract language code
97
- recursive: If True, scan subdirectories
98
-
99
- Returns:
100
- List of OCRResult objects
101
- """
102
- from src.downloaders.files import scan_files
103
-
104
- path = Path(path)
105
- output_dir = output_dir or settings.transcripts_dir
106
-
107
- # Get image files
108
- files = scan_files(path, recursive=recursive, file_types=["image"])
109
-
110
- if not files:
111
- console.print("[yellow]No images found[/]")
112
- return []
113
-
114
- console.print(f"[bold blue]Found {len(files)} images to process[/]")
115
-
116
- results = []
117
- for file_info in files:
118
- try:
119
- result = extract_text_from_image(file_info.path, language=language)
120
-
121
- if result.text.strip():
122
- # Save extracted text
123
- output_path = output_dir / f"{file_info.path.stem}_ocr.txt"
124
- result.save(output_path)
125
- console.print(f"[green]✓[/] Saved: {output_path.name}")
126
- results.append(result)
127
- else:
128
- console.print(f"[yellow]No text found in {file_info.name}[/]")
129
-
130
- except Exception as e:
131
- console.print(f"[red]Error processing {file_info.name}:[/] {e}")
132
-
133
- return results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/processors/transcriber.py DELETED
@@ -1,243 +0,0 @@
1
- """Audio transcription using Whisper (local, free)."""
2
-
3
- import json
4
- from dataclasses import dataclass
5
- from pathlib import Path
6
- from typing import Optional
7
-
8
- from rich.console import Console
9
- from rich.progress import Progress, SpinnerColumn, TextColumn
10
-
11
- from src.config import settings
12
-
13
- console = Console()
14
-
15
-
16
- @dataclass
17
- class TranscriptSegment:
18
- """A segment of transcribed text with timing."""
19
-
20
- start: float # Start time in seconds
21
- end: float # End time in seconds
22
- text: str # Transcribed text
23
-
24
- @property
25
- def start_formatted(self) -> str:
26
- """Format start time as HH:MM:SS."""
27
- return self._format_time(self.start)
28
-
29
- @property
30
- def end_formatted(self) -> str:
31
- """Format end time as HH:MM:SS."""
32
- return self._format_time(self.end)
33
-
34
- def _format_time(self, seconds: float) -> str:
35
- hours, remainder = divmod(int(seconds), 3600)
36
- minutes, secs = divmod(remainder, 60)
37
- return f"{hours:02d}:{minutes:02d}:{secs:02d}"
38
-
39
-
40
- @dataclass
41
- class Transcript:
42
- """Complete transcript with segments and metadata."""
43
-
44
- text: str # Full transcript text
45
- segments: list[TranscriptSegment] # Timed segments
46
- language: str # Detected language
47
- duration: float # Audio duration in seconds
48
-
49
- def to_srt(self) -> str:
50
- """Convert to SRT subtitle format."""
51
- lines = []
52
- for i, seg in enumerate(self.segments, 1):
53
- start = self._format_srt_time(seg.start)
54
- end = self._format_srt_time(seg.end)
55
- lines.append(f"{i}")
56
- lines.append(f"{start} --> {end}")
57
- lines.append(seg.text.strip())
58
- lines.append("")
59
- return "\n".join(lines)
60
-
61
- def _format_srt_time(self, seconds: float) -> str:
62
- hours, remainder = divmod(int(seconds), 3600)
63
- minutes, secs = divmod(remainder, 60)
64
- ms = int((seconds % 1) * 1000)
65
- return f"{hours:02d}:{minutes:02d}:{secs:02d},{ms:03d}"
66
-
67
- def save(self, output_path: Path, format: str = "txt") -> Path:
68
- """Save transcript to file.
69
-
70
- Args:
71
- output_path: Output file path
72
- format: 'txt', 'srt', or 'json'
73
- """
74
- output_path = Path(output_path)
75
- output_path.parent.mkdir(parents=True, exist_ok=True)
76
-
77
- if format == "txt":
78
- output_path.write_text(self.text)
79
- elif format == "srt":
80
- output_path.write_text(self.to_srt())
81
- elif format == "json":
82
- data = {
83
- "text": self.text,
84
- "language": self.language,
85
- "duration": self.duration,
86
- "segments": [
87
- {"start": s.start, "end": s.end, "text": s.text}
88
- for s in self.segments
89
- ]
90
- }
91
- output_path.write_text(json.dumps(data, indent=2))
92
-
93
- return output_path
94
-
95
-
96
- class WhisperTranscriber:
97
- """Transcribe audio using faster-whisper (local, free)."""
98
-
99
- def __init__(
100
- self,
101
- model_size: str = "base",
102
- device: str = "auto",
103
- compute_type: str = "auto"
104
- ):
105
- """Initialize the transcriber.
106
-
107
- Args:
108
- model_size: Whisper model size - tiny, base, small, medium, large-v3
109
- device: Device to use - 'auto', 'cpu', 'cuda'
110
- compute_type: Computation type - 'auto', 'int8', 'float16', 'float32'
111
- """
112
- self.model_size = model_size
113
- self.device = device
114
- self.compute_type = compute_type
115
- self._model = None
116
-
117
- def _load_model(self):
118
- """Lazy load the Whisper model."""
119
- if self._model is None:
120
- console.print(f"[bold green]Loading Whisper model:[/] {self.model_size}")
121
-
122
- from faster_whisper import WhisperModel
123
-
124
- # Determine device and compute type
125
- device = self.device
126
- compute_type = self.compute_type
127
-
128
- if device == "auto":
129
- try:
130
- import torch
131
- device = "cuda" if torch.cuda.is_available() else "cpu"
132
- except ImportError:
133
- device = "cpu"
134
-
135
- if compute_type == "auto":
136
- compute_type = "float16" if device == "cuda" else "int8"
137
-
138
- self._model = WhisperModel(
139
- self.model_size,
140
- device=device,
141
- compute_type=compute_type
142
- )
143
-
144
- console.print(f"[green]✓[/] Model loaded on {device}")
145
-
146
- def transcribe(
147
- self,
148
- audio_path: Path,
149
- language: Optional[str] = None,
150
- ) -> Transcript:
151
- """Transcribe an audio file.
152
-
153
- Args:
154
- audio_path: Path to audio file
155
- language: Language code (e.g., 'en') or None for auto-detect
156
-
157
- Returns:
158
- Transcript object with full text and segments
159
- """
160
- audio_path = Path(audio_path)
161
-
162
- if not audio_path.exists():
163
- raise FileNotFoundError(f"Audio file not found: {audio_path}")
164
-
165
- self._load_model()
166
-
167
- console.print(f"[bold green]Transcribing:[/] {audio_path.name}")
168
-
169
- # Transcribe
170
- segments_generator, info = self._model.transcribe(
171
- str(audio_path),
172
- language=language,
173
- beam_size=5,
174
- word_timestamps=True,
175
- vad_filter=True, # Filter out non-speech
176
- )
177
-
178
- # Collect segments
179
- segments = []
180
- full_text_parts = []
181
-
182
- with Progress(
183
- SpinnerColumn(),
184
- TextColumn("[progress.description]{task.description}"),
185
- console=console
186
- ) as progress:
187
- task = progress.add_task("Processing segments...", total=None)
188
-
189
- for segment in segments_generator:
190
- segments.append(TranscriptSegment(
191
- start=segment.start,
192
- end=segment.end,
193
- text=segment.text
194
- ))
195
- full_text_parts.append(segment.text)
196
- progress.update(task, description=f"Processed {len(segments)} segments...")
197
-
198
- transcript = Transcript(
199
- text=" ".join(full_text_parts).strip(),
200
- segments=segments,
201
- language=info.language,
202
- duration=info.duration
203
- )
204
-
205
- console.print(f"[green]✓[/] Transcription complete")
206
- console.print(f"[bold blue]Language:[/] {info.language}")
207
- console.print(f"[bold blue]Duration:[/] {info.duration:.1f}s")
208
- console.print(f"[bold blue]Segments:[/] {len(segments)}")
209
-
210
- return transcript
211
-
212
-
213
- def transcribe_audio(
214
- audio_path: Path,
215
- model_size: str = "base",
216
- output_dir: Optional[Path] = None,
217
- save_formats: list[str] = ["txt", "json"]
218
- ) -> Transcript:
219
- """Convenience function to transcribe audio and save results.
220
-
221
- Args:
222
- audio_path: Path to audio file
223
- model_size: Whisper model size
224
- output_dir: Output directory (default: data/transcripts)
225
- save_formats: List of formats to save ('txt', 'srt', 'json')
226
-
227
- Returns:
228
- Transcript object
229
- """
230
- audio_path = Path(audio_path)
231
- output_dir = output_dir or settings.transcripts_dir
232
-
233
- # Transcribe
234
- transcriber = WhisperTranscriber(model_size=model_size)
235
- transcript = transcriber.transcribe(audio_path)
236
-
237
- # Save in requested formats
238
- for fmt in save_formats:
239
- output_path = output_dir / f"{audio_path.stem}.{fmt}"
240
- transcript.save(output_path, format=fmt)
241
- console.print(f"[green]✓[/] Saved: {output_path}")
242
-
243
- return transcript
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/ui/__init__.py DELETED
@@ -1 +0,0 @@
1
- """User interfaces for Video Analyzer."""
 
 
src/ui/__pycache__/__init__.cpython-312.pyc DELETED
Binary file (176 Bytes)
 
src/ui/__pycache__/cli.cpython-312.pyc DELETED
Binary file (32.7 kB)