Spaces:
Runtime error
Runtime error
Claude commited on
refactor: Replace video analyzer with blank Gradio 6 project
Browse files- Remove all existing source code, tests, and documentation
- Add minimal pyproject.toml with Gradio 6 dependency
- Add blank app.py with simple Gradio interface
- Add README.md with HuggingFace Spaces YAML frontmatter
- Generate uv.lock for dependency management
This view is limited to 50 files because it contains too many changes.
See raw diff
- .cursorrules +0 -71
- .env.example +0 -12
- .gitignore +22 -21
- DEPLOY_TO_HF_SPACES.md +0 -134
- PLAN.md +0 -437
- README.md +14 -192
- VOICE_COMMANDS_PLAN.md +0 -323
- app.py +11 -0
- data/audio/test_silence.wav +0 -0
- data/summaries/sample_real_estate_summary.md +0 -1
- data/transcripts/sample_real_estate.txt +0 -98
- hf_space/README.md +0 -39
- hf_space/app.py +0 -413
- hf_space/requirements.txt +0 -5
- pyproject.toml +9 -0
- pytest.ini +0 -9
- requirements.txt +0 -44
- src/__init__.py +0 -3
- src/__pycache__/__init__.cpython-312.pyc +0 -0
- src/__pycache__/config.cpython-312.pyc +0 -0
- src/__pycache__/main.cpython-312.pyc +0 -0
- src/analyzers/__init__.py +0 -26
- src/analyzers/__pycache__/__init__.cpython-312.pyc +0 -0
- src/analyzers/__pycache__/chunker.cpython-312.pyc +0 -0
- src/analyzers/__pycache__/huggingface.cpython-312.pyc +0 -0
- src/analyzers/__pycache__/summarizer.cpython-312.pyc +0 -0
- src/analyzers/chunker.py +0 -118
- src/analyzers/huggingface.py +0 -407
- src/analyzers/summarizer.py +0 -410
- src/config.py +0 -50
- src/downloaders/__init__.py +0 -6
- src/downloaders/files.py +0 -177
- src/downloaders/youtube.py +0 -264
- src/knowledge/__init__.py +0 -19
- src/knowledge/embeddings.py +0 -107
- src/knowledge/indexer.py +0 -151
- src/knowledge/vectorstore.py +0 -316
- src/main.py +0 -6
- src/mentor/__init__.py +0 -1
- src/processors/__init__.py +0 -18
- src/processors/__pycache__/__init__.cpython-312.pyc +0 -0
- src/processors/__pycache__/audio.cpython-312.pyc +0 -0
- src/processors/__pycache__/transcriber.cpython-312.pyc +0 -0
- src/processors/audio.py +0 -83
- src/processors/documents.py +0 -278
- src/processors/ocr.py +0 -133
- src/processors/transcriber.py +0 -243
- src/ui/__init__.py +0 -1
- src/ui/__pycache__/__init__.cpython-312.pyc +0 -0
- src/ui/__pycache__/cli.cpython-312.pyc +0 -0
.cursorrules
DELETED
|
@@ -1,71 +0,0 @@
|
|
| 1 |
-
# Cursor Engineering Ruleset
|
| 2 |
-
|
| 3 |
-
## 1. Context First
|
| 4 |
-
Always request full context and constraints before proposing any decision.
|
| 5 |
-
- Understand the problem completely before suggesting solutions
|
| 6 |
-
- Ask clarifying questions when requirements are ambiguous
|
| 7 |
-
- Consider existing codebase patterns and conventions
|
| 8 |
-
|
| 9 |
-
## 2. Tech Stack Principles
|
| 10 |
-
Recommend tech stacks using:
|
| 11 |
-
- Idiomatic, native patterns for the language/framework
|
| 12 |
-
- Simple and maintainable components
|
| 13 |
-
- Minimal unnecessary abstraction
|
| 14 |
-
- Prefer standard library over external dependencies when reasonable
|
| 15 |
-
|
| 16 |
-
## 3. Scaffold Before Implementation
|
| 17 |
-
Scaffold the project structure BEFORE implementation:
|
| 18 |
-
- Clear domain boundaries
|
| 19 |
-
- Clean folder organization
|
| 20 |
-
- Conventional naming (language-specific conventions)
|
| 21 |
-
- Consistent imports/exports
|
| 22 |
-
- Document the structure in README
|
| 23 |
-
|
| 24 |
-
## 4. Test-Driven Development (TDD)
|
| 25 |
-
Use TDD approach:
|
| 26 |
-
- Tests define behavior before implementation
|
| 27 |
-
- Define what failure looks like explicitly
|
| 28 |
-
- No implementation until tests exist
|
| 29 |
-
- Edge cases explicitly covered
|
| 30 |
-
- Tests should be readable as documentation
|
| 31 |
-
|
| 32 |
-
## 5. Idempotent Functions
|
| 33 |
-
All core functions must be idempotent:
|
| 34 |
-
- Deterministic behavior (same input → same output)
|
| 35 |
-
- Safe to re-run multiple times
|
| 36 |
-
- No hidden state or side effects
|
| 37 |
-
- Pure functions where possible
|
| 38 |
-
|
| 39 |
-
## 6. Simplicity First
|
| 40 |
-
Optimize for simplicity:
|
| 41 |
-
- Low cognitive load
|
| 42 |
-
- Readable and clean code
|
| 43 |
-
- Avoid cleverness and "magic"
|
| 44 |
-
- Avoid premature optimization
|
| 45 |
-
- YAGNI (You Aren't Gonna Need It)
|
| 46 |
-
- DRY (Don't Repeat Yourself) but not at the cost of clarity
|
| 47 |
-
|
| 48 |
-
## 7. Idiomatic Code
|
| 49 |
-
Use idiomatic language patterns at all times:
|
| 50 |
-
- Follow language-specific style guides
|
| 51 |
-
- Use conventional patterns for the ecosystem
|
| 52 |
-
- Leverage language features appropriately
|
| 53 |
-
- Write code that looks familiar to other developers
|
| 54 |
-
|
| 55 |
-
---
|
| 56 |
-
|
| 57 |
-
## Project-Specific Rules
|
| 58 |
-
|
| 59 |
-
### Video Analyzer Project
|
| 60 |
-
- Use 100% free and open-source tools
|
| 61 |
-
- Prefer local processing over cloud APIs
|
| 62 |
-
- Keep user data private (process locally)
|
| 63 |
-
- Support both CLI and future web UI
|
| 64 |
-
- Modular architecture for easy extension
|
| 65 |
-
|
| 66 |
-
### Python Conventions
|
| 67 |
-
- Type hints on all function signatures
|
| 68 |
-
- Docstrings for public functions
|
| 69 |
-
- Use pathlib for file paths
|
| 70 |
-
- Rich for CLI output
|
| 71 |
-
- Pydantic for configuration/validation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.env.example
DELETED
|
@@ -1,12 +0,0 @@
|
|
| 1 |
-
# Video Analyzer - Environment Variables
|
| 2 |
-
# Copy this file to .env and fill in your values
|
| 3 |
-
|
| 4 |
-
# Hugging Face API Key (optional - for faster API-based summarization)
|
| 5 |
-
# Get your free key at: https://huggingface.co/settings/tokens
|
| 6 |
-
HUGGINGFACE_API_KEY=your_token_here
|
| 7 |
-
|
| 8 |
-
# Whisper Model Size (tiny, base, small, medium, large-v3)
|
| 9 |
-
VIDEO_ANALYZER_WHISPER_MODEL=base
|
| 10 |
-
|
| 11 |
-
# Default AI Backend (ollama, huggingface, huggingface-api)
|
| 12 |
-
VIDEO_ANALYZER_AI_BACKEND=huggingface
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.gitignore
CHANGED
|
@@ -1,38 +1,39 @@
|
|
| 1 |
-
# Environment variables (contains secrets!)
|
| 2 |
-
.env
|
| 3 |
-
.env.local
|
| 4 |
-
|
| 5 |
# Python
|
| 6 |
__pycache__/
|
| 7 |
*.py[cod]
|
| 8 |
*$py.class
|
| 9 |
*.so
|
| 10 |
.Python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
.venv/
|
| 12 |
venv/
|
| 13 |
ENV/
|
| 14 |
|
| 15 |
-
# Data directories (large files)
|
| 16 |
-
data/downloads/
|
| 17 |
-
data/audio/
|
| 18 |
-
data/chromadb/
|
| 19 |
-
|
| 20 |
-
# Keep transcripts and summaries (text files are small)
|
| 21 |
-
# data/transcripts/
|
| 22 |
-
# data/summaries/
|
| 23 |
-
|
| 24 |
-
# Models cache
|
| 25 |
-
models/
|
| 26 |
-
|
| 27 |
# IDE
|
| 28 |
.idea/
|
| 29 |
.vscode/
|
| 30 |
*.swp
|
| 31 |
*.swo
|
| 32 |
|
| 33 |
-
#
|
| 34 |
-
.
|
| 35 |
-
|
| 36 |
|
| 37 |
-
#
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Python
|
| 2 |
__pycache__/
|
| 3 |
*.py[cod]
|
| 4 |
*$py.class
|
| 5 |
*.so
|
| 6 |
.Python
|
| 7 |
+
build/
|
| 8 |
+
develop-eggs/
|
| 9 |
+
dist/
|
| 10 |
+
downloads/
|
| 11 |
+
eggs/
|
| 12 |
+
.eggs/
|
| 13 |
+
lib/
|
| 14 |
+
lib64/
|
| 15 |
+
parts/
|
| 16 |
+
sdist/
|
| 17 |
+
var/
|
| 18 |
+
wheels/
|
| 19 |
+
*.egg-info/
|
| 20 |
+
.installed.cfg
|
| 21 |
+
*.egg
|
| 22 |
+
|
| 23 |
+
# Virtual environments
|
| 24 |
.venv/
|
| 25 |
venv/
|
| 26 |
ENV/
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
# IDE
|
| 29 |
.idea/
|
| 30 |
.vscode/
|
| 31 |
*.swp
|
| 32 |
*.swo
|
| 33 |
|
| 34 |
+
# Environment
|
| 35 |
+
.env
|
| 36 |
+
.env.local
|
| 37 |
|
| 38 |
+
# uv
|
| 39 |
+
.python-version
|
DEPLOY_TO_HF_SPACES.md
DELETED
|
@@ -1,134 +0,0 @@
|
|
| 1 |
-
# Deploy to HuggingFace Spaces
|
| 2 |
-
|
| 3 |
-
This guide will help you deploy your Real Estate Mentor to HuggingFace Spaces for free.
|
| 4 |
-
|
| 5 |
-
## What You'll Get
|
| 6 |
-
|
| 7 |
-
- 🌐 **Public URL** - Access from anywhere
|
| 8 |
-
- 💾 **Persistent Storage** - Your data is saved
|
| 9 |
-
- 🆓 **100% Free** - No cost on free tier
|
| 10 |
-
- 🔒 **Private Option** - Can make it private
|
| 11 |
-
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
## Step 1: Create a New Space
|
| 15 |
-
|
| 16 |
-
1. Go to: https://huggingface.co/new-space
|
| 17 |
-
|
| 18 |
-
2. Fill in:
|
| 19 |
-
```
|
| 20 |
-
Space name: real-estate-mentor
|
| 21 |
-
License: MIT
|
| 22 |
-
SDK: Gradio
|
| 23 |
-
Hardware: CPU Basic (Free)
|
| 24 |
-
Visibility: Public (or Private)
|
| 25 |
-
```
|
| 26 |
-
|
| 27 |
-
3. Click **"Create Space"**
|
| 28 |
-
|
| 29 |
-
---
|
| 30 |
-
|
| 31 |
-
## Step 2: Upload Files
|
| 32 |
-
|
| 33 |
-
### Option A: Upload via Web Interface
|
| 34 |
-
|
| 35 |
-
1. In your new Space, click **"Files"** tab
|
| 36 |
-
2. Click **"+ Add file"** → **"Upload files"**
|
| 37 |
-
3. Upload these files from the `hf_space/` folder:
|
| 38 |
-
- `app.py`
|
| 39 |
-
- `requirements.txt`
|
| 40 |
-
- `README.md`
|
| 41 |
-
|
| 42 |
-
### Option B: Use Git (Recommended)
|
| 43 |
-
|
| 44 |
-
```bash
|
| 45 |
-
# Clone your space
|
| 46 |
-
git clone https://huggingface.co/spaces/YOUR_USERNAME/real-estate-mentor
|
| 47 |
-
cd real-estate-mentor
|
| 48 |
-
|
| 49 |
-
# Copy files from hf_space/
|
| 50 |
-
cp /path/to/video_analyzer/hf_space/* .
|
| 51 |
-
|
| 52 |
-
# Push to HuggingFace
|
| 53 |
-
git add .
|
| 54 |
-
git commit -m "Initial deployment"
|
| 55 |
-
git push
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
---
|
| 59 |
-
|
| 60 |
-
## Step 3: Wait for Build
|
| 61 |
-
|
| 62 |
-
1. Go to your Space URL: `https://huggingface.co/spaces/YOUR_USERNAME/real-estate-mentor`
|
| 63 |
-
2. Watch the **"Building"** status
|
| 64 |
-
3. First build takes ~3-5 minutes (downloading models)
|
| 65 |
-
4. When ready, you'll see **"Running"** ✅
|
| 66 |
-
|
| 67 |
-
---
|
| 68 |
-
|
| 69 |
-
## Step 4: Enable Persistent Storage
|
| 70 |
-
|
| 71 |
-
**Important:** To keep your data between restarts:
|
| 72 |
-
|
| 73 |
-
1. Go to Space **Settings**
|
| 74 |
-
2. Find **"Persistent Storage"**
|
| 75 |
-
3. Enable it (free tier: up to 50GB)
|
| 76 |
-
|
| 77 |
-
This ensures your indexed content survives Space restarts.
|
| 78 |
-
|
| 79 |
-
---
|
| 80 |
-
|
| 81 |
-
## Step 5: Start Using It!
|
| 82 |
-
|
| 83 |
-
1. **Upload Tab** - Add your course transcripts
|
| 84 |
-
2. **Search Tab** - Find content semantically
|
| 85 |
-
3. **Ask Tab** - Chat with your AI mentor
|
| 86 |
-
4. **Status Tab** - See what's indexed
|
| 87 |
-
|
| 88 |
-
---
|
| 89 |
-
|
| 90 |
-
## Troubleshooting
|
| 91 |
-
|
| 92 |
-
### Space is "Sleeping"
|
| 93 |
-
|
| 94 |
-
Free Spaces sleep after ~15 minutes of inactivity. Just visit the URL and it will wake up (takes ~30 seconds).
|
| 95 |
-
|
| 96 |
-
### Build Failed
|
| 97 |
-
|
| 98 |
-
Check the **Logs** tab for errors. Common issues:
|
| 99 |
-
- Missing dependencies → Check `requirements.txt`
|
| 100 |
-
- Syntax errors → Check `app.py`
|
| 101 |
-
|
| 102 |
-
### Data Disappeared
|
| 103 |
-
|
| 104 |
-
Make sure **Persistent Storage** is enabled in Settings.
|
| 105 |
-
|
| 106 |
-
---
|
| 107 |
-
|
| 108 |
-
## Upgrading (Optional)
|
| 109 |
-
|
| 110 |
-
For faster performance, you can upgrade hardware:
|
| 111 |
-
|
| 112 |
-
| Tier | Cost | Benefits |
|
| 113 |
-
|------|------|----------|
|
| 114 |
-
| CPU Basic | Free | Works fine, sleeps after 15 min |
|
| 115 |
-
| CPU Upgrade | $0.03/hr | Faster, no sleep |
|
| 116 |
-
| GPU | $0.60/hr | Much faster embeddings |
|
| 117 |
-
|
| 118 |
-
---
|
| 119 |
-
|
| 120 |
-
## Files Reference
|
| 121 |
-
|
| 122 |
-
```
|
| 123 |
-
hf_space/
|
| 124 |
-
├── app.py # Main Gradio application
|
| 125 |
-
├── requirements.txt # Python dependencies
|
| 126 |
-
└── README.md # Space description (shows on page)
|
| 127 |
-
```
|
| 128 |
-
|
| 129 |
-
---
|
| 130 |
-
|
| 131 |
-
## Need Help?
|
| 132 |
-
|
| 133 |
-
- HuggingFace Docs: https://huggingface.co/docs/hub/spaces
|
| 134 |
-
- Gradio Docs: https://gradio.app/docs/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
PLAN.md
DELETED
|
@@ -1,437 +0,0 @@
|
|
| 1 |
-
# Video Analyzer - Project Plan
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
A comprehensive tool to download videos from multiple sources, transcribe them to text, summarize content, and build a searchable knowledge base. The end goal is to create a **Virtual Real Estate Mentor** from course materials.
|
| 5 |
-
|
| 6 |
-
**🆓 100% Free & Open Source - No API costs!**
|
| 7 |
-
|
| 8 |
-
---
|
| 9 |
-
|
| 10 |
-
## Tech Stack (All Free & Open Source)
|
| 11 |
-
|
| 12 |
-
| Component | Technology | License | Notes |
|
| 13 |
-
|-----------|------------|---------|-------|
|
| 14 |
-
| **Language** | Python 3.11+ | PSF | Main language |
|
| 15 |
-
| **Video Download** | yt-dlp | Unlicense | Supports 1000+ sites |
|
| 16 |
-
| **Audio Processing** | ffmpeg | LGPL/GPL | Industry standard |
|
| 17 |
-
| **Transcription** | Whisper.cpp / faster-whisper | MIT | Local, fast, accurate |
|
| 18 |
-
| **Document Parsing** | PyMuPDF, python-docx | AGPL/MIT | PDF, Word support |
|
| 19 |
-
| **OCR** | Tesseract | Apache 2.0 | Image text extraction |
|
| 20 |
-
| **Vector DB** | ChromaDB | Apache 2.0 | Local vector storage |
|
| 21 |
-
| **Embeddings** | sentence-transformers | Apache 2.0 | all-MiniLM-L6-v2 model |
|
| 22 |
-
| **LLM** | Ollama + Llama3/Mistral/Phi | Various OSS | Local AI, no API costs |
|
| 23 |
-
| **Web UI** | Gradio | Apache 2.0 | Simple, beautiful UI |
|
| 24 |
-
| **CLI** | Typer | MIT | Command-line interface |
|
| 25 |
-
| **Database** | SQLite | Public Domain | Metadata storage |
|
| 26 |
-
|
| 27 |
-
---
|
| 28 |
-
|
| 29 |
-
## Core Features
|
| 30 |
-
|
| 31 |
-
### 1. Multi-Source Video Downloader
|
| 32 |
-
- **Supported Platforms:**
|
| 33 |
-
- YouTube, Vimeo, Dailymotion
|
| 34 |
-
- Udemy (with cookies/auth)
|
| 35 |
-
- Teachable, Thinkific, Kajabi
|
| 36 |
-
- Direct video URLs (MP4, WebM, etc.)
|
| 37 |
-
- Google Drive, Dropbox links
|
| 38 |
-
- **Technology:** `yt-dlp` (free, actively maintained)
|
| 39 |
-
- **Features:**
|
| 40 |
-
- Playlist/batch downloading
|
| 41 |
-
- Quality selection
|
| 42 |
-
- Resume interrupted downloads
|
| 43 |
-
- Metadata extraction (title, description, chapters)
|
| 44 |
-
- Cookie-based authentication for paid courses
|
| 45 |
-
|
| 46 |
-
### 2. Audio Extraction & Transcription
|
| 47 |
-
- **Audio Extraction:** `ffmpeg` (free)
|
| 48 |
-
- **Speech-to-Text:**
|
| 49 |
-
- **faster-whisper** - CTranslate2 optimized, 4x faster than original
|
| 50 |
-
- Models: tiny, base, small, medium, large-v3
|
| 51 |
-
- Runs entirely local - no internet needed
|
| 52 |
-
- **Features:**
|
| 53 |
-
- Speaker diarization (with pyannote - free for research)
|
| 54 |
-
- Word-level timestamps
|
| 55 |
-
- Multiple language support (99 languages)
|
| 56 |
-
- Auto language detection
|
| 57 |
-
|
| 58 |
-
### 3. Document Processing
|
| 59 |
-
- **Supported Formats:**
|
| 60 |
-
- PDF (PyMuPDF - fast, accurate)
|
| 61 |
-
- Word documents (python-docx)
|
| 62 |
-
- PowerPoint slides (python-pptx)
|
| 63 |
-
- Images with text (Tesseract OCR)
|
| 64 |
-
- Markdown, TXT, HTML
|
| 65 |
-
- **All libraries are free and open source**
|
| 66 |
-
|
| 67 |
-
### 4. Local LLM for Summarization & Analysis
|
| 68 |
-
- **Ollama** - Run LLMs locally with simple API
|
| 69 |
-
- **Recommended Models (all free):**
|
| 70 |
-
| Model | Size | Speed | Quality | Best For |
|
| 71 |
-
|-------|------|-------|---------|----------|
|
| 72 |
-
| Phi-3 | 3.8B | ⚡⚡⚡ | Good | Fast summaries |
|
| 73 |
-
| Mistral | 7B | ⚡⚡ | Great | Balanced |
|
| 74 |
-
| Llama3 | 8B | ⚡⚡ | Excellent | Best quality |
|
| 75 |
-
| Llama3 | 70B | ⚡ | Outstanding | If you have GPU |
|
| 76 |
-
|
| 77 |
-
- **Features:**
|
| 78 |
-
- Quick summaries
|
| 79 |
-
- Detailed study notes
|
| 80 |
-
- Key concept extraction
|
| 81 |
-
- Action items and strategies
|
| 82 |
-
- Q&A over content
|
| 83 |
-
|
| 84 |
-
### 5. Knowledge Base & Vector Storage
|
| 85 |
-
- **ChromaDB** - Local vector database (free)
|
| 86 |
-
- **Embeddings:** sentence-transformers
|
| 87 |
-
- Model: `all-MiniLM-L6-v2` (fast, 384 dimensions)
|
| 88 |
-
- Alternative: `all-mpnet-base-v2` (better quality, slower)
|
| 89 |
-
- **Features:**
|
| 90 |
-
- Semantic search across all content
|
| 91 |
-
- Source attribution with timestamps
|
| 92 |
-
- Hybrid search (semantic + keyword)
|
| 93 |
-
- No cloud, all local
|
| 94 |
-
|
| 95 |
-
### 6. Virtual Mentor Chat Interface
|
| 96 |
-
- **RAG (Retrieval Augmented Generation):**
|
| 97 |
-
- Query → Find relevant chunks → Generate response
|
| 98 |
-
- All runs locally with Ollama
|
| 99 |
-
- **Interfaces:**
|
| 100 |
-
- CLI chat (terminal)
|
| 101 |
-
- Web UI (Gradio - beautiful, easy)
|
| 102 |
-
- **Features:**
|
| 103 |
-
- Context-aware responses
|
| 104 |
-
- Source citations
|
| 105 |
-
- Conversation memory
|
| 106 |
-
- Export chat history
|
| 107 |
-
|
| 108 |
-
---
|
| 109 |
-
|
| 110 |
-
## Architecture
|
| 111 |
-
|
| 112 |
-
```
|
| 113 |
-
┌─────────────────────────────────────────────────────────────────┐
|
| 114 |
-
│ VIDEO ANALYZER (100% Local) │
|
| 115 |
-
├─────────────────────────────────────────────────────────────────┤
|
| 116 |
-
│ │
|
| 117 |
-
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
|
| 118 |
-
│ │ Ingestion │ │ Processing │ │ Knowledge Base │ │
|
| 119 |
-
│ ├──────────────┤ ├──────────────┤ ├──────────────────────┤ │
|
| 120 |
-
│ │ • yt-dlp │ │ • ffmpeg │ │ • ChromaDB │ │
|
| 121 |
-
│ │ • Cookies │→ ��� • Whisper │→ │ • sentence-transform │ │
|
| 122 |
-
│ │ • File input │ │ • Tesseract │ │ • SQLite metadata │ │
|
| 123 |
-
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
|
| 124 |
-
│ │
|
| 125 |
-
│ ↓ │
|
| 126 |
-
│ │
|
| 127 |
-
│ ┌──────────────────────────────────────────────────────────┐ │
|
| 128 |
-
│ │ Virtual Mentor (Ollama + RAG) │ │
|
| 129 |
-
│ ├──────────────────────────────────────────────────────────┤ │
|
| 130 |
-
│ │ • Llama3 / Mistral / Phi-3 (your choice) │ │
|
| 131 |
-
│ │ • Context retrieval from ChromaDB │ │
|
| 132 |
-
│ │ • Local inference - no API calls │ │
|
| 133 |
-
│ │ • Gradio web interface │ │
|
| 134 |
-
│ └──────────────────────────────────────────────────────────┘ │
|
| 135 |
-
│ │
|
| 136 |
-
└─────────────────────────────────────────────────────────────────┘
|
| 137 |
-
```
|
| 138 |
-
|
| 139 |
-
---
|
| 140 |
-
|
| 141 |
-
## System Requirements
|
| 142 |
-
|
| 143 |
-
### Minimum (CPU only)
|
| 144 |
-
- **RAM:** 8GB (16GB recommended)
|
| 145 |
-
- **Storage:** 20GB+ for models and data
|
| 146 |
-
- **CPU:** Any modern x64 processor
|
| 147 |
-
- **Whisper:** Use "small" or "base" model
|
| 148 |
-
- **LLM:** Use Phi-3 (3.8B) model
|
| 149 |
-
|
| 150 |
-
### Recommended (with GPU)
|
| 151 |
-
- **RAM:** 16GB+
|
| 152 |
-
- **GPU:** NVIDIA with 8GB+ VRAM (RTX 3060+)
|
| 153 |
-
- **Whisper:** Use "medium" or "large-v3" model
|
| 154 |
-
- **LLM:** Use Llama3 8B or Mistral 7B
|
| 155 |
-
|
| 156 |
-
### Optimal (power user)
|
| 157 |
-
- **GPU:** RTX 4090 or similar (24GB VRAM)
|
| 158 |
-
- **LLM:** Llama3 70B for best quality
|
| 159 |
-
|
| 160 |
-
---
|
| 161 |
-
|
| 162 |
-
## Project Structure
|
| 163 |
-
|
| 164 |
-
```
|
| 165 |
-
video_analyzer/
|
| 166 |
-
├── src/
|
| 167 |
-
│ ├── __init__.py
|
| 168 |
-
│ ├── main.py # Entry point
|
| 169 |
-
│ ├── config.py # Configuration
|
| 170 |
-
│ │
|
| 171 |
-
│ ├── downloaders/ # Video/content downloaders
|
| 172 |
-
│ │ ├── __init__.py
|
| 173 |
-
│ │ ├── base.py # Base downloader class
|
| 174 |
-
│ │ ├── ytdlp.py # yt-dlp wrapper
|
| 175 |
-
│ │ └── files.py # Local file handling
|
| 176 |
-
│ │
|
| 177 |
-
│ ├── processors/ # Content processors
|
| 178 |
-
│ │ ├── __init__.py
|
| 179 |
-
│ │ ├── audio.py # Audio extraction (ffmpeg)
|
| 180 |
-
│ │ ├── transcriber.py # Whisper transcription
|
| 181 |
-
│ │ ├── documents.py # PDF, Word, PPT
|
| 182 |
-
│ │ └── ocr.py # Tesseract OCR
|
| 183 |
-
│ │
|
| 184 |
-
│ ├── analyzers/ # AI analysis
|
| 185 |
-
│ │ ├── __init__.py
|
| 186 |
-
│ │ ├── summarizer.py # Ollama summarization
|
| 187 |
-
│ │ ├── extractor.py # Key info extraction
|
| 188 |
-
│ │ └── chunker.py # Text chunking
|
| 189 |
-
│ │
|
| 190 |
-
│ ├── knowledge/ # Knowledge base
|
| 191 |
-
│ │ ├── __init__.py
|
| 192 |
-
│ │ ├── vectorstore.py # ChromaDB
|
| 193 |
-
│ │ ├── embeddings.py # sentence-transformers
|
| 194 |
-
│ │ └── search.py # Semantic search
|
| 195 |
-
│ │
|
| 196 |
-
│ ├── mentor/ # Virtual mentor
|
| 197 |
-
│ │ ├── __init__.py
|
| 198 |
-
│ │ ├── rag.py # RAG pipeline
|
| 199 |
-
│ │ ├── ollama_client.py # Ollama integration
|
| 200 |
-
│ │ └── prompts.py # System prompts
|
| 201 |
-
│ │
|
| 202 |
-
│ └── ui/ # User interfaces
|
| 203 |
-
│ ├── __init__.py
|
| 204 |
-
│ ├── cli.py # Typer CLI
|
| 205 |
-
│ └── web.py # Gradio web app
|
| 206 |
-
│
|
| 207 |
-
├── data/ # Data storage
|
| 208 |
-
│ ├── downloads/ # Downloaded videos
|
| 209 |
-
│ ├── audio/ # Extracted audio
|
| 210 |
-
│ ├── transcripts/ # Transcriptions
|
| 211 |
-
│ ├── summaries/ # Summaries
|
| 212 |
-
│ └── chromadb/ # Vector database
|
| 213 |
-
│
|
| 214 |
-
├── models/ # Local model cache
|
| 215 |
-
│ └── whisper/ # Whisper models
|
| 216 |
-
│
|
| 217 |
-
├── tests/
|
| 218 |
-
├── requirements.txt
|
| 219 |
-
├── install.sh # One-click setup script
|
| 220 |
-
├── .cursorrules
|
| 221 |
-
└── README.md
|
| 222 |
-
```
|
| 223 |
-
|
| 224 |
-
---
|
| 225 |
-
|
| 226 |
-
## Dependencies (requirements.txt)
|
| 227 |
-
|
| 228 |
-
```
|
| 229 |
-
# Core
|
| 230 |
-
python-dotenv>=1.0.0
|
| 231 |
-
typer[all]>=0.9.0
|
| 232 |
-
rich>=13.0.0
|
| 233 |
-
|
| 234 |
-
# Video/Audio
|
| 235 |
-
yt-dlp>=2024.1.0
|
| 236 |
-
ffmpeg-python>=0.2.0
|
| 237 |
-
|
| 238 |
-
# Transcription
|
| 239 |
-
faster-whisper>=1.0.0
|
| 240 |
-
# or: openai-whisper>=20231117
|
| 241 |
-
|
| 242 |
-
# Document Processing
|
| 243 |
-
PyMuPDF>=1.23.0
|
| 244 |
-
python-docx>=1.0.0
|
| 245 |
-
python-pptx>=0.6.23
|
| 246 |
-
pytesseract>=0.3.10
|
| 247 |
-
|
| 248 |
-
# AI/ML
|
| 249 |
-
sentence-transformers>=2.2.0
|
| 250 |
-
chromadb>=0.4.0
|
| 251 |
-
ollama>=0.1.0
|
| 252 |
-
|
| 253 |
-
# Web UI
|
| 254 |
-
gradio>=4.0.0
|
| 255 |
-
|
| 256 |
-
# Utilities
|
| 257 |
-
tqdm>=4.66.0
|
| 258 |
-
pydantic>=2.0.0
|
| 259 |
-
```
|
| 260 |
-
|
| 261 |
-
---
|
| 262 |
-
|
| 263 |
-
## External Dependencies (System)
|
| 264 |
-
|
| 265 |
-
```bash
|
| 266 |
-
# Ubuntu/Debian
|
| 267 |
-
sudo apt install ffmpeg tesseract-ocr
|
| 268 |
-
|
| 269 |
-
# macOS
|
| 270 |
-
brew install ffmpeg tesseract
|
| 271 |
-
|
| 272 |
-
# Windows
|
| 273 |
-
# Download ffmpeg and tesseract installers
|
| 274 |
-
|
| 275 |
-
# Ollama (all platforms)
|
| 276 |
-
curl -fsSL https://ollama.com/install.sh | sh
|
| 277 |
-
ollama pull llama3 # or mistral, phi3
|
| 278 |
-
```
|
| 279 |
-
|
| 280 |
-
---
|
| 281 |
-
|
| 282 |
-
## Development Phases
|
| 283 |
-
|
| 284 |
-
### Phase 1: Foundation (Week 1-2)
|
| 285 |
-
- [ ] Project setup & dependencies
|
| 286 |
-
- [ ] yt-dlp video downloader
|
| 287 |
-
- [ ] ffmpeg audio extraction
|
| 288 |
-
- [ ] faster-whisper transcription
|
| 289 |
-
- [ ] Basic CLI with Typer
|
| 290 |
-
|
| 291 |
-
### Phase 2: Processing Pipeline (Week 3-4)
|
| 292 |
-
- [ ] PDF/Word/PPT processing
|
| 293 |
-
- [ ] OCR for images
|
| 294 |
-
- [ ] Text chunking strategy
|
| 295 |
-
- [ ] SQLite metadata storage
|
| 296 |
-
- [ ] Batch processing
|
| 297 |
-
|
| 298 |
-
### Phase 3: Knowledge Base (Week 5-6)
|
| 299 |
-
- [ ] sentence-transformers embeddings
|
| 300 |
-
- [ ] ChromaDB integration
|
| 301 |
-
- [ ] Semantic search
|
| 302 |
-
- [ ] Hybrid search (semantic + keyword)
|
| 303 |
-
- [ ] Source attribution
|
| 304 |
-
|
| 305 |
-
### Phase 4: Virtual Mentor (Week 7-8)
|
| 306 |
-
- [ ] Ollama integration
|
| 307 |
-
- [ ] RAG implementation
|
| 308 |
-
- [ ] Real estate prompts
|
| 309 |
-
- [ ] Conversation memory
|
| 310 |
-
- [ ] CLI chat interface
|
| 311 |
-
|
| 312 |
-
### Phase 5: Polish & UI (Week 9-10)
|
| 313 |
-
- [ ] Gradio web interface
|
| 314 |
-
- [ ] Progress tracking
|
| 315 |
-
- [ ] Export features
|
| 316 |
-
- [ ] Error handling
|
| 317 |
-
- [ ] Documentation
|
| 318 |
-
|
| 319 |
-
---
|
| 320 |
-
|
| 321 |
-
## Real Estate Mentor - Special Features
|
| 322 |
-
|
| 323 |
-
### Domain-Specific Prompts
|
| 324 |
-
```python
|
| 325 |
-
REAL_ESTATE_SYSTEM_PROMPT = """
|
| 326 |
-
You are a knowledgeable real estate mentor with expertise from
|
| 327 |
-
the user's course materials. Help them with:
|
| 328 |
-
- Deal analysis (cash flow, ROI, cap rates)
|
| 329 |
-
- Negotiation strategies
|
| 330 |
-
- Market analysis
|
| 331 |
-
- Legal considerations
|
| 332 |
-
- Financing options
|
| 333 |
-
|
| 334 |
-
Always cite which video/document your advice comes from.
|
| 335 |
-
"""
|
| 336 |
-
```
|
| 337 |
-
|
| 338 |
-
### Deal Analysis Helper
|
| 339 |
-
- Input property details
|
| 340 |
-
- Get relevant strategies from course content
|
| 341 |
-
- Calculate key metrics
|
| 342 |
-
- Risk assessment based on learned material
|
| 343 |
-
|
| 344 |
-
### Study Features
|
| 345 |
-
- Auto-generate flashcards
|
| 346 |
-
- Create quizzes from content
|
| 347 |
-
- Build glossary of terms
|
| 348 |
-
- Track learning progress
|
| 349 |
-
|
| 350 |
-
---
|
| 351 |
-
|
| 352 |
-
## CLI Commands
|
| 353 |
-
|
| 354 |
-
```bash
|
| 355 |
-
# Download video(s)
|
| 356 |
-
video-analyzer download "https://youtube.com/watch?v=..."
|
| 357 |
-
video-analyzer download --playlist "https://youtube.com/playlist?..."
|
| 358 |
-
video-analyzer download --cookies cookies.txt "https://udemy.com/course/..."
|
| 359 |
-
|
| 360 |
-
# Process content
|
| 361 |
-
video-analyzer transcribe ./data/downloads/
|
| 362 |
-
video-analyzer process ./documents/ # PDFs, Word, etc.
|
| 363 |
-
|
| 364 |
-
# Build knowledge base
|
| 365 |
-
video-analyzer index # Index all processed content
|
| 366 |
-
video-analyzer search "what is cap rate"
|
| 367 |
-
|
| 368 |
-
# Summarize
|
| 369 |
-
video-analyzer summarize ./data/transcripts/video1.txt
|
| 370 |
-
video-analyzer summarize --all # Summarize everything
|
| 371 |
-
|
| 372 |
-
# Chat with mentor
|
| 373 |
-
video-analyzer chat # CLI chat
|
| 374 |
-
video-analyzer ui # Launch web UI
|
| 375 |
-
|
| 376 |
-
# Utilities
|
| 377 |
-
video-analyzer status # Show processing status
|
| 378 |
-
video-analyzer export # Export all notes
|
| 379 |
-
```
|
| 380 |
-
|
| 381 |
-
---
|
| 382 |
-
|
| 383 |
-
## Web UI Preview
|
| 384 |
-
|
| 385 |
-
```
|
| 386 |
-
┌─────────────────────────────────────────────────────────────┐
|
| 387 |
-
│ 🎓 Real Estate Mentor [⚙️] │
|
| 388 |
-
├─────────────────────────────────────────────────────────────┤
|
| 389 |
-
│ │
|
| 390 |
-
│ ┌─────────────────────────────────────────────────────┐ │
|
| 391 |
-
│ │ 📚 Knowledge Base: 47 videos, 12 documents indexed │ │
|
| 392 |
-
│ └─────────────────────────────────────────────────────┘ │
|
| 393 |
-
│ │
|
| 394 |
-
│ ┌─────────────────────────────────────────────────────┐ │
|
| 395 |
-
│ │ You: How do I calculate cash-on-cash return? │ │
|
| 396 |
-
│ │ │ │
|
| 397 |
-
│ │ Mentor: Cash-on-cash return measures the annual │ │
|
| 398 |
-
│ │ pre-tax cash flow relative to the total cash │ │
|
| 399 |
-
│ │ invested. The formula is: │ │
|
| 400 |
-
│ │ │ │
|
| 401 |
-
│ │ CoC Return = (Annual Cash Flow / Total Cash) × 100 │ │
|
| 402 |
-
│ │ │ │
|
| 403 |
-
│ │ ���� Source: Module 3 - Investment Analysis (12:34) │ │
|
| 404 |
-
│ └─────────────────────────────────────────────────────┘ │
|
| 405 |
-
│ │
|
| 406 |
-
│ [Type your question here... ] [Send] │
|
| 407 |
-
│ │
|
| 408 |
-
│ [📥 Add Content] [📊 Analyze Deal] [📝 Study Mode] │
|
| 409 |
-
│ │
|
| 410 |
-
└─────────────────────────────────────────────────────────────┘
|
| 411 |
-
```
|
| 412 |
-
|
| 413 |
-
---
|
| 414 |
-
|
| 415 |
-
## Cost Comparison
|
| 416 |
-
|
| 417 |
-
| Approach | Monthly Cost | Our Approach |
|
| 418 |
-
|----------|--------------|--------------|
|
| 419 |
-
| OpenAI GPT-4 | $20-100+ | **$0** (Ollama) |
|
| 420 |
-
| OpenAI Whisper API | $0.006/min | **$0** (local Whisper) |
|
| 421 |
-
| Pinecone Vector DB | $70+ | **$0** (ChromaDB) |
|
| 422 |
-
| Cloud transcription | $0.01-0.05/min | **$0** (local) |
|
| 423 |
-
| **Total** | **$100+/month** | **$0** |
|
| 424 |
-
|
| 425 |
-
**Only costs:** Electricity to run your computer 💡
|
| 426 |
-
|
| 427 |
-
---
|
| 428 |
-
|
| 429 |
-
## Next Steps
|
| 430 |
-
|
| 431 |
-
1. ✅ Plan complete - 100% free & open source
|
| 432 |
-
2. **Ready to start coding!**
|
| 433 |
-
|
| 434 |
-
Shall I begin with Phase 1?
|
| 435 |
-
- Set up project structure
|
| 436 |
-
- Install dependencies
|
| 437 |
-
- Build the video downloader
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -1,192 +1,14 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
## 🆓 100% Free Stack
|
| 16 |
-
|
| 17 |
-
| Component | Tool | Cost |
|
| 18 |
-
|-----------|------|------|
|
| 19 |
-
| Video Download | yt-dlp | Free |
|
| 20 |
-
| Transcription | Whisper (local) | Free |
|
| 21 |
-
| Document Processing | PyMuPDF, python-docx | Free |
|
| 22 |
-
| OCR | Tesseract | Free |
|
| 23 |
-
| Summarization | Ollama (Llama3/Mistral) | Free |
|
| 24 |
-
| Vector Database | ChromaDB | Free |
|
| 25 |
-
| Web UI | Gradio | Free |
|
| 26 |
-
|
| 27 |
-
**Total monthly cost: $0** 💰
|
| 28 |
-
|
| 29 |
-
## 📋 Features
|
| 30 |
-
|
| 31 |
-
### Phase 1 ✅
|
| 32 |
-
- **YouTube video downloading** with yt-dlp
|
| 33 |
-
- **AI transcription** using local Whisper
|
| 34 |
-
- **Audio extraction** with ffmpeg
|
| 35 |
-
|
| 36 |
-
### Phase 2 ✅
|
| 37 |
-
- **Direct file/folder import** - drop files and process
|
| 38 |
-
- **PDF processing** with PyMuPDF
|
| 39 |
-
- **Word/PowerPoint processing**
|
| 40 |
-
- **OCR for images** with Tesseract
|
| 41 |
-
- **AI summarization** with Ollama (local LLM)
|
| 42 |
-
- **Smart text chunking** for long documents
|
| 43 |
-
|
| 44 |
-
### Coming Soon
|
| 45 |
-
- **Phase 3:** Vector database + semantic search
|
| 46 |
-
- **Phase 4:** Virtual mentor RAG chat
|
| 47 |
-
- **Phase 5:** Web UI with Gradio
|
| 48 |
-
|
| 49 |
-
## 💻 Requirements
|
| 50 |
-
|
| 51 |
-
**Minimum:**
|
| 52 |
-
- 8GB RAM (16GB recommended)
|
| 53 |
-
- Any modern CPU
|
| 54 |
-
- 20GB storage
|
| 55 |
-
|
| 56 |
-
**Recommended (for faster processing):**
|
| 57 |
-
- NVIDIA GPU with 8GB+ VRAM
|
| 58 |
-
- 16GB+ RAM
|
| 59 |
-
|
| 60 |
-
## 🚀 Quick Start
|
| 61 |
-
|
| 62 |
-
### 1. Install Dependencies
|
| 63 |
-
|
| 64 |
-
```bash
|
| 65 |
-
# Clone and setup
|
| 66 |
-
git clone <repo>
|
| 67 |
-
cd video_analyzer
|
| 68 |
-
|
| 69 |
-
# Install Python dependencies
|
| 70 |
-
pip install -r requirements.txt
|
| 71 |
-
```
|
| 72 |
-
|
| 73 |
-
### 2. Install Ollama (for AI summaries)
|
| 74 |
-
|
| 75 |
-
```bash
|
| 76 |
-
# Install Ollama
|
| 77 |
-
curl -fsSL https://ollama.com/install.sh | sh
|
| 78 |
-
|
| 79 |
-
# Pull a model (choose one)
|
| 80 |
-
ollama pull llama3 # Best quality (8B params)
|
| 81 |
-
ollama pull mistral # Good balance
|
| 82 |
-
ollama pull phi3 # Fastest (3.8B params)
|
| 83 |
-
|
| 84 |
-
# Start Ollama server
|
| 85 |
-
ollama serve
|
| 86 |
-
```
|
| 87 |
-
|
| 88 |
-
### 3. Process Your Content
|
| 89 |
-
|
| 90 |
-
```bash
|
| 91 |
-
# Add local files (videos, PDFs, Word docs, etc.)
|
| 92 |
-
./video-analyzer add /path/to/your/course/files
|
| 93 |
-
|
| 94 |
-
# Process everything (transcribe videos, extract docs)
|
| 95 |
-
./video-analyzer process-all
|
| 96 |
-
|
| 97 |
-
# Generate AI summaries
|
| 98 |
-
./video-analyzer summarize --all --type study_notes
|
| 99 |
-
|
| 100 |
-
# Check status
|
| 101 |
-
./video-analyzer status
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
## 📖 CLI Commands
|
| 105 |
-
|
| 106 |
-
### Content Management
|
| 107 |
-
```bash
|
| 108 |
-
./video-analyzer add PATH # Add files/folders
|
| 109 |
-
./video-analyzer status # Show statistics
|
| 110 |
-
./video-analyzer list-content # List processed content
|
| 111 |
-
```
|
| 112 |
-
|
| 113 |
-
### Processing
|
| 114 |
-
```bash
|
| 115 |
-
./video-analyzer transcribe PATH # Transcribe audio/video
|
| 116 |
-
./video-analyzer process-docs [PATH] # Process PDF/Word/PPT
|
| 117 |
-
./video-analyzer process-images [PATH] # OCR images
|
| 118 |
-
./video-analyzer process-all [PATH] # Process everything
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
### YouTube (requires cookies)
|
| 122 |
-
```bash
|
| 123 |
-
./video-analyzer download URL --cookies cookies.txt
|
| 124 |
-
./video-analyzer process URL --cookies cookies.txt
|
| 125 |
-
```
|
| 126 |
-
|
| 127 |
-
### AI Summarization
|
| 128 |
-
```bash
|
| 129 |
-
./video-analyzer summarize PATH # Summarize one file
|
| 130 |
-
./video-analyzer summarize --all # Summarize all transcripts
|
| 131 |
-
./video-analyzer summarize -t real_estate # Real estate focus
|
| 132 |
-
./video-analyzer summarize -t study_notes # Study notes format
|
| 133 |
-
```
|
| 134 |
-
|
| 135 |
-
### Summary Types
|
| 136 |
-
|
| 137 |
-
| Type | Description |
|
| 138 |
-
|------|-------------|
|
| 139 |
-
| `quick` | 2-3 paragraph overview |
|
| 140 |
-
| `detailed` | Comprehensive summary with key points |
|
| 141 |
-
| `study_notes` | Formatted notes with concepts, definitions, action items |
|
| 142 |
-
| `real_estate` | Specialized for real estate content with deal analysis |
|
| 143 |
-
|
| 144 |
-
## 🔧 Whisper Models
|
| 145 |
-
|
| 146 |
-
| Model | Size | Speed | Quality | RAM |
|
| 147 |
-
|-------|------|-------|---------|-----|
|
| 148 |
-
| tiny | 39M | ⚡⚡⚡⚡ | Basic | 1GB |
|
| 149 |
-
| base | 74M | ⚡⚡⚡ | Good | 1GB |
|
| 150 |
-
| small | 244M | ⚡⚡ | Great | 2GB |
|
| 151 |
-
| medium | 769M | ⚡ | Excellent | 5GB |
|
| 152 |
-
| large-v3 | 1550M | 🐢 | Best | 10GB |
|
| 153 |
-
|
| 154 |
-
## 📁 Project Structure
|
| 155 |
-
|
| 156 |
-
```
|
| 157 |
-
video_analyzer/
|
| 158 |
-
├── src/
|
| 159 |
-
│ ├── downloaders/ # yt-dlp, file handling
|
| 160 |
-
│ ├── processors/ # Whisper, documents, OCR
|
| 161 |
-
│ ├── analyzers/ # Ollama summarization, chunking
|
| 162 |
-
│ ├── knowledge/ # Vector DB (Phase 3)
|
| 163 |
-
│ ├── mentor/ # RAG chat (Phase 4)
|
| 164 |
-
│ └── ui/ # CLI, web interface
|
| 165 |
-
├── data/
|
| 166 |
-
│ ├── downloads/ # Source files
|
| 167 |
-
│ ├── audio/ # Extracted audio
|
| 168 |
-
│ ├── transcripts/ # Text content
|
| 169 |
-
│ └── summaries/ # AI summaries
|
| 170 |
-
└── video-analyzer # CLI script
|
| 171 |
-
```
|
| 172 |
-
|
| 173 |
-
## 📁 Supported Formats
|
| 174 |
-
|
| 175 |
-
| Type | Formats |
|
| 176 |
-
|------|---------|
|
| 177 |
-
| Video | .mp4, .mkv, .avi, .mov, .webm, .flv |
|
| 178 |
-
| Audio | .mp3, .wav, .m4a, .flac, .aac, .ogg |
|
| 179 |
-
| Document | .pdf, .docx, .pptx, .txt, .md |
|
| 180 |
-
| Image (OCR) | .png, .jpg, .jpeg, .gif, .bmp |
|
| 181 |
-
|
| 182 |
-
## 🛠️ Development Status
|
| 183 |
-
|
| 184 |
-
- [x] **Phase 1:** Video downloading + transcription
|
| 185 |
-
- [x] **Phase 2:** Document processing + AI summarization
|
| 186 |
-
- [ ] **Phase 3:** Knowledge base + vector search
|
| 187 |
-
- [ ] **Phase 4:** Virtual mentor + RAG chat
|
| 188 |
-
- [ ] **Phase 5:** Web UI + polish
|
| 189 |
-
|
| 190 |
-
## 📜 License
|
| 191 |
-
|
| 192 |
-
MIT - Free for personal and commercial use
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Video Analyzer
|
| 3 |
+
emoji: "🎬"
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "6.2.0"
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Video Analyzer
|
| 13 |
+
|
| 14 |
+
A Gradio application.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
VOICE_COMMANDS_PLAN.md
DELETED
|
@@ -1,323 +0,0 @@
|
|
| 1 |
-
# Voice Commands Plan - 100% Local & Private
|
| 2 |
-
|
| 3 |
-
## BLUF (Bottom Line Up Front)
|
| 4 |
-
|
| 5 |
-
**Add voice control to video_analyzer using Whisper (STT) + Piper (TTS) - both run entirely on your machine. No audio leaves your computer. No voice fingerprinting. No cloud APIs.**
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## ELI5 (Explain Like I'm 5)
|
| 10 |
-
|
| 11 |
-
| What | How | Privacy |
|
| 12 |
-
|------|-----|---------|
|
| 13 |
-
| **You speak** | Microphone → Whisper (already in project!) | Audio never leaves your PC |
|
| 14 |
-
| **App understands** | Whisper converts speech → text command | All processing is local |
|
| 15 |
-
| **App responds** | Piper TTS converts text → speech | No voice profile created |
|
| 16 |
-
| **Loop** | Wake word → listen → execute → respond | 100% offline capable |
|
| 17 |
-
|
| 18 |
-
**Why this is private:**
|
| 19 |
-
- Whisper runs locally - OpenAI never sees your voice
|
| 20 |
-
- Piper TTS runs locally - no cloud synthesis
|
| 21 |
-
- No internet required after initial setup
|
| 22 |
-
- Your voice patterns stay on YOUR machine
|
| 23 |
-
|
| 24 |
-
---
|
| 25 |
-
|
| 26 |
-
## Tech Stack (All Free & Local)
|
| 27 |
-
|
| 28 |
-
| Component | Technology | Why This One |
|
| 29 |
-
|-----------|------------|--------------|
|
| 30 |
-
| **Speech-to-Text** | Whisper (faster-whisper) | Already in project! Fast, accurate, local |
|
| 31 |
-
| **Text-to-Speech** | Piper TTS | Fast, natural voices, 100% local, tiny models |
|
| 32 |
-
| **Wake Word** | Porcupine (free tier) or OpenWakeWord | Local detection, low CPU |
|
| 33 |
-
| **Audio Capture** | sounddevice + numpy | Cross-platform, real-time |
|
| 34 |
-
| **Command Parser** | Simple pattern matching → Ollama for complex | Start simple, add AI later |
|
| 35 |
-
|
| 36 |
-
### Alternative TTS Options
|
| 37 |
-
|
| 38 |
-
| TTS Engine | Quality | Speed | Size | Notes |
|
| 39 |
-
|------------|---------|-------|------|-------|
|
| 40 |
-
| **Piper** ⭐ | Great | ⚡⚡⚡ | 20-60MB | Best balance, recommended |
|
| 41 |
-
| Coqui TTS | Excellent | ⚡⚡ | 200MB+ | More natural, heavier |
|
| 42 |
-
| espeak-ng | Basic | ⚡⚡⚡⚡ | 5MB | Robotic but lightweight |
|
| 43 |
-
| Bark | Amazing | ⚡ | 5GB+ | Too heavy for real-time |
|
| 44 |
-
|
| 45 |
-
---
|
| 46 |
-
|
| 47 |
-
## Architecture
|
| 48 |
-
|
| 49 |
-
```
|
| 50 |
-
┌─────────────────────────────────────────────────────────────────────┐
|
| 51 |
-
│ VOICE COMMAND SYSTEM (100% Local) │
|
| 52 |
-
├─────────────────────────────────────────────────────────────────────┤
|
| 53 |
-
│ │
|
| 54 |
-
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
|
| 55 |
-
│ │ Microphone │ │ Wake Word │ │ Command Parser │ │
|
| 56 |
-
│ │ (sounddev) │ ──▶ │ (Porcupine) │ ──▶ │ (pattern/Ollama)│ │
|
| 57 |
-
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
|
| 58 |
-
│ │ │ │
|
| 59 |
-
│ ▼ ▼ │
|
| 60 |
-
│ ┌──────────────┐ ┌──────────────────┐ │
|
| 61 |
-
│ │ Whisper │ │ Execute Command │ │
|
| 62 |
-
│ │ (STT) │ │ (existing CLI) │ │
|
| 63 |
-
│ │ LOCAL │ └──────────────────┘ │
|
| 64 |
-
│ └──────────────┘ │ │
|
| 65 |
-
│ ▼ │
|
| 66 |
-
│ ┌──────────────────┐ │
|
| 67 |
-
│ ┌──────────────┐ │ Piper TTS │ │
|
| 68 |
-
│ │ Speaker │ ◀────────────────────────│ (local) │ │
|
| 69 |
-
│ └──────────────┘ └──────────────────┘ │
|
| 70 |
-
│ │
|
| 71 |
-
│ 🔒 ALL PROCESSING ON LOCAL MACHINE - NOTHING SENT TO CLOUD 🔒 │
|
| 72 |
-
└─────────────────────────────────────────────────────────────────────┘
|
| 73 |
-
```
|
| 74 |
-
|
| 75 |
-
---
|
| 76 |
-
|
| 77 |
-
## Voice Command Flow
|
| 78 |
-
|
| 79 |
-
```
|
| 80 |
-
1. IDLE STATE
|
| 81 |
-
└─▶ Listening for wake word ("Hey Analyzer" / "Computer")
|
| 82 |
-
|
| 83 |
-
2. WAKE WORD DETECTED
|
| 84 |
-
└─▶ Play acknowledgment sound
|
| 85 |
-
└─▶ Start recording user speech
|
| 86 |
-
|
| 87 |
-
3. USER SPEAKS COMMAND
|
| 88 |
-
└─▶ "Summarize my latest video"
|
| 89 |
-
└─▶ Silence detection → stop recording
|
| 90 |
-
|
| 91 |
-
4. SPEECH-TO-TEXT (Whisper)
|
| 92 |
-
└─▶ Audio → "summarize my latest video"
|
| 93 |
-
|
| 94 |
-
5. COMMAND PARSING
|
| 95 |
-
└─▶ Match to CLI command: `./video-analyzer summarize --latest`
|
| 96 |
-
└─▶ For complex queries → use Ollama to interpret
|
| 97 |
-
|
| 98 |
-
6. EXECUTE & RESPOND
|
| 99 |
-
└─▶ Run command
|
| 100 |
-
└─▶ Get result text
|
| 101 |
-
└─▶ Piper TTS → Speak result
|
| 102 |
-
|
| 103 |
-
7. RETURN TO IDLE
|
| 104 |
-
```
|
| 105 |
-
|
| 106 |
-
---
|
| 107 |
-
|
| 108 |
-
## Supported Voice Commands (Examples)
|
| 109 |
-
|
| 110 |
-
| Voice Command | Maps To | Category |
|
| 111 |
-
|---------------|---------|----------|
|
| 112 |
-
| "What's my status" | `./video-analyzer status` | Info |
|
| 113 |
-
| "Summarize latest video" | `./video-analyzer summarize --latest` | Processing |
|
| 114 |
-
| "Add files from downloads" | `./video-analyzer add ~/Downloads` | Content |
|
| 115 |
-
| "Process all videos" | `./video-analyzer process-all` | Processing |
|
| 116 |
-
| "Search for cap rate" | `./video-analyzer search "cap rate"` | Knowledge |
|
| 117 |
-
| "Start chat mode" | `./video-analyzer chat` | Interactive |
|
| 118 |
-
| "What did I learn about negotiation" | RAG query via Ollama | Q&A |
|
| 119 |
-
|
| 120 |
-
---
|
| 121 |
-
|
| 122 |
-
## Project Structure (New Files)
|
| 123 |
-
|
| 124 |
-
```
|
| 125 |
-
src/
|
| 126 |
-
├── voice/ # NEW MODULE
|
| 127 |
-
│ ├── __init__.py
|
| 128 |
-
│ ├── listener.py # Microphone capture + wake word
|
| 129 |
-
│ ├── stt.py # Whisper wrapper for real-time
|
| 130 |
-
│ ├── tts.py # Piper TTS wrapper
|
| 131 |
-
│ ├── commands.py # Command pattern matching
|
| 132 |
-
│ └── assistant.py # Main voice assistant loop
|
| 133 |
-
│
|
| 134 |
-
├── processors/
|
| 135 |
-
│ └── transcriber.py # Already exists - reuse for STT
|
| 136 |
-
```
|
| 137 |
-
|
| 138 |
-
---
|
| 139 |
-
|
| 140 |
-
## Implementation Phases
|
| 141 |
-
|
| 142 |
-
### Phase 1: Basic TTS (Speak Responses) — 2-3 hours
|
| 143 |
-
- [ ] Install Piper TTS
|
| 144 |
-
- [ ] Create `src/voice/tts.py`
|
| 145 |
-
- [ ] Add `--speak` flag to CLI commands
|
| 146 |
-
- [ ] Test: `./video-analyzer status --speak`
|
| 147 |
-
|
| 148 |
-
### Phase 2: Real-time STT (Hear Commands) — 3-4 hours
|
| 149 |
-
- [ ] Install sounddevice for audio capture
|
| 150 |
-
- [ ] Create `src/voice/stt.py` (wrap existing Whisper)
|
| 151 |
-
- [ ] Implement silence detection (stop recording)
|
| 152 |
-
- [ ] Test: record → transcribe → print
|
| 153 |
-
|
| 154 |
-
### Phase 3: Command Parsing — 2-3 hours
|
| 155 |
-
- [ ] Create `src/voice/commands.py`
|
| 156 |
-
- [ ] Pattern matching for simple commands
|
| 157 |
-
- [ ] Ollama fallback for complex/natural queries
|
| 158 |
-
- [ ] Map voice → CLI commands
|
| 159 |
-
|
| 160 |
-
### Phase 4: Wake Word Detection — 2-3 hours
|
| 161 |
-
- [ ] Choose: Porcupine (easier) or OpenWakeWord (more private)
|
| 162 |
-
- [ ] Create `src/voice/listener.py`
|
| 163 |
-
- [ ] Continuous low-power listening
|
| 164 |
-
- [ ] Wake → record → process cycle
|
| 165 |
-
|
| 166 |
-
### Phase 5: Voice Assistant Loop — 2-3 hours
|
| 167 |
-
- [ ] Create `src/voice/assistant.py`
|
| 168 |
-
- [ ] Full loop: wake → listen → parse → execute → speak
|
| 169 |
-
- [ ] Add `./video-analyzer voice` command
|
| 170 |
-
- [ ] Handle errors gracefully with voice feedback
|
| 171 |
-
|
| 172 |
-
### Phase 6: Polish — 2-3 hours
|
| 173 |
-
- [ ] Acknowledgment sounds (beeps/chimes)
|
| 174 |
-
- [ ] Voice feedback for long operations ("Processing, please wait...")
|
| 175 |
-
- [ ] Configurable wake word
|
| 176 |
-
- [ ] Voice selection for TTS
|
| 177 |
-
|
| 178 |
-
---
|
| 179 |
-
|
| 180 |
-
## Dependencies to Add
|
| 181 |
-
|
| 182 |
-
```txt
|
| 183 |
-
# Voice Commands - requirements.txt additions
|
| 184 |
-
|
| 185 |
-
# Audio capture
|
| 186 |
-
sounddevice>=0.4.6
|
| 187 |
-
numpy>=1.24.0
|
| 188 |
-
|
| 189 |
-
# Text-to-Speech (local)
|
| 190 |
-
piper-tts>=1.2.0
|
| 191 |
-
# Alternative: TTS>=0.22.0 # Coqui TTS
|
| 192 |
-
|
| 193 |
-
# Wake Word Detection (choose one)
|
| 194 |
-
pvporcupine>=3.0.0 # Easier setup, free tier
|
| 195 |
-
# openwakeword>=0.5.0 # Fully open source
|
| 196 |
-
|
| 197 |
-
# Voice Activity Detection
|
| 198 |
-
webrtcvad>=2.0.10
|
| 199 |
-
```
|
| 200 |
-
|
| 201 |
-
---
|
| 202 |
-
|
| 203 |
-
## System Dependencies
|
| 204 |
-
|
| 205 |
-
```bash
|
| 206 |
-
# Ubuntu/Debian
|
| 207 |
-
sudo apt install portaudio19-dev python3-pyaudio
|
| 208 |
-
|
| 209 |
-
# For Piper TTS voices (download once)
|
| 210 |
-
mkdir -p ~/.local/share/piper
|
| 211 |
-
cd ~/.local/share/piper
|
| 212 |
-
wget https://github.com/rhasspy/piper/releases/download/v1.2.0/voice-en_US-lessac-medium.onnx.json
|
| 213 |
-
wget https://github.com/rhasspy/piper/releases/download/v1.2.0/voice-en_US-lessac-medium.onnx
|
| 214 |
-
```
|
| 215 |
-
|
| 216 |
-
---
|
| 217 |
-
|
| 218 |
-
## Privacy Guarantees
|
| 219 |
-
|
| 220 |
-
### What NEVER Leaves Your Machine
|
| 221 |
-
- ❌ Raw audio recordings
|
| 222 |
-
- ❌ Voice patterns/fingerprints
|
| 223 |
-
- ❌ Transcribed text
|
| 224 |
-
- ❌ Commands you speak
|
| 225 |
-
- ❌ Any biometric data
|
| 226 |
-
|
| 227 |
-
### What Stays 100% Local
|
| 228 |
-
- ✅ Whisper model runs locally
|
| 229 |
-
- ✅ Piper TTS runs locally
|
| 230 |
-
- ✅ Wake word detection runs locally
|
| 231 |
-
- ✅ All audio processing is local
|
| 232 |
-
- ✅ Works completely offline after setup
|
| 233 |
-
|
| 234 |
-
### Compared to Cloud Alternatives
|
| 235 |
-
|
| 236 |
-
| Cloud Service | What They Collect | Our Approach |
|
| 237 |
-
|---------------|-------------------|--------------|
|
| 238 |
-
| Alexa/Siri | Voice recordings, patterns | Nothing - all local |
|
| 239 |
-
| Google Assistant | Voice data, usage patterns | Nothing - all local |
|
| 240 |
-
| OpenAI Whisper API | Audio sent to cloud | Local Whisper - never sent |
|
| 241 |
-
| ElevenLabs | Voice for cloning | Local Piper - no upload |
|
| 242 |
-
|
| 243 |
-
---
|
| 244 |
-
|
| 245 |
-
## Configuration Options
|
| 246 |
-
|
| 247 |
-
```json
|
| 248 |
-
// config/voice.json
|
| 249 |
-
{
|
| 250 |
-
"wake_word": "hey analyzer",
|
| 251 |
-
"stt_model": "base", // tiny/base/small/medium
|
| 252 |
-
"tts_voice": "en_US-lessac-medium",
|
| 253 |
-
"tts_speed": 1.0,
|
| 254 |
-
"silence_threshold": 0.5, // seconds of silence to stop
|
| 255 |
-
"confirmation_sounds": true,
|
| 256 |
-
"speak_responses": true,
|
| 257 |
-
"max_listen_time": 30 // seconds
|
| 258 |
-
}
|
| 259 |
-
```
|
| 260 |
-
|
| 261 |
-
---
|
| 262 |
-
|
| 263 |
-
## Example Usage
|
| 264 |
-
|
| 265 |
-
```bash
|
| 266 |
-
# Start voice assistant mode
|
| 267 |
-
./video-analyzer voice
|
| 268 |
-
|
| 269 |
-
# One-shot voice command
|
| 270 |
-
./video-analyzer voice --once
|
| 271 |
-
|
| 272 |
-
# Status with spoken response
|
| 273 |
-
./video-analyzer status --speak
|
| 274 |
-
|
| 275 |
-
# Process with voice feedback
|
| 276 |
-
./video-analyzer process-all --speak
|
| 277 |
-
```
|
| 278 |
-
|
| 279 |
-
### Voice Session Example
|
| 280 |
-
|
| 281 |
-
```
|
| 282 |
-
[System]: Listening for "Hey Analyzer"...
|
| 283 |
-
[You]: "Hey Analyzer"
|
| 284 |
-
[System]: *beep* "Yes?"
|
| 285 |
-
[You]: "What's my current status?"
|
| 286 |
-
[System]: "You have 12 videos transcribed, 8 documents processed,
|
| 287 |
-
and 47 items in your knowledge base. 3 videos are
|
| 288 |
-
pending transcription."
|
| 289 |
-
[System]: Listening for "Hey Analyzer"...
|
| 290 |
-
[You]: "Hey Analyzer"
|
| 291 |
-
[System]: *beep* "Yes?"
|
| 292 |
-
[You]: "Summarize the latest video about negotiation"
|
| 293 |
-
[System]: "Working on it... The latest video covers 5 key
|
| 294 |
-
negotiation tactics: anchoring, the flinch,
|
| 295 |
-
bracketing, nibbling, and the walk-away..."
|
| 296 |
-
```
|
| 297 |
-
|
| 298 |
-
---
|
| 299 |
-
|
| 300 |
-
## Why This Approach?
|
| 301 |
-
|
| 302 |
-
| Requirement | Solution |
|
| 303 |
-
|-------------|----------|
|
| 304 |
-
| **No voice collection** | All STT via local Whisper |
|
| 305 |
-
| **No fingerprinting** | No cloud = no profile building |
|
| 306 |
-
| **Works offline** | Everything runs locally |
|
| 307 |
-
| **Fast response** | Piper TTS is <100ms latency |
|
| 308 |
-
| **Natural voices** | Piper neural voices sound great |
|
| 309 |
-
| **Low resources** | Base Whisper + Piper = ~500MB RAM |
|
| 310 |
-
|
| 311 |
-
---
|
| 312 |
-
|
| 313 |
-
## Next Steps
|
| 314 |
-
|
| 315 |
-
1. **Start with Phase 1** - Get TTS working first (instant gratification)
|
| 316 |
-
2. **Then Phase 2** - Add STT (reuse existing Whisper code)
|
| 317 |
-
3. **Phases 3-5** - Build up the full assistant
|
| 318 |
-
4. **Phase 6** - Polish and customize
|
| 319 |
-
|
| 320 |
-
Ready to start implementing? Just say the word! 🎤
|
| 321 |
-
|
| 322 |
-
|
| 323 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
|
| 3 |
+
demo = gr.Interface(
|
| 4 |
+
fn=lambda x: x,
|
| 5 |
+
inputs=gr.Textbox(label="Input"),
|
| 6 |
+
outputs=gr.Textbox(label="Output"),
|
| 7 |
+
title="Video Analyzer",
|
| 8 |
+
)
|
| 9 |
+
|
| 10 |
+
if __name__ == "__main__":
|
| 11 |
+
demo.launch()
|
data/audio/test_silence.wav
DELETED
|
Binary file (32.1 kB)
|
|
|
data/summaries/sample_real_estate_summary.md
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
Real estate investing success comes from: understanding your numbers, doing thorough due diligence, Negotiating, and avoiding common pitfalls. In the next module, we'll dive into deeper into strategies and how to structure deals for maximum returns. Real Estate Investment Fundamentals - Course Transcript is available in English and Spanish. For more information, visit the Real Estate Investing Course Transcripts website or click here for the English version. For the Spanish version, go to the Real estate Investment Course Transcript website or visit the Dutch version.
|
|
|
|
|
|
data/transcripts/sample_real_estate.txt
DELETED
|
@@ -1,98 +0,0 @@
|
|
| 1 |
-
Real Estate Investment Fundamentals - Course Transcript
|
| 2 |
-
|
| 3 |
-
Welcome to Module 1: Understanding Real Estate Investment Basics
|
| 4 |
-
|
| 5 |
-
Today we're going to cover the fundamental concepts every real estate investor needs to know. Whether you're just starting out or looking to expand your portfolio, these principles will guide your decision-making.
|
| 6 |
-
|
| 7 |
-
CASH FLOW ANALYSIS
|
| 8 |
-
|
| 9 |
-
Cash flow is the lifeblood of any real estate investment. Simply put, it's the money left over after you've collected rent and paid all expenses. Here's the basic formula:
|
| 10 |
-
|
| 11 |
-
Monthly Cash Flow = Gross Rent - Operating Expenses - Mortgage Payment
|
| 12 |
-
|
| 13 |
-
Let's break this down with an example. Say you have a rental property that brings in $2,000 per month in rent. Your expenses include:
|
| 14 |
-
- Property taxes: $200/month
|
| 15 |
-
- Insurance: $100/month
|
| 16 |
-
- Maintenance reserve: $150/month
|
| 17 |
-
- Property management: $160/month (8% of rent)
|
| 18 |
-
- Vacancy allowance: $100/month (5%)
|
| 19 |
-
|
| 20 |
-
Total operating expenses: $710/month
|
| 21 |
-
Mortgage payment: $900/month
|
| 22 |
-
|
| 23 |
-
Cash flow = $2,000 - $710 - $900 = $390/month positive cash flow
|
| 24 |
-
|
| 25 |
-
This is a healthy cash-flowing property!
|
| 26 |
-
|
| 27 |
-
CAP RATE (CAPITALIZATION RATE)
|
| 28 |
-
|
| 29 |
-
Cap rate helps you compare properties and determine value. It's calculated as:
|
| 30 |
-
|
| 31 |
-
Cap Rate = Net Operating Income (NOI) / Property Value
|
| 32 |
-
|
| 33 |
-
NOI is your annual income minus operating expenses (not including mortgage). Using our example:
|
| 34 |
-
- Annual gross rent: $24,000
|
| 35 |
-
- Annual operating expenses: $8,520
|
| 36 |
-
- NOI: $15,480
|
| 37 |
-
|
| 38 |
-
If the property is worth $200,000:
|
| 39 |
-
Cap Rate = $15,480 / $200,000 = 7.74%
|
| 40 |
-
|
| 41 |
-
Generally, higher cap rates mean higher returns but often come with more risk. Markets like New York might have 4% cap rates while smaller cities might offer 8-10%.
|
| 42 |
-
|
| 43 |
-
CASH-ON-CASH RETURN
|
| 44 |
-
|
| 45 |
-
This metric tells you how hard your actual invested cash is working:
|
| 46 |
-
|
| 47 |
-
Cash-on-Cash Return = Annual Cash Flow / Total Cash Invested
|
| 48 |
-
|
| 49 |
-
If you put $50,000 down on our example property:
|
| 50 |
-
- Annual cash flow: $390 x 12 = $4,680
|
| 51 |
-
- Cash-on-cash return: $4,680 / $50,000 = 9.36%
|
| 52 |
-
|
| 53 |
-
That means you're earning 9.36% on your actual cash investment - much better than a savings account!
|
| 54 |
-
|
| 55 |
-
THE 1% RULE
|
| 56 |
-
|
| 57 |
-
A quick screening tool: the monthly rent should be at least 1% of the purchase price. For a $200,000 property, you'd want at least $2,000/month in rent.
|
| 58 |
-
|
| 59 |
-
Our example property meets this rule: $2,000 / $200,000 = 1%
|
| 60 |
-
|
| 61 |
-
NEGOTIATION STRATEGIES
|
| 62 |
-
|
| 63 |
-
When making offers:
|
| 64 |
-
1. Always start below asking price - leave room to negotiate
|
| 65 |
-
2. Use inspection findings as leverage
|
| 66 |
-
3. Ask for seller concessions on closing costs
|
| 67 |
-
4. Be prepared to walk away - this is your strongest tool
|
| 68 |
-
5. Build rapport with the seller when possible
|
| 69 |
-
|
| 70 |
-
DUE DILIGENCE CHECKLIST
|
| 71 |
-
|
| 72 |
-
Before closing, verify:
|
| 73 |
-
- Rent rolls and actual income
|
| 74 |
-
- All operating expenses with documentation
|
| 75 |
-
- Property condition (get professional inspection)
|
| 76 |
-
- Comparable sales in the area
|
| 77 |
-
- Zoning and any restrictions
|
| 78 |
-
- Title search for liens or encumbrances
|
| 79 |
-
|
| 80 |
-
COMMON MISTAKES TO AVOID
|
| 81 |
-
|
| 82 |
-
1. Overestimating rental income
|
| 83 |
-
2. Underestimating repairs and maintenance
|
| 84 |
-
3. Not accounting for vacancy
|
| 85 |
-
4. Skipping proper inspections
|
| 86 |
-
5. Emotional decision-making
|
| 87 |
-
6. Over-leveraging (too much debt)
|
| 88 |
-
|
| 89 |
-
SUMMARY
|
| 90 |
-
|
| 91 |
-
Real estate investing success comes from:
|
| 92 |
-
- Understanding your numbers (cash flow, cap rate, CoC return)
|
| 93 |
-
- Doing thorough due diligence
|
| 94 |
-
- Negotiating effectively
|
| 95 |
-
- Avoiding common pitfalls
|
| 96 |
-
- Building for long-term wealth
|
| 97 |
-
|
| 98 |
-
In the next module, we'll dive deeper into financing strategies and how to structure deals for maximum returns.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hf_space/README.md
DELETED
|
@@ -1,39 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: Real Estate Mentor
|
| 3 |
-
emoji: 🏠
|
| 4 |
-
colorFrom: blue
|
| 5 |
-
colorTo: green
|
| 6 |
-
sdk: gradio
|
| 7 |
-
sdk_version: 4.44.0
|
| 8 |
-
app_file: app.py
|
| 9 |
-
pinned: false
|
| 10 |
-
license: mit
|
| 11 |
-
---
|
| 12 |
-
|
| 13 |
-
# 🏠 Real Estate Mentor
|
| 14 |
-
|
| 15 |
-
Your AI-powered course assistant for real estate investing education.
|
| 16 |
-
|
| 17 |
-
## Features
|
| 18 |
-
|
| 19 |
-
- **🔍 Semantic Search** - Search your course content by meaning, not just keywords
|
| 20 |
-
- **💬 Ask Questions** - Get answers based on your indexed materials
|
| 21 |
-
- **📤 Easy Upload** - Add transcripts and notes with one click
|
| 22 |
-
- **💾 Persistent Storage** - Your data is saved between sessions
|
| 23 |
-
|
| 24 |
-
## How to Use
|
| 25 |
-
|
| 26 |
-
1. **Upload Content** - Go to the Upload tab and add your course transcripts
|
| 27 |
-
2. **Search** - Use natural language to find relevant information
|
| 28 |
-
3. **Ask** - Chat with your AI mentor about the content
|
| 29 |
-
|
| 30 |
-
## Tech Stack
|
| 31 |
-
|
| 32 |
-
- **Gradio** - Web interface
|
| 33 |
-
- **ChromaDB** - Vector database for semantic search
|
| 34 |
-
- **Sentence Transformers** - Text embeddings
|
| 35 |
-
- **100% Free** - Runs entirely on HuggingFace Spaces
|
| 36 |
-
|
| 37 |
-
## Privacy
|
| 38 |
-
|
| 39 |
-
Your uploaded content is stored in this Space's persistent storage. No data is sent to external services.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hf_space/app.py
DELETED
|
@@ -1,413 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
Real Estate Mentor - HuggingFace Spaces App
|
| 3 |
-
|
| 4 |
-
A semantic search and Q&A system for course content.
|
| 5 |
-
Upload transcripts, search by meaning, and get answers.
|
| 6 |
-
"""
|
| 7 |
-
|
| 8 |
-
import os
|
| 9 |
-
import sys
|
| 10 |
-
from pathlib import Path
|
| 11 |
-
|
| 12 |
-
# Add src to path for imports
|
| 13 |
-
sys.path.insert(0, str(Path(__file__).parent))
|
| 14 |
-
|
| 15 |
-
import gradio as gr
|
| 16 |
-
|
| 17 |
-
# Set up persistent storage paths for HF Spaces
|
| 18 |
-
DATA_DIR = Path(os.getenv("PERSISTENT_DIR", "/data" if os.path.exists("/data") else "./data"))
|
| 19 |
-
CHROMA_DIR = DATA_DIR / "chromadb"
|
| 20 |
-
TRANSCRIPTS_DIR = DATA_DIR / "transcripts"
|
| 21 |
-
|
| 22 |
-
# Ensure directories exist
|
| 23 |
-
for d in [CHROMA_DIR, TRANSCRIPTS_DIR]:
|
| 24 |
-
d.mkdir(parents=True, exist_ok=True)
|
| 25 |
-
|
| 26 |
-
print(f"Data directory: {DATA_DIR}")
|
| 27 |
-
print(f"ChromaDB directory: {CHROMA_DIR}")
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
# ============== KNOWLEDGE BASE ==============
|
| 31 |
-
|
| 32 |
-
class SimpleKnowledgeBase:
|
| 33 |
-
"""Simplified knowledge base for HF Spaces."""
|
| 34 |
-
|
| 35 |
-
def __init__(self):
|
| 36 |
-
self._client = None
|
| 37 |
-
self._collection = None
|
| 38 |
-
self._model = None
|
| 39 |
-
|
| 40 |
-
def _init(self):
|
| 41 |
-
if self._client is not None:
|
| 42 |
-
return
|
| 43 |
-
|
| 44 |
-
import chromadb
|
| 45 |
-
from chromadb.config import Settings
|
| 46 |
-
from sentence_transformers import SentenceTransformer
|
| 47 |
-
|
| 48 |
-
# Initialize ChromaDB
|
| 49 |
-
self._client = chromadb.PersistentClient(
|
| 50 |
-
path=str(CHROMA_DIR),
|
| 51 |
-
settings=Settings(anonymized_telemetry=False)
|
| 52 |
-
)
|
| 53 |
-
self._collection = self._client.get_or_create_collection(
|
| 54 |
-
name="real_estate_mentor",
|
| 55 |
-
metadata={"hnsw:space": "cosine"}
|
| 56 |
-
)
|
| 57 |
-
|
| 58 |
-
# Initialize embedding model
|
| 59 |
-
self._model = SentenceTransformer("all-MiniLM-L6-v2")
|
| 60 |
-
|
| 61 |
-
print(f"Knowledge base initialized: {self._collection.count()} documents")
|
| 62 |
-
|
| 63 |
-
def add_text(self, text: str, source: str, chunk_size: int = 500):
|
| 64 |
-
"""Add text to the knowledge base in chunks."""
|
| 65 |
-
self._init()
|
| 66 |
-
|
| 67 |
-
# Simple chunking by sentences/paragraphs
|
| 68 |
-
chunks = self._chunk_text(text, chunk_size)
|
| 69 |
-
|
| 70 |
-
if not chunks:
|
| 71 |
-
return 0
|
| 72 |
-
|
| 73 |
-
# Generate embeddings
|
| 74 |
-
embeddings = self._model.encode(chunks).tolist()
|
| 75 |
-
|
| 76 |
-
# Generate IDs
|
| 77 |
-
import hashlib
|
| 78 |
-
ids = [
|
| 79 |
-
hashlib.md5(f"{source}:{i}:{c[:50]}".encode()).hexdigest()
|
| 80 |
-
for i, c in enumerate(chunks)
|
| 81 |
-
]
|
| 82 |
-
|
| 83 |
-
# Add to collection
|
| 84 |
-
self._collection.add(
|
| 85 |
-
ids=ids,
|
| 86 |
-
embeddings=embeddings,
|
| 87 |
-
documents=chunks,
|
| 88 |
-
metadatas=[{"source": source, "chunk_idx": i} for i in range(len(chunks))]
|
| 89 |
-
)
|
| 90 |
-
|
| 91 |
-
return len(chunks)
|
| 92 |
-
|
| 93 |
-
def _chunk_text(self, text: str, chunk_size: int = 500) -> list[str]:
|
| 94 |
-
"""Split text into chunks."""
|
| 95 |
-
if len(text) <= chunk_size:
|
| 96 |
-
return [text] if text.strip() else []
|
| 97 |
-
|
| 98 |
-
chunks = []
|
| 99 |
-
paragraphs = text.split("\n\n")
|
| 100 |
-
current_chunk = ""
|
| 101 |
-
|
| 102 |
-
for para in paragraphs:
|
| 103 |
-
if len(current_chunk) + len(para) <= chunk_size:
|
| 104 |
-
current_chunk += para + "\n\n"
|
| 105 |
-
else:
|
| 106 |
-
if current_chunk.strip():
|
| 107 |
-
chunks.append(current_chunk.strip())
|
| 108 |
-
current_chunk = para + "\n\n"
|
| 109 |
-
|
| 110 |
-
if current_chunk.strip():
|
| 111 |
-
chunks.append(current_chunk.strip())
|
| 112 |
-
|
| 113 |
-
return chunks
|
| 114 |
-
|
| 115 |
-
def search(self, query: str, n_results: int = 5) -> list[dict]:
|
| 116 |
-
"""Search the knowledge base."""
|
| 117 |
-
self._init()
|
| 118 |
-
|
| 119 |
-
if self._collection.count() == 0:
|
| 120 |
-
return []
|
| 121 |
-
|
| 122 |
-
# Generate query embedding
|
| 123 |
-
query_embedding = self._model.encode(query).tolist()
|
| 124 |
-
|
| 125 |
-
# Search
|
| 126 |
-
results = self._collection.query(
|
| 127 |
-
query_embeddings=[query_embedding],
|
| 128 |
-
n_results=min(n_results, self._collection.count()),
|
| 129 |
-
include=["documents", "metadatas", "distances"]
|
| 130 |
-
)
|
| 131 |
-
|
| 132 |
-
# Format results
|
| 133 |
-
output = []
|
| 134 |
-
if results["documents"] and results["documents"][0]:
|
| 135 |
-
for i, doc in enumerate(results["documents"][0]):
|
| 136 |
-
meta = results["metadatas"][0][i] if results["metadatas"] else {}
|
| 137 |
-
dist = results["distances"][0][i] if results["distances"] else 0
|
| 138 |
-
output.append({
|
| 139 |
-
"text": doc,
|
| 140 |
-
"source": meta.get("source", "unknown"),
|
| 141 |
-
"score": 1 - dist # Convert distance to similarity
|
| 142 |
-
})
|
| 143 |
-
|
| 144 |
-
return output
|
| 145 |
-
|
| 146 |
-
def count(self) -> int:
|
| 147 |
-
"""Get document count."""
|
| 148 |
-
self._init()
|
| 149 |
-
return self._collection.count()
|
| 150 |
-
|
| 151 |
-
def get_sources(self) -> list[str]:
|
| 152 |
-
"""Get all sources."""
|
| 153 |
-
self._init()
|
| 154 |
-
results = self._collection.get(include=["metadatas"])
|
| 155 |
-
sources = set()
|
| 156 |
-
if results["metadatas"]:
|
| 157 |
-
for meta in results["metadatas"]:
|
| 158 |
-
if "source" in meta:
|
| 159 |
-
sources.add(meta["source"])
|
| 160 |
-
return sorted(sources)
|
| 161 |
-
|
| 162 |
-
def clear(self):
|
| 163 |
-
"""Clear the knowledge base."""
|
| 164 |
-
self._init()
|
| 165 |
-
self._client.delete_collection("real_estate_mentor")
|
| 166 |
-
self._collection = self._client.create_collection(
|
| 167 |
-
name="real_estate_mentor",
|
| 168 |
-
metadata={"hnsw:space": "cosine"}
|
| 169 |
-
)
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
# Global instance
|
| 173 |
-
kb = SimpleKnowledgeBase()
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
# ============== UI FUNCTIONS ==============
|
| 177 |
-
|
| 178 |
-
def search_knowledge(query: str, n_results: int = 5) -> str:
|
| 179 |
-
"""Search the knowledge base."""
|
| 180 |
-
if not query.strip():
|
| 181 |
-
return "⚠️ Please enter a search query."
|
| 182 |
-
|
| 183 |
-
try:
|
| 184 |
-
results = kb.search(query, n_results=int(n_results))
|
| 185 |
-
|
| 186 |
-
if not results:
|
| 187 |
-
return "📭 No results found. Upload some content first!"
|
| 188 |
-
|
| 189 |
-
output = ["## 🔍 Search Results\n"]
|
| 190 |
-
for i, r in enumerate(results, 1):
|
| 191 |
-
source = Path(r["source"]).stem if r["source"] != "unknown" else "unknown"
|
| 192 |
-
score = r["score"] * 100
|
| 193 |
-
text = r["text"][:400] + "..." if len(r["text"]) > 400 else r["text"]
|
| 194 |
-
|
| 195 |
-
output.append(f"### Result {i} — {score:.0f}% match")
|
| 196 |
-
output.append(f"📄 **Source:** {source}\n")
|
| 197 |
-
output.append(f"```\n{text}\n```\n")
|
| 198 |
-
|
| 199 |
-
return "\n".join(output)
|
| 200 |
-
|
| 201 |
-
except Exception as e:
|
| 202 |
-
return f"❌ Error: {str(e)}"
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
def upload_file(file, source_name: str) -> str:
|
| 206 |
-
"""Process uploaded file."""
|
| 207 |
-
if file is None:
|
| 208 |
-
return "⚠️ Please select a file."
|
| 209 |
-
|
| 210 |
-
try:
|
| 211 |
-
# Read content
|
| 212 |
-
with open(file.name, "r", encoding="utf-8", errors="ignore") as f:
|
| 213 |
-
content = f.read()
|
| 214 |
-
|
| 215 |
-
if not content.strip():
|
| 216 |
-
return "⚠️ File is empty."
|
| 217 |
-
|
| 218 |
-
# Use custom name or filename
|
| 219 |
-
name = source_name.strip() if source_name.strip() else Path(file.name).stem
|
| 220 |
-
|
| 221 |
-
# Save locally
|
| 222 |
-
save_path = TRANSCRIPTS_DIR / f"{name}.txt"
|
| 223 |
-
save_path.write_text(content)
|
| 224 |
-
|
| 225 |
-
# Index
|
| 226 |
-
chunks = kb.add_text(content, source=str(save_path))
|
| 227 |
-
|
| 228 |
-
return f"""✅ **Successfully indexed!**
|
| 229 |
-
|
| 230 |
-
- **Source:** {name}
|
| 231 |
-
- **Chunks created:** {chunks}
|
| 232 |
-
- **Characters:** {len(content):,}
|
| 233 |
-
"""
|
| 234 |
-
except Exception as e:
|
| 235 |
-
return f"❌ Error: {str(e)}"
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
def upload_text(text: str, source_name: str) -> str:
|
| 239 |
-
"""Process pasted text."""
|
| 240 |
-
if not text.strip():
|
| 241 |
-
return "⚠️ Please enter some text."
|
| 242 |
-
if not source_name.strip():
|
| 243 |
-
return "⚠️ Please provide a source name."
|
| 244 |
-
|
| 245 |
-
try:
|
| 246 |
-
# Save locally
|
| 247 |
-
save_path = TRANSCRIPTS_DIR / f"{source_name.strip()}.txt"
|
| 248 |
-
save_path.write_text(text)
|
| 249 |
-
|
| 250 |
-
# Index
|
| 251 |
-
chunks = kb.add_text(text, source=str(save_path))
|
| 252 |
-
|
| 253 |
-
return f"""✅ **Successfully indexed!**
|
| 254 |
-
|
| 255 |
-
- **Source:** {source_name}
|
| 256 |
-
- **Chunks created:** {chunks}
|
| 257 |
-
- **Characters:** {len(text):,}
|
| 258 |
-
"""
|
| 259 |
-
except Exception as e:
|
| 260 |
-
return f"❌ Error: {str(e)}"
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
def get_status() -> str:
|
| 264 |
-
"""Get knowledge base status."""
|
| 265 |
-
try:
|
| 266 |
-
count = kb.count()
|
| 267 |
-
sources = kb.get_sources()
|
| 268 |
-
|
| 269 |
-
output = [f"## 📊 Knowledge Base Status\n"]
|
| 270 |
-
output.append(f"**Total chunks:** {count}")
|
| 271 |
-
output.append(f"**Sources:** {len(sources)}\n")
|
| 272 |
-
|
| 273 |
-
if sources:
|
| 274 |
-
output.append("### 📁 Indexed Sources:")
|
| 275 |
-
for s in sources[:15]:
|
| 276 |
-
name = Path(s).stem
|
| 277 |
-
output.append(f"- {name}")
|
| 278 |
-
if len(sources) > 15:
|
| 279 |
-
output.append(f"- *...and {len(sources) - 15} more*")
|
| 280 |
-
else:
|
| 281 |
-
output.append("*No content indexed yet. Upload some files to get started!*")
|
| 282 |
-
|
| 283 |
-
return "\n".join(output)
|
| 284 |
-
except Exception as e:
|
| 285 |
-
return f"❌ Error: {str(e)}"
|
| 286 |
-
|
| 287 |
-
|
| 288 |
-
def clear_all() -> str:
|
| 289 |
-
"""Clear knowledge base."""
|
| 290 |
-
try:
|
| 291 |
-
kb.clear()
|
| 292 |
-
return "✅ Knowledge base cleared!"
|
| 293 |
-
except Exception as e:
|
| 294 |
-
return f"❌ Error: {str(e)}"
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
def chat_respond(message: str, history: list) -> tuple:
|
| 298 |
-
"""Respond to chat message using RAG."""
|
| 299 |
-
if not message.strip():
|
| 300 |
-
return "", history
|
| 301 |
-
|
| 302 |
-
try:
|
| 303 |
-
# Search for context
|
| 304 |
-
results = kb.search(message, n_results=3)
|
| 305 |
-
|
| 306 |
-
if not results:
|
| 307 |
-
response = "I don't have any relevant information yet. Please upload some course content first! 📚"
|
| 308 |
-
else:
|
| 309 |
-
# Build response from context
|
| 310 |
-
sources = set()
|
| 311 |
-
context_parts = []
|
| 312 |
-
|
| 313 |
-
for r in results:
|
| 314 |
-
source = Path(r["source"]).stem if r["source"] != "unknown" else "unknown"
|
| 315 |
-
sources.add(source)
|
| 316 |
-
context_parts.append(r["text"])
|
| 317 |
-
|
| 318 |
-
context = "\n\n---\n\n".join(context_parts)
|
| 319 |
-
|
| 320 |
-
response = f"""Based on your course materials:
|
| 321 |
-
|
| 322 |
-
{context[:1500]}{"..." if len(context) > 1500 else ""}
|
| 323 |
-
|
| 324 |
-
---
|
| 325 |
-
📚 *Sources: {", ".join(sources)}*"""
|
| 326 |
-
|
| 327 |
-
history.append((message, response))
|
| 328 |
-
return "", history
|
| 329 |
-
|
| 330 |
-
except Exception as e:
|
| 331 |
-
history.append((message, f"❌ Error: {str(e)}"))
|
| 332 |
-
return "", history
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
# ============== BUILD APP ==============
|
| 336 |
-
|
| 337 |
-
with gr.Blocks(
|
| 338 |
-
title="Real Estate Mentor",
|
| 339 |
-
theme=gr.themes.Soft()
|
| 340 |
-
) as demo:
|
| 341 |
-
|
| 342 |
-
gr.Markdown("""
|
| 343 |
-
# 🏠 Real Estate Mentor
|
| 344 |
-
|
| 345 |
-
Your AI-powered course assistant. Upload transcripts, search semantically, and ask questions.
|
| 346 |
-
|
| 347 |
-
---
|
| 348 |
-
""")
|
| 349 |
-
|
| 350 |
-
with gr.Tabs():
|
| 351 |
-
# Search Tab
|
| 352 |
-
with gr.TabItem("🔍 Search"):
|
| 353 |
-
with gr.Row():
|
| 354 |
-
with gr.Column(scale=4):
|
| 355 |
-
search_input = gr.Textbox(
|
| 356 |
-
label="Search Query",
|
| 357 |
-
placeholder="e.g., How do I calculate cash-on-cash return?",
|
| 358 |
-
lines=2
|
| 359 |
-
)
|
| 360 |
-
with gr.Column(scale=1):
|
| 361 |
-
n_results_slider = gr.Slider(1, 10, value=5, step=1, label="Results")
|
| 362 |
-
search_btn = gr.Button("🔍 Search", variant="primary")
|
| 363 |
-
search_output = gr.Markdown()
|
| 364 |
-
|
| 365 |
-
search_btn.click(search_knowledge, [search_input, n_results_slider], search_output)
|
| 366 |
-
search_input.submit(search_knowledge, [search_input, n_results_slider], search_output)
|
| 367 |
-
|
| 368 |
-
# Chat Tab
|
| 369 |
-
with gr.TabItem("💬 Ask"):
|
| 370 |
-
chatbot = gr.Chatbot(height=400, label="Chat")
|
| 371 |
-
chat_input = gr.Textbox(label="Your Question", placeholder="Ask about your course content...")
|
| 372 |
-
chat_btn = gr.Button("💬 Send", variant="primary")
|
| 373 |
-
|
| 374 |
-
chat_btn.click(chat_respond, [chat_input, chatbot], [chat_input, chatbot])
|
| 375 |
-
chat_input.submit(chat_respond, [chat_input, chatbot], [chat_input, chatbot])
|
| 376 |
-
|
| 377 |
-
# Upload Tab
|
| 378 |
-
with gr.TabItem("📤 Upload"):
|
| 379 |
-
with gr.Row():
|
| 380 |
-
with gr.Column():
|
| 381 |
-
gr.Markdown("### Upload File")
|
| 382 |
-
file_input = gr.File(label="Select .txt or .md file", file_types=[".txt", ".md"])
|
| 383 |
-
file_name = gr.Textbox(label="Custom Name (optional)", placeholder="e.g., Module 1")
|
| 384 |
-
file_btn = gr.Button("📤 Upload", variant="primary")
|
| 385 |
-
file_output = gr.Markdown()
|
| 386 |
-
|
| 387 |
-
file_btn.click(upload_file, [file_input, file_name], file_output)
|
| 388 |
-
|
| 389 |
-
with gr.Column():
|
| 390 |
-
gr.Markdown("### Paste Text")
|
| 391 |
-
text_input = gr.Textbox(label="Text Content", lines=8, placeholder="Paste transcript here...")
|
| 392 |
-
text_name = gr.Textbox(label="Source Name", placeholder="e.g., Video 1 Notes")
|
| 393 |
-
text_btn = gr.Button("📥 Index", variant="primary")
|
| 394 |
-
text_output = gr.Markdown()
|
| 395 |
-
|
| 396 |
-
text_btn.click(upload_text, [text_input, text_name], text_output)
|
| 397 |
-
|
| 398 |
-
# Status Tab
|
| 399 |
-
with gr.TabItem("📊 Status"):
|
| 400 |
-
status_output = gr.Markdown()
|
| 401 |
-
with gr.Row():
|
| 402 |
-
refresh_btn = gr.Button("🔄 Refresh")
|
| 403 |
-
clear_btn = gr.Button("🗑️ Clear All", variant="stop")
|
| 404 |
-
|
| 405 |
-
refresh_btn.click(get_status, outputs=status_output)
|
| 406 |
-
clear_btn.click(clear_all, outputs=status_output)
|
| 407 |
-
demo.load(get_status, outputs=status_output)
|
| 408 |
-
|
| 409 |
-
gr.Markdown("---\n*Built with Gradio, ChromaDB & Sentence Transformers • 100% Free*")
|
| 410 |
-
|
| 411 |
-
|
| 412 |
-
if __name__ == "__main__":
|
| 413 |
-
demo.launch()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hf_space/requirements.txt
DELETED
|
@@ -1,5 +0,0 @@
|
|
| 1 |
-
# HuggingFace Spaces Requirements
|
| 2 |
-
gradio>=4.0.0
|
| 3 |
-
chromadb>=0.4.0
|
| 4 |
-
sentence-transformers>=2.2.0
|
| 5 |
-
torch>=2.0.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pyproject.toml
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[project]
|
| 2 |
+
name = "video-analyzer"
|
| 3 |
+
version = "0.1.0"
|
| 4 |
+
description = "A Gradio application"
|
| 5 |
+
readme = "README.md"
|
| 6 |
+
requires-python = ">=3.11"
|
| 7 |
+
dependencies = [
|
| 8 |
+
"gradio>=6.0.0",
|
| 9 |
+
]
|
pytest.ini
DELETED
|
@@ -1,9 +0,0 @@
|
|
| 1 |
-
[pytest]
|
| 2 |
-
testpaths = tests
|
| 3 |
-
python_files = test_*.py
|
| 4 |
-
python_classes = Test*
|
| 5 |
-
python_functions = test_*
|
| 6 |
-
addopts = -v --tb=short
|
| 7 |
-
filterwarnings =
|
| 8 |
-
ignore::DeprecationWarning
|
| 9 |
-
ignore::UserWarning
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
requirements.txt
DELETED
|
@@ -1,44 +0,0 @@
|
|
| 1 |
-
# Video Analyzer - Dependencies
|
| 2 |
-
# 100% Free & Open Source
|
| 3 |
-
|
| 4 |
-
# Core
|
| 5 |
-
python-dotenv>=1.0.0
|
| 6 |
-
typer[all]>=0.9.0
|
| 7 |
-
rich>=13.0.0
|
| 8 |
-
pydantic>=2.0.0
|
| 9 |
-
pydantic-settings>=2.0.0
|
| 10 |
-
tqdm>=4.66.0
|
| 11 |
-
|
| 12 |
-
# Video/Audio
|
| 13 |
-
yt-dlp>=2024.1.0
|
| 14 |
-
|
| 15 |
-
# Transcription
|
| 16 |
-
faster-whisper>=1.0.0
|
| 17 |
-
|
| 18 |
-
# Document Processing
|
| 19 |
-
PyMuPDF>=1.23.0
|
| 20 |
-
python-docx>=1.0.0
|
| 21 |
-
python-pptx>=0.6.23
|
| 22 |
-
|
| 23 |
-
# OCR (optional - requires system tesseract)
|
| 24 |
-
pytesseract>=0.3.10
|
| 25 |
-
Pillow>=10.0.0
|
| 26 |
-
|
| 27 |
-
# AI/ML - Multiple options
|
| 28 |
-
transformers>=4.36.0 # Hugging Face models
|
| 29 |
-
torch>=2.0.0 # PyTorch backend
|
| 30 |
-
# ollama>=0.1.0 # Optional: Ollama client
|
| 31 |
-
|
| 32 |
-
# Testing
|
| 33 |
-
pytest>=7.4.0
|
| 34 |
-
pytest-cov>=4.1.0
|
| 35 |
-
|
| 36 |
-
# Phase 3: Knowledge Base
|
| 37 |
-
sentence-transformers>=2.2.0
|
| 38 |
-
chromadb>=0.4.0
|
| 39 |
-
|
| 40 |
-
# Phase 5: Web UI
|
| 41 |
-
gradio>=4.0.0
|
| 42 |
-
|
| 43 |
-
# Web UI (Phase 5)
|
| 44 |
-
# gradio>=4.0.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/__init__.py
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
"""Video Analyzer - Download, transcribe, and learn from video content."""
|
| 2 |
-
|
| 3 |
-
__version__ = "0.1.0"
|
|
|
|
|
|
|
|
|
|
|
|
src/__pycache__/__init__.cpython-312.pyc
DELETED
|
Binary file (235 Bytes)
|
|
|
src/__pycache__/config.cpython-312.pyc
DELETED
|
Binary file (2.08 kB)
|
|
|
src/__pycache__/main.cpython-312.pyc
DELETED
|
Binary file (278 Bytes)
|
|
|
src/analyzers/__init__.py
DELETED
|
@@ -1,26 +0,0 @@
|
|
| 1 |
-
"""AI analyzers for summarization and extraction."""
|
| 2 |
-
|
| 3 |
-
from .chunker import chunk_text, chunk_for_summarization, TextChunk
|
| 4 |
-
from .summarizer import Summarizer, summarize_file, Summary, OllamaClient
|
| 5 |
-
from .huggingface import (
|
| 6 |
-
HuggingFaceLocal,
|
| 7 |
-
HuggingFaceAPI,
|
| 8 |
-
HuggingFaceTextGen,
|
| 9 |
-
summarize_with_huggingface,
|
| 10 |
-
list_recommended_models
|
| 11 |
-
)
|
| 12 |
-
|
| 13 |
-
__all__ = [
|
| 14 |
-
"chunk_text",
|
| 15 |
-
"chunk_for_summarization",
|
| 16 |
-
"TextChunk",
|
| 17 |
-
"Summarizer",
|
| 18 |
-
"summarize_file",
|
| 19 |
-
"Summary",
|
| 20 |
-
"OllamaClient",
|
| 21 |
-
"HuggingFaceLocal",
|
| 22 |
-
"HuggingFaceAPI",
|
| 23 |
-
"HuggingFaceTextGen",
|
| 24 |
-
"summarize_with_huggingface",
|
| 25 |
-
"list_recommended_models"
|
| 26 |
-
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/analyzers/__pycache__/__init__.cpython-312.pyc
DELETED
|
Binary file (685 Bytes)
|
|
|
src/analyzers/__pycache__/chunker.cpython-312.pyc
DELETED
|
Binary file (3.85 kB)
|
|
|
src/analyzers/__pycache__/huggingface.cpython-312.pyc
DELETED
|
Binary file (14.9 kB)
|
|
|
src/analyzers/__pycache__/summarizer.cpython-312.pyc
DELETED
|
Binary file (14.3 kB)
|
|
|
src/analyzers/chunker.py
DELETED
|
@@ -1,118 +0,0 @@
|
|
| 1 |
-
"""Text chunking for processing long documents with LLMs."""
|
| 2 |
-
|
| 3 |
-
from dataclasses import dataclass
|
| 4 |
-
from typing import Optional
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
@dataclass
|
| 8 |
-
class TextChunk:
|
| 9 |
-
"""A chunk of text with metadata."""
|
| 10 |
-
|
| 11 |
-
text: str
|
| 12 |
-
index: int
|
| 13 |
-
start_char: int
|
| 14 |
-
end_char: int
|
| 15 |
-
|
| 16 |
-
@property
|
| 17 |
-
def word_count(self) -> int:
|
| 18 |
-
return len(self.text.split())
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
def chunk_text(
|
| 22 |
-
text: str,
|
| 23 |
-
chunk_size: int = 4000,
|
| 24 |
-
chunk_overlap: int = 200,
|
| 25 |
-
separator: str = "\n\n"
|
| 26 |
-
) -> list[TextChunk]:
|
| 27 |
-
"""Split text into overlapping chunks.
|
| 28 |
-
|
| 29 |
-
Args:
|
| 30 |
-
text: Text to split
|
| 31 |
-
chunk_size: Maximum characters per chunk
|
| 32 |
-
chunk_overlap: Characters to overlap between chunks
|
| 33 |
-
separator: Preferred split point (paragraphs, sentences, etc.)
|
| 34 |
-
|
| 35 |
-
Returns:
|
| 36 |
-
List of TextChunk objects
|
| 37 |
-
"""
|
| 38 |
-
if len(text) <= chunk_size:
|
| 39 |
-
return [TextChunk(text=text, index=0, start_char=0, end_char=len(text))]
|
| 40 |
-
|
| 41 |
-
chunks = []
|
| 42 |
-
start = 0
|
| 43 |
-
index = 0
|
| 44 |
-
|
| 45 |
-
while start < len(text):
|
| 46 |
-
# Find end of chunk
|
| 47 |
-
end = start + chunk_size
|
| 48 |
-
|
| 49 |
-
if end >= len(text):
|
| 50 |
-
# Last chunk
|
| 51 |
-
chunk_text = text[start:]
|
| 52 |
-
chunks.append(TextChunk(
|
| 53 |
-
text=chunk_text,
|
| 54 |
-
index=index,
|
| 55 |
-
start_char=start,
|
| 56 |
-
end_char=len(text)
|
| 57 |
-
))
|
| 58 |
-
break
|
| 59 |
-
|
| 60 |
-
# Try to find a good break point
|
| 61 |
-
# Look for separator near the end of the chunk
|
| 62 |
-
search_start = max(start + chunk_size - 500, start)
|
| 63 |
-
search_end = min(start + chunk_size + 200, len(text))
|
| 64 |
-
search_text = text[search_start:search_end]
|
| 65 |
-
|
| 66 |
-
# Find last separator in search range
|
| 67 |
-
sep_pos = search_text.rfind(separator)
|
| 68 |
-
if sep_pos != -1:
|
| 69 |
-
end = search_start + sep_pos + len(separator)
|
| 70 |
-
else:
|
| 71 |
-
# Fall back to sentence end
|
| 72 |
-
for punct in [". ", "! ", "? ", "\n"]:
|
| 73 |
-
punct_pos = search_text.rfind(punct)
|
| 74 |
-
if punct_pos != -1:
|
| 75 |
-
end = search_start + punct_pos + len(punct)
|
| 76 |
-
break
|
| 77 |
-
|
| 78 |
-
# Create chunk
|
| 79 |
-
chunk_text = text[start:end].strip()
|
| 80 |
-
if chunk_text:
|
| 81 |
-
chunks.append(TextChunk(
|
| 82 |
-
text=chunk_text,
|
| 83 |
-
index=index,
|
| 84 |
-
start_char=start,
|
| 85 |
-
end_char=end
|
| 86 |
-
))
|
| 87 |
-
index += 1
|
| 88 |
-
|
| 89 |
-
# Move start with overlap
|
| 90 |
-
start = end - chunk_overlap
|
| 91 |
-
|
| 92 |
-
return chunks
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
def chunk_for_summarization(
|
| 96 |
-
text: str,
|
| 97 |
-
max_tokens: int = 3000,
|
| 98 |
-
chars_per_token: float = 4.0
|
| 99 |
-
) -> list[TextChunk]:
|
| 100 |
-
"""Chunk text optimized for LLM summarization.
|
| 101 |
-
|
| 102 |
-
Args:
|
| 103 |
-
text: Text to chunk
|
| 104 |
-
max_tokens: Maximum tokens per chunk (for LLM context)
|
| 105 |
-
chars_per_token: Approximate characters per token
|
| 106 |
-
|
| 107 |
-
Returns:
|
| 108 |
-
List of TextChunk objects
|
| 109 |
-
"""
|
| 110 |
-
chunk_size = int(max_tokens * chars_per_token)
|
| 111 |
-
overlap = int(chunk_size * 0.05) # 5% overlap for context
|
| 112 |
-
|
| 113 |
-
return chunk_text(text, chunk_size=chunk_size, chunk_overlap=overlap)
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
def estimate_tokens(text: str, chars_per_token: float = 4.0) -> int:
|
| 117 |
-
"""Estimate number of tokens in text."""
|
| 118 |
-
return int(len(text) / chars_per_token)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/analyzers/huggingface.py
DELETED
|
@@ -1,407 +0,0 @@
|
|
| 1 |
-
"""AI summarization using Hugging Face (local models or API)."""
|
| 2 |
-
|
| 3 |
-
from dataclasses import dataclass
|
| 4 |
-
from pathlib import Path
|
| 5 |
-
from typing import Optional
|
| 6 |
-
import os
|
| 7 |
-
|
| 8 |
-
from rich.console import Console
|
| 9 |
-
from rich.progress import Progress, SpinnerColumn, TextColumn
|
| 10 |
-
|
| 11 |
-
from src.config import settings
|
| 12 |
-
from src.analyzers.chunker import chunk_for_summarization, estimate_tokens
|
| 13 |
-
|
| 14 |
-
console = Console()
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
# Recommended models for different tasks
|
| 18 |
-
RECOMMENDED_MODELS = {
|
| 19 |
-
"summarization": {
|
| 20 |
-
"small": "facebook/bart-large-cnn", # Fast, good for news-style
|
| 21 |
-
"medium": "google/flan-t5-base", # Balanced
|
| 22 |
-
"large": "google/flan-t5-large", # Better quality
|
| 23 |
-
"best": "facebook/bart-large-xsum", # Abstractive summaries
|
| 24 |
-
},
|
| 25 |
-
"text_generation": {
|
| 26 |
-
"small": "microsoft/phi-2", # 2.7B, very fast
|
| 27 |
-
"medium": "mistralai/Mistral-7B-Instruct-v0.2", # 7B, good quality
|
| 28 |
-
"large": "meta-llama/Llama-2-7b-chat-hf", # Requires access
|
| 29 |
-
}
|
| 30 |
-
}
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
class HuggingFaceLocal:
|
| 34 |
-
"""Run Hugging Face models locally."""
|
| 35 |
-
|
| 36 |
-
def __init__(
|
| 37 |
-
self,
|
| 38 |
-
model_name: str = "facebook/bart-large-cnn",
|
| 39 |
-
device: str = "auto"
|
| 40 |
-
):
|
| 41 |
-
self.model_name = model_name
|
| 42 |
-
self.device = device
|
| 43 |
-
self._model = None
|
| 44 |
-
self._tokenizer = None
|
| 45 |
-
|
| 46 |
-
def _load_model(self):
|
| 47 |
-
"""Lazy load the model."""
|
| 48 |
-
if self._model is None:
|
| 49 |
-
console.print(f"[bold green]Loading model:[/] {self.model_name}")
|
| 50 |
-
console.print("[dim]This may take a few minutes on first run...[/]")
|
| 51 |
-
|
| 52 |
-
try:
|
| 53 |
-
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
|
| 54 |
-
import torch
|
| 55 |
-
except ImportError:
|
| 56 |
-
raise ImportError(
|
| 57 |
-
"Transformers not installed. Run:\n"
|
| 58 |
-
" pip install transformers torch"
|
| 59 |
-
)
|
| 60 |
-
|
| 61 |
-
# Determine device
|
| 62 |
-
if self.device == "auto":
|
| 63 |
-
device = 0 if torch.cuda.is_available() else -1
|
| 64 |
-
else:
|
| 65 |
-
device = 0 if self.device == "cuda" else -1
|
| 66 |
-
|
| 67 |
-
# Load tokenizer and model
|
| 68 |
-
self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
|
| 69 |
-
|
| 70 |
-
# Use pipeline for easier inference
|
| 71 |
-
self._pipeline = pipeline(
|
| 72 |
-
"summarization",
|
| 73 |
-
model=self.model_name,
|
| 74 |
-
tokenizer=self._tokenizer,
|
| 75 |
-
device=device
|
| 76 |
-
)
|
| 77 |
-
|
| 78 |
-
device_name = "GPU" if device >= 0 else "CPU"
|
| 79 |
-
console.print(f"[green]✓[/] Model loaded on {device_name}")
|
| 80 |
-
|
| 81 |
-
def summarize(
|
| 82 |
-
self,
|
| 83 |
-
text: str,
|
| 84 |
-
max_length: int = 500,
|
| 85 |
-
min_length: int = 100
|
| 86 |
-
) -> str:
|
| 87 |
-
"""Summarize text using local model.
|
| 88 |
-
|
| 89 |
-
Args:
|
| 90 |
-
text: Text to summarize
|
| 91 |
-
max_length: Maximum summary length in tokens
|
| 92 |
-
min_length: Minimum summary length in tokens
|
| 93 |
-
|
| 94 |
-
Returns:
|
| 95 |
-
Summary text
|
| 96 |
-
"""
|
| 97 |
-
self._load_model()
|
| 98 |
-
|
| 99 |
-
# Handle long texts by chunking
|
| 100 |
-
tokens = estimate_tokens(text)
|
| 101 |
-
|
| 102 |
-
if tokens > 1000: # BART/T5 have ~1024 token limit
|
| 103 |
-
return self._summarize_chunks(text, max_length, min_length)
|
| 104 |
-
|
| 105 |
-
result = self._pipeline(
|
| 106 |
-
text,
|
| 107 |
-
max_length=max_length,
|
| 108 |
-
min_length=min_length,
|
| 109 |
-
do_sample=False
|
| 110 |
-
)
|
| 111 |
-
|
| 112 |
-
return result[0]["summary_text"]
|
| 113 |
-
|
| 114 |
-
def _summarize_chunks(
|
| 115 |
-
self,
|
| 116 |
-
text: str,
|
| 117 |
-
max_length: int,
|
| 118 |
-
min_length: int
|
| 119 |
-
) -> str:
|
| 120 |
-
"""Summarize long text in chunks."""
|
| 121 |
-
chunks = chunk_for_summarization(text, max_tokens=800)
|
| 122 |
-
console.print(f"[bold blue]Processing {len(chunks)} chunks...[/]")
|
| 123 |
-
|
| 124 |
-
chunk_summaries = []
|
| 125 |
-
|
| 126 |
-
with Progress(
|
| 127 |
-
SpinnerColumn(),
|
| 128 |
-
TextColumn("[progress.description]{task.description}"),
|
| 129 |
-
console=console
|
| 130 |
-
) as progress:
|
| 131 |
-
task = progress.add_task("Summarizing...", total=len(chunks))
|
| 132 |
-
|
| 133 |
-
for i, chunk in enumerate(chunks):
|
| 134 |
-
progress.update(task, description=f"Chunk {i+1}/{len(chunks)}")
|
| 135 |
-
|
| 136 |
-
result = self._pipeline(
|
| 137 |
-
chunk.text,
|
| 138 |
-
max_length=max_length // len(chunks) + 50,
|
| 139 |
-
min_length=min_length // len(chunks),
|
| 140 |
-
do_sample=False
|
| 141 |
-
)
|
| 142 |
-
chunk_summaries.append(result[0]["summary_text"])
|
| 143 |
-
progress.advance(task)
|
| 144 |
-
|
| 145 |
-
# Combine summaries
|
| 146 |
-
combined = " ".join(chunk_summaries)
|
| 147 |
-
|
| 148 |
-
# If combined is still long, summarize again
|
| 149 |
-
if len(combined) > 2000:
|
| 150 |
-
console.print("[bold blue]Creating final summary...[/]")
|
| 151 |
-
result = self._pipeline(
|
| 152 |
-
combined,
|
| 153 |
-
max_length=max_length,
|
| 154 |
-
min_length=min_length,
|
| 155 |
-
do_sample=False
|
| 156 |
-
)
|
| 157 |
-
return result[0]["summary_text"]
|
| 158 |
-
|
| 159 |
-
return combined
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
class HuggingFaceAPI:
|
| 163 |
-
"""Use Hugging Face Inference API (free tier available)."""
|
| 164 |
-
|
| 165 |
-
def __init__(
|
| 166 |
-
self,
|
| 167 |
-
model_name: str = "facebook/bart-large-cnn",
|
| 168 |
-
api_key: Optional[str] = None
|
| 169 |
-
):
|
| 170 |
-
self.model_name = model_name
|
| 171 |
-
# Check multiple sources for API key
|
| 172 |
-
self.api_key = (
|
| 173 |
-
api_key or
|
| 174 |
-
os.getenv("HUGGINGFACE_API_KEY") or
|
| 175 |
-
os.getenv("VIDEO_ANALYZER_HUGGINGFACE_API_KEY")
|
| 176 |
-
)
|
| 177 |
-
|
| 178 |
-
# Also try loading from settings
|
| 179 |
-
if not self.api_key:
|
| 180 |
-
try:
|
| 181 |
-
from src.config import settings
|
| 182 |
-
self.api_key = settings.huggingface_api_key
|
| 183 |
-
except:
|
| 184 |
-
pass
|
| 185 |
-
|
| 186 |
-
self.api_url = f"https://router.huggingface.co/hf-inference/models/{model_name}"
|
| 187 |
-
|
| 188 |
-
def _check_api_key(self):
|
| 189 |
-
if not self.api_key:
|
| 190 |
-
raise ValueError(
|
| 191 |
-
"Hugging Face API key not found.\n"
|
| 192 |
-
"Set it via:\n"
|
| 193 |
-
" export HUGGINGFACE_API_KEY=your_key\n"
|
| 194 |
-
"Or get a free key at: https://huggingface.co/settings/tokens"
|
| 195 |
-
)
|
| 196 |
-
|
| 197 |
-
def summarize(
|
| 198 |
-
self,
|
| 199 |
-
text: str,
|
| 200 |
-
max_length: int = 500,
|
| 201 |
-
min_length: int = 100
|
| 202 |
-
) -> str:
|
| 203 |
-
"""Summarize text using Hugging Face API.
|
| 204 |
-
|
| 205 |
-
Args:
|
| 206 |
-
text: Text to summarize
|
| 207 |
-
max_length: Maximum summary length
|
| 208 |
-
min_length: Minimum summary length
|
| 209 |
-
|
| 210 |
-
Returns:
|
| 211 |
-
Summary text
|
| 212 |
-
"""
|
| 213 |
-
self._check_api_key()
|
| 214 |
-
|
| 215 |
-
try:
|
| 216 |
-
import requests
|
| 217 |
-
except ImportError:
|
| 218 |
-
raise ImportError("requests not installed. Run: pip install requests")
|
| 219 |
-
|
| 220 |
-
headers = {"Authorization": f"Bearer {self.api_key}"}
|
| 221 |
-
|
| 222 |
-
# Handle long texts
|
| 223 |
-
tokens = estimate_tokens(text)
|
| 224 |
-
if tokens > 1000:
|
| 225 |
-
return self._summarize_chunks_api(text, max_length, min_length, headers)
|
| 226 |
-
|
| 227 |
-
payload = {
|
| 228 |
-
"inputs": text,
|
| 229 |
-
"parameters": {
|
| 230 |
-
"max_length": max_length,
|
| 231 |
-
"min_length": min_length,
|
| 232 |
-
"do_sample": False
|
| 233 |
-
}
|
| 234 |
-
}
|
| 235 |
-
|
| 236 |
-
with console.status("[bold green]Calling Hugging Face API..."):
|
| 237 |
-
response = requests.post(self.api_url, headers=headers, json=payload)
|
| 238 |
-
|
| 239 |
-
if response.status_code != 200:
|
| 240 |
-
error = response.json().get("error", response.text)
|
| 241 |
-
raise Exception(f"API error: {error}")
|
| 242 |
-
|
| 243 |
-
result = response.json()
|
| 244 |
-
|
| 245 |
-
if isinstance(result, list) and len(result) > 0:
|
| 246 |
-
return result[0].get("summary_text", str(result))
|
| 247 |
-
|
| 248 |
-
return str(result)
|
| 249 |
-
|
| 250 |
-
def _summarize_chunks_api(
|
| 251 |
-
self,
|
| 252 |
-
text: str,
|
| 253 |
-
max_length: int,
|
| 254 |
-
min_length: int,
|
| 255 |
-
headers: dict
|
| 256 |
-
) -> str:
|
| 257 |
-
"""Summarize chunks via API."""
|
| 258 |
-
import requests
|
| 259 |
-
|
| 260 |
-
chunks = chunk_for_summarization(text, max_tokens=800)
|
| 261 |
-
console.print(f"[bold blue]Processing {len(chunks)} chunks via API...[/]")
|
| 262 |
-
|
| 263 |
-
chunk_summaries = []
|
| 264 |
-
|
| 265 |
-
for i, chunk in enumerate(chunks):
|
| 266 |
-
console.print(f"[dim]Chunk {i+1}/{len(chunks)}...[/]")
|
| 267 |
-
|
| 268 |
-
payload = {
|
| 269 |
-
"inputs": chunk.text,
|
| 270 |
-
"parameters": {
|
| 271 |
-
"max_length": max_length // len(chunks) + 50,
|
| 272 |
-
"min_length": min(30, min_length // len(chunks)),
|
| 273 |
-
"do_sample": False
|
| 274 |
-
}
|
| 275 |
-
}
|
| 276 |
-
|
| 277 |
-
response = requests.post(self.api_url, headers=headers, json=payload)
|
| 278 |
-
|
| 279 |
-
if response.status_code == 200:
|
| 280 |
-
result = response.json()
|
| 281 |
-
if isinstance(result, list) and len(result) > 0:
|
| 282 |
-
chunk_summaries.append(result[0].get("summary_text", ""))
|
| 283 |
-
|
| 284 |
-
return " ".join(chunk_summaries)
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
class HuggingFaceTextGen:
|
| 288 |
-
"""Use Hugging Face for text generation (like Ollama alternative)."""
|
| 289 |
-
|
| 290 |
-
def __init__(
|
| 291 |
-
self,
|
| 292 |
-
model_name: str = "microsoft/phi-2",
|
| 293 |
-
device: str = "auto"
|
| 294 |
-
):
|
| 295 |
-
self.model_name = model_name
|
| 296 |
-
self.device = device
|
| 297 |
-
self._pipeline = None
|
| 298 |
-
|
| 299 |
-
def _load_model(self):
|
| 300 |
-
"""Lazy load the model."""
|
| 301 |
-
if self._pipeline is None:
|
| 302 |
-
console.print(f"[bold green]Loading model:[/] {self.model_name}")
|
| 303 |
-
console.print("[dim]This may download several GB on first run...[/]")
|
| 304 |
-
|
| 305 |
-
try:
|
| 306 |
-
from transformers import pipeline
|
| 307 |
-
import torch
|
| 308 |
-
except ImportError:
|
| 309 |
-
raise ImportError(
|
| 310 |
-
"Transformers not installed. Run:\n"
|
| 311 |
-
" pip install transformers torch accelerate"
|
| 312 |
-
)
|
| 313 |
-
|
| 314 |
-
# Determine device
|
| 315 |
-
if self.device == "auto":
|
| 316 |
-
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 317 |
-
else:
|
| 318 |
-
device = self.device
|
| 319 |
-
|
| 320 |
-
self._pipeline = pipeline(
|
| 321 |
-
"text-generation",
|
| 322 |
-
model=self.model_name,
|
| 323 |
-
device_map="auto" if device == "cuda" else None,
|
| 324 |
-
torch_dtype="auto"
|
| 325 |
-
)
|
| 326 |
-
|
| 327 |
-
console.print(f"[green]✓[/] Model loaded on {device}")
|
| 328 |
-
|
| 329 |
-
def generate(
|
| 330 |
-
self,
|
| 331 |
-
prompt: str,
|
| 332 |
-
max_new_tokens: int = 500,
|
| 333 |
-
temperature: float = 0.7
|
| 334 |
-
) -> str:
|
| 335 |
-
"""Generate text from prompt.
|
| 336 |
-
|
| 337 |
-
Args:
|
| 338 |
-
prompt: Input prompt
|
| 339 |
-
max_new_tokens: Maximum tokens to generate
|
| 340 |
-
temperature: Creativity (0-1)
|
| 341 |
-
|
| 342 |
-
Returns:
|
| 343 |
-
Generated text
|
| 344 |
-
"""
|
| 345 |
-
self._load_model()
|
| 346 |
-
|
| 347 |
-
result = self._pipeline(
|
| 348 |
-
prompt,
|
| 349 |
-
max_new_tokens=max_new_tokens,
|
| 350 |
-
temperature=temperature,
|
| 351 |
-
do_sample=temperature > 0,
|
| 352 |
-
pad_token_id=self._pipeline.tokenizer.eos_token_id
|
| 353 |
-
)
|
| 354 |
-
|
| 355 |
-
generated = result[0]["generated_text"]
|
| 356 |
-
|
| 357 |
-
# Remove the prompt from output
|
| 358 |
-
if generated.startswith(prompt):
|
| 359 |
-
generated = generated[len(prompt):].strip()
|
| 360 |
-
|
| 361 |
-
return generated
|
| 362 |
-
|
| 363 |
-
|
| 364 |
-
def summarize_with_huggingface(
|
| 365 |
-
text: str,
|
| 366 |
-
model: str = "facebook/bart-large-cnn",
|
| 367 |
-
use_api: bool = False,
|
| 368 |
-
api_key: Optional[str] = None,
|
| 369 |
-
max_length: int = 500
|
| 370 |
-
) -> str:
|
| 371 |
-
"""Convenience function to summarize with Hugging Face.
|
| 372 |
-
|
| 373 |
-
Args:
|
| 374 |
-
text: Text to summarize
|
| 375 |
-
model: Model name
|
| 376 |
-
use_api: If True, use API instead of local
|
| 377 |
-
api_key: API key (if using API)
|
| 378 |
-
max_length: Maximum summary length
|
| 379 |
-
|
| 380 |
-
Returns:
|
| 381 |
-
Summary text
|
| 382 |
-
"""
|
| 383 |
-
if use_api:
|
| 384 |
-
client = HuggingFaceAPI(model, api_key)
|
| 385 |
-
else:
|
| 386 |
-
client = HuggingFaceLocal(model)
|
| 387 |
-
|
| 388 |
-
return client.summarize(text, max_length=max_length)
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
def list_recommended_models():
|
| 392 |
-
"""Display recommended Hugging Face models."""
|
| 393 |
-
from rich.table import Table
|
| 394 |
-
|
| 395 |
-
table = Table(title="Recommended Hugging Face Models")
|
| 396 |
-
table.add_column("Task", style="cyan")
|
| 397 |
-
table.add_column("Size", style="white")
|
| 398 |
-
table.add_column("Model", style="green")
|
| 399 |
-
table.add_column("Notes", style="dim")
|
| 400 |
-
|
| 401 |
-
table.add_row("Summarization", "Small", "facebook/bart-large-cnn", "Fast, news-style")
|
| 402 |
-
table.add_row("Summarization", "Medium", "google/flan-t5-base", "Balanced")
|
| 403 |
-
table.add_row("Summarization", "Large", "google/flan-t5-large", "Better quality")
|
| 404 |
-
table.add_row("Text Gen", "Small", "microsoft/phi-2", "2.7B, very fast")
|
| 405 |
-
table.add_row("Text Gen", "Medium", "mistralai/Mistral-7B-Instruct-v0.2", "7B, good")
|
| 406 |
-
|
| 407 |
-
console.print(table)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/analyzers/summarizer.py
DELETED
|
@@ -1,410 +0,0 @@
|
|
| 1 |
-
"""AI-powered summarization using Ollama (local, free)."""
|
| 2 |
-
|
| 3 |
-
import json
|
| 4 |
-
import subprocess
|
| 5 |
-
from dataclasses import dataclass
|
| 6 |
-
from pathlib import Path
|
| 7 |
-
from typing import Optional
|
| 8 |
-
|
| 9 |
-
from rich.console import Console
|
| 10 |
-
from rich.progress import Progress, SpinnerColumn, TextColumn
|
| 11 |
-
|
| 12 |
-
from src.config import settings
|
| 13 |
-
from src.analyzers.chunker import chunk_for_summarization, estimate_tokens
|
| 14 |
-
|
| 15 |
-
console = Console()
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
@dataclass
|
| 19 |
-
class Summary:
|
| 20 |
-
"""A generated summary."""
|
| 21 |
-
|
| 22 |
-
text: str
|
| 23 |
-
source_path: Optional[Path]
|
| 24 |
-
model: str
|
| 25 |
-
summary_type: str # quick, detailed, study_notes
|
| 26 |
-
original_length: int
|
| 27 |
-
summary_length: int
|
| 28 |
-
|
| 29 |
-
@property
|
| 30 |
-
def compression_ratio(self) -> float:
|
| 31 |
-
"""How much the text was compressed."""
|
| 32 |
-
if self.original_length == 0:
|
| 33 |
-
return 0
|
| 34 |
-
return self.summary_length / self.original_length
|
| 35 |
-
|
| 36 |
-
def save(self, output_path: Optional[Path] = None) -> Path:
|
| 37 |
-
"""Save summary to file."""
|
| 38 |
-
if output_path is None:
|
| 39 |
-
stem = self.source_path.stem if self.source_path else "summary"
|
| 40 |
-
output_path = settings.summaries_dir / f"{stem}_{self.summary_type}.md"
|
| 41 |
-
|
| 42 |
-
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 43 |
-
output_path.write_text(self.text)
|
| 44 |
-
return output_path
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
# Prompts for different summary types
|
| 48 |
-
PROMPTS = {
|
| 49 |
-
"quick": """Summarize the following text in 2-3 paragraphs. Focus on the main points and key takeaways.
|
| 50 |
-
|
| 51 |
-
TEXT:
|
| 52 |
-
{text}
|
| 53 |
-
|
| 54 |
-
SUMMARY:""",
|
| 55 |
-
|
| 56 |
-
"detailed": """Create a detailed summary of the following text. Include:
|
| 57 |
-
- Main topics covered
|
| 58 |
-
- Key points and concepts
|
| 59 |
-
- Important details and examples
|
| 60 |
-
- Actionable insights
|
| 61 |
-
|
| 62 |
-
TEXT:
|
| 63 |
-
{text}
|
| 64 |
-
|
| 65 |
-
DETAILED SUMMARY:""",
|
| 66 |
-
|
| 67 |
-
"study_notes": """Create comprehensive study notes from the following text. Format as:
|
| 68 |
-
|
| 69 |
-
## Key Concepts
|
| 70 |
-
- List main concepts with brief explanations
|
| 71 |
-
|
| 72 |
-
## Important Points
|
| 73 |
-
- Bullet points of critical information
|
| 74 |
-
|
| 75 |
-
## Definitions
|
| 76 |
-
- Any important terms defined
|
| 77 |
-
|
| 78 |
-
## Action Items
|
| 79 |
-
- Practical steps or strategies mentioned
|
| 80 |
-
|
| 81 |
-
## Summary
|
| 82 |
-
- Brief overall summary
|
| 83 |
-
|
| 84 |
-
TEXT:
|
| 85 |
-
{text}
|
| 86 |
-
|
| 87 |
-
STUDY NOTES:""",
|
| 88 |
-
|
| 89 |
-
"real_estate": """You are a real estate expert. Analyze the following content and extract:
|
| 90 |
-
|
| 91 |
-
## Key Real Estate Concepts
|
| 92 |
-
- Investment strategies mentioned
|
| 93 |
-
- Market analysis techniques
|
| 94 |
-
- Deal evaluation methods
|
| 95 |
-
|
| 96 |
-
## Financial Metrics
|
| 97 |
-
- ROI, Cap Rate, Cash-on-Cash calculations if mentioned
|
| 98 |
-
- Financing strategies
|
| 99 |
-
|
| 100 |
-
## Negotiation & Strategy
|
| 101 |
-
- Negotiation tactics
|
| 102 |
-
- Deal structuring advice
|
| 103 |
-
|
| 104 |
-
## Action Items
|
| 105 |
-
- Practical steps to take
|
| 106 |
-
|
| 107 |
-
## Critical Warnings
|
| 108 |
-
- Risks or pitfalls mentioned
|
| 109 |
-
|
| 110 |
-
TEXT:
|
| 111 |
-
{text}
|
| 112 |
-
|
| 113 |
-
REAL ESTATE ANALYSIS:"""
|
| 114 |
-
}
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
class OllamaClient:
|
| 118 |
-
"""Client for interacting with Ollama."""
|
| 119 |
-
|
| 120 |
-
def __init__(self, model: str = "llama3"):
|
| 121 |
-
self.model = model
|
| 122 |
-
self._verified = False
|
| 123 |
-
|
| 124 |
-
def is_available(self) -> bool:
|
| 125 |
-
"""Check if Ollama is running."""
|
| 126 |
-
try:
|
| 127 |
-
result = subprocess.run(
|
| 128 |
-
["ollama", "list"],
|
| 129 |
-
capture_output=True,
|
| 130 |
-
text=True,
|
| 131 |
-
timeout=5
|
| 132 |
-
)
|
| 133 |
-
return result.returncode == 0
|
| 134 |
-
except (subprocess.TimeoutExpired, FileNotFoundError):
|
| 135 |
-
return False
|
| 136 |
-
|
| 137 |
-
def list_models(self) -> list[str]:
|
| 138 |
-
"""List available Ollama models."""
|
| 139 |
-
try:
|
| 140 |
-
result = subprocess.run(
|
| 141 |
-
["ollama", "list"],
|
| 142 |
-
capture_output=True,
|
| 143 |
-
text=True,
|
| 144 |
-
timeout=10
|
| 145 |
-
)
|
| 146 |
-
if result.returncode != 0:
|
| 147 |
-
return []
|
| 148 |
-
|
| 149 |
-
models = []
|
| 150 |
-
for line in result.stdout.strip().split("\n")[1:]: # Skip header
|
| 151 |
-
if line.strip():
|
| 152 |
-
model_name = line.split()[0]
|
| 153 |
-
models.append(model_name)
|
| 154 |
-
return models
|
| 155 |
-
except Exception:
|
| 156 |
-
return []
|
| 157 |
-
|
| 158 |
-
def pull_model(self, model: Optional[str] = None) -> bool:
|
| 159 |
-
"""Pull/download a model."""
|
| 160 |
-
model = model or self.model
|
| 161 |
-
console.print(f"[bold green]Pulling model:[/] {model}")
|
| 162 |
-
|
| 163 |
-
try:
|
| 164 |
-
result = subprocess.run(
|
| 165 |
-
["ollama", "pull", model],
|
| 166 |
-
capture_output=False,
|
| 167 |
-
timeout=600 # 10 minutes
|
| 168 |
-
)
|
| 169 |
-
return result.returncode == 0
|
| 170 |
-
except Exception as e:
|
| 171 |
-
console.print(f"[red]Error pulling model:[/] {e}")
|
| 172 |
-
return False
|
| 173 |
-
|
| 174 |
-
def generate(
|
| 175 |
-
self,
|
| 176 |
-
prompt: str,
|
| 177 |
-
system: Optional[str] = None,
|
| 178 |
-
temperature: float = 0.7,
|
| 179 |
-
max_tokens: int = 2000
|
| 180 |
-
) -> str:
|
| 181 |
-
"""Generate text using Ollama.
|
| 182 |
-
|
| 183 |
-
Args:
|
| 184 |
-
prompt: The prompt to send
|
| 185 |
-
system: Optional system message
|
| 186 |
-
temperature: Creativity (0-1)
|
| 187 |
-
max_tokens: Maximum response length
|
| 188 |
-
|
| 189 |
-
Returns:
|
| 190 |
-
Generated text
|
| 191 |
-
"""
|
| 192 |
-
# Build the request
|
| 193 |
-
request = {
|
| 194 |
-
"model": self.model,
|
| 195 |
-
"prompt": prompt,
|
| 196 |
-
"stream": False,
|
| 197 |
-
"options": {
|
| 198 |
-
"temperature": temperature,
|
| 199 |
-
"num_predict": max_tokens
|
| 200 |
-
}
|
| 201 |
-
}
|
| 202 |
-
|
| 203 |
-
if system:
|
| 204 |
-
request["system"] = system
|
| 205 |
-
|
| 206 |
-
try:
|
| 207 |
-
# Use ollama CLI with run command
|
| 208 |
-
full_prompt = prompt
|
| 209 |
-
if system:
|
| 210 |
-
full_prompt = f"System: {system}\n\n{prompt}"
|
| 211 |
-
|
| 212 |
-
result = subprocess.run(
|
| 213 |
-
["ollama", "run", self.model],
|
| 214 |
-
input=full_prompt,
|
| 215 |
-
capture_output=True,
|
| 216 |
-
text=True,
|
| 217 |
-
timeout=300 # 5 minutes
|
| 218 |
-
)
|
| 219 |
-
|
| 220 |
-
if result.returncode != 0:
|
| 221 |
-
raise Exception(f"Ollama error: {result.stderr}")
|
| 222 |
-
|
| 223 |
-
return result.stdout.strip()
|
| 224 |
-
|
| 225 |
-
except subprocess.TimeoutExpired:
|
| 226 |
-
raise Exception("Ollama request timed out")
|
| 227 |
-
except FileNotFoundError:
|
| 228 |
-
raise Exception(
|
| 229 |
-
"Ollama not found. Install it:\n"
|
| 230 |
-
" curl -fsSL https://ollama.com/install.sh | sh\n"
|
| 231 |
-
" ollama pull llama3"
|
| 232 |
-
)
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
class Summarizer:
|
| 236 |
-
"""Summarize text using Ollama."""
|
| 237 |
-
|
| 238 |
-
def __init__(self, model: str = "llama3"):
|
| 239 |
-
self.client = OllamaClient(model)
|
| 240 |
-
self.model = model
|
| 241 |
-
|
| 242 |
-
def summarize(
|
| 243 |
-
self,
|
| 244 |
-
text: str,
|
| 245 |
-
summary_type: str = "detailed",
|
| 246 |
-
source_path: Optional[Path] = None
|
| 247 |
-
) -> Summary:
|
| 248 |
-
"""Summarize text using Ollama.
|
| 249 |
-
|
| 250 |
-
Args:
|
| 251 |
-
text: Text to summarize
|
| 252 |
-
summary_type: Type of summary (quick, detailed, study_notes, real_estate)
|
| 253 |
-
source_path: Optional path to source file
|
| 254 |
-
|
| 255 |
-
Returns:
|
| 256 |
-
Summary object
|
| 257 |
-
"""
|
| 258 |
-
# Check Ollama availability
|
| 259 |
-
if not self.client.is_available():
|
| 260 |
-
raise Exception(
|
| 261 |
-
"Ollama is not running. Start it with:\n"
|
| 262 |
-
" ollama serve\n"
|
| 263 |
-
"Or install: curl -fsSL https://ollama.com/install.sh | sh"
|
| 264 |
-
)
|
| 265 |
-
|
| 266 |
-
# Check if model is available
|
| 267 |
-
models = self.client.list_models()
|
| 268 |
-
if self.model not in models and f"{self.model}:latest" not in models:
|
| 269 |
-
console.print(f"[yellow]Model {self.model} not found. Pulling...[/]")
|
| 270 |
-
self.client.pull_model()
|
| 271 |
-
|
| 272 |
-
# Get prompt template
|
| 273 |
-
prompt_template = PROMPTS.get(summary_type, PROMPTS["detailed"])
|
| 274 |
-
|
| 275 |
-
# Check if text needs chunking
|
| 276 |
-
tokens = estimate_tokens(text)
|
| 277 |
-
|
| 278 |
-
if tokens > 3000:
|
| 279 |
-
# Process in chunks
|
| 280 |
-
console.print(f"[bold blue]Text is long ({tokens} tokens). Processing in chunks...[/]")
|
| 281 |
-
return self._summarize_chunks(text, summary_type, source_path, prompt_template)
|
| 282 |
-
else:
|
| 283 |
-
# Process directly
|
| 284 |
-
return self._summarize_single(text, summary_type, source_path, prompt_template)
|
| 285 |
-
|
| 286 |
-
def _summarize_single(
|
| 287 |
-
self,
|
| 288 |
-
text: str,
|
| 289 |
-
summary_type: str,
|
| 290 |
-
source_path: Optional[Path],
|
| 291 |
-
prompt_template: str
|
| 292 |
-
) -> Summary:
|
| 293 |
-
"""Summarize a single chunk of text."""
|
| 294 |
-
prompt = prompt_template.format(text=text)
|
| 295 |
-
|
| 296 |
-
with Progress(
|
| 297 |
-
SpinnerColumn(),
|
| 298 |
-
TextColumn("[progress.description]{task.description}"),
|
| 299 |
-
console=console
|
| 300 |
-
) as progress:
|
| 301 |
-
progress.add_task(f"Generating {summary_type} summary...", total=None)
|
| 302 |
-
|
| 303 |
-
response = self.client.generate(prompt)
|
| 304 |
-
|
| 305 |
-
console.print(f"[green]✓[/] Summary generated")
|
| 306 |
-
|
| 307 |
-
return Summary(
|
| 308 |
-
text=response,
|
| 309 |
-
source_path=source_path,
|
| 310 |
-
model=self.model,
|
| 311 |
-
summary_type=summary_type,
|
| 312 |
-
original_length=len(text),
|
| 313 |
-
summary_length=len(response)
|
| 314 |
-
)
|
| 315 |
-
|
| 316 |
-
def _summarize_chunks(
|
| 317 |
-
self,
|
| 318 |
-
text: str,
|
| 319 |
-
summary_type: str,
|
| 320 |
-
source_path: Optional[Path],
|
| 321 |
-
prompt_template: str
|
| 322 |
-
) -> Summary:
|
| 323 |
-
"""Summarize text in chunks, then combine."""
|
| 324 |
-
chunks = chunk_for_summarization(text)
|
| 325 |
-
console.print(f"[bold blue]Split into {len(chunks)} chunks[/]")
|
| 326 |
-
|
| 327 |
-
chunk_summaries = []
|
| 328 |
-
|
| 329 |
-
with Progress(
|
| 330 |
-
SpinnerColumn(),
|
| 331 |
-
TextColumn("[progress.description]{task.description}"),
|
| 332 |
-
console=console
|
| 333 |
-
) as progress:
|
| 334 |
-
task = progress.add_task("Processing chunks...", total=len(chunks))
|
| 335 |
-
|
| 336 |
-
for i, chunk in enumerate(chunks):
|
| 337 |
-
progress.update(task, description=f"Processing chunk {i+1}/{len(chunks)}...")
|
| 338 |
-
|
| 339 |
-
prompt = prompt_template.format(text=chunk.text)
|
| 340 |
-
response = self.client.generate(prompt)
|
| 341 |
-
chunk_summaries.append(response)
|
| 342 |
-
|
| 343 |
-
progress.advance(task)
|
| 344 |
-
|
| 345 |
-
# Combine chunk summaries
|
| 346 |
-
if len(chunk_summaries) > 1:
|
| 347 |
-
console.print("[bold blue]Combining chunk summaries...[/]")
|
| 348 |
-
|
| 349 |
-
combined_text = "\n\n---\n\n".join(chunk_summaries)
|
| 350 |
-
|
| 351 |
-
combine_prompt = f"""Combine these summaries into one coherent {summary_type} summary.
|
| 352 |
-
Remove redundancy and organize the information clearly.
|
| 353 |
-
|
| 354 |
-
SUMMARIES:
|
| 355 |
-
{combined_text}
|
| 356 |
-
|
| 357 |
-
COMBINED SUMMARY:"""
|
| 358 |
-
|
| 359 |
-
final_response = self.client.generate(combine_prompt)
|
| 360 |
-
else:
|
| 361 |
-
final_response = chunk_summaries[0]
|
| 362 |
-
|
| 363 |
-
console.print(f"[green]✓[/] Summary generated")
|
| 364 |
-
|
| 365 |
-
return Summary(
|
| 366 |
-
text=final_response,
|
| 367 |
-
source_path=source_path,
|
| 368 |
-
model=self.model,
|
| 369 |
-
summary_type=summary_type,
|
| 370 |
-
original_length=len(text),
|
| 371 |
-
summary_length=len(final_response)
|
| 372 |
-
)
|
| 373 |
-
|
| 374 |
-
|
| 375 |
-
def summarize_file(
|
| 376 |
-
path: Path,
|
| 377 |
-
summary_type: str = "detailed",
|
| 378 |
-
model: str = "llama3",
|
| 379 |
-
output_dir: Optional[Path] = None
|
| 380 |
-
) -> Summary:
|
| 381 |
-
"""Summarize a transcript or document file.
|
| 382 |
-
|
| 383 |
-
Args:
|
| 384 |
-
path: Path to text file
|
| 385 |
-
summary_type: Type of summary
|
| 386 |
-
model: Ollama model to use
|
| 387 |
-
output_dir: Output directory for summary
|
| 388 |
-
|
| 389 |
-
Returns:
|
| 390 |
-
Summary object
|
| 391 |
-
"""
|
| 392 |
-
path = Path(path)
|
| 393 |
-
output_dir = output_dir or settings.summaries_dir
|
| 394 |
-
|
| 395 |
-
if not path.exists():
|
| 396 |
-
raise FileNotFoundError(f"File not found: {path}")
|
| 397 |
-
|
| 398 |
-
console.print(f"[bold green]Summarizing:[/] {path.name}")
|
| 399 |
-
|
| 400 |
-
text = path.read_text(encoding="utf-8", errors="ignore")
|
| 401 |
-
|
| 402 |
-
summarizer = Summarizer(model=model)
|
| 403 |
-
summary = summarizer.summarize(text, summary_type=summary_type, source_path=path)
|
| 404 |
-
|
| 405 |
-
# Save summary
|
| 406 |
-
output_path = output_dir / f"{path.stem}_{summary_type}.md"
|
| 407 |
-
summary.save(output_path)
|
| 408 |
-
console.print(f"[green]✓[/] Saved: {output_path}")
|
| 409 |
-
|
| 410 |
-
return summary
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/config.py
DELETED
|
@@ -1,50 +0,0 @@
|
|
| 1 |
-
"""Configuration management for Video Analyzer."""
|
| 2 |
-
|
| 3 |
-
from pathlib import Path
|
| 4 |
-
from typing import Optional
|
| 5 |
-
from pydantic_settings import BaseSettings, SettingsConfigDict
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
class Settings(BaseSettings):
|
| 9 |
-
"""Application settings."""
|
| 10 |
-
|
| 11 |
-
# Paths
|
| 12 |
-
base_dir: Path = Path(__file__).parent.parent
|
| 13 |
-
data_dir: Path = base_dir / "data"
|
| 14 |
-
downloads_dir: Path = data_dir / "downloads"
|
| 15 |
-
audio_dir: Path = data_dir / "audio"
|
| 16 |
-
transcripts_dir: Path = data_dir / "transcripts"
|
| 17 |
-
summaries_dir: Path = data_dir / "summaries"
|
| 18 |
-
|
| 19 |
-
# yt-dlp settings
|
| 20 |
-
video_format: str = "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best"
|
| 21 |
-
audio_format: str = "bestaudio[ext=m4a]/bestaudio/best"
|
| 22 |
-
|
| 23 |
-
# Whisper settings
|
| 24 |
-
whisper_model: str = "base" # tiny, base, small, medium, large-v3
|
| 25 |
-
whisper_device: str = "auto" # auto, cpu, cuda
|
| 26 |
-
|
| 27 |
-
# AI settings
|
| 28 |
-
ai_backend: str = "huggingface" # ollama, huggingface, huggingface-api
|
| 29 |
-
huggingface_api_key: Optional[str] = None
|
| 30 |
-
ollama_model: str = "llama3"
|
| 31 |
-
huggingface_model: str = "facebook/bart-large-cnn"
|
| 32 |
-
|
| 33 |
-
# Processing
|
| 34 |
-
max_concurrent_downloads: int = 3
|
| 35 |
-
|
| 36 |
-
model_config = SettingsConfigDict(
|
| 37 |
-
env_prefix="VIDEO_ANALYZER_",
|
| 38 |
-
env_file=".env",
|
| 39 |
-
env_file_encoding="utf-8",
|
| 40 |
-
extra="ignore" # Ignore extra env vars
|
| 41 |
-
)
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
# Global settings instance
|
| 45 |
-
settings = Settings()
|
| 46 |
-
|
| 47 |
-
# Ensure directories exist
|
| 48 |
-
for dir_path in [settings.downloads_dir, settings.audio_dir,
|
| 49 |
-
settings.transcripts_dir, settings.summaries_dir]:
|
| 50 |
-
dir_path.mkdir(parents=True, exist_ok=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/downloaders/__init__.py
DELETED
|
@@ -1,6 +0,0 @@
|
|
| 1 |
-
"""Video downloaders and file handling."""
|
| 2 |
-
|
| 3 |
-
from .youtube import YouTubeDownloader
|
| 4 |
-
from .files import scan_files, import_files, FileInfo, get_file_type
|
| 5 |
-
|
| 6 |
-
__all__ = ["YouTubeDownloader", "scan_files", "import_files", "FileInfo", "get_file_type"]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/downloaders/files.py
DELETED
|
@@ -1,177 +0,0 @@
|
|
| 1 |
-
"""Direct file and folder processing support."""
|
| 2 |
-
|
| 3 |
-
import shutil
|
| 4 |
-
from dataclasses import dataclass
|
| 5 |
-
from pathlib import Path
|
| 6 |
-
from typing import Optional
|
| 7 |
-
|
| 8 |
-
from rich.console import Console
|
| 9 |
-
from rich.table import Table
|
| 10 |
-
|
| 11 |
-
from src.config import settings
|
| 12 |
-
|
| 13 |
-
console = Console()
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
@dataclass
|
| 17 |
-
class FileInfo:
|
| 18 |
-
"""Information about a local file."""
|
| 19 |
-
|
| 20 |
-
path: Path
|
| 21 |
-
name: str
|
| 22 |
-
size: int
|
| 23 |
-
file_type: str # video, audio, document, image
|
| 24 |
-
extension: str
|
| 25 |
-
|
| 26 |
-
@property
|
| 27 |
-
def size_formatted(self) -> str:
|
| 28 |
-
"""Return human-readable file size."""
|
| 29 |
-
if self.size >= 1024 * 1024 * 1024:
|
| 30 |
-
return f"{self.size / (1024**3):.1f} GB"
|
| 31 |
-
elif self.size >= 1024 * 1024:
|
| 32 |
-
return f"{self.size / (1024**2):.1f} MB"
|
| 33 |
-
elif self.size >= 1024:
|
| 34 |
-
return f"{self.size / 1024:.1f} KB"
|
| 35 |
-
return f"{self.size} B"
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
# File type mappings
|
| 39 |
-
VIDEO_EXTENSIONS = {".mp4", ".mkv", ".avi", ".mov", ".webm", ".flv", ".wmv", ".m4v"}
|
| 40 |
-
AUDIO_EXTENSIONS = {".mp3", ".wav", ".m4a", ".flac", ".aac", ".ogg", ".wma"}
|
| 41 |
-
DOCUMENT_EXTENSIONS = {".pdf", ".docx", ".doc", ".pptx", ".ppt", ".txt", ".md", ".rtf"}
|
| 42 |
-
IMAGE_EXTENSIONS = {".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff", ".webp"}
|
| 43 |
-
|
| 44 |
-
ALL_SUPPORTED = VIDEO_EXTENSIONS | AUDIO_EXTENSIONS | DOCUMENT_EXTENSIONS | IMAGE_EXTENSIONS
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
def get_file_type(path: Path) -> str:
|
| 48 |
-
"""Determine file type from extension."""
|
| 49 |
-
ext = path.suffix.lower()
|
| 50 |
-
if ext in VIDEO_EXTENSIONS:
|
| 51 |
-
return "video"
|
| 52 |
-
elif ext in AUDIO_EXTENSIONS:
|
| 53 |
-
return "audio"
|
| 54 |
-
elif ext in DOCUMENT_EXTENSIONS:
|
| 55 |
-
return "document"
|
| 56 |
-
elif ext in IMAGE_EXTENSIONS:
|
| 57 |
-
return "image"
|
| 58 |
-
return "unknown"
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
def scan_files(
|
| 62 |
-
path: Path,
|
| 63 |
-
recursive: bool = True,
|
| 64 |
-
file_types: Optional[list[str]] = None
|
| 65 |
-
) -> list[FileInfo]:
|
| 66 |
-
"""Scan a file or directory for supported files.
|
| 67 |
-
|
| 68 |
-
Args:
|
| 69 |
-
path: File or directory path
|
| 70 |
-
recursive: If True, scan subdirectories
|
| 71 |
-
file_types: Filter by type - ['video', 'audio', 'document', 'image']
|
| 72 |
-
|
| 73 |
-
Returns:
|
| 74 |
-
List of FileInfo objects
|
| 75 |
-
"""
|
| 76 |
-
path = Path(path)
|
| 77 |
-
files = []
|
| 78 |
-
|
| 79 |
-
if path.is_file():
|
| 80 |
-
# Single file
|
| 81 |
-
if path.suffix.lower() in ALL_SUPPORTED:
|
| 82 |
-
file_type = get_file_type(path)
|
| 83 |
-
if file_types is None or file_type in file_types:
|
| 84 |
-
files.append(FileInfo(
|
| 85 |
-
path=path,
|
| 86 |
-
name=path.name,
|
| 87 |
-
size=path.stat().st_size,
|
| 88 |
-
file_type=file_type,
|
| 89 |
-
extension=path.suffix.lower()
|
| 90 |
-
))
|
| 91 |
-
elif path.is_dir():
|
| 92 |
-
# Directory
|
| 93 |
-
pattern = "**/*" if recursive else "*"
|
| 94 |
-
for file_path in path.glob(pattern):
|
| 95 |
-
if file_path.is_file() and file_path.suffix.lower() in ALL_SUPPORTED:
|
| 96 |
-
file_type = get_file_type(file_path)
|
| 97 |
-
if file_types is None or file_type in file_types:
|
| 98 |
-
files.append(FileInfo(
|
| 99 |
-
path=file_path,
|
| 100 |
-
name=file_path.name,
|
| 101 |
-
size=file_path.stat().st_size,
|
| 102 |
-
file_type=file_type,
|
| 103 |
-
extension=file_path.suffix.lower()
|
| 104 |
-
))
|
| 105 |
-
|
| 106 |
-
# Sort by name
|
| 107 |
-
files.sort(key=lambda f: f.name.lower())
|
| 108 |
-
return files
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
def import_files(
|
| 112 |
-
source: Path,
|
| 113 |
-
dest_dir: Optional[Path] = None,
|
| 114 |
-
copy: bool = True,
|
| 115 |
-
recursive: bool = True
|
| 116 |
-
) -> list[FileInfo]:
|
| 117 |
-
"""Import files from a source location to the data directory.
|
| 118 |
-
|
| 119 |
-
Args:
|
| 120 |
-
source: Source file or directory
|
| 121 |
-
dest_dir: Destination directory (default: data/downloads)
|
| 122 |
-
copy: If True, copy files. If False, move files.
|
| 123 |
-
recursive: If True, scan subdirectories
|
| 124 |
-
|
| 125 |
-
Returns:
|
| 126 |
-
List of imported FileInfo objects
|
| 127 |
-
"""
|
| 128 |
-
source = Path(source)
|
| 129 |
-
dest_dir = dest_dir or settings.downloads_dir
|
| 130 |
-
dest_dir.mkdir(parents=True, exist_ok=True)
|
| 131 |
-
|
| 132 |
-
files = scan_files(source, recursive=recursive)
|
| 133 |
-
imported = []
|
| 134 |
-
|
| 135 |
-
for file_info in files:
|
| 136 |
-
dest_path = dest_dir / file_info.name
|
| 137 |
-
|
| 138 |
-
# Handle duplicates
|
| 139 |
-
if dest_path.exists():
|
| 140 |
-
stem = dest_path.stem
|
| 141 |
-
suffix = dest_path.suffix
|
| 142 |
-
counter = 1
|
| 143 |
-
while dest_path.exists():
|
| 144 |
-
dest_path = dest_dir / f"{stem}_{counter}{suffix}"
|
| 145 |
-
counter += 1
|
| 146 |
-
|
| 147 |
-
# Copy or move
|
| 148 |
-
if copy:
|
| 149 |
-
shutil.copy2(file_info.path, dest_path)
|
| 150 |
-
console.print(f"[green]✓[/] Copied: {file_info.name}")
|
| 151 |
-
else:
|
| 152 |
-
shutil.move(file_info.path, dest_path)
|
| 153 |
-
console.print(f"[green]✓[/] Moved: {file_info.name}")
|
| 154 |
-
|
| 155 |
-
imported.append(FileInfo(
|
| 156 |
-
path=dest_path,
|
| 157 |
-
name=dest_path.name,
|
| 158 |
-
size=file_info.size,
|
| 159 |
-
file_type=file_info.file_type,
|
| 160 |
-
extension=file_info.extension
|
| 161 |
-
))
|
| 162 |
-
|
| 163 |
-
return imported
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
def list_supported_formats():
|
| 167 |
-
"""Display all supported file formats."""
|
| 168 |
-
table = Table(title="Supported File Formats")
|
| 169 |
-
table.add_column("Type", style="cyan")
|
| 170 |
-
table.add_column("Extensions", style="white")
|
| 171 |
-
|
| 172 |
-
table.add_row("Video", ", ".join(sorted(VIDEO_EXTENSIONS)))
|
| 173 |
-
table.add_row("Audio", ", ".join(sorted(AUDIO_EXTENSIONS)))
|
| 174 |
-
table.add_row("Document", ", ".join(sorted(DOCUMENT_EXTENSIONS)))
|
| 175 |
-
table.add_row("Image (OCR)", ", ".join(sorted(IMAGE_EXTENSIONS)))
|
| 176 |
-
|
| 177 |
-
console.print(table)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/downloaders/youtube.py
DELETED
|
@@ -1,264 +0,0 @@
|
|
| 1 |
-
"""YouTube video downloader using yt-dlp."""
|
| 2 |
-
|
| 3 |
-
import json
|
| 4 |
-
import subprocess
|
| 5 |
-
from dataclasses import dataclass, field
|
| 6 |
-
from pathlib import Path
|
| 7 |
-
from typing import Optional
|
| 8 |
-
|
| 9 |
-
from rich.console import Console
|
| 10 |
-
from rich.progress import Progress, SpinnerColumn, TextColumn
|
| 11 |
-
|
| 12 |
-
from src.config import settings
|
| 13 |
-
|
| 14 |
-
console = Console()
|
| 15 |
-
|
| 16 |
-
# Add local bin paths
|
| 17 |
-
import os
|
| 18 |
-
os.environ["PATH"] = os.environ.get("PATH", "") + ":/home/ubuntu/.local/bin:/home/ubuntu/.deno/bin"
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
@dataclass
|
| 22 |
-
class VideoInfo:
|
| 23 |
-
"""Information about a downloaded video."""
|
| 24 |
-
|
| 25 |
-
id: str
|
| 26 |
-
title: str
|
| 27 |
-
description: str
|
| 28 |
-
duration: int # seconds
|
| 29 |
-
uploader: str
|
| 30 |
-
upload_date: str
|
| 31 |
-
url: str
|
| 32 |
-
filepath: Optional[Path] = None
|
| 33 |
-
audio_filepath: Optional[Path] = None
|
| 34 |
-
subtitles: Optional[str] = None
|
| 35 |
-
chapters: list = field(default_factory=list)
|
| 36 |
-
|
| 37 |
-
@property
|
| 38 |
-
def duration_formatted(self) -> str:
|
| 39 |
-
"""Return duration in HH:MM:SS format."""
|
| 40 |
-
hours, remainder = divmod(self.duration, 3600)
|
| 41 |
-
minutes, seconds = divmod(remainder, 60)
|
| 42 |
-
if hours:
|
| 43 |
-
return f"{hours:02d}:{minutes:02d}:{seconds:02d}"
|
| 44 |
-
return f"{minutes:02d}:{seconds:02d}"
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
class YouTubeDownloader:
|
| 48 |
-
"""Download videos from YouTube using yt-dlp."""
|
| 49 |
-
|
| 50 |
-
def __init__(self, output_dir: Optional[Path] = None, cookies_file: Optional[Path] = None):
|
| 51 |
-
self.output_dir = output_dir or settings.downloads_dir
|
| 52 |
-
self.output_dir.mkdir(parents=True, exist_ok=True)
|
| 53 |
-
self.cookies_file = cookies_file # Path to cookies.txt for authenticated downloads
|
| 54 |
-
|
| 55 |
-
def get_info(self, url: str) -> VideoInfo:
|
| 56 |
-
"""Get video information without downloading."""
|
| 57 |
-
cmd = [
|
| 58 |
-
"yt-dlp",
|
| 59 |
-
"--dump-json",
|
| 60 |
-
"--no-download",
|
| 61 |
-
]
|
| 62 |
-
|
| 63 |
-
# Add cookies if provided
|
| 64 |
-
if self.cookies_file and Path(self.cookies_file).exists():
|
| 65 |
-
cmd.extend(["--cookies", str(self.cookies_file)])
|
| 66 |
-
|
| 67 |
-
cmd.append(url)
|
| 68 |
-
|
| 69 |
-
result = subprocess.run(cmd, capture_output=True, text=True)
|
| 70 |
-
if result.returncode != 0:
|
| 71 |
-
error_msg = result.stderr
|
| 72 |
-
if "Sign in to confirm you're not a bot" in error_msg:
|
| 73 |
-
raise Exception(
|
| 74 |
-
"YouTube requires authentication. Please provide a cookies file:\n"
|
| 75 |
-
"1. Install a browser extension to export cookies (e.g., 'Get cookies.txt LOCALLY')\n"
|
| 76 |
-
"2. Export cookies from youtube.com\n"
|
| 77 |
-
"3. Use --cookies path/to/cookies.txt"
|
| 78 |
-
)
|
| 79 |
-
raise Exception(f"Failed to get video info: {error_msg}")
|
| 80 |
-
|
| 81 |
-
data = json.loads(result.stdout)
|
| 82 |
-
|
| 83 |
-
return VideoInfo(
|
| 84 |
-
id=data.get("id", ""),
|
| 85 |
-
title=data.get("title", "Unknown"),
|
| 86 |
-
description=data.get("description", ""),
|
| 87 |
-
duration=data.get("duration", 0),
|
| 88 |
-
uploader=data.get("uploader", "Unknown"),
|
| 89 |
-
upload_date=data.get("upload_date", ""),
|
| 90 |
-
url=url,
|
| 91 |
-
chapters=data.get("chapters", [])
|
| 92 |
-
)
|
| 93 |
-
|
| 94 |
-
def download_video(
|
| 95 |
-
self,
|
| 96 |
-
url: str,
|
| 97 |
-
audio_only: bool = False,
|
| 98 |
-
get_subtitles: bool = True,
|
| 99 |
-
quality: str = "best"
|
| 100 |
-
) -> VideoInfo:
|
| 101 |
-
"""Download a video from YouTube.
|
| 102 |
-
|
| 103 |
-
Args:
|
| 104 |
-
url: YouTube video URL
|
| 105 |
-
audio_only: If True, download only audio (faster for transcription)
|
| 106 |
-
get_subtitles: If True, download auto-generated subtitles if available
|
| 107 |
-
quality: Video quality - 'best', '1080p', '720p', '480p', 'audio'
|
| 108 |
-
|
| 109 |
-
Returns:
|
| 110 |
-
VideoInfo with filepath set
|
| 111 |
-
"""
|
| 112 |
-
# Get video info first
|
| 113 |
-
with console.status("[bold green]Getting video info..."):
|
| 114 |
-
info = self.get_info(url)
|
| 115 |
-
|
| 116 |
-
console.print(f"[bold blue]Title:[/] {info.title}")
|
| 117 |
-
console.print(f"[bold blue]Duration:[/] {info.duration_formatted}")
|
| 118 |
-
console.print(f"[bold blue]Uploader:[/] {info.uploader}")
|
| 119 |
-
|
| 120 |
-
# Build output template
|
| 121 |
-
output_template = str(self.output_dir / "%(id)s.%(ext)s")
|
| 122 |
-
|
| 123 |
-
# Build yt-dlp command
|
| 124 |
-
cmd = [
|
| 125 |
-
"yt-dlp",
|
| 126 |
-
"--output", output_template,
|
| 127 |
-
"--no-playlist", # Download single video only
|
| 128 |
-
"--newline", # Progress on new lines
|
| 129 |
-
]
|
| 130 |
-
|
| 131 |
-
# Add cookies if provided
|
| 132 |
-
if self.cookies_file and Path(self.cookies_file).exists():
|
| 133 |
-
cmd.extend(["--cookies", str(self.cookies_file)])
|
| 134 |
-
|
| 135 |
-
# Format selection
|
| 136 |
-
if audio_only:
|
| 137 |
-
cmd.extend([
|
| 138 |
-
"-x", # Extract audio
|
| 139 |
-
"--audio-format", "mp3",
|
| 140 |
-
"--audio-quality", "0", # Best quality
|
| 141 |
-
])
|
| 142 |
-
else:
|
| 143 |
-
if quality == "best":
|
| 144 |
-
cmd.extend(["-f", "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best"])
|
| 145 |
-
elif quality == "1080p":
|
| 146 |
-
cmd.extend(["-f", "bestvideo[height<=1080][ext=mp4]+bestaudio[ext=m4a]/best[height<=1080]"])
|
| 147 |
-
elif quality == "720p":
|
| 148 |
-
cmd.extend(["-f", "bestvideo[height<=720][ext=mp4]+bestaudio[ext=m4a]/best[height<=720]"])
|
| 149 |
-
elif quality == "480p":
|
| 150 |
-
cmd.extend(["-f", "bestvideo[height<=480][ext=mp4]+bestaudio[ext=m4a]/best[height<=480]"])
|
| 151 |
-
|
| 152 |
-
# Subtitles
|
| 153 |
-
if get_subtitles:
|
| 154 |
-
cmd.extend([
|
| 155 |
-
"--write-auto-sub",
|
| 156 |
-
"--sub-lang", "en",
|
| 157 |
-
"--sub-format", "srt/vtt/best",
|
| 158 |
-
"--convert-subs", "srt",
|
| 159 |
-
])
|
| 160 |
-
|
| 161 |
-
cmd.append(url)
|
| 162 |
-
|
| 163 |
-
# Download
|
| 164 |
-
console.print(f"\n[bold green]Downloading...[/]")
|
| 165 |
-
|
| 166 |
-
result = subprocess.run(cmd, capture_output=True, text=True)
|
| 167 |
-
|
| 168 |
-
if result.returncode != 0:
|
| 169 |
-
console.print(f"[red]Error:[/] {result.stderr}")
|
| 170 |
-
raise Exception(f"Download failed: {result.stderr}")
|
| 171 |
-
|
| 172 |
-
# Find the downloaded file
|
| 173 |
-
if audio_only:
|
| 174 |
-
filepath = self.output_dir / f"{info.id}.mp3"
|
| 175 |
-
info.audio_filepath = filepath
|
| 176 |
-
else:
|
| 177 |
-
# Try common extensions
|
| 178 |
-
for ext in ["mp4", "webm", "mkv"]:
|
| 179 |
-
filepath = self.output_dir / f"{info.id}.{ext}"
|
| 180 |
-
if filepath.exists():
|
| 181 |
-
break
|
| 182 |
-
info.filepath = filepath
|
| 183 |
-
|
| 184 |
-
# Check for subtitles
|
| 185 |
-
subtitle_path = self.output_dir / f"{info.id}.en.srt"
|
| 186 |
-
if subtitle_path.exists():
|
| 187 |
-
info.subtitles = subtitle_path.read_text()
|
| 188 |
-
console.print(f"[green]✓[/] Subtitles downloaded")
|
| 189 |
-
|
| 190 |
-
console.print(f"[green]✓[/] Downloaded to: {filepath}")
|
| 191 |
-
|
| 192 |
-
return info
|
| 193 |
-
|
| 194 |
-
def download_playlist(
|
| 195 |
-
self,
|
| 196 |
-
url: str,
|
| 197 |
-
audio_only: bool = False,
|
| 198 |
-
max_videos: Optional[int] = None
|
| 199 |
-
) -> list[VideoInfo]:
|
| 200 |
-
"""Download all videos from a playlist.
|
| 201 |
-
|
| 202 |
-
Args:
|
| 203 |
-
url: YouTube playlist URL
|
| 204 |
-
audio_only: If True, download only audio
|
| 205 |
-
max_videos: Maximum number of videos to download
|
| 206 |
-
|
| 207 |
-
Returns:
|
| 208 |
-
List of VideoInfo objects
|
| 209 |
-
"""
|
| 210 |
-
# Get playlist info
|
| 211 |
-
cmd = [
|
| 212 |
-
"yt-dlp",
|
| 213 |
-
"--dump-json",
|
| 214 |
-
"--flat-playlist",
|
| 215 |
-
url
|
| 216 |
-
]
|
| 217 |
-
|
| 218 |
-
result = subprocess.run(cmd, capture_output=True, text=True)
|
| 219 |
-
if result.returncode != 0:
|
| 220 |
-
raise Exception(f"Failed to get playlist info: {result.stderr}")
|
| 221 |
-
|
| 222 |
-
# Parse each video entry
|
| 223 |
-
videos = []
|
| 224 |
-
for line in result.stdout.strip().split("\n"):
|
| 225 |
-
if line:
|
| 226 |
-
data = json.loads(line)
|
| 227 |
-
video_url = f"https://www.youtube.com/watch?v={data['id']}"
|
| 228 |
-
videos.append(video_url)
|
| 229 |
-
|
| 230 |
-
if max_videos:
|
| 231 |
-
videos = videos[:max_videos]
|
| 232 |
-
|
| 233 |
-
console.print(f"[bold blue]Found {len(videos)} videos in playlist[/]")
|
| 234 |
-
|
| 235 |
-
# Download each video
|
| 236 |
-
downloaded = []
|
| 237 |
-
for i, video_url in enumerate(videos, 1):
|
| 238 |
-
console.print(f"\n[bold]Downloading {i}/{len(videos)}[/]")
|
| 239 |
-
try:
|
| 240 |
-
info = self.download_video(video_url, audio_only=audio_only)
|
| 241 |
-
downloaded.append(info)
|
| 242 |
-
except Exception as e:
|
| 243 |
-
console.print(f"[red]Failed to download:[/] {e}")
|
| 244 |
-
|
| 245 |
-
return downloaded
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
def download_youtube(
|
| 249 |
-
url: str,
|
| 250 |
-
audio_only: bool = False,
|
| 251 |
-
output_dir: Optional[Path] = None
|
| 252 |
-
) -> VideoInfo:
|
| 253 |
-
"""Convenience function to download a YouTube video.
|
| 254 |
-
|
| 255 |
-
Args:
|
| 256 |
-
url: YouTube video URL
|
| 257 |
-
audio_only: If True, download only audio
|
| 258 |
-
output_dir: Output directory (default: data/downloads)
|
| 259 |
-
|
| 260 |
-
Returns:
|
| 261 |
-
VideoInfo with download details
|
| 262 |
-
"""
|
| 263 |
-
downloader = YouTubeDownloader(output_dir)
|
| 264 |
-
return downloader.download_video(url, audio_only=audio_only)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/knowledge/__init__.py
DELETED
|
@@ -1,19 +0,0 @@
|
|
| 1 |
-
"""Knowledge base with vector storage."""
|
| 2 |
-
|
| 3 |
-
from .embeddings import EmbeddingModel, embed_text, embed_texts
|
| 4 |
-
from .vectorstore import KnowledgeBase, SearchResult, get_knowledge_base, search
|
| 5 |
-
from .indexer import index_text, index_file, index_directory, reindex_all
|
| 6 |
-
|
| 7 |
-
__all__ = [
|
| 8 |
-
"EmbeddingModel",
|
| 9 |
-
"embed_text",
|
| 10 |
-
"embed_texts",
|
| 11 |
-
"KnowledgeBase",
|
| 12 |
-
"SearchResult",
|
| 13 |
-
"get_knowledge_base",
|
| 14 |
-
"search",
|
| 15 |
-
"index_text",
|
| 16 |
-
"index_file",
|
| 17 |
-
"index_directory",
|
| 18 |
-
"reindex_all",
|
| 19 |
-
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/knowledge/embeddings.py
DELETED
|
@@ -1,107 +0,0 @@
|
|
| 1 |
-
"""Embedding generation using sentence-transformers (local, free)."""
|
| 2 |
-
|
| 3 |
-
from typing import Optional
|
| 4 |
-
import numpy as np
|
| 5 |
-
|
| 6 |
-
from rich.console import Console
|
| 7 |
-
|
| 8 |
-
console = Console()
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
class EmbeddingModel:
|
| 12 |
-
"""Generate embeddings using sentence-transformers."""
|
| 13 |
-
|
| 14 |
-
# Recommended models (all free, run locally)
|
| 15 |
-
MODELS = {
|
| 16 |
-
"fast": "all-MiniLM-L6-v2", # 384 dims, very fast
|
| 17 |
-
"balanced": "all-mpnet-base-v2", # 768 dims, good quality
|
| 18 |
-
"multilingual": "paraphrase-multilingual-MiniLM-L12-v2", # 384 dims
|
| 19 |
-
}
|
| 20 |
-
|
| 21 |
-
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
|
| 22 |
-
"""Initialize embedding model.
|
| 23 |
-
|
| 24 |
-
Args:
|
| 25 |
-
model_name: Model name or key from MODELS dict
|
| 26 |
-
"""
|
| 27 |
-
# Allow shorthand names
|
| 28 |
-
if model_name in self.MODELS:
|
| 29 |
-
model_name = self.MODELS[model_name]
|
| 30 |
-
|
| 31 |
-
self.model_name = model_name
|
| 32 |
-
self._model = None
|
| 33 |
-
|
| 34 |
-
def _load_model(self):
|
| 35 |
-
"""Lazy load the model."""
|
| 36 |
-
if self._model is None:
|
| 37 |
-
console.print(f"[bold green]Loading embedding model:[/] {self.model_name}")
|
| 38 |
-
|
| 39 |
-
try:
|
| 40 |
-
from sentence_transformers import SentenceTransformer
|
| 41 |
-
except ImportError:
|
| 42 |
-
raise ImportError(
|
| 43 |
-
"sentence-transformers not installed. Run:\n"
|
| 44 |
-
" pip install sentence-transformers"
|
| 45 |
-
)
|
| 46 |
-
|
| 47 |
-
self._model = SentenceTransformer(self.model_name)
|
| 48 |
-
console.print(f"[green]✓[/] Model loaded (dim={self._model.get_sentence_embedding_dimension()})")
|
| 49 |
-
|
| 50 |
-
@property
|
| 51 |
-
def dimension(self) -> int:
|
| 52 |
-
"""Get embedding dimension."""
|
| 53 |
-
self._load_model()
|
| 54 |
-
return self._model.get_sentence_embedding_dimension()
|
| 55 |
-
|
| 56 |
-
def embed(self, text: str) -> list[float]:
|
| 57 |
-
"""Generate embedding for a single text.
|
| 58 |
-
|
| 59 |
-
Args:
|
| 60 |
-
text: Text to embed
|
| 61 |
-
|
| 62 |
-
Returns:
|
| 63 |
-
List of floats (embedding vector)
|
| 64 |
-
"""
|
| 65 |
-
self._load_model()
|
| 66 |
-
embedding = self._model.encode(text, convert_to_numpy=True)
|
| 67 |
-
return embedding.tolist()
|
| 68 |
-
|
| 69 |
-
def embed_batch(self, texts: list[str], show_progress: bool = True) -> list[list[float]]:
|
| 70 |
-
"""Generate embeddings for multiple texts.
|
| 71 |
-
|
| 72 |
-
Args:
|
| 73 |
-
texts: List of texts to embed
|
| 74 |
-
show_progress: Show progress bar
|
| 75 |
-
|
| 76 |
-
Returns:
|
| 77 |
-
List of embedding vectors
|
| 78 |
-
"""
|
| 79 |
-
self._load_model()
|
| 80 |
-
embeddings = self._model.encode(
|
| 81 |
-
texts,
|
| 82 |
-
convert_to_numpy=True,
|
| 83 |
-
show_progress_bar=show_progress
|
| 84 |
-
)
|
| 85 |
-
return embeddings.tolist()
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
# Global instance for convenience
|
| 89 |
-
_default_model: Optional[EmbeddingModel] = None
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
def get_embedding_model(model_name: str = "all-MiniLM-L6-v2") -> EmbeddingModel:
|
| 93 |
-
"""Get or create the default embedding model."""
|
| 94 |
-
global _default_model
|
| 95 |
-
if _default_model is None or _default_model.model_name != model_name:
|
| 96 |
-
_default_model = EmbeddingModel(model_name)
|
| 97 |
-
return _default_model
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
def embed_text(text: str) -> list[float]:
|
| 101 |
-
"""Convenience function to embed a single text."""
|
| 102 |
-
return get_embedding_model().embed(text)
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
def embed_texts(texts: list[str]) -> list[list[float]]:
|
| 106 |
-
"""Convenience function to embed multiple texts."""
|
| 107 |
-
return get_embedding_model().embed_batch(texts)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/knowledge/indexer.py
DELETED
|
@@ -1,151 +0,0 @@
|
|
| 1 |
-
"""Index content into the knowledge base."""
|
| 2 |
-
|
| 3 |
-
from pathlib import Path
|
| 4 |
-
from typing import Optional
|
| 5 |
-
|
| 6 |
-
from rich.console import Console
|
| 7 |
-
from rich.progress import Progress, SpinnerColumn, TextColumn
|
| 8 |
-
|
| 9 |
-
from src.config import settings
|
| 10 |
-
from src.analyzers.chunker import chunk_for_summarization
|
| 11 |
-
from src.knowledge.vectorstore import KnowledgeBase, get_knowledge_base
|
| 12 |
-
|
| 13 |
-
console = Console()
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
def index_text(
|
| 17 |
-
text: str,
|
| 18 |
-
source: str,
|
| 19 |
-
kb: Optional[KnowledgeBase] = None,
|
| 20 |
-
chunk_size: int = 1000
|
| 21 |
-
) -> int:
|
| 22 |
-
"""Index a text into the knowledge base.
|
| 23 |
-
|
| 24 |
-
Args:
|
| 25 |
-
text: Text content to index
|
| 26 |
-
source: Source identifier
|
| 27 |
-
kb: Knowledge base (uses default if None)
|
| 28 |
-
chunk_size: Characters per chunk
|
| 29 |
-
|
| 30 |
-
Returns:
|
| 31 |
-
Number of chunks indexed
|
| 32 |
-
"""
|
| 33 |
-
kb = kb or get_knowledge_base()
|
| 34 |
-
|
| 35 |
-
# Chunk the text
|
| 36 |
-
chunks = chunk_for_summarization(text, max_tokens=chunk_size // 4)
|
| 37 |
-
|
| 38 |
-
if not chunks:
|
| 39 |
-
return 0
|
| 40 |
-
|
| 41 |
-
# Extract just the text from chunks
|
| 42 |
-
texts = [c.text for c in chunks]
|
| 43 |
-
metadatas = [{"start_char": c.start_char, "end_char": c.end_char} for c in chunks]
|
| 44 |
-
|
| 45 |
-
# Add to knowledge base
|
| 46 |
-
kb.add_texts(texts, source=source, metadatas=metadatas)
|
| 47 |
-
|
| 48 |
-
return len(chunks)
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
def index_file(
|
| 52 |
-
path: Path,
|
| 53 |
-
kb: Optional[KnowledgeBase] = None
|
| 54 |
-
) -> int:
|
| 55 |
-
"""Index a file into the knowledge base.
|
| 56 |
-
|
| 57 |
-
Args:
|
| 58 |
-
path: Path to text file
|
| 59 |
-
kb: Knowledge base (uses default if None)
|
| 60 |
-
|
| 61 |
-
Returns:
|
| 62 |
-
Number of chunks indexed
|
| 63 |
-
"""
|
| 64 |
-
path = Path(path)
|
| 65 |
-
|
| 66 |
-
if not path.exists():
|
| 67 |
-
console.print(f"[red]File not found:[/] {path}")
|
| 68 |
-
return 0
|
| 69 |
-
|
| 70 |
-
text = path.read_text(encoding="utf-8", errors="ignore")
|
| 71 |
-
|
| 72 |
-
if not text.strip():
|
| 73 |
-
console.print(f"[yellow]Empty file:[/] {path.name}")
|
| 74 |
-
return 0
|
| 75 |
-
|
| 76 |
-
return index_text(text, source=str(path), kb=kb)
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
def index_directory(
|
| 80 |
-
path: Optional[Path] = None,
|
| 81 |
-
kb: Optional[KnowledgeBase] = None,
|
| 82 |
-
extensions: list[str] = [".txt", ".md"]
|
| 83 |
-
) -> dict:
|
| 84 |
-
"""Index all text files in a directory.
|
| 85 |
-
|
| 86 |
-
Args:
|
| 87 |
-
path: Directory path (defaults to transcripts_dir)
|
| 88 |
-
kb: Knowledge base
|
| 89 |
-
extensions: File extensions to index
|
| 90 |
-
|
| 91 |
-
Returns:
|
| 92 |
-
Dict with stats {files: int, chunks: int}
|
| 93 |
-
"""
|
| 94 |
-
path = path or settings.transcripts_dir
|
| 95 |
-
path = Path(path)
|
| 96 |
-
kb = kb or get_knowledge_base()
|
| 97 |
-
|
| 98 |
-
# Find all text files
|
| 99 |
-
files = []
|
| 100 |
-
for ext in extensions:
|
| 101 |
-
files.extend(path.glob(f"*{ext}"))
|
| 102 |
-
|
| 103 |
-
if not files:
|
| 104 |
-
console.print(f"[yellow]No files found in {path}[/]")
|
| 105 |
-
return {"files": 0, "chunks": 0}
|
| 106 |
-
|
| 107 |
-
console.print(f"[bold blue]Indexing {len(files)} files...[/]")
|
| 108 |
-
|
| 109 |
-
total_chunks = 0
|
| 110 |
-
indexed_files = 0
|
| 111 |
-
|
| 112 |
-
for file_path in files:
|
| 113 |
-
try:
|
| 114 |
-
chunks = index_file(file_path, kb=kb)
|
| 115 |
-
if chunks > 0:
|
| 116 |
-
indexed_files += 1
|
| 117 |
-
total_chunks += chunks
|
| 118 |
-
except Exception as e:
|
| 119 |
-
console.print(f"[red]Error indexing {file_path.name}:[/] {e}")
|
| 120 |
-
|
| 121 |
-
console.print(f"[green]✓[/] Indexed {indexed_files} files, {total_chunks} chunks")
|
| 122 |
-
|
| 123 |
-
return {"files": indexed_files, "chunks": total_chunks}
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
def reindex_all(kb: Optional[KnowledgeBase] = None) -> dict:
|
| 127 |
-
"""Clear and reindex everything.
|
| 128 |
-
|
| 129 |
-
Args:
|
| 130 |
-
kb: Knowledge base
|
| 131 |
-
|
| 132 |
-
Returns:
|
| 133 |
-
Dict with stats
|
| 134 |
-
"""
|
| 135 |
-
kb = kb or get_knowledge_base()
|
| 136 |
-
|
| 137 |
-
console.print("[bold yellow]Clearing existing index...[/]")
|
| 138 |
-
kb.clear()
|
| 139 |
-
|
| 140 |
-
# Index transcripts
|
| 141 |
-
console.print("\n[bold blue]Indexing transcripts...[/]")
|
| 142 |
-
transcript_stats = index_directory(settings.transcripts_dir, kb=kb)
|
| 143 |
-
|
| 144 |
-
# Index summaries
|
| 145 |
-
console.print("\n[bold blue]Indexing summaries...[/]")
|
| 146 |
-
summary_stats = index_directory(settings.summaries_dir, kb=kb, extensions=[".md", ".txt"])
|
| 147 |
-
|
| 148 |
-
return {
|
| 149 |
-
"files": transcript_stats["files"] + summary_stats["files"],
|
| 150 |
-
"chunks": transcript_stats["chunks"] + summary_stats["chunks"]
|
| 151 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/knowledge/vectorstore.py
DELETED
|
@@ -1,316 +0,0 @@
|
|
| 1 |
-
"""Vector store using ChromaDB (local, free, persistent)."""
|
| 2 |
-
|
| 3 |
-
import hashlib
|
| 4 |
-
import json
|
| 5 |
-
from dataclasses import dataclass
|
| 6 |
-
from pathlib import Path
|
| 7 |
-
from typing import Optional
|
| 8 |
-
|
| 9 |
-
from rich.console import Console
|
| 10 |
-
|
| 11 |
-
from src.config import settings
|
| 12 |
-
|
| 13 |
-
console = Console()
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
@dataclass
|
| 17 |
-
class SearchResult:
|
| 18 |
-
"""A search result from the knowledge base."""
|
| 19 |
-
|
| 20 |
-
text: str
|
| 21 |
-
source: str
|
| 22 |
-
score: float # Similarity score (higher = more similar)
|
| 23 |
-
metadata: dict
|
| 24 |
-
|
| 25 |
-
@property
|
| 26 |
-
def source_name(self) -> str:
|
| 27 |
-
"""Get just the filename from source path."""
|
| 28 |
-
return Path(self.source).stem if self.source else "unknown"
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
class KnowledgeBase:
|
| 32 |
-
"""Vector store for semantic search using ChromaDB."""
|
| 33 |
-
|
| 34 |
-
def __init__(
|
| 35 |
-
self,
|
| 36 |
-
persist_dir: Optional[Path] = None,
|
| 37 |
-
collection_name: str = "video_analyzer"
|
| 38 |
-
):
|
| 39 |
-
"""Initialize knowledge base.
|
| 40 |
-
|
| 41 |
-
Args:
|
| 42 |
-
persist_dir: Directory for persistent storage
|
| 43 |
-
collection_name: Name of the ChromaDB collection
|
| 44 |
-
"""
|
| 45 |
-
self.persist_dir = persist_dir or (settings.data_dir / "chromadb")
|
| 46 |
-
self.collection_name = collection_name
|
| 47 |
-
self._client = None
|
| 48 |
-
self._collection = None
|
| 49 |
-
self._embedding_model = None
|
| 50 |
-
|
| 51 |
-
def _init_db(self):
|
| 52 |
-
"""Initialize ChromaDB client and collection."""
|
| 53 |
-
if self._client is None:
|
| 54 |
-
try:
|
| 55 |
-
import chromadb
|
| 56 |
-
from chromadb.config import Settings as ChromaSettings
|
| 57 |
-
except ImportError:
|
| 58 |
-
raise ImportError(
|
| 59 |
-
"ChromaDB not installed. Run:\n"
|
| 60 |
-
" pip install chromadb"
|
| 61 |
-
)
|
| 62 |
-
|
| 63 |
-
# Create persistent client
|
| 64 |
-
self.persist_dir.mkdir(parents=True, exist_ok=True)
|
| 65 |
-
|
| 66 |
-
self._client = chromadb.PersistentClient(
|
| 67 |
-
path=str(self.persist_dir),
|
| 68 |
-
settings=ChromaSettings(anonymized_telemetry=False)
|
| 69 |
-
)
|
| 70 |
-
|
| 71 |
-
# Get or create collection
|
| 72 |
-
self._collection = self._client.get_or_create_collection(
|
| 73 |
-
name=self.collection_name,
|
| 74 |
-
metadata={"description": "Video Analyzer Knowledge Base"}
|
| 75 |
-
)
|
| 76 |
-
|
| 77 |
-
console.print(f"[green]✓[/] Knowledge base loaded: {self._collection.count()} documents")
|
| 78 |
-
|
| 79 |
-
def _get_embedding_model(self):
|
| 80 |
-
"""Get the embedding model."""
|
| 81 |
-
if self._embedding_model is None:
|
| 82 |
-
from src.knowledge.embeddings import EmbeddingModel
|
| 83 |
-
self._embedding_model = EmbeddingModel()
|
| 84 |
-
return self._embedding_model
|
| 85 |
-
|
| 86 |
-
def _generate_id(self, text: str, source: str) -> str:
|
| 87 |
-
"""Generate a unique ID for a document."""
|
| 88 |
-
content = f"{source}:{text[:100]}"
|
| 89 |
-
return hashlib.md5(content.encode()).hexdigest()
|
| 90 |
-
|
| 91 |
-
def add_text(
|
| 92 |
-
self,
|
| 93 |
-
text: str,
|
| 94 |
-
source: str,
|
| 95 |
-
metadata: Optional[dict] = None
|
| 96 |
-
) -> str:
|
| 97 |
-
"""Add a single text to the knowledge base.
|
| 98 |
-
|
| 99 |
-
Args:
|
| 100 |
-
text: Text content
|
| 101 |
-
source: Source file path or identifier
|
| 102 |
-
metadata: Additional metadata
|
| 103 |
-
|
| 104 |
-
Returns:
|
| 105 |
-
Document ID
|
| 106 |
-
"""
|
| 107 |
-
self._init_db()
|
| 108 |
-
|
| 109 |
-
# Generate embedding
|
| 110 |
-
model = self._get_embedding_model()
|
| 111 |
-
embedding = model.embed(text)
|
| 112 |
-
|
| 113 |
-
# Generate ID
|
| 114 |
-
doc_id = self._generate_id(text, source)
|
| 115 |
-
|
| 116 |
-
# Prepare metadata
|
| 117 |
-
meta = metadata or {}
|
| 118 |
-
meta["source"] = source
|
| 119 |
-
meta["text_length"] = len(text)
|
| 120 |
-
|
| 121 |
-
# Add to collection
|
| 122 |
-
self._collection.add(
|
| 123 |
-
ids=[doc_id],
|
| 124 |
-
embeddings=[embedding],
|
| 125 |
-
documents=[text],
|
| 126 |
-
metadatas=[meta]
|
| 127 |
-
)
|
| 128 |
-
|
| 129 |
-
return doc_id
|
| 130 |
-
|
| 131 |
-
def add_texts(
|
| 132 |
-
self,
|
| 133 |
-
texts: list[str],
|
| 134 |
-
source: str,
|
| 135 |
-
metadatas: Optional[list[dict]] = None,
|
| 136 |
-
show_progress: bool = True
|
| 137 |
-
) -> list[str]:
|
| 138 |
-
"""Add multiple texts to the knowledge base.
|
| 139 |
-
|
| 140 |
-
Args:
|
| 141 |
-
texts: List of text content
|
| 142 |
-
source: Source file path
|
| 143 |
-
metadatas: List of metadata dicts
|
| 144 |
-
show_progress: Show progress bar
|
| 145 |
-
|
| 146 |
-
Returns:
|
| 147 |
-
List of document IDs
|
| 148 |
-
"""
|
| 149 |
-
self._init_db()
|
| 150 |
-
|
| 151 |
-
if not texts:
|
| 152 |
-
return []
|
| 153 |
-
|
| 154 |
-
console.print(f"[bold blue]Indexing {len(texts)} chunks from {Path(source).name}[/]")
|
| 155 |
-
|
| 156 |
-
# Generate embeddings in batch
|
| 157 |
-
model = self._get_embedding_model()
|
| 158 |
-
embeddings = model.embed_batch(texts, show_progress=show_progress)
|
| 159 |
-
|
| 160 |
-
# Generate IDs and prepare metadata
|
| 161 |
-
ids = []
|
| 162 |
-
metas = []
|
| 163 |
-
for i, text in enumerate(texts):
|
| 164 |
-
doc_id = self._generate_id(text, f"{source}:{i}")
|
| 165 |
-
ids.append(doc_id)
|
| 166 |
-
|
| 167 |
-
meta = metadatas[i] if metadatas else {}
|
| 168 |
-
meta["source"] = source
|
| 169 |
-
meta["chunk_index"] = i
|
| 170 |
-
meta["text_length"] = len(text)
|
| 171 |
-
metas.append(meta)
|
| 172 |
-
|
| 173 |
-
# Add to collection
|
| 174 |
-
self._collection.add(
|
| 175 |
-
ids=ids,
|
| 176 |
-
embeddings=embeddings,
|
| 177 |
-
documents=texts,
|
| 178 |
-
metadatas=metas
|
| 179 |
-
)
|
| 180 |
-
|
| 181 |
-
console.print(f"[green]✓[/] Added {len(texts)} chunks to knowledge base")
|
| 182 |
-
|
| 183 |
-
return ids
|
| 184 |
-
|
| 185 |
-
def search(
|
| 186 |
-
self,
|
| 187 |
-
query: str,
|
| 188 |
-
n_results: int = 5,
|
| 189 |
-
filter_source: Optional[str] = None
|
| 190 |
-
) -> list[SearchResult]:
|
| 191 |
-
"""Search the knowledge base semantically.
|
| 192 |
-
|
| 193 |
-
Args:
|
| 194 |
-
query: Search query
|
| 195 |
-
n_results: Number of results to return
|
| 196 |
-
filter_source: Filter by source file
|
| 197 |
-
|
| 198 |
-
Returns:
|
| 199 |
-
List of SearchResult objects
|
| 200 |
-
"""
|
| 201 |
-
self._init_db()
|
| 202 |
-
|
| 203 |
-
# Generate query embedding
|
| 204 |
-
model = self._get_embedding_model()
|
| 205 |
-
query_embedding = model.embed(query)
|
| 206 |
-
|
| 207 |
-
# Build filter
|
| 208 |
-
where_filter = None
|
| 209 |
-
if filter_source:
|
| 210 |
-
where_filter = {"source": filter_source}
|
| 211 |
-
|
| 212 |
-
# Search
|
| 213 |
-
results = self._collection.query(
|
| 214 |
-
query_embeddings=[query_embedding],
|
| 215 |
-
n_results=n_results,
|
| 216 |
-
where=where_filter,
|
| 217 |
-
include=["documents", "metadatas", "distances"]
|
| 218 |
-
)
|
| 219 |
-
|
| 220 |
-
# Convert to SearchResult objects
|
| 221 |
-
search_results = []
|
| 222 |
-
if results["documents"] and results["documents"][0]:
|
| 223 |
-
for i, doc in enumerate(results["documents"][0]):
|
| 224 |
-
# Convert distance to similarity score (1 - distance for cosine)
|
| 225 |
-
distance = results["distances"][0][i] if results["distances"] else 0
|
| 226 |
-
score = 1 - distance # Higher = more similar
|
| 227 |
-
|
| 228 |
-
metadata = results["metadatas"][0][i] if results["metadatas"] else {}
|
| 229 |
-
source = metadata.pop("source", "unknown")
|
| 230 |
-
|
| 231 |
-
search_results.append(SearchResult(
|
| 232 |
-
text=doc,
|
| 233 |
-
source=source,
|
| 234 |
-
score=score,
|
| 235 |
-
metadata=metadata
|
| 236 |
-
))
|
| 237 |
-
|
| 238 |
-
return search_results
|
| 239 |
-
|
| 240 |
-
def count(self) -> int:
|
| 241 |
-
"""Get total number of documents in the knowledge base."""
|
| 242 |
-
self._init_db()
|
| 243 |
-
return self._collection.count()
|
| 244 |
-
|
| 245 |
-
def get_sources(self) -> list[str]:
|
| 246 |
-
"""Get list of all sources in the knowledge base."""
|
| 247 |
-
self._init_db()
|
| 248 |
-
|
| 249 |
-
# Get all metadata
|
| 250 |
-
results = self._collection.get(include=["metadatas"])
|
| 251 |
-
|
| 252 |
-
sources = set()
|
| 253 |
-
if results["metadatas"]:
|
| 254 |
-
for meta in results["metadatas"]:
|
| 255 |
-
if "source" in meta:
|
| 256 |
-
sources.add(meta["source"])
|
| 257 |
-
|
| 258 |
-
return sorted(sources)
|
| 259 |
-
|
| 260 |
-
def delete_source(self, source: str) -> int:
|
| 261 |
-
"""Delete all documents from a specific source.
|
| 262 |
-
|
| 263 |
-
Args:
|
| 264 |
-
source: Source path to delete
|
| 265 |
-
|
| 266 |
-
Returns:
|
| 267 |
-
Number of documents deleted
|
| 268 |
-
"""
|
| 269 |
-
self._init_db()
|
| 270 |
-
|
| 271 |
-
# Get IDs for this source
|
| 272 |
-
results = self._collection.get(
|
| 273 |
-
where={"source": source},
|
| 274 |
-
include=["metadatas"]
|
| 275 |
-
)
|
| 276 |
-
|
| 277 |
-
if not results["ids"]:
|
| 278 |
-
return 0
|
| 279 |
-
|
| 280 |
-
# Delete
|
| 281 |
-
count = len(results["ids"])
|
| 282 |
-
self._collection.delete(ids=results["ids"])
|
| 283 |
-
|
| 284 |
-
console.print(f"[green]✓[/] Deleted {count} chunks from {source}")
|
| 285 |
-
|
| 286 |
-
return count
|
| 287 |
-
|
| 288 |
-
def clear(self):
|
| 289 |
-
"""Clear all documents from the knowledge base."""
|
| 290 |
-
self._init_db()
|
| 291 |
-
|
| 292 |
-
# Delete and recreate collection
|
| 293 |
-
self._client.delete_collection(self.collection_name)
|
| 294 |
-
self._collection = self._client.create_collection(
|
| 295 |
-
name=self.collection_name,
|
| 296 |
-
metadata={"description": "Video Analyzer Knowledge Base"}
|
| 297 |
-
)
|
| 298 |
-
|
| 299 |
-
console.print("[green]✓[/] Knowledge base cleared")
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
# Convenience functions
|
| 303 |
-
_default_kb: Optional[KnowledgeBase] = None
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
def get_knowledge_base() -> KnowledgeBase:
|
| 307 |
-
"""Get the default knowledge base instance."""
|
| 308 |
-
global _default_kb
|
| 309 |
-
if _default_kb is None:
|
| 310 |
-
_default_kb = KnowledgeBase()
|
| 311 |
-
return _default_kb
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
def search(query: str, n_results: int = 5) -> list[SearchResult]:
|
| 315 |
-
"""Search the knowledge base."""
|
| 316 |
-
return get_knowledge_base().search(query, n_results)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/main.py
DELETED
|
@@ -1,6 +0,0 @@
|
|
| 1 |
-
"""Main entry point for Video Analyzer."""
|
| 2 |
-
|
| 3 |
-
from src.ui.cli import app
|
| 4 |
-
|
| 5 |
-
if __name__ == "__main__":
|
| 6 |
-
app()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/mentor/__init__.py
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
"""Virtual mentor with RAG (Phase 4+)."""
|
|
|
|
|
|
src/processors/__init__.py
DELETED
|
@@ -1,18 +0,0 @@
|
|
| 1 |
-
"""Content processors for audio, documents, and transcription."""
|
| 2 |
-
|
| 3 |
-
from .audio import extract_audio
|
| 4 |
-
from .transcriber import transcribe_audio, WhisperTranscriber
|
| 5 |
-
from .documents import extract_document, process_documents, DocumentContent
|
| 6 |
-
from .ocr import extract_text_from_image, process_images, OCRResult
|
| 7 |
-
|
| 8 |
-
__all__ = [
|
| 9 |
-
"extract_audio",
|
| 10 |
-
"transcribe_audio",
|
| 11 |
-
"WhisperTranscriber",
|
| 12 |
-
"extract_document",
|
| 13 |
-
"process_documents",
|
| 14 |
-
"DocumentContent",
|
| 15 |
-
"extract_text_from_image",
|
| 16 |
-
"process_images",
|
| 17 |
-
"OCRResult"
|
| 18 |
-
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/processors/__pycache__/__init__.cpython-312.pyc
DELETED
|
Binary file (618 Bytes)
|
|
|
src/processors/__pycache__/audio.cpython-312.pyc
DELETED
|
Binary file (3.13 kB)
|
|
|
src/processors/__pycache__/transcriber.cpython-312.pyc
DELETED
|
Binary file (10.6 kB)
|
|
|
src/processors/audio.py
DELETED
|
@@ -1,83 +0,0 @@
|
|
| 1 |
-
"""Audio extraction from video files using ffmpeg."""
|
| 2 |
-
|
| 3 |
-
import subprocess
|
| 4 |
-
from pathlib import Path
|
| 5 |
-
from typing import Optional
|
| 6 |
-
|
| 7 |
-
from rich.console import Console
|
| 8 |
-
|
| 9 |
-
from src.config import settings
|
| 10 |
-
|
| 11 |
-
console = Console()
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
def extract_audio(
|
| 15 |
-
video_path: Path,
|
| 16 |
-
output_path: Optional[Path] = None,
|
| 17 |
-
audio_format: str = "mp3",
|
| 18 |
-
sample_rate: int = 16000, # Whisper prefers 16kHz
|
| 19 |
-
) -> Path:
|
| 20 |
-
"""Extract audio from a video file using ffmpeg.
|
| 21 |
-
|
| 22 |
-
Args:
|
| 23 |
-
video_path: Path to the video file
|
| 24 |
-
output_path: Output path for audio file (default: data/audio/<video_name>.mp3)
|
| 25 |
-
audio_format: Output audio format (mp3, wav, m4a)
|
| 26 |
-
sample_rate: Audio sample rate in Hz (16000 recommended for Whisper)
|
| 27 |
-
|
| 28 |
-
Returns:
|
| 29 |
-
Path to the extracted audio file
|
| 30 |
-
"""
|
| 31 |
-
video_path = Path(video_path)
|
| 32 |
-
|
| 33 |
-
if not video_path.exists():
|
| 34 |
-
raise FileNotFoundError(f"Video not found: {video_path}")
|
| 35 |
-
|
| 36 |
-
# Default output path
|
| 37 |
-
if output_path is None:
|
| 38 |
-
output_path = settings.audio_dir / f"{video_path.stem}.{audio_format}"
|
| 39 |
-
else:
|
| 40 |
-
output_path = Path(output_path)
|
| 41 |
-
|
| 42 |
-
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 43 |
-
|
| 44 |
-
console.print(f"[bold green]Extracting audio from:[/] {video_path.name}")
|
| 45 |
-
|
| 46 |
-
# Build ffmpeg command
|
| 47 |
-
cmd = [
|
| 48 |
-
"ffmpeg",
|
| 49 |
-
"-i", str(video_path),
|
| 50 |
-
"-vn", # No video
|
| 51 |
-
"-acodec", "libmp3lame" if audio_format == "mp3" else "pcm_s16le",
|
| 52 |
-
"-ar", str(sample_rate), # Sample rate
|
| 53 |
-
"-ac", "1", # Mono (better for speech recognition)
|
| 54 |
-
"-y", # Overwrite output
|
| 55 |
-
str(output_path)
|
| 56 |
-
]
|
| 57 |
-
|
| 58 |
-
result = subprocess.run(cmd, capture_output=True, text=True)
|
| 59 |
-
|
| 60 |
-
if result.returncode != 0:
|
| 61 |
-
raise Exception(f"Audio extraction failed: {result.stderr}")
|
| 62 |
-
|
| 63 |
-
console.print(f"[green]✓[/] Audio extracted to: {output_path}")
|
| 64 |
-
|
| 65 |
-
return output_path
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
def get_audio_duration(audio_path: Path) -> float:
|
| 69 |
-
"""Get the duration of an audio file in seconds."""
|
| 70 |
-
cmd = [
|
| 71 |
-
"ffprobe",
|
| 72 |
-
"-v", "error",
|
| 73 |
-
"-show_entries", "format=duration",
|
| 74 |
-
"-of", "default=noprint_wrappers=1:nokey=1",
|
| 75 |
-
str(audio_path)
|
| 76 |
-
]
|
| 77 |
-
|
| 78 |
-
result = subprocess.run(cmd, capture_output=True, text=True)
|
| 79 |
-
|
| 80 |
-
if result.returncode != 0:
|
| 81 |
-
raise Exception(f"Failed to get audio duration: {result.stderr}")
|
| 82 |
-
|
| 83 |
-
return float(result.stdout.strip())
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/processors/documents.py
DELETED
|
@@ -1,278 +0,0 @@
|
|
| 1 |
-
"""Document processing for PDF, Word, PowerPoint, and text files."""
|
| 2 |
-
|
| 3 |
-
from dataclasses import dataclass
|
| 4 |
-
from pathlib import Path
|
| 5 |
-
from typing import Optional
|
| 6 |
-
|
| 7 |
-
from rich.console import Console
|
| 8 |
-
|
| 9 |
-
from src.config import settings
|
| 10 |
-
|
| 11 |
-
console = Console()
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
@dataclass
|
| 15 |
-
class DocumentContent:
|
| 16 |
-
"""Extracted content from a document."""
|
| 17 |
-
|
| 18 |
-
text: str
|
| 19 |
-
title: str
|
| 20 |
-
pages: int
|
| 21 |
-
source_path: Path
|
| 22 |
-
doc_type: str # pdf, docx, pptx, txt, md
|
| 23 |
-
metadata: dict
|
| 24 |
-
|
| 25 |
-
def save(self, output_path: Optional[Path] = None) -> Path:
|
| 26 |
-
"""Save extracted text to file."""
|
| 27 |
-
if output_path is None:
|
| 28 |
-
output_path = settings.transcripts_dir / f"{self.source_path.stem}.txt"
|
| 29 |
-
|
| 30 |
-
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 31 |
-
output_path.write_text(self.text)
|
| 32 |
-
return output_path
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
def extract_pdf(path: Path) -> DocumentContent:
|
| 36 |
-
"""Extract text from a PDF file.
|
| 37 |
-
|
| 38 |
-
Args:
|
| 39 |
-
path: Path to PDF file
|
| 40 |
-
|
| 41 |
-
Returns:
|
| 42 |
-
DocumentContent with extracted text
|
| 43 |
-
"""
|
| 44 |
-
try:
|
| 45 |
-
import fitz # PyMuPDF
|
| 46 |
-
except ImportError:
|
| 47 |
-
raise ImportError("PyMuPDF not installed. Run: pip install PyMuPDF")
|
| 48 |
-
|
| 49 |
-
path = Path(path)
|
| 50 |
-
console.print(f"[bold green]Extracting PDF:[/] {path.name}")
|
| 51 |
-
|
| 52 |
-
doc = fitz.open(path)
|
| 53 |
-
|
| 54 |
-
text_parts = []
|
| 55 |
-
for page_num, page in enumerate(doc, 1):
|
| 56 |
-
text = page.get_text()
|
| 57 |
-
if text.strip():
|
| 58 |
-
text_parts.append(f"--- Page {page_num} ---\n{text}")
|
| 59 |
-
|
| 60 |
-
full_text = "\n\n".join(text_parts)
|
| 61 |
-
|
| 62 |
-
# Extract metadata
|
| 63 |
-
metadata = {
|
| 64 |
-
"author": doc.metadata.get("author", ""),
|
| 65 |
-
"title": doc.metadata.get("title", ""),
|
| 66 |
-
"subject": doc.metadata.get("subject", ""),
|
| 67 |
-
"creator": doc.metadata.get("creator", ""),
|
| 68 |
-
}
|
| 69 |
-
|
| 70 |
-
title = metadata.get("title") or path.stem
|
| 71 |
-
|
| 72 |
-
doc.close()
|
| 73 |
-
|
| 74 |
-
console.print(f"[green]✓[/] Extracted {len(doc)} pages, {len(full_text)} characters")
|
| 75 |
-
|
| 76 |
-
return DocumentContent(
|
| 77 |
-
text=full_text,
|
| 78 |
-
title=title,
|
| 79 |
-
pages=len(doc),
|
| 80 |
-
source_path=path,
|
| 81 |
-
doc_type="pdf",
|
| 82 |
-
metadata=metadata
|
| 83 |
-
)
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
def extract_docx(path: Path) -> DocumentContent:
|
| 87 |
-
"""Extract text from a Word document.
|
| 88 |
-
|
| 89 |
-
Args:
|
| 90 |
-
path: Path to .docx file
|
| 91 |
-
|
| 92 |
-
Returns:
|
| 93 |
-
DocumentContent with extracted text
|
| 94 |
-
"""
|
| 95 |
-
try:
|
| 96 |
-
from docx import Document
|
| 97 |
-
except ImportError:
|
| 98 |
-
raise ImportError("python-docx not installed. Run: pip install python-docx")
|
| 99 |
-
|
| 100 |
-
path = Path(path)
|
| 101 |
-
console.print(f"[bold green]Extracting Word doc:[/] {path.name}")
|
| 102 |
-
|
| 103 |
-
doc = Document(path)
|
| 104 |
-
|
| 105 |
-
text_parts = []
|
| 106 |
-
for para in doc.paragraphs:
|
| 107 |
-
if para.text.strip():
|
| 108 |
-
text_parts.append(para.text)
|
| 109 |
-
|
| 110 |
-
# Also extract from tables
|
| 111 |
-
for table in doc.tables:
|
| 112 |
-
for row in table.rows:
|
| 113 |
-
row_text = " | ".join(cell.text.strip() for cell in row.cells if cell.text.strip())
|
| 114 |
-
if row_text:
|
| 115 |
-
text_parts.append(row_text)
|
| 116 |
-
|
| 117 |
-
full_text = "\n\n".join(text_parts)
|
| 118 |
-
|
| 119 |
-
# Extract metadata
|
| 120 |
-
metadata = {
|
| 121 |
-
"author": doc.core_properties.author or "",
|
| 122 |
-
"title": doc.core_properties.title or "",
|
| 123 |
-
"subject": doc.core_properties.subject or "",
|
| 124 |
-
}
|
| 125 |
-
|
| 126 |
-
title = metadata.get("title") or path.stem
|
| 127 |
-
|
| 128 |
-
console.print(f"[green]✓[/] Extracted {len(text_parts)} paragraphs, {len(full_text)} characters")
|
| 129 |
-
|
| 130 |
-
return DocumentContent(
|
| 131 |
-
text=full_text,
|
| 132 |
-
title=title,
|
| 133 |
-
pages=1, # Word docs don't have fixed pages
|
| 134 |
-
source_path=path,
|
| 135 |
-
doc_type="docx",
|
| 136 |
-
metadata=metadata
|
| 137 |
-
)
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
def extract_pptx(path: Path) -> DocumentContent:
|
| 141 |
-
"""Extract text from a PowerPoint presentation.
|
| 142 |
-
|
| 143 |
-
Args:
|
| 144 |
-
path: Path to .pptx file
|
| 145 |
-
|
| 146 |
-
Returns:
|
| 147 |
-
DocumentContent with extracted text
|
| 148 |
-
"""
|
| 149 |
-
try:
|
| 150 |
-
from pptx import Presentation
|
| 151 |
-
except ImportError:
|
| 152 |
-
raise ImportError("python-pptx not installed. Run: pip install python-pptx")
|
| 153 |
-
|
| 154 |
-
path = Path(path)
|
| 155 |
-
console.print(f"[bold green]Extracting PowerPoint:[/] {path.name}")
|
| 156 |
-
|
| 157 |
-
prs = Presentation(path)
|
| 158 |
-
|
| 159 |
-
text_parts = []
|
| 160 |
-
for slide_num, slide in enumerate(prs.slides, 1):
|
| 161 |
-
slide_text = [f"--- Slide {slide_num} ---"]
|
| 162 |
-
|
| 163 |
-
for shape in slide.shapes:
|
| 164 |
-
if hasattr(shape, "text") and shape.text.strip():
|
| 165 |
-
slide_text.append(shape.text)
|
| 166 |
-
|
| 167 |
-
if len(slide_text) > 1: # Has content beyond header
|
| 168 |
-
text_parts.append("\n".join(slide_text))
|
| 169 |
-
|
| 170 |
-
full_text = "\n\n".join(text_parts)
|
| 171 |
-
|
| 172 |
-
console.print(f"[green]✓[/] Extracted {len(prs.slides)} slides, {len(full_text)} characters")
|
| 173 |
-
|
| 174 |
-
return DocumentContent(
|
| 175 |
-
text=full_text,
|
| 176 |
-
title=path.stem,
|
| 177 |
-
pages=len(prs.slides),
|
| 178 |
-
source_path=path,
|
| 179 |
-
doc_type="pptx",
|
| 180 |
-
metadata={}
|
| 181 |
-
)
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
def extract_text_file(path: Path) -> DocumentContent:
|
| 185 |
-
"""Extract text from plain text or markdown files.
|
| 186 |
-
|
| 187 |
-
Args:
|
| 188 |
-
path: Path to .txt or .md file
|
| 189 |
-
|
| 190 |
-
Returns:
|
| 191 |
-
DocumentContent with text
|
| 192 |
-
"""
|
| 193 |
-
path = Path(path)
|
| 194 |
-
console.print(f"[bold green]Reading text file:[/] {path.name}")
|
| 195 |
-
|
| 196 |
-
text = path.read_text(encoding="utf-8", errors="ignore")
|
| 197 |
-
|
| 198 |
-
console.print(f"[green]✓[/] Read {len(text)} characters")
|
| 199 |
-
|
| 200 |
-
return DocumentContent(
|
| 201 |
-
text=text,
|
| 202 |
-
title=path.stem,
|
| 203 |
-
pages=1,
|
| 204 |
-
source_path=path,
|
| 205 |
-
doc_type=path.suffix.lstrip("."),
|
| 206 |
-
metadata={}
|
| 207 |
-
)
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
def extract_document(path: Path) -> DocumentContent:
|
| 211 |
-
"""Extract text from any supported document type.
|
| 212 |
-
|
| 213 |
-
Args:
|
| 214 |
-
path: Path to document
|
| 215 |
-
|
| 216 |
-
Returns:
|
| 217 |
-
DocumentContent with extracted text
|
| 218 |
-
"""
|
| 219 |
-
path = Path(path)
|
| 220 |
-
ext = path.suffix.lower()
|
| 221 |
-
|
| 222 |
-
if ext == ".pdf":
|
| 223 |
-
return extract_pdf(path)
|
| 224 |
-
elif ext in {".docx", ".doc"}:
|
| 225 |
-
return extract_docx(path)
|
| 226 |
-
elif ext in {".pptx", ".ppt"}:
|
| 227 |
-
return extract_pptx(path)
|
| 228 |
-
elif ext in {".txt", ".md", ".rtf"}:
|
| 229 |
-
return extract_text_file(path)
|
| 230 |
-
else:
|
| 231 |
-
raise ValueError(f"Unsupported document type: {ext}")
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
def process_documents(
|
| 235 |
-
path: Path,
|
| 236 |
-
output_dir: Optional[Path] = None,
|
| 237 |
-
recursive: bool = True
|
| 238 |
-
) -> list[DocumentContent]:
|
| 239 |
-
"""Process all documents in a file or directory.
|
| 240 |
-
|
| 241 |
-
Args:
|
| 242 |
-
path: File or directory path
|
| 243 |
-
output_dir: Output directory for extracted text
|
| 244 |
-
recursive: If True, scan subdirectories
|
| 245 |
-
|
| 246 |
-
Returns:
|
| 247 |
-
List of DocumentContent objects
|
| 248 |
-
"""
|
| 249 |
-
from src.downloaders.files import scan_files
|
| 250 |
-
|
| 251 |
-
path = Path(path)
|
| 252 |
-
output_dir = output_dir or settings.transcripts_dir
|
| 253 |
-
|
| 254 |
-
# Get document files
|
| 255 |
-
files = scan_files(path, recursive=recursive, file_types=["document"])
|
| 256 |
-
|
| 257 |
-
if not files:
|
| 258 |
-
console.print("[yellow]No documents found[/]")
|
| 259 |
-
return []
|
| 260 |
-
|
| 261 |
-
console.print(f"[bold blue]Found {len(files)} documents to process[/]")
|
| 262 |
-
|
| 263 |
-
results = []
|
| 264 |
-
for file_info in files:
|
| 265 |
-
try:
|
| 266 |
-
content = extract_document(file_info.path)
|
| 267 |
-
|
| 268 |
-
# Save extracted text
|
| 269 |
-
output_path = output_dir / f"{file_info.path.stem}.txt"
|
| 270 |
-
content.save(output_path)
|
| 271 |
-
console.print(f"[green]✓[/] Saved: {output_path.name}")
|
| 272 |
-
|
| 273 |
-
results.append(content)
|
| 274 |
-
|
| 275 |
-
except Exception as e:
|
| 276 |
-
console.print(f"[red]Error processing {file_info.name}:[/] {e}")
|
| 277 |
-
|
| 278 |
-
return results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/processors/ocr.py
DELETED
|
@@ -1,133 +0,0 @@
|
|
| 1 |
-
"""OCR (Optical Character Recognition) for images using Tesseract."""
|
| 2 |
-
|
| 3 |
-
from dataclasses import dataclass
|
| 4 |
-
from pathlib import Path
|
| 5 |
-
from typing import Optional
|
| 6 |
-
|
| 7 |
-
from rich.console import Console
|
| 8 |
-
|
| 9 |
-
from src.config import settings
|
| 10 |
-
|
| 11 |
-
console = Console()
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
@dataclass
|
| 15 |
-
class OCRResult:
|
| 16 |
-
"""Result of OCR processing."""
|
| 17 |
-
|
| 18 |
-
text: str
|
| 19 |
-
source_path: Path
|
| 20 |
-
confidence: float # 0-100
|
| 21 |
-
|
| 22 |
-
def save(self, output_path: Optional[Path] = None) -> Path:
|
| 23 |
-
"""Save extracted text to file."""
|
| 24 |
-
if output_path is None:
|
| 25 |
-
output_path = settings.transcripts_dir / f"{self.source_path.stem}_ocr.txt"
|
| 26 |
-
|
| 27 |
-
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 28 |
-
output_path.write_text(self.text)
|
| 29 |
-
return output_path
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
def extract_text_from_image(path: Path, language: str = "eng") -> OCRResult:
|
| 33 |
-
"""Extract text from an image using Tesseract OCR.
|
| 34 |
-
|
| 35 |
-
Args:
|
| 36 |
-
path: Path to image file
|
| 37 |
-
language: Tesseract language code (eng, spa, fra, deu, etc.)
|
| 38 |
-
|
| 39 |
-
Returns:
|
| 40 |
-
OCRResult with extracted text
|
| 41 |
-
"""
|
| 42 |
-
try:
|
| 43 |
-
import pytesseract
|
| 44 |
-
from PIL import Image
|
| 45 |
-
except ImportError:
|
| 46 |
-
raise ImportError(
|
| 47 |
-
"OCR dependencies not installed. Run:\n"
|
| 48 |
-
" pip install pytesseract Pillow\n"
|
| 49 |
-
" sudo apt install tesseract-ocr # Linux\n"
|
| 50 |
-
" brew install tesseract # macOS"
|
| 51 |
-
)
|
| 52 |
-
|
| 53 |
-
path = Path(path)
|
| 54 |
-
console.print(f"[bold green]OCR processing:[/] {path.name}")
|
| 55 |
-
|
| 56 |
-
# Open and process image
|
| 57 |
-
image = Image.open(path)
|
| 58 |
-
|
| 59 |
-
# Get OCR data with confidence scores
|
| 60 |
-
data = pytesseract.image_to_data(image, lang=language, output_type=pytesseract.Output.DICT)
|
| 61 |
-
|
| 62 |
-
# Extract text and calculate average confidence
|
| 63 |
-
words = []
|
| 64 |
-
confidences = []
|
| 65 |
-
|
| 66 |
-
for i, word in enumerate(data["text"]):
|
| 67 |
-
if word.strip():
|
| 68 |
-
words.append(word)
|
| 69 |
-
conf = data["conf"][i]
|
| 70 |
-
if conf > 0: # -1 means no confidence data
|
| 71 |
-
confidences.append(conf)
|
| 72 |
-
|
| 73 |
-
text = " ".join(words)
|
| 74 |
-
avg_confidence = sum(confidences) / len(confidences) if confidences else 0
|
| 75 |
-
|
| 76 |
-
console.print(f"[green]✓[/] Extracted {len(words)} words, confidence: {avg_confidence:.1f}%")
|
| 77 |
-
|
| 78 |
-
return OCRResult(
|
| 79 |
-
text=text,
|
| 80 |
-
source_path=path,
|
| 81 |
-
confidence=avg_confidence
|
| 82 |
-
)
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
def process_images(
|
| 86 |
-
path: Path,
|
| 87 |
-
output_dir: Optional[Path] = None,
|
| 88 |
-
language: str = "eng",
|
| 89 |
-
recursive: bool = True
|
| 90 |
-
) -> list[OCRResult]:
|
| 91 |
-
"""Process all images in a file or directory with OCR.
|
| 92 |
-
|
| 93 |
-
Args:
|
| 94 |
-
path: File or directory path
|
| 95 |
-
output_dir: Output directory for extracted text
|
| 96 |
-
language: Tesseract language code
|
| 97 |
-
recursive: If True, scan subdirectories
|
| 98 |
-
|
| 99 |
-
Returns:
|
| 100 |
-
List of OCRResult objects
|
| 101 |
-
"""
|
| 102 |
-
from src.downloaders.files import scan_files
|
| 103 |
-
|
| 104 |
-
path = Path(path)
|
| 105 |
-
output_dir = output_dir or settings.transcripts_dir
|
| 106 |
-
|
| 107 |
-
# Get image files
|
| 108 |
-
files = scan_files(path, recursive=recursive, file_types=["image"])
|
| 109 |
-
|
| 110 |
-
if not files:
|
| 111 |
-
console.print("[yellow]No images found[/]")
|
| 112 |
-
return []
|
| 113 |
-
|
| 114 |
-
console.print(f"[bold blue]Found {len(files)} images to process[/]")
|
| 115 |
-
|
| 116 |
-
results = []
|
| 117 |
-
for file_info in files:
|
| 118 |
-
try:
|
| 119 |
-
result = extract_text_from_image(file_info.path, language=language)
|
| 120 |
-
|
| 121 |
-
if result.text.strip():
|
| 122 |
-
# Save extracted text
|
| 123 |
-
output_path = output_dir / f"{file_info.path.stem}_ocr.txt"
|
| 124 |
-
result.save(output_path)
|
| 125 |
-
console.print(f"[green]✓[/] Saved: {output_path.name}")
|
| 126 |
-
results.append(result)
|
| 127 |
-
else:
|
| 128 |
-
console.print(f"[yellow]No text found in {file_info.name}[/]")
|
| 129 |
-
|
| 130 |
-
except Exception as e:
|
| 131 |
-
console.print(f"[red]Error processing {file_info.name}:[/] {e}")
|
| 132 |
-
|
| 133 |
-
return results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/processors/transcriber.py
DELETED
|
@@ -1,243 +0,0 @@
|
|
| 1 |
-
"""Audio transcription using Whisper (local, free)."""
|
| 2 |
-
|
| 3 |
-
import json
|
| 4 |
-
from dataclasses import dataclass
|
| 5 |
-
from pathlib import Path
|
| 6 |
-
from typing import Optional
|
| 7 |
-
|
| 8 |
-
from rich.console import Console
|
| 9 |
-
from rich.progress import Progress, SpinnerColumn, TextColumn
|
| 10 |
-
|
| 11 |
-
from src.config import settings
|
| 12 |
-
|
| 13 |
-
console = Console()
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
@dataclass
|
| 17 |
-
class TranscriptSegment:
|
| 18 |
-
"""A segment of transcribed text with timing."""
|
| 19 |
-
|
| 20 |
-
start: float # Start time in seconds
|
| 21 |
-
end: float # End time in seconds
|
| 22 |
-
text: str # Transcribed text
|
| 23 |
-
|
| 24 |
-
@property
|
| 25 |
-
def start_formatted(self) -> str:
|
| 26 |
-
"""Format start time as HH:MM:SS."""
|
| 27 |
-
return self._format_time(self.start)
|
| 28 |
-
|
| 29 |
-
@property
|
| 30 |
-
def end_formatted(self) -> str:
|
| 31 |
-
"""Format end time as HH:MM:SS."""
|
| 32 |
-
return self._format_time(self.end)
|
| 33 |
-
|
| 34 |
-
def _format_time(self, seconds: float) -> str:
|
| 35 |
-
hours, remainder = divmod(int(seconds), 3600)
|
| 36 |
-
minutes, secs = divmod(remainder, 60)
|
| 37 |
-
return f"{hours:02d}:{minutes:02d}:{secs:02d}"
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
@dataclass
|
| 41 |
-
class Transcript:
|
| 42 |
-
"""Complete transcript with segments and metadata."""
|
| 43 |
-
|
| 44 |
-
text: str # Full transcript text
|
| 45 |
-
segments: list[TranscriptSegment] # Timed segments
|
| 46 |
-
language: str # Detected language
|
| 47 |
-
duration: float # Audio duration in seconds
|
| 48 |
-
|
| 49 |
-
def to_srt(self) -> str:
|
| 50 |
-
"""Convert to SRT subtitle format."""
|
| 51 |
-
lines = []
|
| 52 |
-
for i, seg in enumerate(self.segments, 1):
|
| 53 |
-
start = self._format_srt_time(seg.start)
|
| 54 |
-
end = self._format_srt_time(seg.end)
|
| 55 |
-
lines.append(f"{i}")
|
| 56 |
-
lines.append(f"{start} --> {end}")
|
| 57 |
-
lines.append(seg.text.strip())
|
| 58 |
-
lines.append("")
|
| 59 |
-
return "\n".join(lines)
|
| 60 |
-
|
| 61 |
-
def _format_srt_time(self, seconds: float) -> str:
|
| 62 |
-
hours, remainder = divmod(int(seconds), 3600)
|
| 63 |
-
minutes, secs = divmod(remainder, 60)
|
| 64 |
-
ms = int((seconds % 1) * 1000)
|
| 65 |
-
return f"{hours:02d}:{minutes:02d}:{secs:02d},{ms:03d}"
|
| 66 |
-
|
| 67 |
-
def save(self, output_path: Path, format: str = "txt") -> Path:
|
| 68 |
-
"""Save transcript to file.
|
| 69 |
-
|
| 70 |
-
Args:
|
| 71 |
-
output_path: Output file path
|
| 72 |
-
format: 'txt', 'srt', or 'json'
|
| 73 |
-
"""
|
| 74 |
-
output_path = Path(output_path)
|
| 75 |
-
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 76 |
-
|
| 77 |
-
if format == "txt":
|
| 78 |
-
output_path.write_text(self.text)
|
| 79 |
-
elif format == "srt":
|
| 80 |
-
output_path.write_text(self.to_srt())
|
| 81 |
-
elif format == "json":
|
| 82 |
-
data = {
|
| 83 |
-
"text": self.text,
|
| 84 |
-
"language": self.language,
|
| 85 |
-
"duration": self.duration,
|
| 86 |
-
"segments": [
|
| 87 |
-
{"start": s.start, "end": s.end, "text": s.text}
|
| 88 |
-
for s in self.segments
|
| 89 |
-
]
|
| 90 |
-
}
|
| 91 |
-
output_path.write_text(json.dumps(data, indent=2))
|
| 92 |
-
|
| 93 |
-
return output_path
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
class WhisperTranscriber:
|
| 97 |
-
"""Transcribe audio using faster-whisper (local, free)."""
|
| 98 |
-
|
| 99 |
-
def __init__(
|
| 100 |
-
self,
|
| 101 |
-
model_size: str = "base",
|
| 102 |
-
device: str = "auto",
|
| 103 |
-
compute_type: str = "auto"
|
| 104 |
-
):
|
| 105 |
-
"""Initialize the transcriber.
|
| 106 |
-
|
| 107 |
-
Args:
|
| 108 |
-
model_size: Whisper model size - tiny, base, small, medium, large-v3
|
| 109 |
-
device: Device to use - 'auto', 'cpu', 'cuda'
|
| 110 |
-
compute_type: Computation type - 'auto', 'int8', 'float16', 'float32'
|
| 111 |
-
"""
|
| 112 |
-
self.model_size = model_size
|
| 113 |
-
self.device = device
|
| 114 |
-
self.compute_type = compute_type
|
| 115 |
-
self._model = None
|
| 116 |
-
|
| 117 |
-
def _load_model(self):
|
| 118 |
-
"""Lazy load the Whisper model."""
|
| 119 |
-
if self._model is None:
|
| 120 |
-
console.print(f"[bold green]Loading Whisper model:[/] {self.model_size}")
|
| 121 |
-
|
| 122 |
-
from faster_whisper import WhisperModel
|
| 123 |
-
|
| 124 |
-
# Determine device and compute type
|
| 125 |
-
device = self.device
|
| 126 |
-
compute_type = self.compute_type
|
| 127 |
-
|
| 128 |
-
if device == "auto":
|
| 129 |
-
try:
|
| 130 |
-
import torch
|
| 131 |
-
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 132 |
-
except ImportError:
|
| 133 |
-
device = "cpu"
|
| 134 |
-
|
| 135 |
-
if compute_type == "auto":
|
| 136 |
-
compute_type = "float16" if device == "cuda" else "int8"
|
| 137 |
-
|
| 138 |
-
self._model = WhisperModel(
|
| 139 |
-
self.model_size,
|
| 140 |
-
device=device,
|
| 141 |
-
compute_type=compute_type
|
| 142 |
-
)
|
| 143 |
-
|
| 144 |
-
console.print(f"[green]✓[/] Model loaded on {device}")
|
| 145 |
-
|
| 146 |
-
def transcribe(
|
| 147 |
-
self,
|
| 148 |
-
audio_path: Path,
|
| 149 |
-
language: Optional[str] = None,
|
| 150 |
-
) -> Transcript:
|
| 151 |
-
"""Transcribe an audio file.
|
| 152 |
-
|
| 153 |
-
Args:
|
| 154 |
-
audio_path: Path to audio file
|
| 155 |
-
language: Language code (e.g., 'en') or None for auto-detect
|
| 156 |
-
|
| 157 |
-
Returns:
|
| 158 |
-
Transcript object with full text and segments
|
| 159 |
-
"""
|
| 160 |
-
audio_path = Path(audio_path)
|
| 161 |
-
|
| 162 |
-
if not audio_path.exists():
|
| 163 |
-
raise FileNotFoundError(f"Audio file not found: {audio_path}")
|
| 164 |
-
|
| 165 |
-
self._load_model()
|
| 166 |
-
|
| 167 |
-
console.print(f"[bold green]Transcribing:[/] {audio_path.name}")
|
| 168 |
-
|
| 169 |
-
# Transcribe
|
| 170 |
-
segments_generator, info = self._model.transcribe(
|
| 171 |
-
str(audio_path),
|
| 172 |
-
language=language,
|
| 173 |
-
beam_size=5,
|
| 174 |
-
word_timestamps=True,
|
| 175 |
-
vad_filter=True, # Filter out non-speech
|
| 176 |
-
)
|
| 177 |
-
|
| 178 |
-
# Collect segments
|
| 179 |
-
segments = []
|
| 180 |
-
full_text_parts = []
|
| 181 |
-
|
| 182 |
-
with Progress(
|
| 183 |
-
SpinnerColumn(),
|
| 184 |
-
TextColumn("[progress.description]{task.description}"),
|
| 185 |
-
console=console
|
| 186 |
-
) as progress:
|
| 187 |
-
task = progress.add_task("Processing segments...", total=None)
|
| 188 |
-
|
| 189 |
-
for segment in segments_generator:
|
| 190 |
-
segments.append(TranscriptSegment(
|
| 191 |
-
start=segment.start,
|
| 192 |
-
end=segment.end,
|
| 193 |
-
text=segment.text
|
| 194 |
-
))
|
| 195 |
-
full_text_parts.append(segment.text)
|
| 196 |
-
progress.update(task, description=f"Processed {len(segments)} segments...")
|
| 197 |
-
|
| 198 |
-
transcript = Transcript(
|
| 199 |
-
text=" ".join(full_text_parts).strip(),
|
| 200 |
-
segments=segments,
|
| 201 |
-
language=info.language,
|
| 202 |
-
duration=info.duration
|
| 203 |
-
)
|
| 204 |
-
|
| 205 |
-
console.print(f"[green]✓[/] Transcription complete")
|
| 206 |
-
console.print(f"[bold blue]Language:[/] {info.language}")
|
| 207 |
-
console.print(f"[bold blue]Duration:[/] {info.duration:.1f}s")
|
| 208 |
-
console.print(f"[bold blue]Segments:[/] {len(segments)}")
|
| 209 |
-
|
| 210 |
-
return transcript
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
def transcribe_audio(
|
| 214 |
-
audio_path: Path,
|
| 215 |
-
model_size: str = "base",
|
| 216 |
-
output_dir: Optional[Path] = None,
|
| 217 |
-
save_formats: list[str] = ["txt", "json"]
|
| 218 |
-
) -> Transcript:
|
| 219 |
-
"""Convenience function to transcribe audio and save results.
|
| 220 |
-
|
| 221 |
-
Args:
|
| 222 |
-
audio_path: Path to audio file
|
| 223 |
-
model_size: Whisper model size
|
| 224 |
-
output_dir: Output directory (default: data/transcripts)
|
| 225 |
-
save_formats: List of formats to save ('txt', 'srt', 'json')
|
| 226 |
-
|
| 227 |
-
Returns:
|
| 228 |
-
Transcript object
|
| 229 |
-
"""
|
| 230 |
-
audio_path = Path(audio_path)
|
| 231 |
-
output_dir = output_dir or settings.transcripts_dir
|
| 232 |
-
|
| 233 |
-
# Transcribe
|
| 234 |
-
transcriber = WhisperTranscriber(model_size=model_size)
|
| 235 |
-
transcript = transcriber.transcribe(audio_path)
|
| 236 |
-
|
| 237 |
-
# Save in requested formats
|
| 238 |
-
for fmt in save_formats:
|
| 239 |
-
output_path = output_dir / f"{audio_path.stem}.{fmt}"
|
| 240 |
-
transcript.save(output_path, format=fmt)
|
| 241 |
-
console.print(f"[green]✓[/] Saved: {output_path}")
|
| 242 |
-
|
| 243 |
-
return transcript
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/ui/__init__.py
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
"""User interfaces for Video Analyzer."""
|
|
|
|
|
|
src/ui/__pycache__/__init__.cpython-312.pyc
DELETED
|
Binary file (176 Bytes)
|
|
|
src/ui/__pycache__/cli.cpython-312.pyc
DELETED
|
Binary file (32.7 kB)
|
|
|