Spaces:
Sleeping
Sleeping
File size: 10,877 Bytes
7e9812c 254776d 7e9812c e90d887 04f4b47 7e9812c e90d887 7e9812c e90d887 254776d 7e9812c e0ffa13 7e9812c e0ffa13 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 | ---
title: NotebookLM
emoji: π
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: "5.12.0"
python_version: "3.12"
app_file: app.py
hf_oauth: true
hf_oauth_expiration_minutes: 480
tags:
- gradio
pinned: false
short_description: NotebookLM - AI-Powered Study Companion
license: mit
---
# NotebookLM β AI-Powered Study Companion
A full-featured, Google NotebookLM-inspired study tool built with **Gradio** on **Hugging Face Spaces**. Upload documents, chat with your sources using RAG, and generate study artifacts like summaries, podcasts, and quizzes β all from one interface.
---
## Features
### Chat with Your Sources (RAG)
- Ask questions about uploaded documents and get grounded, cited answers
- Two-stage retrieval: semantic vector search (top-40) followed by hybrid reranking (top-8) combining embedding similarity with lexical keyword overlap
- Inline citation chips (`[S1]`, `[S2]`) link back to source passages
- Conversational context: last 5 messages included for follow-up questions
- Powered by **Qwen/Qwen2.5-7B-Instruct** via HF Inference API
### Multi-Format Document Ingestion
| Format | Extractor | Notes |
|--------|-----------|-------|
| **PDF** | PyMuPDF (fitz) | Full-page text extraction |
| **PPTX** | python-pptx | Text from all slides and shapes |
| **TXT** | Built-in | UTF-8 plain text |
| **Web URLs** | BeautifulSoup | Strips nav/footer/scripts, extracts article content |
| **YouTube** | youtube-transcript-api | Auto-fetches video transcripts |
- Max file size: **15 MB**
- Max sources per notebook: **20**
- Duplicate detection for both files and URLs
### Ingestion Pipeline
```
Upload/URL β Text Extraction β Recursive Chunking β Embedding Generation β Pinecone Upsert
```
- **Chunking:** Recursive character splitting (2000 chars per chunk, 200 char overlap) with separators `\n\n` β `\n` β `. ` β ` `
- **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions) via HF Inference API
- **Vector Store:** Pinecone with namespace-per-notebook isolation
### Document Summary
- Summarize content from your uploaded sources using semantic retrieval
- **Source selection:** Choose specific sources to include via checkbox selector
- **Styles:** Brief (3-5 bullets, <150 words) or Detailed (sectioned, 300-500 words)
- Powered by **Claude Haiku 4.5**
### Conversation Summary
- Summarize your chat history into structured notes
- **Styles:** Brief or Detailed
- Powered by **Claude Haiku 4.5**
### Podcast Generation
- Generates a natural two-host dialogue script (Alex & Sam) from your summary
- Script converted to audio via **OpenAI TTS** (`tts-1` model, `alloy` voice)
- In-browser audio player with download option
- Falls back to script-only if TTS is unavailable
### Quiz Generation
- Creates multiple-choice quizzes (5 or 10 questions) from your source material
- Interactive HTML with "Show Answer" reveal buttons
- Includes explanations for each correct answer
- Downloadable as standalone HTML
### Multi-Notebook Support
- Create, rename, and delete notebooks from the sidebar
- Each notebook has its own sources, chat history, and artifacts
- Switch between notebooks instantly
### Authentication & Persistence
- **OAuth** via Hugging Face (session expiration: 480 minutes)
- User data serialized to JSON and stored in a **private HF Dataset repo** (`Group-1-5010/notebooklm-data`)
- Manual save button with unsaved-changes warning on page unload
---
## Architecture
```
NotebookLM/
βββ app.py # Gradio UI, event wiring, refresh logic
βββ state.py # Data models: UserData, Notebook, Source, Message, Artifact
βββ theme.py # Dark theme, custom CSS, SVG logos
βββ mock_data.py # Mock responses for offline testing
βββ requirements.txt # Python dependencies
β
βββ artifacts/
β βββ prompt.poml # RAG system prompt (POML format)
β
βββ assets/
β βββ logo.svg # App logo (SVG with gradients)
β βββ podcasts/ # Generated podcast MP3 files
β
βββ ingestion_engine/ # Document processing pipeline
β βββ ingestion_manager.py # Orchestrates: extract β chunk β embed β upsert
β βββ pdf_extractor.py # PDF text extraction (PyMuPDF)
β βββ text_extractor.py # Plain text file reading
β βββ url_scrapper.py # Web page scraping (BeautifulSoup)
β βββ transcripter.py # YouTube transcript fetching
β βββ chunker.py # Recursive text chunking with overlap
β βββ embedding_generator.py # Sentence-transformer embeddings (HF API)
β
βββ persistence/ # Storage layer
β βββ vector_store.py # Pinecone CRUD (upsert, query, delete)
β βββ storage_service.py # HF Dataset repo for user data persistence
β
βββ services/ # Core AI features
β βββ rag_engine.py # RAG pipeline: retrieve β rerank β generate
β βββ summary_service.py # Conversation & document summaries (Claude)
β βββ podcast_service.py # Podcast script generation + OpenAI TTS
β βββ quiz_service.py # Quiz generation + interactive HTML renderer
β
βββ pages/ # UI tab handlers
βββ chat.py # Chat interface with citation rendering
βββ sources.py # Source upload, URL add, delete, status display
βββ artifacts.py # Summary, podcast, quiz display & generation
```
---
## Tech Stack
| Layer | Technology |
|-------|-----------|
| **UI Framework** | Gradio 5.12.0 |
| **Hosting** | Hugging Face Spaces |
| **Auth** | HF OAuth |
| **Chat LLM** | Qwen/Qwen2.5-7B-Instruct (HF Inference API) |
| **Summary / Podcast / Quiz LLM** | Claude Haiku 4.5 (Anthropic API) |
| **Text-to-Speech** | OpenAI TTS (`tts-1`, `alloy` voice) |
| **Embeddings** | sentence-transformers/all-MiniLM-L6-v2 (HF Inference API) |
| **Vector Database** | Pinecone |
| **User Data Storage** | HF Dataset repo (private JSON) |
| **PDF Extraction** | PyMuPDF |
| **PPTX Extraction** | python-pptx |
| **Web Scraping** | BeautifulSoup4 + Requests |
| **YouTube Transcripts** | youtube-transcript-api |
---
## Setup
### Prerequisites
- Python 3.12+
- A Hugging Face account
- API keys for Pinecone, Anthropic, and OpenAI
### Environment Variables
Set these as **Secrets** in your HF Space settings (or in a `.env` for local dev):
| Variable | Description |
|----------|-------------|
| `HF_TOKEN` | Hugging Face API token (read/write access) |
| `Pinecone_API` | Pinecone API key |
| `ANTHROPIC_API_KEY` | Anthropic API key (for Claude Haiku) |
| `OPENAI_API_KEY` | OpenAI API key (for TTS audio generation) |
### Local Development
```bash
# Clone the repo
git clone https://huggingface.co/spaces/Group-1-5010/NotebookLM
cd NotebookLM
# Install dependencies
pip install -r requirements.txt
# Export your API keys
export HF_TOKEN="hf_..."
export Pinecone_API="..."
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
# Run the app
python app.py
```
The app launches at `http://localhost:7860`.
### Deploying to HF Spaces
1. Create a new Gradio Space on Hugging Face
2. Push the code to the Space repo
3. Add all four API keys as Secrets in the Space settings
4. The Space auto-builds and deploys
---
## How It Works
### RAG Pipeline (Chat)
```
User Question
β
βββ Embed query (MiniLM-L6-v2, 384d)
β
βββ Pinecone semantic search (top-40 candidates)
β
βββ Hybrid rerank: embedding_score + 0.05 Γ keyword_overlap
β βββ Deduplicate β top-8 final chunks
β
βββ Build prompt (system rules + context + last 5 messages + question)
β
βββ Qwen 2.5-7B-Instruct generates grounded answer with [S1], [S2] citations
```
### Artifact Generation
```
Sources (Pinecone) βββ Claude Haiku βββ Summary (Markdown)
β
ββββ Claude Haiku βββ Podcast Script
β β
β OpenAI TTS βββ MP3 Audio
β
ββββ Claude Haiku βββ Quiz (JSON β Interactive HTML)
```
### Data Flow
```
Upload File βββ Extract Text βββ Chunk (500 tokens, 50 overlap)
β
ββββ Embed (HF Inference API)
β
ββββ Upsert to Pinecone (namespace = notebook_id)
metadata: {source_id, source_filename, chunk_index, text}
```
---
## UI Overview
### Sidebar
- Create / rename / delete notebooks
- Notebook selector (radio buttons with source & message counts)
- Save button with unsaved-changes indicator
### Chat Tab
- Chatbot with message bubbles and citation chips
- Warning banner if no sources are uploaded yet
### Sources Tab
- Drag-and-drop file uploader (PDF, PPTX, TXT)
- URL input for web pages and YouTube videos
- Source cards showing type icon, file size, chunk count, and status badge
- Status indicators: Processing (yellow pulse), Ready (green), Failed (red with tooltip)
- Delete source dropdown
### Artifacts Tab
**Summary Sub-tab:**
- Conversation Summary: style selector (brief/detailed) + generate button
- Document Summary: source selector (checkboxes) + style selector + generate button
- Download buttons for each (`.md`)
**Podcast Sub-tab:**
- Locked until a summary is generated
- Generate button produces dialogue script + MP3 audio
- In-browser audio player
- Download button (`.mp3`)
**Quiz Sub-tab:**
- Question count selector (5 or 10)
- Interactive multiple-choice with "Show Answer" reveals
- Download button (`.html`)
---
## Design
- **Theme:** Custom dark theme (Indigo/Purple gradient)
- **Background:** `#0e1117`
- **Font:** Inter (Google Fonts)
- **Color palette:** Indigo (`#667eea`), Purple (`#764ba2`), Gold accent (`#fbbf24`), Green success (`#22c55e`)
- **Custom SVG logo** with notebook + sparkle motif
---
## Dependencies
```
gradio>=5.0.0
huggingface_hub>=0.20.0
pinecone>=5.0.0
PyMuPDF>=1.23.0
python-pptx>=0.6.21
beautifulsoup4>=4.12.0
requests>=2.31.0
youtube-transcript-api>=0.6.0
scipy>=1.11.0
anthropic>=0.40.0
openai>=1.0.0
```
---
## License
MIT
|