File size: 10,877 Bytes
7e9812c
254776d
7e9812c
e90d887
 
 
 
 
 
04f4b47
 
7e9812c
e90d887
7e9812c
e90d887
254776d
7e9812c
 
e0ffa13
7e9812c
e0ffa13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
---
title: NotebookLM
emoji: πŸš€
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: "5.12.0"
python_version: "3.12"
app_file: app.py
hf_oauth: true
hf_oauth_expiration_minutes: 480
tags:
- gradio
pinned: false
short_description: NotebookLM - AI-Powered Study Companion
license: mit
---

# NotebookLM β€” AI-Powered Study Companion

A full-featured, Google NotebookLM-inspired study tool built with **Gradio** on **Hugging Face Spaces**. Upload documents, chat with your sources using RAG, and generate study artifacts like summaries, podcasts, and quizzes β€” all from one interface.

---

## Features

### Chat with Your Sources (RAG)
- Ask questions about uploaded documents and get grounded, cited answers
- Two-stage retrieval: semantic vector search (top-40) followed by hybrid reranking (top-8) combining embedding similarity with lexical keyword overlap
- Inline citation chips (`[S1]`, `[S2]`) link back to source passages
- Conversational context: last 5 messages included for follow-up questions
- Powered by **Qwen/Qwen2.5-7B-Instruct** via HF Inference API

### Multi-Format Document Ingestion
| Format | Extractor | Notes |
|--------|-----------|-------|
| **PDF** | PyMuPDF (fitz) | Full-page text extraction |
| **PPTX** | python-pptx | Text from all slides and shapes |
| **TXT** | Built-in | UTF-8 plain text |
| **Web URLs** | BeautifulSoup | Strips nav/footer/scripts, extracts article content |
| **YouTube** | youtube-transcript-api | Auto-fetches video transcripts |

- Max file size: **15 MB**
- Max sources per notebook: **20**
- Duplicate detection for both files and URLs

### Ingestion Pipeline
```
Upload/URL β†’ Text Extraction β†’ Recursive Chunking β†’ Embedding Generation β†’ Pinecone Upsert
```
- **Chunking:** Recursive character splitting (2000 chars per chunk, 200 char overlap) with separators `\n\n` β†’ `\n` β†’ `. ` β†’ ` `
- **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions) via HF Inference API
- **Vector Store:** Pinecone with namespace-per-notebook isolation

### Document Summary
- Summarize content from your uploaded sources using semantic retrieval
- **Source selection:** Choose specific sources to include via checkbox selector
- **Styles:** Brief (3-5 bullets, <150 words) or Detailed (sectioned, 300-500 words)
- Powered by **Claude Haiku 4.5**

### Conversation Summary
- Summarize your chat history into structured notes
- **Styles:** Brief or Detailed
- Powered by **Claude Haiku 4.5**

### Podcast Generation
- Generates a natural two-host dialogue script (Alex & Sam) from your summary
- Script converted to audio via **OpenAI TTS** (`tts-1` model, `alloy` voice)
- In-browser audio player with download option
- Falls back to script-only if TTS is unavailable

### Quiz Generation
- Creates multiple-choice quizzes (5 or 10 questions) from your source material
- Interactive HTML with "Show Answer" reveal buttons
- Includes explanations for each correct answer
- Downloadable as standalone HTML

### Multi-Notebook Support
- Create, rename, and delete notebooks from the sidebar
- Each notebook has its own sources, chat history, and artifacts
- Switch between notebooks instantly

### Authentication & Persistence
- **OAuth** via Hugging Face (session expiration: 480 minutes)
- User data serialized to JSON and stored in a **private HF Dataset repo** (`Group-1-5010/notebooklm-data`)
- Manual save button with unsaved-changes warning on page unload

---

## Architecture

```
NotebookLM/
β”œβ”€β”€ app.py                          # Gradio UI, event wiring, refresh logic
β”œβ”€β”€ state.py                        # Data models: UserData, Notebook, Source, Message, Artifact
β”œβ”€β”€ theme.py                        # Dark theme, custom CSS, SVG logos
β”œβ”€β”€ mock_data.py                    # Mock responses for offline testing
β”œβ”€β”€ requirements.txt                # Python dependencies
β”‚
β”œβ”€β”€ artifacts/
β”‚   └── prompt.poml                 # RAG system prompt (POML format)
β”‚
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ logo.svg                    # App logo (SVG with gradients)
β”‚   └── podcasts/                   # Generated podcast MP3 files
β”‚
β”œβ”€β”€ ingestion_engine/               # Document processing pipeline
β”‚   β”œβ”€β”€ ingestion_manager.py        # Orchestrates: extract β†’ chunk β†’ embed β†’ upsert
β”‚   β”œβ”€β”€ pdf_extractor.py            # PDF text extraction (PyMuPDF)
β”‚   β”œβ”€β”€ text_extractor.py           # Plain text file reading
β”‚   β”œβ”€β”€ url_scrapper.py             # Web page scraping (BeautifulSoup)
β”‚   β”œβ”€β”€ transcripter.py             # YouTube transcript fetching
β”‚   β”œβ”€β”€ chunker.py                  # Recursive text chunking with overlap
β”‚   └── embedding_generator.py      # Sentence-transformer embeddings (HF API)
β”‚
β”œβ”€β”€ persistence/                    # Storage layer
β”‚   β”œβ”€β”€ vector_store.py             # Pinecone CRUD (upsert, query, delete)
β”‚   └── storage_service.py          # HF Dataset repo for user data persistence
β”‚
β”œβ”€β”€ services/                       # Core AI features
β”‚   β”œβ”€β”€ rag_engine.py               # RAG pipeline: retrieve β†’ rerank β†’ generate
β”‚   β”œβ”€β”€ summary_service.py          # Conversation & document summaries (Claude)
β”‚   β”œβ”€β”€ podcast_service.py          # Podcast script generation + OpenAI TTS
β”‚   └── quiz_service.py             # Quiz generation + interactive HTML renderer
β”‚
└── pages/                          # UI tab handlers
    β”œβ”€β”€ chat.py                     # Chat interface with citation rendering
    β”œβ”€β”€ sources.py                  # Source upload, URL add, delete, status display
    └── artifacts.py                # Summary, podcast, quiz display & generation
```

---

## Tech Stack

| Layer | Technology |
|-------|-----------|
| **UI Framework** | Gradio 5.12.0 |
| **Hosting** | Hugging Face Spaces |
| **Auth** | HF OAuth |
| **Chat LLM** | Qwen/Qwen2.5-7B-Instruct (HF Inference API) |
| **Summary / Podcast / Quiz LLM** | Claude Haiku 4.5 (Anthropic API) |
| **Text-to-Speech** | OpenAI TTS (`tts-1`, `alloy` voice) |
| **Embeddings** | sentence-transformers/all-MiniLM-L6-v2 (HF Inference API) |
| **Vector Database** | Pinecone |
| **User Data Storage** | HF Dataset repo (private JSON) |
| **PDF Extraction** | PyMuPDF |
| **PPTX Extraction** | python-pptx |
| **Web Scraping** | BeautifulSoup4 + Requests |
| **YouTube Transcripts** | youtube-transcript-api |

---

## Setup

### Prerequisites

- Python 3.12+
- A Hugging Face account
- API keys for Pinecone, Anthropic, and OpenAI

### Environment Variables

Set these as **Secrets** in your HF Space settings (or in a `.env` for local dev):

| Variable | Description |
|----------|-------------|
| `HF_TOKEN` | Hugging Face API token (read/write access) |
| `Pinecone_API` | Pinecone API key |
| `ANTHROPIC_API_KEY` | Anthropic API key (for Claude Haiku) |
| `OPENAI_API_KEY` | OpenAI API key (for TTS audio generation) |

### Local Development

```bash
# Clone the repo
git clone https://huggingface.co/spaces/Group-1-5010/NotebookLM
cd NotebookLM

# Install dependencies
pip install -r requirements.txt

# Export your API keys
export HF_TOKEN="hf_..."
export Pinecone_API="..."
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

# Run the app
python app.py
```

The app launches at `http://localhost:7860`.

### Deploying to HF Spaces

1. Create a new Gradio Space on Hugging Face
2. Push the code to the Space repo
3. Add all four API keys as Secrets in the Space settings
4. The Space auto-builds and deploys

---

## How It Works

### RAG Pipeline (Chat)

```
User Question
    β”‚
    β”œβ”€β†’ Embed query (MiniLM-L6-v2, 384d)
    β”‚
    β”œβ”€β†’ Pinecone semantic search (top-40 candidates)
    β”‚
    β”œβ”€β†’ Hybrid rerank: embedding_score + 0.05 Γ— keyword_overlap
    β”‚       └─→ Deduplicate β†’ top-8 final chunks
    β”‚
    β”œβ”€β†’ Build prompt (system rules + context + last 5 messages + question)
    β”‚
    └─→ Qwen 2.5-7B-Instruct generates grounded answer with [S1], [S2] citations
```

### Artifact Generation

```
Sources (Pinecone) ──→ Claude Haiku ──→ Summary (Markdown)
                                            β”‚
                                            β”œβ”€β”€β†’ Claude Haiku ──→ Podcast Script
                                            β”‚                         β”‚
                                            β”‚                    OpenAI TTS ──→ MP3 Audio
                                            β”‚
                                            └──→ Claude Haiku ──→ Quiz (JSON β†’ Interactive HTML)
```

### Data Flow

```
Upload File ──→ Extract Text ──→ Chunk (500 tokens, 50 overlap)
                                      β”‚
                                      β”œβ”€β”€β†’ Embed (HF Inference API)
                                      β”‚
                                      └──→ Upsert to Pinecone (namespace = notebook_id)
                                                metadata: {source_id, source_filename, chunk_index, text}
```

---

## UI Overview

### Sidebar
- Create / rename / delete notebooks
- Notebook selector (radio buttons with source & message counts)
- Save button with unsaved-changes indicator

### Chat Tab
- Chatbot with message bubbles and citation chips
- Warning banner if no sources are uploaded yet

### Sources Tab
- Drag-and-drop file uploader (PDF, PPTX, TXT)
- URL input for web pages and YouTube videos
- Source cards showing type icon, file size, chunk count, and status badge
- Status indicators: Processing (yellow pulse), Ready (green), Failed (red with tooltip)
- Delete source dropdown

### Artifacts Tab

**Summary Sub-tab:**
- Conversation Summary: style selector (brief/detailed) + generate button
- Document Summary: source selector (checkboxes) + style selector + generate button
- Download buttons for each (`.md`)

**Podcast Sub-tab:**
- Locked until a summary is generated
- Generate button produces dialogue script + MP3 audio
- In-browser audio player
- Download button (`.mp3`)

**Quiz Sub-tab:**
- Question count selector (5 or 10)
- Interactive multiple-choice with "Show Answer" reveals
- Download button (`.html`)

---

## Design

- **Theme:** Custom dark theme (Indigo/Purple gradient)
- **Background:** `#0e1117`
- **Font:** Inter (Google Fonts)
- **Color palette:** Indigo (`#667eea`), Purple (`#764ba2`), Gold accent (`#fbbf24`), Green success (`#22c55e`)
- **Custom SVG logo** with notebook + sparkle motif

---

## Dependencies

```
gradio>=5.0.0
huggingface_hub>=0.20.0
pinecone>=5.0.0
PyMuPDF>=1.23.0
python-pptx>=0.6.21
beautifulsoup4>=4.12.0
requests>=2.31.0
youtube-transcript-api>=0.6.0
scipy>=1.11.0
anthropic>=0.40.0
openai>=1.0.0
```

---

## License

MIT