internomega-terrablue commited on
Commit
e0ffa13
Β·
1 Parent(s): 690fe5e

Readme updation

Browse files
Files changed (1) hide show
  1. README.md +297 -2
README.md CHANGED
@@ -16,6 +16,301 @@ short_description: NotebookLM - AI-Powered Study Companion
16
  license: mit
17
  ---
18
 
19
- # NotebookLM Clone
20
 
21
- AI-powered study companion built with Gradio on Hugging Face Spaces.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  license: mit
17
  ---
18
 
19
+ # NotebookLM β€” AI-Powered Study Companion
20
 
21
+ A full-featured, Google NotebookLM-inspired study tool built with **Gradio** on **Hugging Face Spaces**. Upload documents, chat with your sources using RAG, and generate study artifacts like summaries, podcasts, and quizzes β€” all from one interface.
22
+
23
+ ---
24
+
25
+ ## Features
26
+
27
+ ### Chat with Your Sources (RAG)
28
+ - Ask questions about uploaded documents and get grounded, cited answers
29
+ - Two-stage retrieval: semantic vector search (top-40) followed by hybrid reranking (top-8) combining embedding similarity with lexical keyword overlap
30
+ - Inline citation chips (`[S1]`, `[S2]`) link back to source passages
31
+ - Conversational context: last 5 messages included for follow-up questions
32
+ - Powered by **Qwen/Qwen2.5-7B-Instruct** via HF Inference API
33
+
34
+ ### Multi-Format Document Ingestion
35
+ | Format | Extractor | Notes |
36
+ |--------|-----------|-------|
37
+ | **PDF** | PyMuPDF (fitz) | Full-page text extraction |
38
+ | **PPTX** | python-pptx | Text from all slides and shapes |
39
+ | **TXT** | Built-in | UTF-8 plain text |
40
+ | **Web URLs** | BeautifulSoup | Strips nav/footer/scripts, extracts article content |
41
+ | **YouTube** | youtube-transcript-api | Auto-fetches video transcripts |
42
+
43
+ - Max file size: **15 MB**
44
+ - Max sources per notebook: **20**
45
+ - Duplicate detection for both files and URLs
46
+
47
+ ### Ingestion Pipeline
48
+ ```
49
+ Upload/URL β†’ Text Extraction β†’ Recursive Chunking β†’ Embedding Generation β†’ Pinecone Upsert
50
+ ```
51
+ - **Chunking:** Recursive character splitting (2000 chars per chunk, 200 char overlap) with separators `\n\n` β†’ `\n` β†’ `. ` β†’ ` `
52
+ - **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions) via HF Inference API
53
+ - **Vector Store:** Pinecone with namespace-per-notebook isolation
54
+
55
+ ### Document Summary
56
+ - Summarize content from your uploaded sources using semantic retrieval
57
+ - **Source selection:** Choose specific sources to include via checkbox selector
58
+ - **Styles:** Brief (3-5 bullets, <150 words) or Detailed (sectioned, 300-500 words)
59
+ - Powered by **Claude Haiku 4.5**
60
+
61
+ ### Conversation Summary
62
+ - Summarize your chat history into structured notes
63
+ - **Styles:** Brief or Detailed
64
+ - Powered by **Claude Haiku 4.5**
65
+
66
+ ### Podcast Generation
67
+ - Generates a natural two-host dialogue script (Alex & Sam) from your summary
68
+ - Script converted to audio via **OpenAI TTS** (`tts-1` model, `alloy` voice)
69
+ - In-browser audio player with download option
70
+ - Falls back to script-only if TTS is unavailable
71
+
72
+ ### Quiz Generation
73
+ - Creates multiple-choice quizzes (5 or 10 questions) from your source material
74
+ - Interactive HTML with "Show Answer" reveal buttons
75
+ - Includes explanations for each correct answer
76
+ - Downloadable as standalone HTML
77
+
78
+ ### Multi-Notebook Support
79
+ - Create, rename, and delete notebooks from the sidebar
80
+ - Each notebook has its own sources, chat history, and artifacts
81
+ - Switch between notebooks instantly
82
+
83
+ ### Authentication & Persistence
84
+ - **OAuth** via Hugging Face (session expiration: 480 minutes)
85
+ - User data serialized to JSON and stored in a **private HF Dataset repo** (`Group-1-5010/notebooklm-data`)
86
+ - Manual save button with unsaved-changes warning on page unload
87
+
88
+ ---
89
+
90
+ ## Architecture
91
+
92
+ ```
93
+ NotebookLM/
94
+ β”œβ”€β”€ app.py # Gradio UI, event wiring, refresh logic
95
+ β”œβ”€β”€ state.py # Data models: UserData, Notebook, Source, Message, Artifact
96
+ β”œβ”€β”€ theme.py # Dark theme, custom CSS, SVG logos
97
+ β”œβ”€β”€ mock_data.py # Mock responses for offline testing
98
+ β”œβ”€β”€ requirements.txt # Python dependencies
99
+ β”‚
100
+ β”œβ”€β”€ artifacts/
101
+ β”‚ └── prompt.poml # RAG system prompt (POML format)
102
+ β”‚
103
+ β”œβ”€β”€ assets/
104
+ β”‚ β”œβ”€β”€ logo.svg # App logo (SVG with gradients)
105
+ β”‚ └── podcasts/ # Generated podcast MP3 files
106
+ β”‚
107
+ β”œβ”€β”€ ingestion_engine/ # Document processing pipeline
108
+ β”‚ β”œβ”€β”€ ingestion_manager.py # Orchestrates: extract β†’ chunk β†’ embed β†’ upsert
109
+ β”‚ β”œβ”€β”€ pdf_extractor.py # PDF text extraction (PyMuPDF)
110
+ β”‚ β”œβ”€β”€ text_extractor.py # Plain text file reading
111
+ β”‚ β”œβ”€β”€ url_scrapper.py # Web page scraping (BeautifulSoup)
112
+ β”‚ β”œβ”€β”€ transcripter.py # YouTube transcript fetching
113
+ β”‚ β”œβ”€β”€ chunker.py # Recursive text chunking with overlap
114
+ β”‚ └── embedding_generator.py # Sentence-transformer embeddings (HF API)
115
+ β”‚
116
+ β”œβ”€β”€ persistence/ # Storage layer
117
+ β”‚ β”œβ”€β”€ vector_store.py # Pinecone CRUD (upsert, query, delete)
118
+ β”‚ └── storage_service.py # HF Dataset repo for user data persistence
119
+ β”‚
120
+ β”œβ”€β”€ services/ # Core AI features
121
+ β”‚ β”œβ”€β”€ rag_engine.py # RAG pipeline: retrieve β†’ rerank β†’ generate
122
+ β”‚ β”œβ”€β”€ summary_service.py # Conversation & document summaries (Claude)
123
+ β”‚ β”œβ”€β”€ podcast_service.py # Podcast script generation + OpenAI TTS
124
+ β”‚ └── quiz_service.py # Quiz generation + interactive HTML renderer
125
+ β”‚
126
+ └── pages/ # UI tab handlers
127
+ β”œβ”€β”€ chat.py # Chat interface with citation rendering
128
+ β”œβ”€β”€ sources.py # Source upload, URL add, delete, status display
129
+ └── artifacts.py # Summary, podcast, quiz display & generation
130
+ ```
131
+
132
+ ---
133
+
134
+ ## Tech Stack
135
+
136
+ | Layer | Technology |
137
+ |-------|-----------|
138
+ | **UI Framework** | Gradio 5.12.0 |
139
+ | **Hosting** | Hugging Face Spaces |
140
+ | **Auth** | HF OAuth |
141
+ | **Chat LLM** | Qwen/Qwen2.5-7B-Instruct (HF Inference API) |
142
+ | **Summary / Podcast / Quiz LLM** | Claude Haiku 4.5 (Anthropic API) |
143
+ | **Text-to-Speech** | OpenAI TTS (`tts-1`, `alloy` voice) |
144
+ | **Embeddings** | sentence-transformers/all-MiniLM-L6-v2 (HF Inference API) |
145
+ | **Vector Database** | Pinecone |
146
+ | **User Data Storage** | HF Dataset repo (private JSON) |
147
+ | **PDF Extraction** | PyMuPDF |
148
+ | **PPTX Extraction** | python-pptx |
149
+ | **Web Scraping** | BeautifulSoup4 + Requests |
150
+ | **YouTube Transcripts** | youtube-transcript-api |
151
+
152
+ ---
153
+
154
+ ## Setup
155
+
156
+ ### Prerequisites
157
+
158
+ - Python 3.12+
159
+ - A Hugging Face account
160
+ - API keys for Pinecone, Anthropic, and OpenAI
161
+
162
+ ### Environment Variables
163
+
164
+ Set these as **Secrets** in your HF Space settings (or in a `.env` for local dev):
165
+
166
+ | Variable | Description |
167
+ |----------|-------------|
168
+ | `HF_TOKEN` | Hugging Face API token (read/write access) |
169
+ | `Pinecone_API` | Pinecone API key |
170
+ | `ANTHROPIC_API_KEY` | Anthropic API key (for Claude Haiku) |
171
+ | `OPENAI_API_KEY` | OpenAI API key (for TTS audio generation) |
172
+
173
+ ### Local Development
174
+
175
+ ```bash
176
+ # Clone the repo
177
+ git clone https://huggingface.co/spaces/Group-1-5010/NotebookLM
178
+ cd NotebookLM
179
+
180
+ # Install dependencies
181
+ pip install -r requirements.txt
182
+
183
+ # Export your API keys
184
+ export HF_TOKEN="hf_..."
185
+ export Pinecone_API="..."
186
+ export ANTHROPIC_API_KEY="sk-ant-..."
187
+ export OPENAI_API_KEY="sk-..."
188
+
189
+ # Run the app
190
+ python app.py
191
+ ```
192
+
193
+ The app launches at `http://localhost:7860`.
194
+
195
+ ### Deploying to HF Spaces
196
+
197
+ 1. Create a new Gradio Space on Hugging Face
198
+ 2. Push the code to the Space repo
199
+ 3. Add all four API keys as Secrets in the Space settings
200
+ 4. The Space auto-builds and deploys
201
+
202
+ ---
203
+
204
+ ## How It Works
205
+
206
+ ### RAG Pipeline (Chat)
207
+
208
+ ```
209
+ User Question
210
+ β”‚
211
+ β”œβ”€β†’ Embed query (MiniLM-L6-v2, 384d)
212
+ β”‚
213
+ β”œβ”€β†’ Pinecone semantic search (top-40 candidates)
214
+ β”‚
215
+ β”œβ”€β†’ Hybrid rerank: embedding_score + 0.05 Γ— keyword_overlap
216
+ β”‚ └─→ Deduplicate β†’ top-8 final chunks
217
+ β”‚
218
+ β”œβ”€β†’ Build prompt (system rules + context + last 5 messages + question)
219
+ β”‚
220
+ └─→ Qwen 2.5-7B-Instruct generates grounded answer with [S1], [S2] citations
221
+ ```
222
+
223
+ ### Artifact Generation
224
+
225
+ ```
226
+ Sources (Pinecone) ──→ Claude Haiku ──→ Summary (Markdown)
227
+ β”‚
228
+ β”œβ”€β”€β†’ Claude Haiku ──→ Podcast Script
229
+ β”‚ β”‚
230
+ β”‚ OpenAI TTS ──→ MP3 Audio
231
+ β”‚
232
+ └──→ Claude Haiku ──→ Quiz (JSON β†’ Interactive HTML)
233
+ ```
234
+
235
+ ### Data Flow
236
+
237
+ ```
238
+ Upload File ──→ Extract Text ──→ Chunk (500 tokens, 50 overlap)
239
+ β”‚
240
+ β”œβ”€β”€β†’ Embed (HF Inference API)
241
+ β”‚
242
+ └──→ Upsert to Pinecone (namespace = notebook_id)
243
+ metadata: {source_id, source_filename, chunk_index, text}
244
+ ```
245
+
246
+ ---
247
+
248
+ ## UI Overview
249
+
250
+ ### Sidebar
251
+ - Create / rename / delete notebooks
252
+ - Notebook selector (radio buttons with source & message counts)
253
+ - Save button with unsaved-changes indicator
254
+
255
+ ### Chat Tab
256
+ - Chatbot with message bubbles and citation chips
257
+ - Warning banner if no sources are uploaded yet
258
+
259
+ ### Sources Tab
260
+ - Drag-and-drop file uploader (PDF, PPTX, TXT)
261
+ - URL input for web pages and YouTube videos
262
+ - Source cards showing type icon, file size, chunk count, and status badge
263
+ - Status indicators: Processing (yellow pulse), Ready (green), Failed (red with tooltip)
264
+ - Delete source dropdown
265
+
266
+ ### Artifacts Tab
267
+
268
+ **Summary Sub-tab:**
269
+ - Conversation Summary: style selector (brief/detailed) + generate button
270
+ - Document Summary: source selector (checkboxes) + style selector + generate button
271
+ - Download buttons for each (`.md`)
272
+
273
+ **Podcast Sub-tab:**
274
+ - Locked until a summary is generated
275
+ - Generate button produces dialogue script + MP3 audio
276
+ - In-browser audio player
277
+ - Download button (`.mp3`)
278
+
279
+ **Quiz Sub-tab:**
280
+ - Question count selector (5 or 10)
281
+ - Interactive multiple-choice with "Show Answer" reveals
282
+ - Download button (`.html`)
283
+
284
+ ---
285
+
286
+ ## Design
287
+
288
+ - **Theme:** Custom dark theme (Indigo/Purple gradient)
289
+ - **Background:** `#0e1117`
290
+ - **Font:** Inter (Google Fonts)
291
+ - **Color palette:** Indigo (`#667eea`), Purple (`#764ba2`), Gold accent (`#fbbf24`), Green success (`#22c55e`)
292
+ - **Custom SVG logo** with notebook + sparkle motif
293
+
294
+ ---
295
+
296
+ ## Dependencies
297
+
298
+ ```
299
+ gradio>=5.0.0
300
+ huggingface_hub>=0.20.0
301
+ pinecone>=5.0.0
302
+ PyMuPDF>=1.23.0
303
+ python-pptx>=0.6.21
304
+ beautifulsoup4>=4.12.0
305
+ requests>=2.31.0
306
+ youtube-transcript-api>=0.6.0
307
+ scipy>=1.11.0
308
+ anthropic>=0.40.0
309
+ openai>=1.0.0
310
+ ```
311
+
312
+ ---
313
+
314
+ ## License
315
+
316
+ MIT