Spaces:
Running
Running
| title: NotebookLM | |
| emoji: π | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: "5.12.0" | |
| python_version: "3.12" | |
| app_file: app.py | |
| hf_oauth: true | |
| hf_oauth_expiration_minutes: 480 | |
| tags: | |
| - gradio | |
| pinned: false | |
| short_description: NotebookLM - AI-Powered Study Companion | |
| license: mit | |
| # NotebookLM β AI-Powered Study Companion | |
| A full-featured, Google NotebookLM-inspired study tool built with **Gradio** on **Hugging Face Spaces**. Upload documents, chat with your sources using RAG, and generate study artifacts like summaries, podcasts, and quizzes β all from one interface. | |
| --- | |
| ## Features | |
| ### Chat with Your Sources (RAG) | |
| - Ask questions about uploaded documents and get grounded, cited answers | |
| - Two-stage retrieval: semantic vector search (top-40) followed by hybrid reranking (top-8) combining embedding similarity with lexical keyword overlap | |
| - Inline citation chips (`[S1]`, `[S2]`) link back to source passages | |
| - Conversational context: last 5 messages included for follow-up questions | |
| - Powered by **Qwen/Qwen2.5-7B-Instruct** via HF Inference API | |
| ### Multi-Format Document Ingestion | |
| | Format | Extractor | Notes | | |
| |--------|-----------|-------| | |
| | **PDF** | PyMuPDF (fitz) | Full-page text extraction | | |
| | **PPTX** | python-pptx | Text from all slides and shapes | | |
| | **TXT** | Built-in | UTF-8 plain text | | |
| | **Web URLs** | BeautifulSoup | Strips nav/footer/scripts, extracts article content | | |
| | **YouTube** | youtube-transcript-api | Auto-fetches video transcripts | | |
| - Max file size: **15 MB** | |
| - Max sources per notebook: **20** | |
| - Duplicate detection for both files and URLs | |
| ### Ingestion Pipeline | |
| ``` | |
| Upload/URL β Text Extraction β Recursive Chunking β Embedding Generation β Pinecone Upsert | |
| ``` | |
| - **Chunking:** Recursive character splitting (2000 chars per chunk, 200 char overlap) with separators `\n\n` β `\n` β `. ` β ` ` | |
| - **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions) via HF Inference API | |
| - **Vector Store:** Pinecone with namespace-per-notebook isolation | |
| ### Document Summary | |
| - Summarize content from your uploaded sources using semantic retrieval | |
| - **Source selection:** Choose specific sources to include via checkbox selector | |
| - **Styles:** Brief (3-5 bullets, <150 words) or Detailed (sectioned, 300-500 words) | |
| - Powered by **Claude Haiku 4.5** | |
| ### Conversation Summary | |
| - Summarize your chat history into structured notes | |
| - **Styles:** Brief or Detailed | |
| - Powered by **Claude Haiku 4.5** | |
| ### Podcast Generation | |
| - Generates a natural two-host dialogue script (Alex & Sam) from your summary | |
| - Script converted to audio via **OpenAI TTS** (`tts-1` model, `alloy` voice) | |
| - In-browser audio player with download option | |
| - Falls back to script-only if TTS is unavailable | |
| ### Quiz Generation | |
| - Creates multiple-choice quizzes (5 or 10 questions) from your source material | |
| - Interactive HTML with "Show Answer" reveal buttons | |
| - Includes explanations for each correct answer | |
| - Downloadable as standalone HTML | |
| ### Multi-Notebook Support | |
| - Create, rename, and delete notebooks from the sidebar | |
| - Each notebook has its own sources, chat history, and artifacts | |
| - Switch between notebooks instantly | |
| ### Authentication & Persistence | |
| - **OAuth** via Hugging Face (session expiration: 480 minutes) | |
| - User data serialized to JSON and stored in a **private HF Dataset repo** (`Group-1-5010/notebooklm-data`) | |
| - Manual save button with unsaved-changes warning on page unload | |
| --- | |
| ## Architecture | |
| ``` | |
| NotebookLM/ | |
| βββ app.py # Gradio UI, event wiring, refresh logic | |
| βββ state.py # Data models: UserData, Notebook, Source, Message, Artifact | |
| βββ theme.py # Dark theme, custom CSS, SVG logos | |
| βββ mock_data.py # Mock responses for offline testing | |
| βββ requirements.txt # Python dependencies | |
| β | |
| βββ artifacts/ | |
| β βββ prompt.poml # RAG system prompt (POML format) | |
| β | |
| βββ assets/ | |
| β βββ logo.svg # App logo (SVG with gradients) | |
| β βββ podcasts/ # Generated podcast MP3 files | |
| β | |
| βββ ingestion_engine/ # Document processing pipeline | |
| β βββ ingestion_manager.py # Orchestrates: extract β chunk β embed β upsert | |
| β βββ pdf_extractor.py # PDF text extraction (PyMuPDF) | |
| β βββ text_extractor.py # Plain text file reading | |
| β βββ url_scrapper.py # Web page scraping (BeautifulSoup) | |
| β βββ transcripter.py # YouTube transcript fetching | |
| β βββ chunker.py # Recursive text chunking with overlap | |
| β βββ embedding_generator.py # Sentence-transformer embeddings (HF API) | |
| β | |
| βββ persistence/ # Storage layer | |
| β βββ vector_store.py # Pinecone CRUD (upsert, query, delete) | |
| β βββ storage_service.py # HF Dataset repo for user data persistence | |
| β | |
| βββ services/ # Core AI features | |
| β βββ rag_engine.py # RAG pipeline: retrieve β rerank β generate | |
| β βββ summary_service.py # Conversation & document summaries (Claude) | |
| β βββ podcast_service.py # Podcast script generation + OpenAI TTS | |
| β βββ quiz_service.py # Quiz generation + interactive HTML renderer | |
| β | |
| βββ pages/ # UI tab handlers | |
| βββ chat.py # Chat interface with citation rendering | |
| βββ sources.py # Source upload, URL add, delete, status display | |
| βββ artifacts.py # Summary, podcast, quiz display & generation | |
| ``` | |
| --- | |
| ## Tech Stack | |
| | Layer | Technology | | |
| |-------|-----------| | |
| | **UI Framework** | Gradio 5.12.0 | | |
| | **Hosting** | Hugging Face Spaces | | |
| | **Auth** | HF OAuth | | |
| | **Chat LLM** | Qwen/Qwen2.5-7B-Instruct (HF Inference API) | | |
| | **Summary / Podcast / Quiz LLM** | Claude Haiku 4.5 (Anthropic API) | | |
| | **Text-to-Speech** | OpenAI TTS (`tts-1`, `alloy` voice) | | |
| | **Embeddings** | sentence-transformers/all-MiniLM-L6-v2 (HF Inference API) | | |
| | **Vector Database** | Pinecone | | |
| | **User Data Storage** | HF Dataset repo (private JSON) | | |
| | **PDF Extraction** | PyMuPDF | | |
| | **PPTX Extraction** | python-pptx | | |
| | **Web Scraping** | BeautifulSoup4 + Requests | | |
| | **YouTube Transcripts** | youtube-transcript-api | | |
| --- | |
| ## Setup | |
| ### Prerequisites | |
| - Python 3.12+ | |
| - A Hugging Face account | |
| - API keys for Pinecone, Anthropic, and OpenAI | |
| ### Environment Variables | |
| Set these as **Secrets** in your HF Space settings (or in a `.env` for local dev): | |
| | Variable | Description | | |
| |----------|-------------| | |
| | `HF_TOKEN` | Hugging Face API token (read/write access) | | |
| | `Pinecone_API` | Pinecone API key | | |
| | `ANTHROPIC_API_KEY` | Anthropic API key (for Claude Haiku) | | |
| | `OPENAI_API_KEY` | OpenAI API key (for TTS audio generation) | | |
| ### Local Development | |
| ```bash | |
| # Clone the repo | |
| git clone https://huggingface.co/spaces/Group-1-5010/NotebookLM | |
| cd NotebookLM | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Export your API keys | |
| export HF_TOKEN="hf_..." | |
| export Pinecone_API="..." | |
| export ANTHROPIC_API_KEY="sk-ant-..." | |
| export OPENAI_API_KEY="sk-..." | |
| # Run the app | |
| python app.py | |
| ``` | |
| The app launches at `http://localhost:7860`. | |
| ### Deploying to HF Spaces | |
| 1. Create a new Gradio Space on Hugging Face | |
| 2. Push the code to the Space repo | |
| 3. Add all four API keys as Secrets in the Space settings | |
| 4. The Space auto-builds and deploys | |
| --- | |
| ## How It Works | |
| ### RAG Pipeline (Chat) | |
| ``` | |
| User Question | |
| β | |
| βββ Embed query (MiniLM-L6-v2, 384d) | |
| β | |
| βββ Pinecone semantic search (top-40 candidates) | |
| β | |
| βββ Hybrid rerank: embedding_score + 0.05 Γ keyword_overlap | |
| β βββ Deduplicate β top-8 final chunks | |
| β | |
| βββ Build prompt (system rules + context + last 5 messages + question) | |
| β | |
| βββ Qwen 2.5-7B-Instruct generates grounded answer with [S1], [S2] citations | |
| ``` | |
| ### Artifact Generation | |
| ``` | |
| Sources (Pinecone) βββ Claude Haiku βββ Summary (Markdown) | |
| β | |
| ββββ Claude Haiku βββ Podcast Script | |
| β β | |
| β OpenAI TTS βββ MP3 Audio | |
| β | |
| ββββ Claude Haiku βββ Quiz (JSON β Interactive HTML) | |
| ``` | |
| ### Data Flow | |
| ``` | |
| Upload File βββ Extract Text βββ Chunk (500 tokens, 50 overlap) | |
| β | |
| ββββ Embed (HF Inference API) | |
| β | |
| ββββ Upsert to Pinecone (namespace = notebook_id) | |
| metadata: {source_id, source_filename, chunk_index, text} | |
| ``` | |
| --- | |
| ## UI Overview | |
| ### Sidebar | |
| - Create / rename / delete notebooks | |
| - Notebook selector (radio buttons with source & message counts) | |
| - Save button with unsaved-changes indicator | |
| ### Chat Tab | |
| - Chatbot with message bubbles and citation chips | |
| - Warning banner if no sources are uploaded yet | |
| ### Sources Tab | |
| - Drag-and-drop file uploader (PDF, PPTX, TXT) | |
| - URL input for web pages and YouTube videos | |
| - Source cards showing type icon, file size, chunk count, and status badge | |
| - Status indicators: Processing (yellow pulse), Ready (green), Failed (red with tooltip) | |
| - Delete source dropdown | |
| ### Artifacts Tab | |
| **Summary Sub-tab:** | |
| - Conversation Summary: style selector (brief/detailed) + generate button | |
| - Document Summary: source selector (checkboxes) + style selector + generate button | |
| - Download buttons for each (`.md`) | |
| **Podcast Sub-tab:** | |
| - Locked until a summary is generated | |
| - Generate button produces dialogue script + MP3 audio | |
| - In-browser audio player | |
| - Download button (`.mp3`) | |
| **Quiz Sub-tab:** | |
| - Question count selector (5 or 10) | |
| - Interactive multiple-choice with "Show Answer" reveals | |
| - Download button (`.html`) | |
| --- | |
| ## Design | |
| - **Theme:** Custom dark theme (Indigo/Purple gradient) | |
| - **Background:** `#0e1117` | |
| - **Font:** Inter (Google Fonts) | |
| - **Color palette:** Indigo (`#667eea`), Purple (`#764ba2`), Gold accent (`#fbbf24`), Green success (`#22c55e`) | |
| - **Custom SVG logo** with notebook + sparkle motif | |
| --- | |
| ## Dependencies | |
| ``` | |
| gradio>=5.0.0 | |
| huggingface_hub>=0.20.0 | |
| pinecone>=5.0.0 | |
| PyMuPDF>=1.23.0 | |
| python-pptx>=0.6.21 | |
| beautifulsoup4>=4.12.0 | |
| requests>=2.31.0 | |
| youtube-transcript-api>=0.6.0 | |
| scipy>=1.11.0 | |
| anthropic>=0.40.0 | |
| openai>=1.0.0 | |
| ``` | |
| --- | |
| ## License | |
| MIT | |