--- title: NotebookLM emoji: 🚀 colorFrom: purple colorTo: blue sdk: gradio sdk_version: "5.12.0" python_version: "3.12" app_file: app.py hf_oauth: true hf_oauth_expiration_minutes: 480 tags: - gradio pinned: false short_description: NotebookLM - AI-Powered Study Companion license: mit --- # NotebookLM — AI-Powered Study Companion A full-featured, Google NotebookLM-inspired study tool built with **Gradio** on **Hugging Face Spaces**. Upload documents, chat with your sources using RAG, and generate study artifacts like summaries, podcasts, and quizzes — all from one interface. --- ## Features ### Chat with Your Sources (RAG) - Ask questions about uploaded documents and get grounded, cited answers - Two-stage retrieval: semantic vector search (top-40) followed by hybrid reranking (top-8) combining embedding similarity with lexical keyword overlap - Inline citation chips (`[S1]`, `[S2]`) link back to source passages - Conversational context: last 5 messages included for follow-up questions - Powered by **Qwen/Qwen2.5-7B-Instruct** via HF Inference API ### Multi-Format Document Ingestion | Format | Extractor | Notes | |--------|-----------|-------| | **PDF** | PyMuPDF (fitz) | Full-page text extraction | | **PPTX** | python-pptx | Text from all slides and shapes | | **TXT** | Built-in | UTF-8 plain text | | **Web URLs** | BeautifulSoup | Strips nav/footer/scripts, extracts article content | | **YouTube** | youtube-transcript-api | Auto-fetches video transcripts | - Max file size: **15 MB** - Max sources per notebook: **20** - Duplicate detection for both files and URLs ### Ingestion Pipeline ``` Upload/URL → Text Extraction → Recursive Chunking → Embedding Generation → Pinecone Upsert ``` - **Chunking:** Recursive character splitting (2000 chars per chunk, 200 char overlap) with separators `\n\n` → `\n` → `. ` → ` ` - **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions) via HF Inference API - **Vector Store:** Pinecone with namespace-per-notebook isolation ### Document Summary - Summarize content from your uploaded sources using semantic retrieval - **Source selection:** Choose specific sources to include via checkbox selector - **Styles:** Brief (3-5 bullets, <150 words) or Detailed (sectioned, 300-500 words) - Powered by **Claude Haiku 4.5** ### Conversation Summary - Summarize your chat history into structured notes - **Styles:** Brief or Detailed - Powered by **Claude Haiku 4.5** ### Podcast Generation - Generates a natural two-host dialogue script (Alex & Sam) from your summary - Script converted to audio via **OpenAI TTS** (`tts-1` model, `alloy` voice) - In-browser audio player with download option - Falls back to script-only if TTS is unavailable ### Quiz Generation - Creates multiple-choice quizzes (5 or 10 questions) from your source material - Interactive HTML with "Show Answer" reveal buttons - Includes explanations for each correct answer - Downloadable as standalone HTML ### Multi-Notebook Support - Create, rename, and delete notebooks from the sidebar - Each notebook has its own sources, chat history, and artifacts - Switch between notebooks instantly ### Authentication & Persistence - **OAuth** via Hugging Face (session expiration: 480 minutes) - User data serialized to JSON and stored in a **private HF Dataset repo** (`Group-1-5010/notebooklm-data`) - Manual save button with unsaved-changes warning on page unload --- ## Architecture ``` NotebookLM/ ├── app.py # Gradio UI, event wiring, refresh logic ├── state.py # Data models: UserData, Notebook, Source, Message, Artifact ├── theme.py # Dark theme, custom CSS, SVG logos ├── mock_data.py # Mock responses for offline testing ├── requirements.txt # Python dependencies │ ├── artifacts/ │ └── prompt.poml # RAG system prompt (POML format) │ ├── assets/ │ ├── logo.svg # App logo (SVG with gradients) │ └── podcasts/ # Generated podcast MP3 files │ ├── ingestion_engine/ # Document processing pipeline │ ├── ingestion_manager.py # Orchestrates: extract → chunk → embed → upsert │ ├── pdf_extractor.py # PDF text extraction (PyMuPDF) │ ├── text_extractor.py # Plain text file reading │ ├── url_scrapper.py # Web page scraping (BeautifulSoup) │ ├── transcripter.py # YouTube transcript fetching │ ├── chunker.py # Recursive text chunking with overlap │ └── embedding_generator.py # Sentence-transformer embeddings (HF API) │ ├── persistence/ # Storage layer │ ├── vector_store.py # Pinecone CRUD (upsert, query, delete) │ └── storage_service.py # HF Dataset repo for user data persistence │ ├── services/ # Core AI features │ ├── rag_engine.py # RAG pipeline: retrieve → rerank → generate │ ├── summary_service.py # Conversation & document summaries (Claude) │ ├── podcast_service.py # Podcast script generation + OpenAI TTS │ └── quiz_service.py # Quiz generation + interactive HTML renderer │ └── pages/ # UI tab handlers ├── chat.py # Chat interface with citation rendering ├── sources.py # Source upload, URL add, delete, status display └── artifacts.py # Summary, podcast, quiz display & generation ``` --- ## Tech Stack | Layer | Technology | |-------|-----------| | **UI Framework** | Gradio 5.12.0 | | **Hosting** | Hugging Face Spaces | | **Auth** | HF OAuth | | **Chat LLM** | Qwen/Qwen2.5-7B-Instruct (HF Inference API) | | **Summary / Podcast / Quiz LLM** | Claude Haiku 4.5 (Anthropic API) | | **Text-to-Speech** | OpenAI TTS (`tts-1`, `alloy` voice) | | **Embeddings** | sentence-transformers/all-MiniLM-L6-v2 (HF Inference API) | | **Vector Database** | Pinecone | | **User Data Storage** | HF Dataset repo (private JSON) | | **PDF Extraction** | PyMuPDF | | **PPTX Extraction** | python-pptx | | **Web Scraping** | BeautifulSoup4 + Requests | | **YouTube Transcripts** | youtube-transcript-api | --- ## Setup ### Prerequisites - Python 3.12+ - A Hugging Face account - API keys for Pinecone, Anthropic, and OpenAI ### Environment Variables Set these as **Secrets** in your HF Space settings (or in a `.env` for local dev): | Variable | Description | |----------|-------------| | `HF_TOKEN` | Hugging Face API token (read/write access) | | `Pinecone_API` | Pinecone API key | | `ANTHROPIC_API_KEY` | Anthropic API key (for Claude Haiku) | | `OPENAI_API_KEY` | OpenAI API key (for TTS audio generation) | ### Local Development ```bash # Clone the repo git clone https://huggingface.co/spaces/Group-1-5010/NotebookLM cd NotebookLM # Install dependencies pip install -r requirements.txt # Export your API keys export HF_TOKEN="hf_..." export Pinecone_API="..." export ANTHROPIC_API_KEY="sk-ant-..." export OPENAI_API_KEY="sk-..." # Run the app python app.py ``` The app launches at `http://localhost:7860`. ### Deploying to HF Spaces 1. Create a new Gradio Space on Hugging Face 2. Push the code to the Space repo 3. Add all four API keys as Secrets in the Space settings 4. The Space auto-builds and deploys --- ## How It Works ### RAG Pipeline (Chat) ``` User Question │ ├─→ Embed query (MiniLM-L6-v2, 384d) │ ├─→ Pinecone semantic search (top-40 candidates) │ ├─→ Hybrid rerank: embedding_score + 0.05 × keyword_overlap │ └─→ Deduplicate → top-8 final chunks │ ├─→ Build prompt (system rules + context + last 5 messages + question) │ └─→ Qwen 2.5-7B-Instruct generates grounded answer with [S1], [S2] citations ``` ### Artifact Generation ``` Sources (Pinecone) ──→ Claude Haiku ──→ Summary (Markdown) │ ├──→ Claude Haiku ──→ Podcast Script │ │ │ OpenAI TTS ──→ MP3 Audio │ └──→ Claude Haiku ──→ Quiz (JSON → Interactive HTML) ``` ### Data Flow ``` Upload File ──→ Extract Text ──→ Chunk (500 tokens, 50 overlap) │ ├──→ Embed (HF Inference API) │ └──→ Upsert to Pinecone (namespace = notebook_id) metadata: {source_id, source_filename, chunk_index, text} ``` --- ## UI Overview ### Sidebar - Create / rename / delete notebooks - Notebook selector (radio buttons with source & message counts) - Save button with unsaved-changes indicator ### Chat Tab - Chatbot with message bubbles and citation chips - Warning banner if no sources are uploaded yet ### Sources Tab - Drag-and-drop file uploader (PDF, PPTX, TXT) - URL input for web pages and YouTube videos - Source cards showing type icon, file size, chunk count, and status badge - Status indicators: Processing (yellow pulse), Ready (green), Failed (red with tooltip) - Delete source dropdown ### Artifacts Tab **Summary Sub-tab:** - Conversation Summary: style selector (brief/detailed) + generate button - Document Summary: source selector (checkboxes) + style selector + generate button - Download buttons for each (`.md`) **Podcast Sub-tab:** - Locked until a summary is generated - Generate button produces dialogue script + MP3 audio - In-browser audio player - Download button (`.mp3`) **Quiz Sub-tab:** - Question count selector (5 or 10) - Interactive multiple-choice with "Show Answer" reveals - Download button (`.html`) --- ## Design - **Theme:** Custom dark theme (Indigo/Purple gradient) - **Background:** `#0e1117` - **Font:** Inter (Google Fonts) - **Color palette:** Indigo (`#667eea`), Purple (`#764ba2`), Gold accent (`#fbbf24`), Green success (`#22c55e`) - **Custom SVG logo** with notebook + sparkle motif --- ## Dependencies ``` gradio>=5.0.0 huggingface_hub>=0.20.0 pinecone>=5.0.0 PyMuPDF>=1.23.0 python-pptx>=0.6.21 beautifulsoup4>=4.12.0 requests>=2.31.0 youtube-transcript-api>=0.6.0 scipy>=1.11.0 anthropic>=0.40.0 openai>=1.0.0 ``` --- ## License MIT