NotebookLM / README.md
internomega-terrablue
Readme updation
e0ffa13

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: NotebookLM
emoji: πŸš€
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.12.0
python_version: '3.12'
app_file: app.py
hf_oauth: true
hf_oauth_expiration_minutes: 480
tags:
  - gradio
pinned: false
short_description: NotebookLM - AI-Powered Study Companion
license: mit

NotebookLM β€” AI-Powered Study Companion

A full-featured, Google NotebookLM-inspired study tool built with Gradio on Hugging Face Spaces. Upload documents, chat with your sources using RAG, and generate study artifacts like summaries, podcasts, and quizzes β€” all from one interface.


Features

Chat with Your Sources (RAG)

  • Ask questions about uploaded documents and get grounded, cited answers
  • Two-stage retrieval: semantic vector search (top-40) followed by hybrid reranking (top-8) combining embedding similarity with lexical keyword overlap
  • Inline citation chips ([S1], [S2]) link back to source passages
  • Conversational context: last 5 messages included for follow-up questions
  • Powered by Qwen/Qwen2.5-7B-Instruct via HF Inference API

Multi-Format Document Ingestion

Format Extractor Notes
PDF PyMuPDF (fitz) Full-page text extraction
PPTX python-pptx Text from all slides and shapes
TXT Built-in UTF-8 plain text
Web URLs BeautifulSoup Strips nav/footer/scripts, extracts article content
YouTube youtube-transcript-api Auto-fetches video transcripts
  • Max file size: 15 MB
  • Max sources per notebook: 20
  • Duplicate detection for both files and URLs

Ingestion Pipeline

Upload/URL β†’ Text Extraction β†’ Recursive Chunking β†’ Embedding Generation β†’ Pinecone Upsert
  • Chunking: Recursive character splitting (2000 chars per chunk, 200 char overlap) with separators \n\n β†’ \n β†’ . β†’
  • Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) via HF Inference API
  • Vector Store: Pinecone with namespace-per-notebook isolation

Document Summary

  • Summarize content from your uploaded sources using semantic retrieval
  • Source selection: Choose specific sources to include via checkbox selector
  • Styles: Brief (3-5 bullets, <150 words) or Detailed (sectioned, 300-500 words)
  • Powered by Claude Haiku 4.5

Conversation Summary

  • Summarize your chat history into structured notes
  • Styles: Brief or Detailed
  • Powered by Claude Haiku 4.5

Podcast Generation

  • Generates a natural two-host dialogue script (Alex & Sam) from your summary
  • Script converted to audio via OpenAI TTS (tts-1 model, alloy voice)
  • In-browser audio player with download option
  • Falls back to script-only if TTS is unavailable

Quiz Generation

  • Creates multiple-choice quizzes (5 or 10 questions) from your source material
  • Interactive HTML with "Show Answer" reveal buttons
  • Includes explanations for each correct answer
  • Downloadable as standalone HTML

Multi-Notebook Support

  • Create, rename, and delete notebooks from the sidebar
  • Each notebook has its own sources, chat history, and artifacts
  • Switch between notebooks instantly

Authentication & Persistence

  • OAuth via Hugging Face (session expiration: 480 minutes)
  • User data serialized to JSON and stored in a private HF Dataset repo (Group-1-5010/notebooklm-data)
  • Manual save button with unsaved-changes warning on page unload

Architecture

NotebookLM/
β”œβ”€β”€ app.py                          # Gradio UI, event wiring, refresh logic
β”œβ”€β”€ state.py                        # Data models: UserData, Notebook, Source, Message, Artifact
β”œβ”€β”€ theme.py                        # Dark theme, custom CSS, SVG logos
β”œβ”€β”€ mock_data.py                    # Mock responses for offline testing
β”œβ”€β”€ requirements.txt                # Python dependencies
β”‚
β”œβ”€β”€ artifacts/
β”‚   └── prompt.poml                 # RAG system prompt (POML format)
β”‚
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ logo.svg                    # App logo (SVG with gradients)
β”‚   └── podcasts/                   # Generated podcast MP3 files
β”‚
β”œβ”€β”€ ingestion_engine/               # Document processing pipeline
β”‚   β”œβ”€β”€ ingestion_manager.py        # Orchestrates: extract β†’ chunk β†’ embed β†’ upsert
β”‚   β”œβ”€β”€ pdf_extractor.py            # PDF text extraction (PyMuPDF)
β”‚   β”œβ”€β”€ text_extractor.py           # Plain text file reading
β”‚   β”œβ”€β”€ url_scrapper.py             # Web page scraping (BeautifulSoup)
β”‚   β”œβ”€β”€ transcripter.py             # YouTube transcript fetching
β”‚   β”œβ”€β”€ chunker.py                  # Recursive text chunking with overlap
β”‚   └── embedding_generator.py      # Sentence-transformer embeddings (HF API)
β”‚
β”œβ”€β”€ persistence/                    # Storage layer
β”‚   β”œβ”€β”€ vector_store.py             # Pinecone CRUD (upsert, query, delete)
β”‚   └── storage_service.py          # HF Dataset repo for user data persistence
β”‚
β”œβ”€β”€ services/                       # Core AI features
β”‚   β”œβ”€β”€ rag_engine.py               # RAG pipeline: retrieve β†’ rerank β†’ generate
β”‚   β”œβ”€β”€ summary_service.py          # Conversation & document summaries (Claude)
β”‚   β”œβ”€β”€ podcast_service.py          # Podcast script generation + OpenAI TTS
β”‚   └── quiz_service.py             # Quiz generation + interactive HTML renderer
β”‚
└── pages/                          # UI tab handlers
    β”œβ”€β”€ chat.py                     # Chat interface with citation rendering
    β”œβ”€β”€ sources.py                  # Source upload, URL add, delete, status display
    └── artifacts.py                # Summary, podcast, quiz display & generation

Tech Stack

Layer Technology
UI Framework Gradio 5.12.0
Hosting Hugging Face Spaces
Auth HF OAuth
Chat LLM Qwen/Qwen2.5-7B-Instruct (HF Inference API)
Summary / Podcast / Quiz LLM Claude Haiku 4.5 (Anthropic API)
Text-to-Speech OpenAI TTS (tts-1, alloy voice)
Embeddings sentence-transformers/all-MiniLM-L6-v2 (HF Inference API)
Vector Database Pinecone
User Data Storage HF Dataset repo (private JSON)
PDF Extraction PyMuPDF
PPTX Extraction python-pptx
Web Scraping BeautifulSoup4 + Requests
YouTube Transcripts youtube-transcript-api

Setup

Prerequisites

  • Python 3.12+
  • A Hugging Face account
  • API keys for Pinecone, Anthropic, and OpenAI

Environment Variables

Set these as Secrets in your HF Space settings (or in a .env for local dev):

Variable Description
HF_TOKEN Hugging Face API token (read/write access)
Pinecone_API Pinecone API key
ANTHROPIC_API_KEY Anthropic API key (for Claude Haiku)
OPENAI_API_KEY OpenAI API key (for TTS audio generation)

Local Development

# Clone the repo
git clone https://huggingface.co/spaces/Group-1-5010/NotebookLM
cd NotebookLM

# Install dependencies
pip install -r requirements.txt

# Export your API keys
export HF_TOKEN="hf_..."
export Pinecone_API="..."
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

# Run the app
python app.py

The app launches at http://localhost:7860.

Deploying to HF Spaces

  1. Create a new Gradio Space on Hugging Face
  2. Push the code to the Space repo
  3. Add all four API keys as Secrets in the Space settings
  4. The Space auto-builds and deploys

How It Works

RAG Pipeline (Chat)

User Question
    β”‚
    β”œβ”€β†’ Embed query (MiniLM-L6-v2, 384d)
    β”‚
    β”œβ”€β†’ Pinecone semantic search (top-40 candidates)
    β”‚
    β”œβ”€β†’ Hybrid rerank: embedding_score + 0.05 Γ— keyword_overlap
    β”‚       └─→ Deduplicate β†’ top-8 final chunks
    β”‚
    β”œβ”€β†’ Build prompt (system rules + context + last 5 messages + question)
    β”‚
    └─→ Qwen 2.5-7B-Instruct generates grounded answer with [S1], [S2] citations

Artifact Generation

Sources (Pinecone) ──→ Claude Haiku ──→ Summary (Markdown)
                                            β”‚
                                            β”œβ”€β”€β†’ Claude Haiku ──→ Podcast Script
                                            β”‚                         β”‚
                                            β”‚                    OpenAI TTS ──→ MP3 Audio
                                            β”‚
                                            └──→ Claude Haiku ──→ Quiz (JSON β†’ Interactive HTML)

Data Flow

Upload File ──→ Extract Text ──→ Chunk (500 tokens, 50 overlap)
                                      β”‚
                                      β”œβ”€β”€β†’ Embed (HF Inference API)
                                      β”‚
                                      └──→ Upsert to Pinecone (namespace = notebook_id)
                                                metadata: {source_id, source_filename, chunk_index, text}

UI Overview

Sidebar

  • Create / rename / delete notebooks
  • Notebook selector (radio buttons with source & message counts)
  • Save button with unsaved-changes indicator

Chat Tab

  • Chatbot with message bubbles and citation chips
  • Warning banner if no sources are uploaded yet

Sources Tab

  • Drag-and-drop file uploader (PDF, PPTX, TXT)
  • URL input for web pages and YouTube videos
  • Source cards showing type icon, file size, chunk count, and status badge
  • Status indicators: Processing (yellow pulse), Ready (green), Failed (red with tooltip)
  • Delete source dropdown

Artifacts Tab

Summary Sub-tab:

  • Conversation Summary: style selector (brief/detailed) + generate button
  • Document Summary: source selector (checkboxes) + style selector + generate button
  • Download buttons for each (.md)

Podcast Sub-tab:

  • Locked until a summary is generated
  • Generate button produces dialogue script + MP3 audio
  • In-browser audio player
  • Download button (.mp3)

Quiz Sub-tab:

  • Question count selector (5 or 10)
  • Interactive multiple-choice with "Show Answer" reveals
  • Download button (.html)

Design

  • Theme: Custom dark theme (Indigo/Purple gradient)
  • Background: #0e1117
  • Font: Inter (Google Fonts)
  • Color palette: Indigo (#667eea), Purple (#764ba2), Gold accent (#fbbf24), Green success (#22c55e)
  • Custom SVG logo with notebook + sparkle motif

Dependencies

gradio>=5.0.0
huggingface_hub>=0.20.0
pinecone>=5.0.0
PyMuPDF>=1.23.0
python-pptx>=0.6.21
beautifulsoup4>=4.12.0
requests>=2.31.0
youtube-transcript-api>=0.6.0
scipy>=1.11.0
anthropic>=0.40.0
openai>=1.0.0

License

MIT