Spaces:

Group-1-5010
/

NotebookLM

Running

App Files Files Community

NotebookLM / README.md

internomega-terrablue

Readme updation

e0ffa13 19 days ago

preview code

raw

history blame contribute delete

10.9 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

metadata

title: NotebookLM
emoji: 🚀
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.12.0
python_version: '3.12'
app_file: app.py
hf_oauth: true
hf_oauth_expiration_minutes: 480
tags:
  - gradio
pinned: false
short_description: NotebookLM - AI-Powered Study Companion
license: mit

NotebookLM — AI-Powered Study Companion

A full-featured, Google NotebookLM-inspired study tool built with Gradio on Hugging Face Spaces. Upload documents, chat with your sources using RAG, and generate study artifacts like summaries, podcasts, and quizzes — all from one interface.

Features

Chat with Your Sources (RAG)

Ask questions about uploaded documents and get grounded, cited answers
Two-stage retrieval: semantic vector search (top-40) followed by hybrid reranking (top-8) combining embedding similarity with lexical keyword overlap
Inline citation chips ([S1], [S2]) link back to source passages
Conversational context: last 5 messages included for follow-up questions
Powered by Qwen/Qwen2.5-7B-Instruct via HF Inference API

Multi-Format Document Ingestion

Format	Extractor	Notes
PDF	PyMuPDF (fitz)	Full-page text extraction
PPTX	python-pptx	Text from all slides and shapes
TXT	Built-in	UTF-8 plain text
Web URLs	BeautifulSoup	Strips nav/footer/scripts, extracts article content
YouTube	youtube-transcript-api	Auto-fetches video transcripts

Max file size: 15 MB
Max sources per notebook: 20
Duplicate detection for both files and URLs

Ingestion Pipeline

Upload/URL → Text Extraction → Recursive Chunking → Embedding Generation → Pinecone Upsert

Chunking: Recursive character splitting (2000 chars per chunk, 200 char overlap) with separators \n\n → \n → . →
Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions) via HF Inference API
Vector Store: Pinecone with namespace-per-notebook isolation

Document Summary

Summarize content from your uploaded sources using semantic retrieval
Source selection: Choose specific sources to include via checkbox selector
Styles: Brief (3-5 bullets, <150 words) or Detailed (sectioned, 300-500 words)
Powered by Claude Haiku 4.5

Conversation Summary

Summarize your chat history into structured notes
Styles: Brief or Detailed
Powered by Claude Haiku 4.5

Podcast Generation

Generates a natural two-host dialogue script (Alex & Sam) from your summary
Script converted to audio via OpenAI TTS (tts-1 model, alloy voice)
In-browser audio player with download option
Falls back to script-only if TTS is unavailable

Quiz Generation

Creates multiple-choice quizzes (5 or 10 questions) from your source material
Interactive HTML with "Show Answer" reveal buttons
Includes explanations for each correct answer
Downloadable as standalone HTML

Multi-Notebook Support

Create, rename, and delete notebooks from the sidebar
Each notebook has its own sources, chat history, and artifacts
Switch between notebooks instantly

Authentication & Persistence

OAuth via Hugging Face (session expiration: 480 minutes)
User data serialized to JSON and stored in a private HF Dataset repo (Group-1-5010/notebooklm-data)
Manual save button with unsaved-changes warning on page unload

Architecture

NotebookLM/
├── app.py                          # Gradio UI, event wiring, refresh logic
├── state.py                        # Data models: UserData, Notebook, Source, Message, Artifact
├── theme.py                        # Dark theme, custom CSS, SVG logos
├── mock_data.py                    # Mock responses for offline testing
├── requirements.txt                # Python dependencies
│
├── artifacts/
│   └── prompt.poml                 # RAG system prompt (POML format)
│
├── assets/
│   ├── logo.svg                    # App logo (SVG with gradients)
│   └── podcasts/                   # Generated podcast MP3 files
│
├── ingestion_engine/               # Document processing pipeline
│   ├── ingestion_manager.py        # Orchestrates: extract → chunk → embed → upsert
│   ├── pdf_extractor.py            # PDF text extraction (PyMuPDF)
│   ├── text_extractor.py           # Plain text file reading
│   ├── url_scrapper.py             # Web page scraping (BeautifulSoup)
│   ├── transcripter.py             # YouTube transcript fetching
│   ├── chunker.py                  # Recursive text chunking with overlap
│   └── embedding_generator.py      # Sentence-transformer embeddings (HF API)
│
├── persistence/                    # Storage layer
│   ├── vector_store.py             # Pinecone CRUD (upsert, query, delete)
│   └── storage_service.py          # HF Dataset repo for user data persistence
│
├── services/                       # Core AI features
│   ├── rag_engine.py               # RAG pipeline: retrieve → rerank → generate
│   ├── summary_service.py          # Conversation & document summaries (Claude)
│   ├── podcast_service.py          # Podcast script generation + OpenAI TTS
│   └── quiz_service.py             # Quiz generation + interactive HTML renderer
│
└── pages/                          # UI tab handlers
    ├── chat.py                     # Chat interface with citation rendering
    ├── sources.py                  # Source upload, URL add, delete, status display
    └── artifacts.py                # Summary, podcast, quiz display & generation

Tech Stack

Layer	Technology
UI Framework	Gradio 5.12.0
Hosting	Hugging Face Spaces
Auth	HF OAuth
Chat LLM	Qwen/Qwen2.5-7B-Instruct (HF Inference API)
Summary / Podcast / Quiz LLM	Claude Haiku 4.5 (Anthropic API)
Text-to-Speech	OpenAI TTS (`tts-1`, `alloy` voice)
Embeddings	sentence-transformers/all-MiniLM-L6-v2 (HF Inference API)
Vector Database	Pinecone
User Data Storage	HF Dataset repo (private JSON)
PDF Extraction	PyMuPDF
PPTX Extraction	python-pptx
Web Scraping	BeautifulSoup4 + Requests
YouTube Transcripts	youtube-transcript-api

Setup

Prerequisites

Python 3.12+
A Hugging Face account
API keys for Pinecone, Anthropic, and OpenAI

Environment Variables

Set these as Secrets in your HF Space settings (or in a .env for local dev):

Variable	Description
`HF_TOKEN`	Hugging Face API token (read/write access)
`Pinecone_API`	Pinecone API key
`ANTHROPIC_API_KEY`	Anthropic API key (for Claude Haiku)
`OPENAI_API_KEY`	OpenAI API key (for TTS audio generation)

Local Development

# Clone the repo
git clone https://huggingface.co/spaces/Group-1-5010/NotebookLM
cd NotebookLM

# Install dependencies
pip install -r requirements.txt

# Export your API keys
export HF_TOKEN="hf_..."
export Pinecone_API="..."
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

# Run the app
python app.py

The app launches at http://localhost:7860.

Deploying to HF Spaces

Create a new Gradio Space on Hugging Face
Push the code to the Space repo
Add all four API keys as Secrets in the Space settings
The Space auto-builds and deploys

How It Works

RAG Pipeline (Chat)

User Question
    │
    ├─→ Embed query (MiniLM-L6-v2, 384d)
    │
    ├─→ Pinecone semantic search (top-40 candidates)
    │
    ├─→ Hybrid rerank: embedding_score + 0.05 × keyword_overlap
    │       └─→ Deduplicate → top-8 final chunks
    │
    ├─→ Build prompt (system rules + context + last 5 messages + question)
    │
    └─→ Qwen 2.5-7B-Instruct generates grounded answer with [S1], [S2] citations

Artifact Generation

Sources (Pinecone) ──→ Claude Haiku ──→ Summary (Markdown)
                                            │
                                            ├──→ Claude Haiku ──→ Podcast Script
                                            │                         │
                                            │                    OpenAI TTS ──→ MP3 Audio
                                            │
                                            └──→ Claude Haiku ──→ Quiz (JSON → Interactive HTML)

Data Flow

Upload File ──→ Extract Text ──→ Chunk (500 tokens, 50 overlap)
                                      │
                                      ├──→ Embed (HF Inference API)
                                      │
                                      └──→ Upsert to Pinecone (namespace = notebook_id)
                                                metadata: {source_id, source_filename, chunk_index, text}

UI Overview

Sidebar

Create / rename / delete notebooks
Notebook selector (radio buttons with source & message counts)
Save button with unsaved-changes indicator

Chat Tab

Chatbot with message bubbles and citation chips
Warning banner if no sources are uploaded yet

Sources Tab

Drag-and-drop file uploader (PDF, PPTX, TXT)
URL input for web pages and YouTube videos
Source cards showing type icon, file size, chunk count, and status badge
Status indicators: Processing (yellow pulse), Ready (green), Failed (red with tooltip)
Delete source dropdown

Artifacts Tab

Summary Sub-tab:

Conversation Summary: style selector (brief/detailed) + generate button
Document Summary: source selector (checkboxes) + style selector + generate button
Download buttons for each (.md)

Podcast Sub-tab:

Locked until a summary is generated
Generate button produces dialogue script + MP3 audio
In-browser audio player
Download button (.mp3)

Quiz Sub-tab:

Question count selector (5 or 10)
Interactive multiple-choice with "Show Answer" reveals
Download button (.html)

Design

Theme: Custom dark theme (Indigo/Purple gradient)
Background: #0e1117
Font: Inter (Google Fonts)
Color palette: Indigo (#667eea), Purple (#764ba2), Gold accent (#fbbf24), Green success (#22c55e)
Custom SVG logo with notebook + sparkle motif

Dependencies

gradio>=5.0.0
huggingface_hub>=0.20.0
pinecone>=5.0.0
PyMuPDF>=1.23.0
python-pptx>=0.6.21
beautifulsoup4>=4.12.0
requests>=2.31.0
youtube-transcript-api>=0.6.0
scipy>=1.11.0
anthropic>=0.40.0
openai>=1.0.0

License

MIT