NotebookLM / README.md
internomega-terrablue
Readme updation
e0ffa13
---
title: NotebookLM
emoji: πŸš€
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: "5.12.0"
python_version: "3.12"
app_file: app.py
hf_oauth: true
hf_oauth_expiration_minutes: 480
tags:
- gradio
pinned: false
short_description: NotebookLM - AI-Powered Study Companion
license: mit
---
# NotebookLM β€” AI-Powered Study Companion
A full-featured, Google NotebookLM-inspired study tool built with **Gradio** on **Hugging Face Spaces**. Upload documents, chat with your sources using RAG, and generate study artifacts like summaries, podcasts, and quizzes β€” all from one interface.
---
## Features
### Chat with Your Sources (RAG)
- Ask questions about uploaded documents and get grounded, cited answers
- Two-stage retrieval: semantic vector search (top-40) followed by hybrid reranking (top-8) combining embedding similarity with lexical keyword overlap
- Inline citation chips (`[S1]`, `[S2]`) link back to source passages
- Conversational context: last 5 messages included for follow-up questions
- Powered by **Qwen/Qwen2.5-7B-Instruct** via HF Inference API
### Multi-Format Document Ingestion
| Format | Extractor | Notes |
|--------|-----------|-------|
| **PDF** | PyMuPDF (fitz) | Full-page text extraction |
| **PPTX** | python-pptx | Text from all slides and shapes |
| **TXT** | Built-in | UTF-8 plain text |
| **Web URLs** | BeautifulSoup | Strips nav/footer/scripts, extracts article content |
| **YouTube** | youtube-transcript-api | Auto-fetches video transcripts |
- Max file size: **15 MB**
- Max sources per notebook: **20**
- Duplicate detection for both files and URLs
### Ingestion Pipeline
```
Upload/URL β†’ Text Extraction β†’ Recursive Chunking β†’ Embedding Generation β†’ Pinecone Upsert
```
- **Chunking:** Recursive character splitting (2000 chars per chunk, 200 char overlap) with separators `\n\n` β†’ `\n` β†’ `. ` β†’ ` `
- **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions) via HF Inference API
- **Vector Store:** Pinecone with namespace-per-notebook isolation
### Document Summary
- Summarize content from your uploaded sources using semantic retrieval
- **Source selection:** Choose specific sources to include via checkbox selector
- **Styles:** Brief (3-5 bullets, <150 words) or Detailed (sectioned, 300-500 words)
- Powered by **Claude Haiku 4.5**
### Conversation Summary
- Summarize your chat history into structured notes
- **Styles:** Brief or Detailed
- Powered by **Claude Haiku 4.5**
### Podcast Generation
- Generates a natural two-host dialogue script (Alex & Sam) from your summary
- Script converted to audio via **OpenAI TTS** (`tts-1` model, `alloy` voice)
- In-browser audio player with download option
- Falls back to script-only if TTS is unavailable
### Quiz Generation
- Creates multiple-choice quizzes (5 or 10 questions) from your source material
- Interactive HTML with "Show Answer" reveal buttons
- Includes explanations for each correct answer
- Downloadable as standalone HTML
### Multi-Notebook Support
- Create, rename, and delete notebooks from the sidebar
- Each notebook has its own sources, chat history, and artifacts
- Switch between notebooks instantly
### Authentication & Persistence
- **OAuth** via Hugging Face (session expiration: 480 minutes)
- User data serialized to JSON and stored in a **private HF Dataset repo** (`Group-1-5010/notebooklm-data`)
- Manual save button with unsaved-changes warning on page unload
---
## Architecture
```
NotebookLM/
β”œβ”€β”€ app.py # Gradio UI, event wiring, refresh logic
β”œβ”€β”€ state.py # Data models: UserData, Notebook, Source, Message, Artifact
β”œβ”€β”€ theme.py # Dark theme, custom CSS, SVG logos
β”œβ”€β”€ mock_data.py # Mock responses for offline testing
β”œβ”€β”€ requirements.txt # Python dependencies
β”‚
β”œβ”€β”€ artifacts/
β”‚ └── prompt.poml # RAG system prompt (POML format)
β”‚
β”œβ”€β”€ assets/
β”‚ β”œβ”€β”€ logo.svg # App logo (SVG with gradients)
β”‚ └── podcasts/ # Generated podcast MP3 files
β”‚
β”œβ”€β”€ ingestion_engine/ # Document processing pipeline
β”‚ β”œβ”€β”€ ingestion_manager.py # Orchestrates: extract β†’ chunk β†’ embed β†’ upsert
β”‚ β”œβ”€β”€ pdf_extractor.py # PDF text extraction (PyMuPDF)
β”‚ β”œβ”€β”€ text_extractor.py # Plain text file reading
β”‚ β”œβ”€β”€ url_scrapper.py # Web page scraping (BeautifulSoup)
β”‚ β”œβ”€β”€ transcripter.py # YouTube transcript fetching
β”‚ β”œβ”€β”€ chunker.py # Recursive text chunking with overlap
β”‚ └── embedding_generator.py # Sentence-transformer embeddings (HF API)
β”‚
β”œβ”€β”€ persistence/ # Storage layer
β”‚ β”œβ”€β”€ vector_store.py # Pinecone CRUD (upsert, query, delete)
β”‚ └── storage_service.py # HF Dataset repo for user data persistence
β”‚
β”œβ”€β”€ services/ # Core AI features
β”‚ β”œβ”€β”€ rag_engine.py # RAG pipeline: retrieve β†’ rerank β†’ generate
β”‚ β”œβ”€β”€ summary_service.py # Conversation & document summaries (Claude)
β”‚ β”œβ”€β”€ podcast_service.py # Podcast script generation + OpenAI TTS
β”‚ └── quiz_service.py # Quiz generation + interactive HTML renderer
β”‚
└── pages/ # UI tab handlers
β”œβ”€β”€ chat.py # Chat interface with citation rendering
β”œβ”€β”€ sources.py # Source upload, URL add, delete, status display
└── artifacts.py # Summary, podcast, quiz display & generation
```
---
## Tech Stack
| Layer | Technology |
|-------|-----------|
| **UI Framework** | Gradio 5.12.0 |
| **Hosting** | Hugging Face Spaces |
| **Auth** | HF OAuth |
| **Chat LLM** | Qwen/Qwen2.5-7B-Instruct (HF Inference API) |
| **Summary / Podcast / Quiz LLM** | Claude Haiku 4.5 (Anthropic API) |
| **Text-to-Speech** | OpenAI TTS (`tts-1`, `alloy` voice) |
| **Embeddings** | sentence-transformers/all-MiniLM-L6-v2 (HF Inference API) |
| **Vector Database** | Pinecone |
| **User Data Storage** | HF Dataset repo (private JSON) |
| **PDF Extraction** | PyMuPDF |
| **PPTX Extraction** | python-pptx |
| **Web Scraping** | BeautifulSoup4 + Requests |
| **YouTube Transcripts** | youtube-transcript-api |
---
## Setup
### Prerequisites
- Python 3.12+
- A Hugging Face account
- API keys for Pinecone, Anthropic, and OpenAI
### Environment Variables
Set these as **Secrets** in your HF Space settings (or in a `.env` for local dev):
| Variable | Description |
|----------|-------------|
| `HF_TOKEN` | Hugging Face API token (read/write access) |
| `Pinecone_API` | Pinecone API key |
| `ANTHROPIC_API_KEY` | Anthropic API key (for Claude Haiku) |
| `OPENAI_API_KEY` | OpenAI API key (for TTS audio generation) |
### Local Development
```bash
# Clone the repo
git clone https://huggingface.co/spaces/Group-1-5010/NotebookLM
cd NotebookLM
# Install dependencies
pip install -r requirements.txt
# Export your API keys
export HF_TOKEN="hf_..."
export Pinecone_API="..."
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
# Run the app
python app.py
```
The app launches at `http://localhost:7860`.
### Deploying to HF Spaces
1. Create a new Gradio Space on Hugging Face
2. Push the code to the Space repo
3. Add all four API keys as Secrets in the Space settings
4. The Space auto-builds and deploys
---
## How It Works
### RAG Pipeline (Chat)
```
User Question
β”‚
β”œβ”€β†’ Embed query (MiniLM-L6-v2, 384d)
β”‚
β”œβ”€β†’ Pinecone semantic search (top-40 candidates)
β”‚
β”œβ”€β†’ Hybrid rerank: embedding_score + 0.05 Γ— keyword_overlap
β”‚ └─→ Deduplicate β†’ top-8 final chunks
β”‚
β”œβ”€β†’ Build prompt (system rules + context + last 5 messages + question)
β”‚
└─→ Qwen 2.5-7B-Instruct generates grounded answer with [S1], [S2] citations
```
### Artifact Generation
```
Sources (Pinecone) ──→ Claude Haiku ──→ Summary (Markdown)
β”‚
β”œβ”€β”€β†’ Claude Haiku ──→ Podcast Script
β”‚ β”‚
β”‚ OpenAI TTS ──→ MP3 Audio
β”‚
└──→ Claude Haiku ──→ Quiz (JSON β†’ Interactive HTML)
```
### Data Flow
```
Upload File ──→ Extract Text ──→ Chunk (500 tokens, 50 overlap)
β”‚
β”œβ”€β”€β†’ Embed (HF Inference API)
β”‚
└──→ Upsert to Pinecone (namespace = notebook_id)
metadata: {source_id, source_filename, chunk_index, text}
```
---
## UI Overview
### Sidebar
- Create / rename / delete notebooks
- Notebook selector (radio buttons with source & message counts)
- Save button with unsaved-changes indicator
### Chat Tab
- Chatbot with message bubbles and citation chips
- Warning banner if no sources are uploaded yet
### Sources Tab
- Drag-and-drop file uploader (PDF, PPTX, TXT)
- URL input for web pages and YouTube videos
- Source cards showing type icon, file size, chunk count, and status badge
- Status indicators: Processing (yellow pulse), Ready (green), Failed (red with tooltip)
- Delete source dropdown
### Artifacts Tab
**Summary Sub-tab:**
- Conversation Summary: style selector (brief/detailed) + generate button
- Document Summary: source selector (checkboxes) + style selector + generate button
- Download buttons for each (`.md`)
**Podcast Sub-tab:**
- Locked until a summary is generated
- Generate button produces dialogue script + MP3 audio
- In-browser audio player
- Download button (`.mp3`)
**Quiz Sub-tab:**
- Question count selector (5 or 10)
- Interactive multiple-choice with "Show Answer" reveals
- Download button (`.html`)
---
## Design
- **Theme:** Custom dark theme (Indigo/Purple gradient)
- **Background:** `#0e1117`
- **Font:** Inter (Google Fonts)
- **Color palette:** Indigo (`#667eea`), Purple (`#764ba2`), Gold accent (`#fbbf24`), Green success (`#22c55e`)
- **Custom SVG logo** with notebook + sparkle motif
---
## Dependencies
```
gradio>=5.0.0
huggingface_hub>=0.20.0
pinecone>=5.0.0
PyMuPDF>=1.23.0
python-pptx>=0.6.21
beautifulsoup4>=4.12.0
requests>=2.31.0
youtube-transcript-api>=0.6.0
scipy>=1.11.0
anthropic>=0.40.0
openai>=1.0.0
```
---
## License
MIT