Spaces:

Group-1-5010
/

NotebookLM

Sleeping

File size: 10,877 Bytes

7e9812c
254776d
7e9812c
e90d887
 
 
 
 
 
04f4b47
 
7e9812c
e90d887
7e9812c
e90d887
254776d
7e9812c
 
e0ffa13
7e9812c
e0ffa13

---
title: NotebookLM
emoji: 🚀
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: "5.12.0"
python_version: "3.12"
app_file: app.py
hf_oauth: true
hf_oauth_expiration_minutes: 480
tags:
- gradio
pinned: false
short_description: NotebookLM - AI-Powered Study Companion
license: mit
---

# NotebookLM — AI-Powered Study Companion

A full-featured, Google NotebookLM-inspired study tool built with **Gradio** on **Hugging Face Spaces**. Upload documents, chat with your sources using RAG, and generate study artifacts like summaries, podcasts, and quizzes — all from one interface.

---

## Features

### Chat with Your Sources (RAG)
- Ask questions about uploaded documents and get grounded, cited answers
- Two-stage retrieval: semantic vector search (top-40) followed by hybrid reranking (top-8) combining embedding similarity with lexical keyword overlap
- Inline citation chips (`[S1]`, `[S2]`) link back to source passages
- Conversational context: last 5 messages included for follow-up questions
- Powered by **Qwen/Qwen2.5-7B-Instruct** via HF Inference API

### Multi-Format Document Ingestion
| Format | Extractor | Notes |
|--------|-----------|-------|
| **PDF** | PyMuPDF (fitz) | Full-page text extraction |
| **PPTX** | python-pptx | Text from all slides and shapes |
| **TXT** | Built-in | UTF-8 plain text |
| **Web URLs** | BeautifulSoup | Strips nav/footer/scripts, extracts article content |
| **YouTube** | youtube-transcript-api | Auto-fetches video transcripts |

- Max file size: **15 MB**
- Max sources per notebook: **20**
- Duplicate detection for both files and URLs

### Ingestion Pipeline
```
Upload/URL → Text Extraction → Recursive Chunking → Embedding Generation → Pinecone Upsert
```
- **Chunking:** Recursive character splitting (2000 chars per chunk, 200 char overlap) with separators `\n\n` → `\n` → `. ` → ` `
- **Embeddings:** `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions) via HF Inference API
- **Vector Store:** Pinecone with namespace-per-notebook isolation

### Document Summary
- Summarize content from your uploaded sources using semantic retrieval
- **Source selection:** Choose specific sources to include via checkbox selector
- **Styles:** Brief (3-5 bullets, <150 words) or Detailed (sectioned, 300-500 words)
- Powered by **Claude Haiku 4.5**

### Conversation Summary
- Summarize your chat history into structured notes
- **Styles:** Brief or Detailed
- Powered by **Claude Haiku 4.5**

### Podcast Generation
- Generates a natural two-host dialogue script (Alex & Sam) from your summary
- Script converted to audio via **OpenAI TTS** (`tts-1` model, `alloy` voice)
- In-browser audio player with download option
- Falls back to script-only if TTS is unavailable

### Quiz Generation
- Creates multiple-choice quizzes (5 or 10 questions) from your source material
- Interactive HTML with "Show Answer" reveal buttons
- Includes explanations for each correct answer
- Downloadable as standalone HTML

### Multi-Notebook Support
- Create, rename, and delete notebooks from the sidebar
- Each notebook has its own sources, chat history, and artifacts
- Switch between notebooks instantly

### Authentication & Persistence
- **OAuth** via Hugging Face (session expiration: 480 minutes)
- User data serialized to JSON and stored in a **private HF Dataset repo** (`Group-1-5010/notebooklm-data`)
- Manual save button with unsaved-changes warning on page unload

---

## Architecture

```
NotebookLM/
├── app.py                          # Gradio UI, event wiring, refresh logic
├── state.py                        # Data models: UserData, Notebook, Source, Message, Artifact
├── theme.py                        # Dark theme, custom CSS, SVG logos
├── mock_data.py                    # Mock responses for offline testing
├── requirements.txt                # Python dependencies
│
├── artifacts/
│   └── prompt.poml                 # RAG system prompt (POML format)
│
├── assets/
│   ├── logo.svg                    # App logo (SVG with gradients)
│   └── podcasts/                   # Generated podcast MP3 files
│
├── ingestion_engine/               # Document processing pipeline
│   ├── ingestion_manager.py        # Orchestrates: extract → chunk → embed → upsert
│   ├── pdf_extractor.py            # PDF text extraction (PyMuPDF)
│   ├── text_extractor.py           # Plain text file reading
│   ├── url_scrapper.py             # Web page scraping (BeautifulSoup)
│   ├── transcripter.py             # YouTube transcript fetching
│   ├── chunker.py                  # Recursive text chunking with overlap
│   └── embedding_generator.py      # Sentence-transformer embeddings (HF API)
│
├── persistence/                    # Storage layer
│   ├── vector_store.py             # Pinecone CRUD (upsert, query, delete)
│   └── storage_service.py          # HF Dataset repo for user data persistence
│
├── services/                       # Core AI features
│   ├── rag_engine.py               # RAG pipeline: retrieve → rerank → generate
│   ├── summary_service.py          # Conversation & document summaries (Claude)
│   ├── podcast_service.py          # Podcast script generation + OpenAI TTS
│   └── quiz_service.py             # Quiz generation + interactive HTML renderer
│
└── pages/                          # UI tab handlers
    ├── chat.py                     # Chat interface with citation rendering
    ├── sources.py                  # Source upload, URL add, delete, status display
    └── artifacts.py                # Summary, podcast, quiz display & generation
```

---

## Tech Stack

| Layer | Technology |
|-------|-----------|
| **UI Framework** | Gradio 5.12.0 |
| **Hosting** | Hugging Face Spaces |
| **Auth** | HF OAuth |
| **Chat LLM** | Qwen/Qwen2.5-7B-Instruct (HF Inference API) |
| **Summary / Podcast / Quiz LLM** | Claude Haiku 4.5 (Anthropic API) |
| **Text-to-Speech** | OpenAI TTS (`tts-1`, `alloy` voice) |
| **Embeddings** | sentence-transformers/all-MiniLM-L6-v2 (HF Inference API) |
| **Vector Database** | Pinecone |
| **User Data Storage** | HF Dataset repo (private JSON) |
| **PDF Extraction** | PyMuPDF |
| **PPTX Extraction** | python-pptx |
| **Web Scraping** | BeautifulSoup4 + Requests |
| **YouTube Transcripts** | youtube-transcript-api |

---

## Setup

### Prerequisites

- Python 3.12+
- A Hugging Face account
- API keys for Pinecone, Anthropic, and OpenAI

### Environment Variables

Set these as **Secrets** in your HF Space settings (or in a `.env` for local dev):

| Variable | Description |
|----------|-------------|
| `HF_TOKEN` | Hugging Face API token (read/write access) |
| `Pinecone_API` | Pinecone API key |
| `ANTHROPIC_API_KEY` | Anthropic API key (for Claude Haiku) |
| `OPENAI_API_KEY` | OpenAI API key (for TTS audio generation) |

### Local Development

```bash
# Clone the repo
git clone https://huggingface.co/spaces/Group-1-5010/NotebookLM
cd NotebookLM

# Install dependencies
pip install -r requirements.txt

# Export your API keys
export HF_TOKEN="hf_..."
export Pinecone_API="..."
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

# Run the app
python app.py
```

The app launches at `http://localhost:7860`.

### Deploying to HF Spaces

1. Create a new Gradio Space on Hugging Face
2. Push the code to the Space repo
3. Add all four API keys as Secrets in the Space settings
4. The Space auto-builds and deploys

---

## How It Works

### RAG Pipeline (Chat)

```
User Question
    │
    ├─→ Embed query (MiniLM-L6-v2, 384d)
    │
    ├─→ Pinecone semantic search (top-40 candidates)
    │
    ├─→ Hybrid rerank: embedding_score + 0.05 × keyword_overlap
    │       └─→ Deduplicate → top-8 final chunks
    │
    ├─→ Build prompt (system rules + context + last 5 messages + question)
    │
    └─→ Qwen 2.5-7B-Instruct generates grounded answer with [S1], [S2] citations
```

### Artifact Generation

```
Sources (Pinecone) ──→ Claude Haiku ──→ Summary (Markdown)
                                            │
                                            ├──→ Claude Haiku ──→ Podcast Script
                                            │                         │
                                            │                    OpenAI TTS ──→ MP3 Audio
                                            │
                                            └──→ Claude Haiku ──→ Quiz (JSON → Interactive HTML)
```

### Data Flow

```
Upload File ──→ Extract Text ──→ Chunk (500 tokens, 50 overlap)
                                      │
                                      ├──→ Embed (HF Inference API)
                                      │
                                      └──→ Upsert to Pinecone (namespace = notebook_id)
                                                metadata: {source_id, source_filename, chunk_index, text}
```

---

## UI Overview

### Sidebar
- Create / rename / delete notebooks
- Notebook selector (radio buttons with source & message counts)
- Save button with unsaved-changes indicator

### Chat Tab
- Chatbot with message bubbles and citation chips
- Warning banner if no sources are uploaded yet

### Sources Tab
- Drag-and-drop file uploader (PDF, PPTX, TXT)
- URL input for web pages and YouTube videos
- Source cards showing type icon, file size, chunk count, and status badge
- Status indicators: Processing (yellow pulse), Ready (green), Failed (red with tooltip)
- Delete source dropdown

### Artifacts Tab

**Summary Sub-tab:**
- Conversation Summary: style selector (brief/detailed) + generate button
- Document Summary: source selector (checkboxes) + style selector + generate button
- Download buttons for each (`.md`)

**Podcast Sub-tab:**
- Locked until a summary is generated
- Generate button produces dialogue script + MP3 audio
- In-browser audio player
- Download button (`.mp3`)

**Quiz Sub-tab:**
- Question count selector (5 or 10)
- Interactive multiple-choice with "Show Answer" reveals
- Download button (`.html`)

---

## Design

- **Theme:** Custom dark theme (Indigo/Purple gradient)
- **Background:** `#0e1117`
- **Font:** Inter (Google Fonts)
- **Color palette:** Indigo (`#667eea`), Purple (`#764ba2`), Gold accent (`#fbbf24`), Green success (`#22c55e`)
- **Custom SVG logo** with notebook + sparkle motif

---

## Dependencies

```
gradio>=5.0.0
huggingface_hub>=0.20.0
pinecone>=5.0.0
PyMuPDF>=1.23.0
python-pptx>=0.6.21
beautifulsoup4>=4.12.0
requests>=2.31.0
youtube-transcript-api>=0.6.0
scipy>=1.11.0
anthropic>=0.40.0
openai>=1.0.0
```

---

## License

MIT