Spaces:

Group-1-5010
/

NotebookLM

Running

App Files Files Community

NotebookLM / README.md

internomega-terrablue

Readme updation

e0ffa13 19 days ago

preview code

raw

history blame contribute delete

10.9 kB

	---
	title: NotebookLM
	emoji: 🚀
	colorFrom: purple
	colorTo: blue
	sdk: gradio
	sdk_version: "5.12.0"
	python_version: "3.12"
	app_file: app.py
	hf_oauth: true
	hf_oauth_expiration_minutes: 480
	tags:
	- gradio
	pinned: false
	short_description: NotebookLM - AI-Powered Study Companion
	license: mit
	---

	# NotebookLM — AI-Powered Study Companion

	A full-featured, Google NotebookLM-inspired study tool built with Gradio on Hugging Face Spaces. Upload documents, chat with your sources using RAG, and generate study artifacts like summaries, podcasts, and quizzes — all from one interface.

	---

	## Features

	### Chat with Your Sources (RAG)
	- Ask questions about uploaded documents and get grounded, cited answers
	- Two-stage retrieval: semantic vector search (top-40) followed by hybrid reranking (top-8) combining embedding similarity with lexical keyword overlap
	- Inline citation chips (`[S1]`, `[S2]`) link back to source passages
	- Conversational context: last 5 messages included for follow-up questions
	- Powered by Qwen/Qwen2.5-7B-Instruct via HF Inference API

	### Multi-Format Document Ingestion
	\| Format \| Extractor \| Notes \|
	\|--------\|-----------\|-------\|
	\| PDF \| PyMuPDF (fitz) \| Full-page text extraction \|
	\| PPTX \| python-pptx \| Text from all slides and shapes \|
	\| TXT \| Built-in \| UTF-8 plain text \|
	\| Web URLs \| BeautifulSoup \| Strips nav/footer/scripts, extracts article content \|
	\| YouTube \| youtube-transcript-api \| Auto-fetches video transcripts \|

	- Max file size: 15 MB
	- Max sources per notebook: 20
	- Duplicate detection for both files and URLs

	### Ingestion Pipeline
	```
	Upload/URL → Text Extraction → Recursive Chunking → Embedding Generation → Pinecone Upsert
	```
	- Chunking: Recursive character splitting (2000 chars per chunk, 200 char overlap) with separators `\n\n` → `\n` → `. ` → ` `
	- Embeddings: `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions) via HF Inference API
	- Vector Store: Pinecone with namespace-per-notebook isolation

	### Document Summary
	- Summarize content from your uploaded sources using semantic retrieval
	- Source selection: Choose specific sources to include via checkbox selector
	- Styles: Brief (3-5 bullets, <150 words) or Detailed (sectioned, 300-500 words)
	- Powered by Claude Haiku 4.5

	### Conversation Summary
	- Summarize your chat history into structured notes
	- Styles: Brief or Detailed
	- Powered by Claude Haiku 4.5

	### Podcast Generation
	- Generates a natural two-host dialogue script (Alex & Sam) from your summary
	- Script converted to audio via OpenAI TTS (`tts-1` model, `alloy` voice)
	- In-browser audio player with download option
	- Falls back to script-only if TTS is unavailable

	### Quiz Generation
	- Creates multiple-choice quizzes (5 or 10 questions) from your source material
	- Interactive HTML with "Show Answer" reveal buttons
	- Includes explanations for each correct answer
	- Downloadable as standalone HTML

	### Multi-Notebook Support
	- Create, rename, and delete notebooks from the sidebar
	- Each notebook has its own sources, chat history, and artifacts
	- Switch between notebooks instantly

	### Authentication & Persistence
	- OAuth via Hugging Face (session expiration: 480 minutes)
	- User data serialized to JSON and stored in a private HF Dataset repo (`Group-1-5010/notebooklm-data`)
	- Manual save button with unsaved-changes warning on page unload

	---

	## Architecture

	```
	NotebookLM/
	├── app.py # Gradio UI, event wiring, refresh logic
	├── state.py # Data models: UserData, Notebook, Source, Message, Artifact
	├── theme.py # Dark theme, custom CSS, SVG logos
	├── mock_data.py # Mock responses for offline testing
	├── requirements.txt # Python dependencies
	│
	├── artifacts/
	│ └── prompt.poml # RAG system prompt (POML format)
	│
	├── assets/
	│ ├── logo.svg # App logo (SVG with gradients)
	│ └── podcasts/ # Generated podcast MP3 files
	│
	├── ingestion_engine/ # Document processing pipeline
	│ ├── ingestion_manager.py # Orchestrates: extract → chunk → embed → upsert
	│ ├── pdf_extractor.py # PDF text extraction (PyMuPDF)
	│ ├── text_extractor.py # Plain text file reading
	│ ├── url_scrapper.py # Web page scraping (BeautifulSoup)
	│ ├── transcripter.py # YouTube transcript fetching
	│ ├── chunker.py # Recursive text chunking with overlap
	│ └── embedding_generator.py # Sentence-transformer embeddings (HF API)
	│
	├── persistence/ # Storage layer
	│ ├── vector_store.py # Pinecone CRUD (upsert, query, delete)
	│ └── storage_service.py # HF Dataset repo for user data persistence
	│
	├── services/ # Core AI features
	│ ├── rag_engine.py # RAG pipeline: retrieve → rerank → generate
	│ ├── summary_service.py # Conversation & document summaries (Claude)
	│ ├── podcast_service.py # Podcast script generation + OpenAI TTS
	│ └── quiz_service.py # Quiz generation + interactive HTML renderer
	│
	└── pages/ # UI tab handlers
	├── chat.py # Chat interface with citation rendering
	├── sources.py # Source upload, URL add, delete, status display
	└── artifacts.py # Summary, podcast, quiz display & generation
	```

	---

	## Tech Stack

	\| Layer \| Technology \|
	\|-------\|-----------\|
	\| UI Framework \| Gradio 5.12.0 \|
	\| Hosting \| Hugging Face Spaces \|
	\| Auth \| HF OAuth \|
	\| Chat LLM \| Qwen/Qwen2.5-7B-Instruct (HF Inference API) \|
	\| Summary / Podcast / Quiz LLM \| Claude Haiku 4.5 (Anthropic API) \|
	\| Text-to-Speech \| OpenAI TTS (`tts-1`, `alloy` voice) \|
	\| Embeddings \| sentence-transformers/all-MiniLM-L6-v2 (HF Inference API) \|
	\| Vector Database \| Pinecone \|
	\| User Data Storage \| HF Dataset repo (private JSON) \|
	\| PDF Extraction \| PyMuPDF \|
	\| PPTX Extraction \| python-pptx \|
	\| Web Scraping \| BeautifulSoup4 + Requests \|
	\| YouTube Transcripts \| youtube-transcript-api \|

	---

	## Setup

	### Prerequisites

	- Python 3.12+
	- A Hugging Face account
	- API keys for Pinecone, Anthropic, and OpenAI

	### Environment Variables

	Set these as Secrets in your HF Space settings (or in a `.env` for local dev):

	\| Variable \| Description \|
	\|----------\|-------------\|
	\| `HF_TOKEN` \| Hugging Face API token (read/write access) \|
	\| `Pinecone_API` \| Pinecone API key \|
	\| `ANTHROPIC_API_KEY` \| Anthropic API key (for Claude Haiku) \|
	\| `OPENAI_API_KEY` \| OpenAI API key (for TTS audio generation) \|

	### Local Development

	```bash
	# Clone the repo
	git clone https://huggingface.co/spaces/Group-1-5010/NotebookLM
	cd NotebookLM

	# Install dependencies
	pip install -r requirements.txt

	# Export your API keys
	export HF_TOKEN="hf_..."
	export Pinecone_API="..."
	export ANTHROPIC_API_KEY="sk-ant-..."
	export OPENAI_API_KEY="sk-..."

	# Run the app
	python app.py
	```

	The app launches at `http://localhost:7860`.

	### Deploying to HF Spaces

	1. Create a new Gradio Space on Hugging Face
	2. Push the code to the Space repo
	3. Add all four API keys as Secrets in the Space settings
	4. The Space auto-builds and deploys

	---

	## How It Works

	### RAG Pipeline (Chat)

	```
	User Question
	│
	├─→ Embed query (MiniLM-L6-v2, 384d)
	│
	├─→ Pinecone semantic search (top-40 candidates)
	│
	├─→ Hybrid rerank: embedding_score + 0.05 × keyword_overlap
	│ └─→ Deduplicate → top-8 final chunks
	│
	├─→ Build prompt (system rules + context + last 5 messages + question)
	│
	└─→ Qwen 2.5-7B-Instruct generates grounded answer with [S1], [S2] citations
	```

	### Artifact Generation

	```
	Sources (Pinecone) ──→ Claude Haiku ──→ Summary (Markdown)
	│
	├──→ Claude Haiku ──→ Podcast Script
	│ │
	│ OpenAI TTS ──→ MP3 Audio
	│
	└──→ Claude Haiku ──→ Quiz (JSON → Interactive HTML)
	```

	### Data Flow

	```
	Upload File ──→ Extract Text ──→ Chunk (500 tokens, 50 overlap)
	│
	├──→ Embed (HF Inference API)
	│
	└──→ Upsert to Pinecone (namespace = notebook_id)
	metadata: {source_id, source_filename, chunk_index, text}
	```

	---

	## UI Overview

	### Sidebar
	- Create / rename / delete notebooks
	- Notebook selector (radio buttons with source & message counts)
	- Save button with unsaved-changes indicator

	### Chat Tab
	- Chatbot with message bubbles and citation chips
	- Warning banner if no sources are uploaded yet

	### Sources Tab
	- Drag-and-drop file uploader (PDF, PPTX, TXT)
	- URL input for web pages and YouTube videos
	- Source cards showing type icon, file size, chunk count, and status badge
	- Status indicators: Processing (yellow pulse), Ready (green), Failed (red with tooltip)
	- Delete source dropdown

	### Artifacts Tab

	Summary Sub-tab:
	- Conversation Summary: style selector (brief/detailed) + generate button
	- Document Summary: source selector (checkboxes) + style selector + generate button
	- Download buttons for each (`.md`)

	Podcast Sub-tab:
	- Locked until a summary is generated
	- Generate button produces dialogue script + MP3 audio
	- In-browser audio player
	- Download button (`.mp3`)

	Quiz Sub-tab:
	- Question count selector (5 or 10)
	- Interactive multiple-choice with "Show Answer" reveals
	- Download button (`.html`)

	---

	## Design

	- Theme: Custom dark theme (Indigo/Purple gradient)
	- Background: `#0e1117`
	- Font: Inter (Google Fonts)
	- Color palette: Indigo (`#667eea`), Purple (`#764ba2`), Gold accent (`#fbbf24`), Green success (`#22c55e`)
	- Custom SVG logo with notebook + sparkle motif

	---

	## Dependencies

	```
	gradio>=5.0.0
	huggingface_hub>=0.20.0
	pinecone>=5.0.0
	PyMuPDF>=1.23.0
	python-pptx>=0.6.21
	beautifulsoup4>=4.12.0
	requests>=2.31.0
	youtube-transcript-api>=0.6.0
	scipy>=1.11.0
	anthropic>=0.40.0
	openai>=1.0.0
	```

	---

	## License

	MIT