--- title: NotebookLM Clone ITCS4681 Group5 emoji: 🌖 colorFrom: gray colorTo: blue sdk: gradio sdk_version: 6.8.0 app_file: app.py pinned: false license: mit short_description: A replica of NotebookLM hf_oauth: true hf_oauth_scopes: - email --- # NotebookLM Clone 🪐 An open-source, locally runnable replica of Google's NotebookLM. This application allows users to upload documents (PDF, TXT, PPTX) or URLs, and creates an intelligent, persistent notebook environment where you can query your documents using a state-of-the-art Retrieval-Augmented Generation (RAG) pipeline. 🔗 **[Live Demo on Hugging Face Spaces](https://huggingface.co/spaces/abiju/notebook_lm_clone)** --- ## ✨ Core Features * **🔒 Secure & Isolated Identity:** Integrates with Hugging Face OAuth. All notebooks, documents, and chat histories are strictly isolated per user. * **📓 Multi-Notebook Management:** Create, rename, delete, and seamlessly switch between multiple isolated notebooks. * **🧠 Advanced RAG Pipeline:** * **Contextual Headers:** LLMs generate document summaries to prepend to chunks, eliminating cross-document ambiguity. * **Adaptive Semantic Chunking:** Intelligent text splitting based on continuous thought boundaries rather than arbitrary character counts. * **Query Expansion:** Automatically generates alternative query phrasings to combat vocabulary mismatch. * **Cross-Encoder Reranking:** Deep attention-based relevance scoring prevents hallucinations by filtering out noisy distractors. * **Fast / Reasoning Toggle:** Toggle between instant BM25+Vector answers ("Fast") and the full reasoning pipeline ("Reasoning") directly in the UI. * **💬 Grounded Chat with Citations:** The Assistant *only* answers using the provided sources and explicitly cites them inline (e.g., `[S1]`, `[S2]`). * **🎨 Multi-Modal Artifact Generation:** * **Reports:** Detailed markdown summaries. * **Quizzes:** Auto-generated multiple-choice questions with answer keys. * **Deep Dive Audio Podcasts:** Transcripts and Text-To-Speech (TTS) generated `.mp3` podcasts discussing your documents. * **💾 Persistent Memory:** Uploaded sources, metadata, and full chat histories survive across sessions. --- ## 🏗️ Architecture & Stack ### Environment & Libraries The project utilizes a modern Python/ML stack: * **UI Framework:** `gradio (6.8.0)` for the responsive frontend and OAuth handling. * **Vector Database:** `chromadb (1.5.2)` for persistent, local embedding storage. * **Embeddings & Reranking:** `sentence-transformers (5.2.3)` powered by local Hugging Face models (`all-MiniLM-L6-v2`, `ms-marco-MiniLM-L-6-v2`). * **LLM Provider:** `openai (2.24.0)` handles chat generation, query expansion, and contextual headers. * **Document Parsing:** `pypdf (6.7.5)` for processing uploaded resources. * **Hub Integration:** `huggingface_hub` for Space syncing and authentication. ### High-Level Flow 1. **Ingest:** Files/URLs are extracted, semantically chunked, embedded, and stored in ChromaDB within `/data`. 2. **Retrieve:** User queries are expanded via LLM. A hybrid search (BM25 + Vector) fetches candidates, which are then re-scored by a Cross-Encoder. 3. **Generate:** The LLM receives the strict context and returns a cited answer. The chat is appended to `messages.jsonl` for persistent memory. --- ## 🚀 Running Locally ### Prerequisites 1. Open up a terminal. 2. Ensure you have modern Python (`3.10+`) or [`uv`](https://docs.astral.sh/uv/) installed. 3. Clone the repository. ### Setup 1. **Environment Variables:** Create a `.env` file in the root directory: ```env OPENAI_API_KEY=sk-your-actual-api-key NOTEBOOKLM_CHAT_MODEL=gpt-4o-mini NOTEBOOKLM_CHUNKING_METHOD=semantic ``` 2. **Install & Run:** Using `uv` (recommended): ```bash uv run --env-file .env python app.py ``` Or using standard `pip`: ```bash pip install -r requirements.txt python app.py ``` 3. **Login Mocking:** When running locally, Gradio automatically mocks the Hugging Face OAuth flow. Just click "Sign in with Hugging Face" and it will inject a dummy local profile allowing you to test the app seamlessly. --- ## ☁️ Deployment & CI This repository is designed to run out-of-the-box on Hugging Face Spaces. * **HF Spaces Deployment:** The `README.md` contains the necessary YAML frontmatter to configure the Gradio SDK and OAuth scopes upon launch. * **Secrets:** You must configure `OPENAI_API_KEY` in the Space settings under **Variables and secrets**. * **Storage Ephemerality:** On HF Free Tiers, the `/data` folder is ephemeral and resets if the Space sleeps (after 48 hours). For persistent production storage, consider linking a **Hugging Face Dataset** volume to the Space. * **Continuous Integration:** Updates pushed to the `main` branch of the connected GitHub repository automatically trigger a GitHub Action or webhook to sync files directly to the Space.