notebook_lm_clone / README.md
Abhinav Biju
readme.md updated
caea232

A newer version of the Gradio SDK is available: 6.15.2

Upgrade
metadata
title: NotebookLM Clone ITCS4681 Group5
emoji: πŸŒ–
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 6.8.0
app_file: app.py
pinned: false
license: mit
short_description: A replica of NotebookLM
hf_oauth: true
hf_oauth_scopes:
  - email

NotebookLM Clone πŸͺ

An open-source, locally runnable replica of Google's NotebookLM. This application allows users to upload documents (PDF, TXT, PPTX) or URLs, and creates an intelligent, persistent notebook environment where you can query your documents using a state-of-the-art Retrieval-Augmented Generation (RAG) pipeline.

πŸ”— Live Demo on Hugging Face Spaces


✨ Core Features

  • πŸ”’ Secure & Isolated Identity: Integrates with Hugging Face OAuth. All notebooks, documents, and chat histories are strictly isolated per user.
  • πŸ““ Multi-Notebook Management: Create, rename, delete, and seamlessly switch between multiple isolated notebooks.
  • 🧠 Advanced RAG Pipeline:
    • Contextual Headers: LLMs generate document summaries to prepend to chunks, eliminating cross-document ambiguity.
    • Adaptive Semantic Chunking: Intelligent text splitting based on continuous thought boundaries rather than arbitrary character counts.
    • Query Expansion: Automatically generates alternative query phrasings to combat vocabulary mismatch.
    • Cross-Encoder Reranking: Deep attention-based relevance scoring prevents hallucinations by filtering out noisy distractors.
    • Fast / Reasoning Toggle: Toggle between instant BM25+Vector answers ("Fast") and the full reasoning pipeline ("Reasoning") directly in the UI.
  • πŸ’¬ Grounded Chat with Citations: The Assistant only answers using the provided sources and explicitly cites them inline (e.g., [S1], [S2]).
  • 🎨 Multi-Modal Artifact Generation:
    • Reports: Detailed markdown summaries.
    • Quizzes: Auto-generated multiple-choice questions with answer keys.
    • Deep Dive Audio Podcasts: Transcripts and Text-To-Speech (TTS) generated .mp3 podcasts discussing your documents.
  • πŸ’Ύ Persistent Memory: Uploaded sources, metadata, and full chat histories survive across sessions.

πŸ—οΈ Architecture & Stack

Environment & Libraries

The project utilizes a modern Python/ML stack:

  • UI Framework: gradio (6.8.0) for the responsive frontend and OAuth handling.
  • Vector Database: chromadb (1.5.2) for persistent, local embedding storage.
  • Embeddings & Reranking: sentence-transformers (5.2.3) powered by local Hugging Face models (all-MiniLM-L6-v2, ms-marco-MiniLM-L-6-v2).
  • LLM Provider: openai (2.24.0) handles chat generation, query expansion, and contextual headers.
  • Document Parsing: pypdf (6.7.5) for processing uploaded resources.
  • Hub Integration: huggingface_hub for Space syncing and authentication.

High-Level Flow

  1. Ingest: Files/URLs are extracted, semantically chunked, embedded, and stored in ChromaDB within /data.
  2. Retrieve: User queries are expanded via LLM. A hybrid search (BM25 + Vector) fetches candidates, which are then re-scored by a Cross-Encoder.
  3. Generate: The LLM receives the strict context and returns a cited answer. The chat is appended to messages.jsonl for persistent memory.

πŸš€ Running Locally

Prerequisites

  1. Open up a terminal.
  2. Ensure you have modern Python (3.10+) or uv installed.
  3. Clone the repository.

Setup

  1. Environment Variables: Create a .env file in the root directory:

    OPENAI_API_KEY=sk-your-actual-api-key
    NOTEBOOKLM_CHAT_MODEL=gpt-4o-mini
    NOTEBOOKLM_CHUNKING_METHOD=semantic
    
  2. Install & Run: Using uv (recommended):

    uv run --env-file .env python app.py
    

    Or using standard pip:

    pip install -r requirements.txt
    python app.py
    
  3. Login Mocking: When running locally, Gradio automatically mocks the Hugging Face OAuth flow. Just click "Sign in with Hugging Face" and it will inject a dummy local profile allowing you to test the app seamlessly.


☁️ Deployment & CI

This repository is designed to run out-of-the-box on Hugging Face Spaces.

  • HF Spaces Deployment: The README.md contains the necessary YAML frontmatter to configure the Gradio SDK and OAuth scopes upon launch.
  • Secrets: You must configure OPENAI_API_KEY in the Space settings under Variables and secrets.
  • Storage Ephemerality: On HF Free Tiers, the /data folder is ephemeral and resets if the Space sleeps (after 48 hours). For persistent production storage, consider linking a Hugging Face Dataset volume to the Space.
  • Continuous Integration: Updates pushed to the main branch of the connected GitHub repository automatically trigger a GitHub Action or webhook to sync files directly to the Space.