Clone_Lm / README.md
skumar54's picture
Pin Python 3.10 for Space (fix audioop / pydub on 3.13)
79605b8

A newer version of the Gradio SDK is available: 6.15.2

Upgrade
metadata
title: NotebookLM Clone
emoji: πŸ““
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
python_version: '3.10'
pinned: false

NotebookLM Clone

A production-style NotebookLM clone as a Hugging Face Space: RAG chat with citations, source ingestion (PDF/PPTX/TXT/URL), artifact generation (report, quiz, podcast), and per-user isolation.

Tech stack

  • Runtime: Python 3.10+
  • UI: Gradio Blocks
  • Auth: Hugging Face OAuth (or MOCK_USER for local)
  • Vector DB: ChromaDB (persisted under /data)
  • Embeddings: Hugging Face Inference API or sentence-transformers
  • Chat: Hugging Face Inference API (configurable model) or local fallback
  • Artifacts (report, quiz, podcast transcript): Google Gemini API (context-only; citations from retrieved chunks)
  • TTS: HF Inference API or gTTS (for podcast .mp3 only)

Repository layout

.
β”œβ”€β”€ app.py                 # Gradio entry
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ auth.py
β”‚   β”œβ”€β”€ storage.py
β”‚   β”œβ”€β”€ notebooks.py
β”‚   β”œβ”€β”€ ingestion.py
β”‚   β”œβ”€β”€ retriever.py
β”‚   β”œβ”€β”€ rag.py
β”‚   β”œβ”€β”€ gemini_client.py   # Gemini API for artifacts
β”‚   β”œβ”€β”€ artifacts.py
β”‚   β”œβ”€β”€ tts.py
β”‚   β”œβ”€β”€ utils.py
β”‚   └── config.py
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ ARCHITECTURE.md
β”‚   └── rag_comparison.md
β”œβ”€β”€ .github/workflows/
β”‚   └── deploy_hf_space.yml
β”œβ”€β”€ requirements.txt
└── README.md

Configuration (env vars)

Variable Description Default
GEMINI_API_KEY Required for artifacts. Google Gemini API key (get one) β€”
GEMINI_MODEL Gemini model for report/quiz/podcast transcript gemini-1.5-flash
HF_TOKEN Hugging Face API token (recommended for embeddings + chat LLM) β€”
HF_LLM_MODEL LLM for RAG chat (not artifacts) HuggingFaceH4/zephyr-7b-beta
HF_EMBED_MODEL Embedding model sentence-transformers/all-MiniLM-L6-v2
HF_TTS_MODEL TTS model (optional) β€”
CHUNK_SIZE RAG chunk size 1000
CHUNK_OVERLAP Chunk overlap 200
TOP_K Retrieved chunks per query 5
MMR_LAMBDA MMR diversity (0–1) 0.7
MOCK_USER Username when not using HF OAuth (e.g. local) β€”
DATA_ROOT Root for user/notebook data /data

Without GEMINI_API_KEY, artifact generation (report, quiz, podcast transcript) shows a clear UI message asking you to set itβ€”the app does not crash. Without HF_TOKEN, chat uses local embeddings and a minimal local LLM fallback (or clear error).


Do I need a .env file?

No. The app runs without one. If you want to set a custom username or use a Hugging Face token, copy .env.example to .env and edit it. See RUN_LOCALLY.md for a full step-by-step guide.

How to run locally

  1. Clone and install

    cd notebooklm__clone
    python -m venv .venv
    .venv\Scripts\activate   # Windows
    # source .venv/bin/activate  # Linux/macOS
    pip install -r requirements.txt
    
  2. Set env

    Artifacts (report, quiz, podcast) require a Gemini API key:

    set GEMINI_API_KEY=your_key_here
    set GEMINI_MODEL=gemini-1.5-flash
    set MOCK_USER=localuser
    set HF_TOKEN=hf_...   # optional; for chat + embeddings
    
  3. Run

    python app.py
    

    If MOCK_USER is not set, the app sets MOCK_USER=localuser when run with python app.py. Open http://localhost:7860.

  4. Data

    Data is written under ./data (or DATA_ROOT). Layout: data/users/<username>/notebooks/<id>/....


How to deploy to HF Space via GitHub Actions

  1. Create a Space on huggingface.co/spaces:

    • SDK: Gradio
    • Name it e.g. notebooklm-clone
  2. GitHub repo secrets
    In the GitHub repo β†’ Settings β†’ Secrets and variables β†’ Actions, add:

    • HF_TOKEN: your Hugging Face token (with write access)
    • HF_SPACE: your Space repo id, e.g. your-username/notebooklm-clone
  3. Push to main
    Pushing to main runs the workflow .github/workflows/deploy_hf_space.yml, which pushes the repo to your Space:

    git remote add origin https://github.com/your-org/notebooklm-clone.git
    git push -u origin main
    
  4. Space Secrets (required for artifacts)
    In your Space β†’ Settings β†’ Secrets, add:

    • GEMINI_API_KEY = your Gemini API key (get one at Google AI Studio)
    • Optionally GEMINI_MODEL = gemini-1.5-flash
  5. First-time Space setup
    If the Space was created empty, you may need to add the HF Space as a git remote and push once manually, then the Action will keep it in sync:

    git remote add space https://huggingface.co/spaces/YOUR_USERNAME/notebooklm-clone
    git push space main
    

    After that, the Action can push to https://...@huggingface.co/spaces/YOUR_USERNAME/notebooklm-clone on each push to main.

  6. Space env
    In the Space β†’ Settings β†’ Variables and secrets, set HF_TOKEN (and optionally HF_LLM_MODEL, HF_EMBED_MODEL, etc.).


Feature checklist (vs rubric)

Requirement Implementation
Source ingestion (PDF/PPTX/TXT/URL) backend/ingestion.py: pypdf, python-pptx, readability-lxml + bs4
RAG chat with citations backend/rag.py + app chat; citations in message + Accordion
Artifact: report (Markdown) backend/artifacts.py β†’ artifacts/reports/
Artifact: quiz (Markdown + answer key) backend/artifacts.py β†’ artifacts/quizzes/
Artifact: podcast (transcript + MP3) backend/artifacts.py + backend/tts.py β†’ artifacts/podcasts/
Per-user isolation backend/auth.py + paths under data/users/<username>/
Multiple notebooks per user backend/notebooks.py + index.json; UI dropdown
Persistent storage under /data backend/storage.py; exact tree in docs/ARCHITECTURE.md
HF OAuth + MOCK_USER fallback get_username_from_request(request) in handlers
ChromaDB persistence chroma/ per notebook; backend/retriever.py, ingestion.py
Two retrieval strategies Similarity + MMR in backend/retriever.py; UI dropdown
Retrieval/generation timing Logged and shown in UI after each reply
RAG comparison doc docs/rag_comparison.md
File enable/disable per source ingestion.set_source_enabled; UI Enable/Disable + Refresh
GitHub Actions deploy to HF Space .github/workflows/deploy_hf_space.yml