| --- |
| title: NotebookLM Clone |
| emoji: π |
| colorFrom: purple |
| colorTo: indigo |
| sdk: gradio |
| sdk_version: "4.44.0" |
| app_file: app.py |
| python_version: "3.10" |
| pinned: false |
| --- |
| |
| # NotebookLM Clone |
|
|
| A production-style **NotebookLM clone** as a Hugging Face Space: RAG chat with citations, source ingestion (PDF/PPTX/TXT/URL), artifact generation (report, quiz, podcast), and per-user isolation. |
|
|
| ## Tech stack |
|
|
| - **Runtime**: Python 3.10+ |
| - **UI**: Gradio Blocks |
| - **Auth**: Hugging Face OAuth (or `MOCK_USER` for local) |
| - **Vector DB**: ChromaDB (persisted under `/data`) |
| - **Embeddings**: Hugging Face Inference API or sentence-transformers |
| - **Chat**: Hugging Face Inference API (configurable model) or local fallback |
| - **Artifacts** (report, quiz, podcast transcript): **Google Gemini API** (context-only; citations from retrieved chunks) |
| - **TTS**: HF Inference API or gTTS (for podcast .mp3 only) |
|
|
| ## Repository layout |
|
|
| ``` |
| . |
| βββ app.py # Gradio entry |
| βββ backend/ |
| β βββ auth.py |
| β βββ storage.py |
| β βββ notebooks.py |
| β βββ ingestion.py |
| β βββ retriever.py |
| β βββ rag.py |
| β βββ gemini_client.py # Gemini API for artifacts |
| β βββ artifacts.py |
| β βββ tts.py |
| β βββ utils.py |
| β βββ config.py |
| βββ docs/ |
| β βββ ARCHITECTURE.md |
| β βββ rag_comparison.md |
| βββ .github/workflows/ |
| β βββ deploy_hf_space.yml |
| βββ requirements.txt |
| βββ README.md |
| ``` |
|
|
| ## Configuration (env vars) |
|
|
| | Variable | Description | Default | |
| |----------|-------------|---------| |
| | **`GEMINI_API_KEY`** | **Required for artifacts.** Google Gemini API key ([get one](https://aistudio.google.com/apikey)) | β | |
| | `GEMINI_MODEL` | Gemini model for report/quiz/podcast transcript | `gemini-1.5-flash` | |
| | `HF_TOKEN` | Hugging Face API token (recommended for embeddings + chat LLM) | β | |
| | `HF_LLM_MODEL` | LLM for RAG chat (not artifacts) | `HuggingFaceH4/zephyr-7b-beta` | |
| | `HF_EMBED_MODEL` | Embedding model | `sentence-transformers/all-MiniLM-L6-v2` | |
| | `HF_TTS_MODEL` | TTS model (optional) | β | |
| | `CHUNK_SIZE` | RAG chunk size | `1000` | |
| | `CHUNK_OVERLAP` | Chunk overlap | `200` | |
| | `TOP_K` | Retrieved chunks per query | `5` | |
| | `MMR_LAMBDA` | MMR diversity (0β1) | `0.7` | |
| | `MOCK_USER` | Username when not using HF OAuth (e.g. local) | β | |
| | `DATA_ROOT` | Root for user/notebook data | `/data` | |
|
|
| Without `GEMINI_API_KEY`, artifact generation (report, quiz, podcast transcript) shows a clear UI message asking you to set itβthe app does not crash. Without `HF_TOKEN`, chat uses local embeddings and a minimal local LLM fallback (or clear error). |
|
|
| --- |
|
|
| ## Do I need a .env file? |
|
|
| **No.** The app runs without one. If you want to set a custom username or use a Hugging Face token, copy `.env.example` to `.env` and edit it. See **RUN_LOCALLY.md** for a full step-by-step guide. |
| |
| ## How to run locally |
| |
| 1. **Clone and install** |
| |
| ```bash |
| cd notebooklm__clone |
| python -m venv .venv |
| .venv\Scripts\activate # Windows |
| # source .venv/bin/activate # Linux/macOS |
| pip install -r requirements.txt |
| ``` |
| |
| 2. **Set env** |
| |
| **Artifacts (report, quiz, podcast) require a Gemini API key:** |
| |
| ```bash |
| set GEMINI_API_KEY=your_key_here |
| set GEMINI_MODEL=gemini-1.5-flash |
| set MOCK_USER=localuser |
| set HF_TOKEN=hf_... # optional; for chat + embeddings |
| ``` |
| |
| 3. **Run** |
| |
| ```bash |
| python app.py |
| ``` |
| |
| If `MOCK_USER` is not set, the app sets `MOCK_USER=localuser` when run with `python app.py`. Open http://localhost:7860. |
| |
| 4. **Data** |
| |
| Data is written under `./data` (or `DATA_ROOT`). Layout: `data/users/<username>/notebooks/<id>/...`. |
| |
| --- |
| |
| ## How to deploy to HF Space via GitHub Actions |
| |
| 1. **Create a Space** on [huggingface.co/spaces](https://huggingface.co/spaces): |
| - SDK: **Gradio** |
| - Name it e.g. `notebooklm-clone` |
|
|
| 2. **GitHub repo secrets** |
| In the GitHub repo β Settings β Secrets and variables β Actions, add: |
| - `HF_TOKEN`: your Hugging Face token (with write access) |
| - `HF_SPACE`: your Space repo id, e.g. `your-username/notebooklm-clone` |
|
|
| 3. **Push to main** |
| Pushing to `main` runs the workflow `.github/workflows/deploy_hf_space.yml`, which pushes the repo to your Space: |
|
|
| ```bash |
| git remote add origin https://github.com/your-org/notebooklm-clone.git |
| git push -u origin main |
| ``` |
|
|
| 4. **Space Secrets (required for artifacts)** |
| In your Space β **Settings β Secrets**, add: |
| - **`GEMINI_API_KEY`** = your Gemini API key (get one at [Google AI Studio](https://aistudio.google.com/apikey)) |
| - Optionally **`GEMINI_MODEL`** = `gemini-1.5-flash` |
| |
| 5. **First-time Space setup** |
| If the Space was created empty, you may need to add the HF Space as a git remote and push once manually, then the Action will keep it in sync: |
| |
| ```bash |
| git remote add space https://huggingface.co/spaces/YOUR_USERNAME/notebooklm-clone |
| git push space main |
| ``` |
| |
| After that, the Action can push to `https://...@huggingface.co/spaces/YOUR_USERNAME/notebooklm-clone` on each push to `main`. |
| |
| 5. **Space env** |
| In the Space β Settings β Variables and secrets, set `HF_TOKEN` (and optionally `HF_LLM_MODEL`, `HF_EMBED_MODEL`, etc.). |
| |
| --- |
| |
| ## Feature checklist (vs rubric) |
| |
| | Requirement | Implementation | |
| |-------------|----------------| |
| | Source ingestion (PDF/PPTX/TXT/URL) | `backend/ingestion.py`: pypdf, python-pptx, readability-lxml + bs4 | |
| | RAG chat with citations | `backend/rag.py` + app chat; citations in message + Accordion | |
| | Artifact: report (Markdown) | `backend/artifacts.py` β `artifacts/reports/` | |
| | Artifact: quiz (Markdown + answer key) | `backend/artifacts.py` β `artifacts/quizzes/` | |
| | Artifact: podcast (transcript + MP3) | `backend/artifacts.py` + `backend/tts.py` β `artifacts/podcasts/` | |
| | Per-user isolation | `backend/auth.py` + paths under `data/users/<username>/` | |
| | Multiple notebooks per user | `backend/notebooks.py` + index.json; UI dropdown | |
| | Persistent storage under /data | `backend/storage.py`; exact tree in docs/ARCHITECTURE.md | |
| | HF OAuth + MOCK_USER fallback | `get_username_from_request(request)` in handlers | |
| | ChromaDB persistence | `chroma/` per notebook; `backend/retriever.py`, `ingestion.py` | |
| | Two retrieval strategies | Similarity + MMR in `backend/retriever.py`; UI dropdown | |
| | Retrieval/generation timing | Logged and shown in UI after each reply | |
| | RAG comparison doc | `docs/rag_comparison.md` | |
| | File enable/disable per source | `ingestion.set_source_enabled`; UI Enable/Disable + Refresh | |
| | GitHub Actions deploy to HF Space | `.github/workflows/deploy_hf_space.yml` | |
| |