Clone_Lm / README.md
skumar54's picture
Pin Python 3.10 for Space (fix audioop / pydub on 3.13)
79605b8
---
title: NotebookLM Clone
emoji: πŸ““
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
python_version: "3.10"
pinned: false
---
# NotebookLM Clone
A production-style **NotebookLM clone** as a Hugging Face Space: RAG chat with citations, source ingestion (PDF/PPTX/TXT/URL), artifact generation (report, quiz, podcast), and per-user isolation.
## Tech stack
- **Runtime**: Python 3.10+
- **UI**: Gradio Blocks
- **Auth**: Hugging Face OAuth (or `MOCK_USER` for local)
- **Vector DB**: ChromaDB (persisted under `/data`)
- **Embeddings**: Hugging Face Inference API or sentence-transformers
- **Chat**: Hugging Face Inference API (configurable model) or local fallback
- **Artifacts** (report, quiz, podcast transcript): **Google Gemini API** (context-only; citations from retrieved chunks)
- **TTS**: HF Inference API or gTTS (for podcast .mp3 only)
## Repository layout
```
.
β”œβ”€β”€ app.py # Gradio entry
β”œβ”€β”€ backend/
β”‚ β”œβ”€β”€ auth.py
β”‚ β”œβ”€β”€ storage.py
β”‚ β”œβ”€β”€ notebooks.py
β”‚ β”œβ”€β”€ ingestion.py
β”‚ β”œβ”€β”€ retriever.py
β”‚ β”œβ”€β”€ rag.py
β”‚ β”œβ”€β”€ gemini_client.py # Gemini API for artifacts
β”‚ β”œβ”€β”€ artifacts.py
β”‚ β”œβ”€β”€ tts.py
β”‚ β”œβ”€β”€ utils.py
β”‚ └── config.py
β”œβ”€β”€ docs/
β”‚ β”œβ”€β”€ ARCHITECTURE.md
β”‚ └── rag_comparison.md
β”œβ”€β”€ .github/workflows/
β”‚ └── deploy_hf_space.yml
β”œβ”€β”€ requirements.txt
└── README.md
```
## Configuration (env vars)
| Variable | Description | Default |
|----------|-------------|---------|
| **`GEMINI_API_KEY`** | **Required for artifacts.** Google Gemini API key ([get one](https://aistudio.google.com/apikey)) | β€” |
| `GEMINI_MODEL` | Gemini model for report/quiz/podcast transcript | `gemini-1.5-flash` |
| `HF_TOKEN` | Hugging Face API token (recommended for embeddings + chat LLM) | β€” |
| `HF_LLM_MODEL` | LLM for RAG chat (not artifacts) | `HuggingFaceH4/zephyr-7b-beta` |
| `HF_EMBED_MODEL` | Embedding model | `sentence-transformers/all-MiniLM-L6-v2` |
| `HF_TTS_MODEL` | TTS model (optional) | β€” |
| `CHUNK_SIZE` | RAG chunk size | `1000` |
| `CHUNK_OVERLAP` | Chunk overlap | `200` |
| `TOP_K` | Retrieved chunks per query | `5` |
| `MMR_LAMBDA` | MMR diversity (0–1) | `0.7` |
| `MOCK_USER` | Username when not using HF OAuth (e.g. local) | β€” |
| `DATA_ROOT` | Root for user/notebook data | `/data` |
Without `GEMINI_API_KEY`, artifact generation (report, quiz, podcast transcript) shows a clear UI message asking you to set itβ€”the app does not crash. Without `HF_TOKEN`, chat uses local embeddings and a minimal local LLM fallback (or clear error).
---
## Do I need a .env file?
**No.** The app runs without one. If you want to set a custom username or use a Hugging Face token, copy `.env.example` to `.env` and edit it. See **RUN_LOCALLY.md** for a full step-by-step guide.
## How to run locally
1. **Clone and install**
```bash
cd notebooklm__clone
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/macOS
pip install -r requirements.txt
```
2. **Set env**
**Artifacts (report, quiz, podcast) require a Gemini API key:**
```bash
set GEMINI_API_KEY=your_key_here
set GEMINI_MODEL=gemini-1.5-flash
set MOCK_USER=localuser
set HF_TOKEN=hf_... # optional; for chat + embeddings
```
3. **Run**
```bash
python app.py
```
If `MOCK_USER` is not set, the app sets `MOCK_USER=localuser` when run with `python app.py`. Open http://localhost:7860.
4. **Data**
Data is written under `./data` (or `DATA_ROOT`). Layout: `data/users/<username>/notebooks/<id>/...`.
---
## How to deploy to HF Space via GitHub Actions
1. **Create a Space** on [huggingface.co/spaces](https://huggingface.co/spaces):
- SDK: **Gradio**
- Name it e.g. `notebooklm-clone`
2. **GitHub repo secrets**
In the GitHub repo β†’ Settings β†’ Secrets and variables β†’ Actions, add:
- `HF_TOKEN`: your Hugging Face token (with write access)
- `HF_SPACE`: your Space repo id, e.g. `your-username/notebooklm-clone`
3. **Push to main**
Pushing to `main` runs the workflow `.github/workflows/deploy_hf_space.yml`, which pushes the repo to your Space:
```bash
git remote add origin https://github.com/your-org/notebooklm-clone.git
git push -u origin main
```
4. **Space Secrets (required for artifacts)**
In your Space β†’ **Settings β†’ Secrets**, add:
- **`GEMINI_API_KEY`** = your Gemini API key (get one at [Google AI Studio](https://aistudio.google.com/apikey))
- Optionally **`GEMINI_MODEL`** = `gemini-1.5-flash`
5. **First-time Space setup**
If the Space was created empty, you may need to add the HF Space as a git remote and push once manually, then the Action will keep it in sync:
```bash
git remote add space https://huggingface.co/spaces/YOUR_USERNAME/notebooklm-clone
git push space main
```
After that, the Action can push to `https://...@huggingface.co/spaces/YOUR_USERNAME/notebooklm-clone` on each push to `main`.
5. **Space env**
In the Space β†’ Settings β†’ Variables and secrets, set `HF_TOKEN` (and optionally `HF_LLM_MODEL`, `HF_EMBED_MODEL`, etc.).
---
## Feature checklist (vs rubric)
| Requirement | Implementation |
|-------------|----------------|
| Source ingestion (PDF/PPTX/TXT/URL) | `backend/ingestion.py`: pypdf, python-pptx, readability-lxml + bs4 |
| RAG chat with citations | `backend/rag.py` + app chat; citations in message + Accordion |
| Artifact: report (Markdown) | `backend/artifacts.py` β†’ `artifacts/reports/` |
| Artifact: quiz (Markdown + answer key) | `backend/artifacts.py` β†’ `artifacts/quizzes/` |
| Artifact: podcast (transcript + MP3) | `backend/artifacts.py` + `backend/tts.py` β†’ `artifacts/podcasts/` |
| Per-user isolation | `backend/auth.py` + paths under `data/users/<username>/` |
| Multiple notebooks per user | `backend/notebooks.py` + index.json; UI dropdown |
| Persistent storage under /data | `backend/storage.py`; exact tree in docs/ARCHITECTURE.md |
| HF OAuth + MOCK_USER fallback | `get_username_from_request(request)` in handlers |
| ChromaDB persistence | `chroma/` per notebook; `backend/retriever.py`, `ingestion.py` |
| Two retrieval strategies | Similarity + MMR in `backend/retriever.py`; UI dropdown |
| Retrieval/generation timing | Logged and shown in UI after each reply |
| RAG comparison doc | `docs/rag_comparison.md` |
| File enable/disable per source | `ingestion.set_source_enabled`; UI Enable/Disable + Refresh |
| GitHub Actions deploy to HF Space | `.github/workflows/deploy_hf_space.yml` |