Spaces:

skumar54
/

Clone_Lm

Runtime error

App Files Files Community

Clone_Lm / README.md

skumar54

Pin Python 3.10 for Space (fix audioop / pydub on 3.13)

79605b8 3 months ago

preview code

raw

history blame contribute delete

6.69 kB

	---
	title: NotebookLM Clone
	emoji: 📓
	colorFrom: purple
	colorTo: indigo
	sdk: gradio
	sdk_version: "4.44.0"
	app_file: app.py
	python_version: "3.10"
	pinned: false
	---

	# NotebookLM Clone

	A production-style NotebookLM clone as a Hugging Face Space: RAG chat with citations, source ingestion (PDF/PPTX/TXT/URL), artifact generation (report, quiz, podcast), and per-user isolation.

	## Tech stack

	- Runtime: Python 3.10+
	- UI: Gradio Blocks
	- Auth: Hugging Face OAuth (or `MOCK_USER` for local)
	- Vector DB: ChromaDB (persisted under `/data`)
	- Embeddings: Hugging Face Inference API or sentence-transformers
	- Chat: Hugging Face Inference API (configurable model) or local fallback
	- Artifacts (report, quiz, podcast transcript): Google Gemini API (context-only; citations from retrieved chunks)
	- TTS: HF Inference API or gTTS (for podcast .mp3 only)

	## Repository layout

	```
	.
	├── app.py # Gradio entry
	├── backend/
	│ ├── auth.py
	│ ├── storage.py
	│ ├── notebooks.py
	│ ├── ingestion.py
	│ ├── retriever.py
	│ ├── rag.py
	│ ├── gemini_client.py # Gemini API for artifacts
	│ ├── artifacts.py
	│ ├── tts.py
	│ ├── utils.py
	│ └── config.py
	├── docs/
	│ ├── ARCHITECTURE.md
	│ └── rag_comparison.md
	├── .github/workflows/
	│ └── deploy_hf_space.yml
	├── requirements.txt
	└── README.md
	```

	## Configuration (env vars)

	\| Variable \| Description \| Default \|
	\|----------\|-------------\|---------\|
	\| `GEMINI_API_KEY` \| Required for artifacts. Google Gemini API key ([get one](https://aistudio.google.com/apikey)) \| — \|
	\| `GEMINI_MODEL` \| Gemini model for report/quiz/podcast transcript \| `gemini-1.5-flash` \|
	\| `HF_TOKEN` \| Hugging Face API token (recommended for embeddings + chat LLM) \| — \|
	\| `HF_LLM_MODEL` \| LLM for RAG chat (not artifacts) \| `HuggingFaceH4/zephyr-7b-beta` \|
	\| `HF_EMBED_MODEL` \| Embedding model \| `sentence-transformers/all-MiniLM-L6-v2` \|
	\| `HF_TTS_MODEL` \| TTS model (optional) \| — \|
	\| `CHUNK_SIZE` \| RAG chunk size \| `1000` \|
	\| `CHUNK_OVERLAP` \| Chunk overlap \| `200` \|
	\| `TOP_K` \| Retrieved chunks per query \| `5` \|
	\| `MMR_LAMBDA` \| MMR diversity (0–1) \| `0.7` \|
	\| `MOCK_USER` \| Username when not using HF OAuth (e.g. local) \| — \|
	\| `DATA_ROOT` \| Root for user/notebook data \| `/data` \|

	Without `GEMINI_API_KEY`, artifact generation (report, quiz, podcast transcript) shows a clear UI message asking you to set it—the app does not crash. Without `HF_TOKEN`, chat uses local embeddings and a minimal local LLM fallback (or clear error).

	---

	## Do I need a .env file?

	No. The app runs without one. If you want to set a custom username or use a Hugging Face token, copy `.env.example` to `.env` and edit it. See RUN_LOCALLY.md for a full step-by-step guide.

	## How to run locally

	1. Clone and install

	```bash
	cd notebooklm__clone
	python -m venv .venv
	.venv\Scripts\activate # Windows
	# source .venv/bin/activate # Linux/macOS
	pip install -r requirements.txt
	```

	2. Set env

	Artifacts (report, quiz, podcast) require a Gemini API key:

	```bash
	set GEMINI_API_KEY=your_key_here
	set GEMINI_MODEL=gemini-1.5-flash
	set MOCK_USER=localuser
	set HF_TOKEN=hf_... # optional; for chat + embeddings
	```

	3. Run

	```bash
	python app.py
	```

	If `MOCK_USER` is not set, the app sets `MOCK_USER=localuser` when run with `python app.py`. Open http://localhost:7860.

	4. Data

	Data is written under `./data` (or `DATA_ROOT`). Layout: `data/users/<username>/notebooks/<id>/...`.

	---

	## How to deploy to HF Space via GitHub Actions

	1. Create a Space on [huggingface.co/spaces](https://huggingface.co/spaces):
	- SDK: Gradio
	- Name it e.g. `notebooklm-clone`

	2. GitHub repo secrets
	In the GitHub repo → Settings → Secrets and variables → Actions, add:
	- `HF_TOKEN`: your Hugging Face token (with write access)
	- `HF_SPACE`: your Space repo id, e.g. `your-username/notebooklm-clone`

	3. Push to main
	Pushing to `main` runs the workflow `.github/workflows/deploy_hf_space.yml`, which pushes the repo to your Space:

	```bash
	git remote add origin https://github.com/your-org/notebooklm-clone.git
	git push -u origin main
	```

	4. Space Secrets (required for artifacts)
	In your Space → Settings → Secrets, add:
	- `GEMINI_API_KEY` = your Gemini API key (get one at [Google AI Studio](https://aistudio.google.com/apikey))
	- Optionally `GEMINI_MODEL` = `gemini-1.5-flash`

	5. First-time Space setup
	If the Space was created empty, you may need to add the HF Space as a git remote and push once manually, then the Action will keep it in sync:

	```bash
	git remote add space https://huggingface.co/spaces/YOUR_USERNAME/notebooklm-clone
	git push space main
	```

	After that, the Action can push to `https://...@huggingface.co/spaces/YOUR_USERNAME/notebooklm-clone` on each push to `main`.

	5. Space env
	In the Space → Settings → Variables and secrets, set `HF_TOKEN` (and optionally `HF_LLM_MODEL`, `HF_EMBED_MODEL`, etc.).

	---

	## Feature checklist (vs rubric)

	\| Requirement \| Implementation \|
	\|-------------\|----------------\|
	\| Source ingestion (PDF/PPTX/TXT/URL) \| `backend/ingestion.py`: pypdf, python-pptx, readability-lxml + bs4 \|
	\| RAG chat with citations \| `backend/rag.py` + app chat; citations in message + Accordion \|
	\| Artifact: report (Markdown) \| `backend/artifacts.py` → `artifacts/reports/` \|
	\| Artifact: quiz (Markdown + answer key) \| `backend/artifacts.py` → `artifacts/quizzes/` \|
	\| Artifact: podcast (transcript + MP3) \| `backend/artifacts.py` + `backend/tts.py` → `artifacts/podcasts/` \|
	\| Per-user isolation \| `backend/auth.py` + paths under `data/users/<username>/` \|
	\| Multiple notebooks per user \| `backend/notebooks.py` + index.json; UI dropdown \|
	\| Persistent storage under /data \| `backend/storage.py`; exact tree in docs/ARCHITECTURE.md \|
	\| HF OAuth + MOCK_USER fallback \| `get_username_from_request(request)` in handlers \|
	\| ChromaDB persistence \| `chroma/` per notebook; `backend/retriever.py`, `ingestion.py` \|
	\| Two retrieval strategies \| Similarity + MMR in `backend/retriever.py`; UI dropdown \|
	\| Retrieval/generation timing \| Logged and shown in UI after each reply \|
	\| RAG comparison doc \| `docs/rag_comparison.md` \|
	\| File enable/disable per source \| `ingestion.set_source_enabled`; UI Enable/Disable + Refresh \|
	\| GitHub Actions deploy to HF Space \| `.github/workflows/deploy_hf_space.yml` \|