Spaces:

skumar54
/

Clone_Lm

Runtime error

App Files Files Community

Clone_Lm / docs /ARCHITECTURE.md

skumar54

NotebookLM clone: Gradio app, backend, Gemini artifacts

9c9ce67 3 months ago

preview code

raw

history blame contribute delete

4.46 kB

	# Architecture: NotebookLM Clone

	## Data flow

	```
	User (HF OAuth / MOCK_USER)
	→ username
	→ /data/users/<username>/notebooks/
	→ index.json (list of notebooks)
	→ <notebook-uuid>/
	→ files_raw/ (uploaded PDF/PPTX/TXT)
	→ files_extracted/ (extracted text JSON per source)
	→ sources.json (source registry: id, filename/url, type, enabled)
	→ chroma/ (ChromaDB persistence)
	→ chat/messages.jsonl (conversation history)
	→ artifacts/
	→ reports/
	→ quizzes/
	→ podcasts/ (transcript_.md, podcast_.mp3)
	→ index.json (artifact metadata)
	```

	## Modules

	\| Module \| Responsibility \|
	\|--------\|----------------\|
	\| `backend/config.py` \| Env vars, paths, constants. No global mutable state. \|
	\| `backend/auth.py` \| Derive username from `gr.Request` or `MOCK_USER`. \|
	\| `backend/storage.py` \| Path helpers for user/notebook dirs and files. \|
	\| `backend/notebooks.py` \| CRUD: list, create, rename, delete notebooks. \|
	\| `backend/ingestion.py` \| File/URL ingestion: extract text → chunk → embed → Chroma upsert. Source enable/disable. \|
	\| `backend/retriever.py` \| Embedding function (HF API or sentence-transformers). Chroma collection. Retrieval: similarity or MMR. \|
	\| `backend/rag.py` \| Retrieve → build prompt → LLM (HF API or local) → format answer with citations. Timing. Chat only. \|
	\| `backend/gemini_client.py` \| Gemini API client for artifact generation only (context-only; no API key logged). \|
	\| `backend/artifacts.py` \| Report, quiz, podcast transcript via Gemini; citations from chunk metadata; TTS for podcast .mp3; persist under artifacts/. \|
	\| `backend/tts.py` \| TTS: HF Inference API or gTTS fallback. \|
	\| `backend/utils.py` \| Logging, user_data_dir, read_json/write_json/read_jsonl/append_jsonl, normalize_text. \|
	\| `app.py` \| Gradio UI: notebooks, sources, chat, citations, artifacts. All handlers take `request` for username. \|

	## Storage tree (exact)

	```
	/data/
	└── users/
	└── <username>/
	└── notebooks/
	├── index.json
	└── <notebook-uuid>/
	├── files_raw/
	├── files_extracted/
	├── sources.json
	├── chroma/
	├── chat/
	│ └── messages.jsonl
	└── artifacts/
	├── index.json
	├── reports/
	├── quizzes/
	└── podcasts/
	```

	## Request flow

	1. Auth: Every state-changing handler receives `gr.Request`; `get_username_from_request(request)` returns username (or `MOCK_USER` / `anonymous`). All paths are under `user_data_dir(username)`.

	2. Notebook: User selects/creates/renames/deletes notebooks. Current `notebook_id` is kept in state and hidden textbox; all source/chat/artifact ops use `(username, notebook_id)`.

	3. Ingestion: Upload or URL → extract text (pypdf/python-pptx/readability) → chunk (recursive split, overlap) → embed (HF or local) → upsert Chroma with metadata (source_id, source_name, page_or_slide, enabled). sources.json updated.

	4. RAG: Query → embed → Chroma query (filter enabled sources) → optional MMR → build context string → LLM with citation instructions → append to messages.jsonl. Retrieval and generation times logged and shown in UI.

	5. Artifacts: Report/quiz/podcast use retrieval again (artifact-specific query + extra instruction). Gemini generates Markdown (context-only; citations [1], [2] mapped from chunk metadata). Podcast transcript from Gemini; TTS (HF or gTTS) for .mp3. Entries appended to artifacts/index.json.

	## Chroma

	- One collection per notebook: name `chunks`.
	- Documents stored with metadata: `source_id`, `source_name`, `source_type`, `page_or_slide`, `chunk_index`, `enabled`.
	- Chunk IDs: `{source_id}::{chunk_index}`.
	- Retrieval filters by `enabled` source_id; supports similarity-only or MMR.

	## Configuration (env)

	See README and `backend/config.py`: `GEMINI_API_KEY`, `GEMINI_MODEL` (artifacts); `HF_TOKEN`, `HF_LLM_MODEL`, `HF_EMBED_MODEL`, `HF_TTS_MODEL` (chat/embeddings); `CHUNK_SIZE`, `CHUNK_OVERLAP`, `TOP_K`, `MMR_LAMBDA`, `MOCK_USER`, `DATA_ROOT`.