# Architecture: NotebookLM Clone ## Data flow ``` User (HF OAuth / MOCK_USER) → username → /data/users//notebooks/ → index.json (list of notebooks) → / → files_raw/ (uploaded PDF/PPTX/TXT) → files_extracted/ (extracted text JSON per source) → sources.json (source registry: id, filename/url, type, enabled) → chroma/ (ChromaDB persistence) → chat/messages.jsonl (conversation history) → artifacts/ → reports/ → quizzes/ → podcasts/ (transcript_*.md, podcast_*.mp3) → index.json (artifact metadata) ``` ## Modules | Module | Responsibility | |--------|----------------| | `backend/config.py` | Env vars, paths, constants. No global mutable state. | | `backend/auth.py` | Derive username from `gr.Request` or `MOCK_USER`. | | `backend/storage.py` | Path helpers for user/notebook dirs and files. | | `backend/notebooks.py` | CRUD: list, create, rename, delete notebooks. | | `backend/ingestion.py` | File/URL ingestion: extract text → chunk → embed → Chroma upsert. Source enable/disable. | | `backend/retriever.py` | Embedding function (HF API or sentence-transformers). Chroma collection. Retrieval: similarity or MMR. | | `backend/rag.py` | Retrieve → build prompt → LLM (HF API or local) → format answer with citations. Timing. Chat only. | | `backend/gemini_client.py` | Gemini API client for artifact generation only (context-only; no API key logged). | | `backend/artifacts.py` | Report, quiz, podcast transcript via Gemini; citations from chunk metadata; TTS for podcast .mp3; persist under artifacts/. | | `backend/tts.py` | TTS: HF Inference API or gTTS fallback. | | `backend/utils.py` | Logging, user_data_dir, read_json/write_json/read_jsonl/append_jsonl, normalize_text. | | `app.py` | Gradio UI: notebooks, sources, chat, citations, artifacts. All handlers take `request` for username. | ## Storage tree (exact) ``` /data/ └── users/ └── / └── notebooks/ ├── index.json └── / ├── files_raw/ ├── files_extracted/ ├── sources.json ├── chroma/ ├── chat/ │ └── messages.jsonl └── artifacts/ ├── index.json ├── reports/ ├── quizzes/ └── podcasts/ ``` ## Request flow 1. **Auth**: Every state-changing handler receives `gr.Request`; `get_username_from_request(request)` returns username (or `MOCK_USER` / `anonymous`). All paths are under `user_data_dir(username)`. 2. **Notebook**: User selects/creates/renames/deletes notebooks. Current `notebook_id` is kept in state and hidden textbox; all source/chat/artifact ops use `(username, notebook_id)`. 3. **Ingestion**: Upload or URL → extract text (pypdf/python-pptx/readability) → chunk (recursive split, overlap) → embed (HF or local) → upsert Chroma with metadata (source_id, source_name, page_or_slide, enabled). sources.json updated. 4. **RAG**: Query → embed → Chroma query (filter enabled sources) → optional MMR → build context string → LLM with citation instructions → append to messages.jsonl. Retrieval and generation times logged and shown in UI. 5. **Artifacts**: Report/quiz/podcast use retrieval again (artifact-specific query + extra instruction). **Gemini** generates Markdown (context-only; citations [1], [2] mapped from chunk metadata). Podcast transcript from Gemini; TTS (HF or gTTS) for .mp3. Entries appended to artifacts/index.json. ## Chroma - One collection per notebook: name `chunks`. - Documents stored with metadata: `source_id`, `source_name`, `source_type`, `page_or_slide`, `chunk_index`, `enabled`. - Chunk IDs: `{source_id}::{chunk_index}`. - Retrieval filters by `enabled` source_id; supports similarity-only or MMR. ## Configuration (env) See README and `backend/config.py`: `GEMINI_API_KEY`, `GEMINI_MODEL` (artifacts); `HF_TOKEN`, `HF_LLM_MODEL`, `HF_EMBED_MODEL`, `HF_TTS_MODEL` (chat/embeddings); `CHUNK_SIZE`, `CHUNK_OVERLAP`, `TOP_K`, `MMR_LAMBDA`, `MOCK_USER`, `DATA_ROOT`.