| # Architecture: NotebookLM Clone |
|
|
| ## Data flow |
|
|
| ``` |
| User (HF OAuth / MOCK_USER) |
| β username |
| β /data/users/<username>/notebooks/ |
| β index.json (list of notebooks) |
| β <notebook-uuid>/ |
| β files_raw/ (uploaded PDF/PPTX/TXT) |
| β files_extracted/ (extracted text JSON per source) |
| β sources.json (source registry: id, filename/url, type, enabled) |
| β chroma/ (ChromaDB persistence) |
| β chat/messages.jsonl (conversation history) |
| β artifacts/ |
| β reports/ |
| β quizzes/ |
| β podcasts/ (transcript_*.md, podcast_*.mp3) |
| β index.json (artifact metadata) |
| ``` |
|
|
| ## Modules |
|
|
| | Module | Responsibility | |
| |--------|----------------| |
| | `backend/config.py` | Env vars, paths, constants. No global mutable state. | |
| | `backend/auth.py` | Derive username from `gr.Request` or `MOCK_USER`. | |
| | `backend/storage.py` | Path helpers for user/notebook dirs and files. | |
| | `backend/notebooks.py` | CRUD: list, create, rename, delete notebooks. | |
| | `backend/ingestion.py` | File/URL ingestion: extract text β chunk β embed β Chroma upsert. Source enable/disable. | |
| | `backend/retriever.py` | Embedding function (HF API or sentence-transformers). Chroma collection. Retrieval: similarity or MMR. | |
| | `backend/rag.py` | Retrieve β build prompt β LLM (HF API or local) β format answer with citations. Timing. Chat only. | |
| | `backend/gemini_client.py` | Gemini API client for artifact generation only (context-only; no API key logged). | |
| | `backend/artifacts.py` | Report, quiz, podcast transcript via Gemini; citations from chunk metadata; TTS for podcast .mp3; persist under artifacts/. | |
| | `backend/tts.py` | TTS: HF Inference API or gTTS fallback. | |
| | `backend/utils.py` | Logging, user_data_dir, read_json/write_json/read_jsonl/append_jsonl, normalize_text. | |
| | `app.py` | Gradio UI: notebooks, sources, chat, citations, artifacts. All handlers take `request` for username. | |
| |
| ## Storage tree (exact) |
| |
| ``` |
| /data/ |
| βββ users/ |
| βββ <username>/ |
| βββ notebooks/ |
| βββ index.json |
| βββ <notebook-uuid>/ |
| βββ files_raw/ |
| βββ files_extracted/ |
| βββ sources.json |
| βββ chroma/ |
| βββ chat/ |
| β βββ messages.jsonl |
| βββ artifacts/ |
| βββ index.json |
| βββ reports/ |
| βββ quizzes/ |
| βββ podcasts/ |
| ``` |
| |
| ## Request flow |
|
|
| 1. **Auth**: Every state-changing handler receives `gr.Request`; `get_username_from_request(request)` returns username (or `MOCK_USER` / `anonymous`). All paths are under `user_data_dir(username)`. |
|
|
| 2. **Notebook**: User selects/creates/renames/deletes notebooks. Current `notebook_id` is kept in state and hidden textbox; all source/chat/artifact ops use `(username, notebook_id)`. |
|
|
| 3. **Ingestion**: Upload or URL β extract text (pypdf/python-pptx/readability) β chunk (recursive split, overlap) β embed (HF or local) β upsert Chroma with metadata (source_id, source_name, page_or_slide, enabled). sources.json updated. |
|
|
| 4. **RAG**: Query β embed β Chroma query (filter enabled sources) β optional MMR β build context string β LLM with citation instructions β append to messages.jsonl. Retrieval and generation times logged and shown in UI. |
|
|
| 5. **Artifacts**: Report/quiz/podcast use retrieval again (artifact-specific query + extra instruction). **Gemini** generates Markdown (context-only; citations [1], [2] mapped from chunk metadata). Podcast transcript from Gemini; TTS (HF or gTTS) for .mp3. Entries appended to artifacts/index.json. |
|
|
| ## Chroma |
|
|
| - One collection per notebook: name `chunks`. |
| - Documents stored with metadata: `source_id`, `source_name`, `source_type`, `page_or_slide`, `chunk_index`, `enabled`. |
| - Chunk IDs: `{source_id}::{chunk_index}`. |
| - Retrieval filters by `enabled` source_id; supports similarity-only or MMR. |
| |
| ## Configuration (env) |
| |
| See README and `backend/config.py`: `GEMINI_API_KEY`, `GEMINI_MODEL` (artifacts); `HF_TOKEN`, `HF_LLM_MODEL`, `HF_EMBED_MODEL`, `HF_TTS_MODEL` (chat/embeddings); `CHUNK_SIZE`, `CHUNK_OVERLAP`, `TOP_K`, `MMR_LAMBDA`, `MOCK_USER`, `DATA_ROOT`. |
| |