# Architecture: NotebookLM Clone

## Data flow

```
User (HF OAuth / MOCK_USER)
    → username
    → /data/users/<username>/notebooks/
        → index.json (list of notebooks)
        → <notebook-uuid>/
            → files_raw/          (uploaded PDF/PPTX/TXT)
            → files_extracted/    (extracted text JSON per source)
            → sources.json        (source registry: id, filename/url, type, enabled)
            → chroma/             (ChromaDB persistence)
            → chat/messages.jsonl (conversation history)
            → artifacts/
                → reports/
                → quizzes/
                → podcasts/       (transcript_*.md, podcast_*.mp3)
                → index.json      (artifact metadata)
```

## Modules

| Module | Responsibility |
|--------|----------------|
| `backend/config.py` | Env vars, paths, constants. No global mutable state. |
| `backend/auth.py` | Derive username from `gr.Request` or `MOCK_USER`. |
| `backend/storage.py` | Path helpers for user/notebook dirs and files. |
| `backend/notebooks.py` | CRUD: list, create, rename, delete notebooks. |
| `backend/ingestion.py` | File/URL ingestion: extract text → chunk → embed → Chroma upsert. Source enable/disable. |
| `backend/retriever.py` | Embedding function (HF API or sentence-transformers). Chroma collection. Retrieval: similarity or MMR. |
| `backend/rag.py` | Retrieve → build prompt → LLM (HF API or local) → format answer with citations. Timing. Chat only. |
| `backend/gemini_client.py` | Gemini API client for artifact generation only (context-only; no API key logged). |
| `backend/artifacts.py` | Report, quiz, podcast transcript via Gemini; citations from chunk metadata; TTS for podcast .mp3; persist under artifacts/. |
| `backend/tts.py` | TTS: HF Inference API or gTTS fallback. |
| `backend/utils.py` | Logging, user_data_dir, read_json/write_json/read_jsonl/append_jsonl, normalize_text. |
| `app.py` | Gradio UI: notebooks, sources, chat, citations, artifacts. All handlers take `request` for username. |

## Storage tree (exact)

```
/data/
  └── users/
      └── <username>/
          └── notebooks/
              ├── index.json
              └── <notebook-uuid>/
                  ├── files_raw/
                  ├── files_extracted/
                  ├── sources.json
                  ├── chroma/
                  ├── chat/
                  │   └── messages.jsonl
                  └── artifacts/
                      ├── index.json
                      ├── reports/
                      ├── quizzes/
                      └── podcasts/
```

## Request flow

1. **Auth**: Every state-changing handler receives `gr.Request`; `get_username_from_request(request)` returns username (or `MOCK_USER` / `anonymous`). All paths are under `user_data_dir(username)`.

2. **Notebook**: User selects/creates/renames/deletes notebooks. Current `notebook_id` is kept in state and hidden textbox; all source/chat/artifact ops use `(username, notebook_id)`.

3. **Ingestion**: Upload or URL → extract text (pypdf/python-pptx/readability) → chunk (recursive split, overlap) → embed (HF or local) → upsert Chroma with metadata (source_id, source_name, page_or_slide, enabled). sources.json updated.

4. **RAG**: Query → embed → Chroma query (filter enabled sources) → optional MMR → build context string → LLM with citation instructions → append to messages.jsonl. Retrieval and generation times logged and shown in UI.

5. **Artifacts**: Report/quiz/podcast use retrieval again (artifact-specific query + extra instruction). **Gemini** generates Markdown (context-only; citations [1], [2] mapped from chunk metadata). Podcast transcript from Gemini; TTS (HF or gTTS) for .mp3. Entries appended to artifacts/index.json.

## Chroma

- One collection per notebook: name `chunks`.
- Documents stored with metadata: `source_id`, `source_name`, `source_type`, `page_or_slide`, `chunk_index`, `enabled`.
- Chunk IDs: `{source_id}::{chunk_index}`.
- Retrieval filters by `enabled` source_id; supports similarity-only or MMR.

## Configuration (env)

See README and `backend/config.py`: `GEMINI_API_KEY`, `GEMINI_MODEL` (artifacts); `HF_TOKEN`, `HF_LLM_MODEL`, `HF_EMBED_MODEL`, `HF_TTS_MODEL` (chat/embeddings); `CHUNK_SIZE`, `CHUNK_OVERLAP`, `TOP_K`, `MMR_LAMBDA`, `MOCK_USER`, `DATA_ROOT`.