Clone_Lm / docs /ARCHITECTURE.md
skumar54's picture
NotebookLM clone: Gradio app, backend, Gemini artifacts
9c9ce67

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Architecture: NotebookLM Clone

Data flow

User (HF OAuth / MOCK_USER)
    β†’ username
    β†’ /data/users/<username>/notebooks/
        β†’ index.json (list of notebooks)
        β†’ <notebook-uuid>/
            β†’ files_raw/          (uploaded PDF/PPTX/TXT)
            β†’ files_extracted/    (extracted text JSON per source)
            β†’ sources.json        (source registry: id, filename/url, type, enabled)
            β†’ chroma/             (ChromaDB persistence)
            β†’ chat/messages.jsonl (conversation history)
            β†’ artifacts/
                β†’ reports/
                β†’ quizzes/
                β†’ podcasts/       (transcript_*.md, podcast_*.mp3)
                β†’ index.json      (artifact metadata)

Modules

Module Responsibility
backend/config.py Env vars, paths, constants. No global mutable state.
backend/auth.py Derive username from gr.Request or MOCK_USER.
backend/storage.py Path helpers for user/notebook dirs and files.
backend/notebooks.py CRUD: list, create, rename, delete notebooks.
backend/ingestion.py File/URL ingestion: extract text β†’ chunk β†’ embed β†’ Chroma upsert. Source enable/disable.
backend/retriever.py Embedding function (HF API or sentence-transformers). Chroma collection. Retrieval: similarity or MMR.
backend/rag.py Retrieve β†’ build prompt β†’ LLM (HF API or local) β†’ format answer with citations. Timing. Chat only.
backend/gemini_client.py Gemini API client for artifact generation only (context-only; no API key logged).
backend/artifacts.py Report, quiz, podcast transcript via Gemini; citations from chunk metadata; TTS for podcast .mp3; persist under artifacts/.
backend/tts.py TTS: HF Inference API or gTTS fallback.
backend/utils.py Logging, user_data_dir, read_json/write_json/read_jsonl/append_jsonl, normalize_text.
app.py Gradio UI: notebooks, sources, chat, citations, artifacts. All handlers take request for username.

Storage tree (exact)

/data/
  └── users/
      └── <username>/
          └── notebooks/
              β”œβ”€β”€ index.json
              └── <notebook-uuid>/
                  β”œβ”€β”€ files_raw/
                  β”œβ”€β”€ files_extracted/
                  β”œβ”€β”€ sources.json
                  β”œβ”€β”€ chroma/
                  β”œβ”€β”€ chat/
                  β”‚   └── messages.jsonl
                  └── artifacts/
                      β”œβ”€β”€ index.json
                      β”œβ”€β”€ reports/
                      β”œβ”€β”€ quizzes/
                      └── podcasts/

Request flow

  1. Auth: Every state-changing handler receives gr.Request; get_username_from_request(request) returns username (or MOCK_USER / anonymous). All paths are under user_data_dir(username).

  2. Notebook: User selects/creates/renames/deletes notebooks. Current notebook_id is kept in state and hidden textbox; all source/chat/artifact ops use (username, notebook_id).

  3. Ingestion: Upload or URL β†’ extract text (pypdf/python-pptx/readability) β†’ chunk (recursive split, overlap) β†’ embed (HF or local) β†’ upsert Chroma with metadata (source_id, source_name, page_or_slide, enabled). sources.json updated.

  4. RAG: Query β†’ embed β†’ Chroma query (filter enabled sources) β†’ optional MMR β†’ build context string β†’ LLM with citation instructions β†’ append to messages.jsonl. Retrieval and generation times logged and shown in UI.

  5. Artifacts: Report/quiz/podcast use retrieval again (artifact-specific query + extra instruction). Gemini generates Markdown (context-only; citations [1], [2] mapped from chunk metadata). Podcast transcript from Gemini; TTS (HF or gTTS) for .mp3. Entries appended to artifacts/index.json.

Chroma

  • One collection per notebook: name chunks.
  • Documents stored with metadata: source_id, source_name, source_type, page_or_slide, chunk_index, enabled.
  • Chunk IDs: {source_id}::{chunk_index}.
  • Retrieval filters by enabled source_id; supports similarity-only or MMR.

Configuration (env)

See README and backend/config.py: GEMINI_API_KEY, GEMINI_MODEL (artifacts); HF_TOKEN, HF_LLM_MODEL, HF_EMBED_MODEL, HF_TTS_MODEL (chat/embeddings); CHUNK_SIZE, CHUNK_OVERLAP, TOP_K, MMR_LAMBDA, MOCK_USER, DATA_ROOT.