Spaces:

pbichpur
/

NotebookLMClone

Sleeping

App Files Files Community

NotebookLMClone / docs /DESIGN_BRIEF.md

github-actions[bot]

Sync from GitHub e2e802be5157aa05d1251459f529eb7eb4242ef2

dba1a8e about 1 month ago

preview code

raw

history blame contribute delete

5.81 kB

NotebookLM Clone Design Brief

1. System Overview

This system is a full-stack NotebookLM-style application that supports:

source ingestion (.pdf, .pptx, .txt, web URL)
retrieval-augmented chat with citations
artifact generation (report, quiz, podcast transcript + audio)
strict per-user data isolation with multiple notebooks per user

The stack is optimized for Hugging Face Spaces deployment:

frontend: Streamlit (frontend/app.py)
backend API: FastAPI (app.py)
metadata store: SQLite via SQLAlchemy (data/models.py, data/crud.py)
vector store: ChromaDB per user+notebook (src/ingestion/vectorstore.py)
ingestion/artifact services: src/ingestion/*, src/artifacts/*

2. Architecture Diagram

flowchart TD
    A[Streamlit Frontend] --> B[FastAPI Backend]
    B --> C[Auth Layer<br/>HF OAuth / Dev Auth]
    B --> D[Notebook & Source APIs]
    B --> E[Thread & Chat APIs]
    B --> F[Artifact APIs]

    D --> G[Ingestion Service]
    G --> H[Extractors<br/>PDF/PPTX/TXT/URL]
    G --> I[Chunker]
    G --> J[Embedding Adapter]
    G --> K[ChromaDB]

    E --> K
    E --> L[LLM Client]
    E --> M[Message + Citation Tables]

    F --> L
    F --> N[TTS Adapter<br/>Edge/OpenAI/ElevenLabs]
    F --> O[Artifacts on Disk]

    B --> P[(SQLite DB)]
    B --> Q[/data + uploads Storage]

3. Component Responsibilities

frontend/app.py
- authentication-aware UI
- notebook switching
- source upload/URL ingestion
- chat interface + citation display
- artifact generation, preview, and downloads
app.py
- route orchestration and auth enforcement
- notebook/source/thread/artifact lifecycle endpoints
- chat orchestration with retrieval + prompting
- background podcast generation
auth/oauth.py, auth/session.py
- HF OAuth code exchange
- secure session bridging to Streamlit
- current-user resolution
src/ingestion/*
- extraction, chunking, embedding, vector upsert/query
src/artifacts/*
- report/quiz/podcast generation and storage
- pluggable TTS providers (edge, openai, elevenlabs)
data/models.py, data/crud.py
- relational schema and ownership-scoped queries

4. Data Model and Storage Strategy

Relational entities:

users
notebooks (owner_user_id foreign key)
sources (per notebook)
chat_threads and messages
message_citations (assistant message -> source references)
artifacts (status, metadata, content, file path)

Filesystem layout:

<STORAGE_BASE_DIR>/users/<user_id>/notebooks/<notebook_id>/
  files_raw/
  files_extracted/
  chroma/
  artifacts/reports/
  artifacts/quizzes/
  artifacts/podcasts/
uploads/notebook_<notebook_id>/

Design rationale:

SQLite keeps operational complexity low for MVP.
Chroma per notebook enables practical RAG retrieval with low infra overhead.
Disk layout mirrors ownership boundaries for simple cleanup and auditability.

5. End-to-End Flow

5.1 Ingestion

User uploads file or submits URL from Streamlit.
Backend verifies notebook ownership and validates URL safety (if URL).
Source record is created with processing status.
Ingestion service extracts text, chunks, embeds, and upserts into Chroma.
Source status transitions to ready or failed.

5.2 Retrieval + Chat

User sends a message in a notebook thread.
Backend checks notebook/thread ownership.
Query embedding is computed and top-k chunks are retrieved from notebook Chroma.
Prompt is assembled with conversation history and retrieved context.
LLM generates an answer.
Assistant message and structured citations are persisted.
UI shows answer and citations; citations remain available on subsequent reloads.

6. Security Plan

Authentication and identity:

AUTH_MODE=hf_oauth for production deployments.
Session-based current-user identity with signed bridge tokens.

User isolation:

all notebook/thread/source/artifact endpoints verify ownership (owner_user_id)
retrieval path binds queries to current user and notebook

Path/data protection:

upload filenames are sanitized and constrained to notebook upload roots
deletion is bounded to expected storage roots to prevent unsafe recursive deletes
URL ingestion blocks local/private network targets (SSRF reduction)

Operational controls:

environment-based secrets (APP_SESSION_SECRET, API keys)
CI test gate before deploy

7. Milestone Plan

MVP (Milestone 1)

auth + sessions
notebook CRUD + isolation checks
ingestion for PDF/PPTX/TXT/URL
notebook-scoped RAG chat with citations

Milestone 2

artifact generation endpoints (report/quiz/podcast)
transcript/audio persistence and frontend playback/download
improved chat UX and citation persistence in history

Milestone 3 (Extensions)

compare retrieval techniques (baseline semantic vs hybrid/rerank)
latency/quality benchmarking and report
stronger observability and error analytics

8. Key Risks and Mitigations

LLM/API cost volatility
- mitigate with model selection defaults, request limits, caching opportunities
HF /data ephemerality on free tier
- document tradeoff; optional HF dataset persistence extension
retrieval quality drift across document types
- tune chunking and top-k; evaluate reranking/hybrid methods
URL ingestion abuse
- strict scheme/host/IP/redirect/content-size checks
dependency/runtime mismatch
- CI tests and pinned dependency strategy where practical

9. Specifications and References in Repo

ingestion spec: docs/INGESTION_SPEC.md
architecture spec: docs/STREAMLIT_ARCHITECTURE_SPEC.md
integration notes: INTEGRATION.md
schema docs: ER_DIAGRAM.md, DATABASE_SCHEMA.md

This brief is intended for export to PDF as the 2-4 page design deliverable.