Spaces:

pbichpur
/

NotebookLMClone

Sleeping

App Files Files Community

github-actions[bot] commited on Feb 28

Commit

dba1a8e

1 Parent(s): 6e9aef8

Sync from GitHub e2e802be5157aa05d1251459f529eb7eb4242ef2

Browse files

Files changed (10) hide show

DATABASE_SCHEMA.md +69 -139
ER_DIAGRAM.md +44 -41
app.py +91 -14
auth/session.py +18 -4
data/crud.py +28 -0
docs/DESIGN_BRIEF.md +164 -0
frontend/app.py +4 -0
requirements.txt +1 -0
tests/test_chat_citations.py +119 -0
tests/test_notebook_management_api.py +46 -1

DATABASE_SCHEMA.md CHANGED Viewed

@@ -1,161 +1,91 @@
 # Database Schema
-This document describes the current database schema defined in `data/models.py`.
 ## Engine and Initialization
 - ORM: SQLAlchemy 2.x
 - Base class: `data.db.Base`
-- Default database URL: `sqlite:///./notebooklm.db`
-- Init entrypoint: `data.db.init_db()`
-## Entity Relationship Overview
-- `users` 1:N `oauth_accounts`
-- `users` 1:N `documents`
-- `documents` 1:N `chunks`
-- `users` 1:N `conversations`
-- `conversations` 1:N `messages`
 ## Tables
 ### `users`
-Stores app users.
-Columns:
-- `id` INTEGER, PK
-- `email` VARCHAR(255), nullable, UNIQUE, indexed
-- `display_name` VARCHAR(255), nullable
-- `avatar_url` VARCHAR(1024), nullable
-- `is_active` BOOLEAN, NOT NULL, default `true`
-- `created_at` DATETIME(timezone=True), NOT NULL, default `now()`
-- `updated_at` DATETIME(timezone=True), NOT NULL, default `now()`, auto-updated on row update
-Relationships:
-- One-to-many with `oauth_accounts`
-- One-to-many with `documents`
-- One-to-many with `conversations`
-Indexes and constraints:
-- PK: `id`
-- UNIQUE: `email`
-- INDEX: `email` (implicit from `index=True`)
----
-### `oauth_accounts`
-OAuth provider identities linked to users (supports Hugging Face via `provider='huggingface'`).
 Columns:
-- `id` INTEGER, PK
-- `user_id` INTEGER, NOT NULL, FK -> `users.id` ON DELETE CASCADE
-- `provider` VARCHAR(50), NOT NULL, indexed
-- `provider_user_id` VARCHAR(255), NOT NULL, indexed
-- `username` VARCHAR(255), nullable
-- `access_token` TEXT, nullable
-- `refresh_token` TEXT, nullable
-- `token_type` VARCHAR(50), nullable
-- `scope` TEXT, nullable
-- `expires_at` DATETIME(timezone=True), nullable
-- `created_at` DATETIME(timezone=True), NOT NULL, default `now()`
-- `updated_at` DATETIME(timezone=True), NOT NULL, default `now()`, auto-updated on row update
-Relationships:
-- Many-to-one with `users`
-Indexes and constraints:
-- PK: `id`
-- UNIQUE: (`provider`, `provider_user_id`) as `uq_provider_user`
-- INDEX: (`user_id`, `provider`) as `ix_oauth_user_provider`
-- INDEX: `provider` (implicit)
-- INDEX: `provider_user_id` (implicit)
----
-### `documents`
-Uploaded/ingested source documents owned by users.
 Columns:
-- `id` INTEGER, PK
-- `user_id` INTEGER, NOT NULL, FK -> `users.id` ON DELETE CASCADE
-- `title` VARCHAR(255), NOT NULL
-- `source_filename` VARCHAR(1024), nullable
-- `source_type` VARCHAR(50), NOT NULL, default `'upload'`
-- `storage_path` VARCHAR(1024), nullable
-- `summary` TEXT, nullable
-- `created_at` DATETIME(timezone=True), NOT NULL, default `now()`
-- `updated_at` DATETIME(timezone=True), NOT NULL, default `now()`, auto-updated on row update
-Relationships:
-- Many-to-one with `users`
-- One-to-many with `chunks`
-Indexes and constraints:
-- PK: `id`
-- INDEX: (`user_id`, `created_at`) as `ix_documents_user_created`
----
-### `chunks`
-Document chunks for retrieval and embedding linkage.
 Columns:
-- `id` INTEGER, PK
-- `document_id` INTEGER, NOT NULL, FK -> `documents.id` ON DELETE CASCADE
-- `chunk_index` INTEGER, NOT NULL
-- `content` TEXT, NOT NULL
-- `token_count` INTEGER, nullable
-- `embedding_id` VARCHAR(255), nullable, indexed
-- `created_at` DATETIME(timezone=True), NOT NULL, default `now()`
-Relationships:
-- Many-to-one with `documents`
-Indexes and constraints:
-- PK: `id`
-- UNIQUE: (`document_id`, `chunk_index`) as `uq_document_chunk_index`
-- INDEX: (`document_id`, `chunk_index`) as `ix_chunks_document_index`
-- INDEX: `embedding_id` (implicit)
----
-### `conversations`
-User chat sessions.
 Columns:
-- `id` INTEGER, PK
-- `user_id` INTEGER, NOT NULL, FK -> `users.id` ON DELETE CASCADE
-- `title` VARCHAR(255), nullable
-- `created_at` DATETIME(timezone=True), NOT NULL, default `now()`
-- `updated_at` DATETIME(timezone=True), NOT NULL, default `now()`, auto-updated on row update
-Relationships:
-- Many-to-one with `users`
-- One-to-many with `messages`
-Indexes and constraints:
-- PK: `id`
-- INDEX: (`user_id`, `created_at`) as `ix_conversations_user_created`
----
 ### `messages`
-Conversation messages, including optional citation payload.
 Columns:
-- `id` INTEGER, PK
-- `conversation_id` INTEGER, NOT NULL, FK -> `conversations.id` ON DELETE CASCADE
-- `role` VARCHAR(20), NOT NULL
-- `content` TEXT, NOT NULL
-- `citations` JSON, nullable
-- `created_at` DATETIME(timezone=True), NOT NULL, default `now()`
-Relationships:
-- Many-to-one with `conversations`
-Indexes and constraints:
-- PK: `id`
-- INDEX: (`conversation_id`, `created_at`) as `ix_messages_conversation_created`
 ## Notes
-- Cascading deletes are enabled from parent to child via foreign keys (`ON DELETE CASCADE`).
-- No soft-delete columns currently exist.
-- Migration tooling (e.g., Alembic) is not yet configured; current initialization uses `Base.metadata.create_all(...)`.

 # Database Schema
+This document reflects the active SQLAlchemy models in `data/models.py`.
 ## Engine and Initialization
 - ORM: SQLAlchemy 2.x
 - Base class: `data.db.Base`
+- Default DB: `sqlite:///./notebooklm.db`
+- Initialization: `data.db.init_db()`
+## Relationship Overview
+- `users` 1:N `notebooks`
+- `notebooks` 1:N `sources`
+- `notebooks` 1:N `chat_threads`
+- `chat_threads` 1:N `messages`
+- `messages` 1:N `message_citations`
+- `sources` 1:N `message_citations`
+- `notebooks` 1:N `artifacts`
 ## Tables
 ### `users`
 Columns:
+- `id` INTEGER PK
+- `email` VARCHAR(255) NOT NULL UNIQUE INDEX
+- `display_name` VARCHAR(255) NULL
+- `avatar_url` VARCHAR(1024) NULL
+- `created_at` DATETIME(timezone=True) NOT NULL
+### `notebooks`
 Columns:
+- `id` INTEGER PK
+- `owner_user_id` INTEGER NOT NULL FK -> `users.id` ON DELETE CASCADE INDEX
+- `title` VARCHAR(255) NOT NULL
+- `created_at` DATETIME(timezone=True) NOT NULL
+- `updated_at` DATETIME(timezone=True) NOT NULL
+### `sources`
 Columns:
+- `id` INTEGER PK
+- `notebook_id` INTEGER NOT NULL FK -> `notebooks.id` ON DELETE CASCADE INDEX
+- `type` VARCHAR(50) NOT NULL
+- `title` VARCHAR(255) NULL
+- `original_name` VARCHAR(1024) NULL
+- `url` VARCHAR(2048) NULL
+- `storage_path` VARCHAR(1024) NULL
+- `status` VARCHAR(50) NOT NULL
+- `ingested_at` DATETIME(timezone=True) NULL
+### `chat_threads`
 Columns:
+- `id` INTEGER PK
+- `notebook_id` INTEGER NOT NULL FK -> `notebooks.id` ON DELETE CASCADE INDEX
+- `title` VARCHAR(255) NULL
+- `created_at` DATETIME(timezone=True) NOT NULL
 ### `messages`
 Columns:
+- `id` INTEGER PK
+- `thread_id` INTEGER NOT NULL FK -> `chat_threads.id` ON DELETE CASCADE INDEX
+- `role` VARCHAR(20) NOT NULL
+- `content` TEXT NOT NULL
+- `created_at` DATETIME(timezone=True) NOT NULL
+### `message_citations`
+Columns:
+- `id` INTEGER PK
+- `message_id` INTEGER NOT NULL FK -> `messages.id` ON DELETE CASCADE INDEX
+- `source_id` INTEGER NOT NULL FK -> `sources.id` ON DELETE CASCADE INDEX
+- `chunk_ref` VARCHAR(255) NULL
+- `quote` TEXT NULL
+- `score` FLOAT NULL
+### `artifacts`
+Columns:
+- `id` INTEGER PK
+- `notebook_id` INTEGER NOT NULL FK -> `notebooks.id` ON DELETE CASCADE INDEX
+- `type` VARCHAR(50) NOT NULL
+- `title` VARCHAR(255) NULL
+- `status` VARCHAR(50) NOT NULL
+- `file_path` VARCHAR(1024) NULL
+- `metadata` JSON NULL (mapped as `artifact_metadata`)
+- `content` TEXT NULL
+- `error_message` TEXT NULL
+- `created_at` DATETIME(timezone=True) NOT NULL
+- `generated_at` DATETIME(timezone=True) NULL
 ## Notes
+- Ownership and isolation are anchored by `notebooks.owner_user_id`.
+- Child records are deleted via `ON DELETE CASCADE`.
+- Schema creation is currently handled with `Base.metadata.create_all(...)` (no Alembic yet).

ER_DIAGRAM.md CHANGED Viewed

@@ -2,79 +2,82 @@
 ```mermaid
 erDiagram
-    users ||--o{ oauth_accounts : has
-    users ||--o{ documents : owns
-    users ||--o{ conversations : owns
-    documents ||--o{ chunks : contains
-    conversations ||--o{ messages : contains
     users {
         int id PK
         string email UK
         string display_name
         string avatar_url
-        boolean is_active
         datetime created_at
-        datetime updated_at
     }
-    oauth_accounts {
         int id PK
-        int user_id FK
-        string provider
-        string provider_user_id
-        string username
-        text access_token
-        text refresh_token
-        string token_type
-        text scope
-        datetime expires_at
         datetime created_at
         datetime updated_at
     }
-    documents {
         int id PK
-        int user_id FK
         string title
-        string source_filename
-        string source_type
         string storage_path
-        text summary
         datetime created_at
-        datetime updated_at
     }
-    chunks {
         int id PK
-        int document_id FK
-        int chunk_index
         text content
-        int token_count
-        string embedding_id
         datetime created_at
     }
-    conversations {
         int id PK
-        int user_id FK
-        string title
-        datetime created_at
-        datetime updated_at
     }
-    messages {
         int id PK
-        int conversation_id FK
-        string role
         text content
-        json citations
         datetime created_at
     }
 ```
 ## Notes
-- `oauth_accounts.provider` supports Hugging Face (`"huggingface"`).
-- Composite unique constraints are documented in `DATABASE_SCHEMA.md`:
-  - `uq_provider_user` on (`provider`, `provider_user_id`)
-  - `uq_document_chunk_index` on (`document_id`, `chunk_index`)

 ```mermaid
 erDiagram
+    users ||--o{ notebooks : owns
+    notebooks ||--o{ sources : contains
+    notebooks ||--o{ chat_threads : has
+    chat_threads ||--o{ messages : contains
+    messages ||--o{ message_citations : has
+    sources ||--o{ message_citations : cited_by
+    notebooks ||--o{ artifacts : generates
     users {
         int id PK
         string email UK
         string display_name
         string avatar_url
         datetime created_at
     }
+    notebooks {
         int id PK
+        int owner_user_id FK
+        string title
         datetime created_at
         datetime updated_at
     }
+    sources {
         int id PK
+        int notebook_id FK
+        string type
         string title
+        string original_name
+        string url
         string storage_path
+        string status
+        datetime ingested_at
+    }
+    chat_threads {
+        int id PK
+        int notebook_id FK
+        string title
         datetime created_at
     }
+    messages {
         int id PK
+        int thread_id FK
+        string role
         text content
         datetime created_at
     }
+    message_citations {
         int id PK
+        int message_id FK
+        int source_id FK
+        string chunk_ref
+        text quote
+        float score
     }
+    artifacts {
         int id PK
+        int notebook_id FK
+        string type
+        string title
+        string status
+        string file_path
+        json metadata
         text content
+        text error_message
         datetime created_at
+        datetime generated_at
     }
 ```
 ## Notes
+- User isolation is enforced through ownership on `notebooks.owner_user_id`.
+- Thread, source, citation, and artifact records are notebook-scoped.
+- Artifact metadata is stored in JSON (`artifacts.metadata`).

app.py CHANGED Viewed

@@ -2,9 +2,12 @@ from __future__ import annotations
 from contextlib import asynccontextmanager
 import os
 from datetime import datetime, timezone
 from pathlib import Path
 from urllib.parse import parse_qsl, urlencode, urlsplit, urlunsplit
 from fastapi.concurrency import run_in_threadpool
 from fastapi import APIRouter, BackgroundTasks, Depends, FastAPI, File, Form, HTTPException, Request, UploadFile, status
@@ -101,14 +104,6 @@ class ThreadResponse(BaseModel):
     created_at: datetime
-class MessageResponse(BaseModel):
-    id: int
-    thread_id: int
-    role: str
-    content: str
-    created_at: datetime
 class CitationResponse(BaseModel):
     source_title: str | None = None
     source_id: int
@@ -117,6 +112,15 @@ class CitationResponse(BaseModel):
     score: float | None = None
 class ChatRequest(BaseModel):
     question: str = Field(min_length=1)
     top_k: int = Field(default=5, ge=1, le=12)
@@ -191,6 +195,9 @@ class ArtifactResponse(BaseModel):
 MAX_HISTORY_MESSAGES = 8
 MAX_HISTORY_CHARS_PER_MESSAGE = 1000
 def _build_conversation_history(
@@ -246,6 +253,53 @@ def _append_query_param(url: str, key: str, value: str) -> str:
     return urlunsplit((split.scheme, split.netloc, split.path, updated_query, split.fragment))
 @app.get("/health", tags=["system"])
 def health_check() -> dict[str, str]:
     return {"status": "ok"}
@@ -461,6 +515,11 @@ def delete_notebook(
     if notebook is None:
         raise HTTPException(status_code=404, detail="Notebook not found for this user.")
     crud.delete_notebook(db=db, notebook=notebook)
     return NotebookDeleteResponse(status="deleted", notebook_id=notebook_id)
@@ -544,18 +603,18 @@ async def upload_source_for_notebook(
     if notebook is None:
         raise HTTPException(status_code=404, detail="Notebook not found for this user.")
-    upload_dir = Path("uploads") / f"notebook_{notebook_id}"
-    upload_dir.mkdir(parents=True, exist_ok=True)
-    destination = upload_dir / file.filename
     content = await file.read()
     destination.write_bytes(content)
     source = crud.create_source(
         db=db,
         notebook_id=notebook_id,
         source_type="file",
-        title=title or file.filename,
-        original_name=file.filename,
         url=None,
         storage_path=str(destination),
         status=status,
@@ -680,9 +739,25 @@ def list_messages_for_thread(
         raise HTTPException(status_code=404, detail="Thread not found for this notebook.")
     messages = crud.list_messages_for_thread(db=db, thread_id=thread_id)
     return [
         MessageResponse(
-            id=m.id, thread_id=m.thread_id, role=m.role, content=m.content, created_at=m.created_at
         )
         for m in messages
     ]
@@ -783,6 +858,7 @@ def chat_on_thread(
             role=user_message.role,
             content=user_message.content,
             created_at=user_message.created_at,
         ),
         assistant_message=MessageResponse(
             id=assistant_message.id,
@@ -790,6 +866,7 @@ def chat_on_thread(
             role=assistant_message.role,
             content=assistant_message.content,
             created_at=assistant_message.created_at,
         ),
         citations=citations,
     )

 from contextlib import asynccontextmanager
 import os
+import re
+import shutil
 from datetime import datetime, timezone
 from pathlib import Path
 from urllib.parse import parse_qsl, urlencode, urlsplit, urlunsplit
+from uuid import uuid4
 from fastapi.concurrency import run_in_threadpool
 from fastapi import APIRouter, BackgroundTasks, Depends, FastAPI, File, Form, HTTPException, Request, UploadFile, status
     created_at: datetime
 class CitationResponse(BaseModel):
     source_title: str | None = None
     source_id: int
     score: float | None = None
+class MessageResponse(BaseModel):
+    id: int
+    thread_id: int
+    role: str
+    content: str
+    created_at: datetime
+    citations: list[CitationResponse] = Field(default_factory=list)
 class ChatRequest(BaseModel):
     question: str = Field(min_length=1)
     top_k: int = Field(default=5, ge=1, le=12)
 MAX_HISTORY_MESSAGES = 8
 MAX_HISTORY_CHARS_PER_MESSAGE = 1000
+MAX_UPLOAD_FILENAME_LENGTH = 255
+SAFE_FILENAME_RE = re.compile(r"[^A-Za-z0-9._-]+")
+UPLOADS_ROOT = Path("uploads")
 def _build_conversation_history(
     return urlunsplit((split.scheme, split.netloc, split.path, updated_query, split.fragment))
+def _sanitize_upload_filename(filename: str | None) -> str:
+    raw_name = Path(str(filename or "")).name.replace("\x00", "").strip()
+    sanitized = SAFE_FILENAME_RE.sub("_", raw_name).strip("._-")
+    if not sanitized:
+        sanitized = f"upload_{uuid4().hex[:10]}.bin"
+    if len(sanitized) > MAX_UPLOAD_FILENAME_LENGTH:
+        ext = Path(sanitized).suffix[:20]
+        stem_limit = max(1, MAX_UPLOAD_FILENAME_LENGTH - len(ext))
+        sanitized = f"{Path(sanitized).stem[:stem_limit]}{ext}"
+    return sanitized
+def _resolve_notebook_upload_path(notebook_id: int, filename: str | None) -> Path:
+    upload_dir = UPLOADS_ROOT / f"notebook_{notebook_id}"
+    upload_dir.mkdir(parents=True, exist_ok=True)
+    upload_dir_resolved = upload_dir.resolve()
+    safe_name = _sanitize_upload_filename(filename)
+    destination = (upload_dir_resolved / safe_name).resolve()
+    if destination.parent != upload_dir_resolved:
+        raise HTTPException(status_code=400, detail="Invalid upload filename.")
+    if destination.exists():
+        destination = (upload_dir_resolved / f"{destination.stem}_{uuid4().hex[:8]}{destination.suffix}").resolve()
+    return destination
+def _remove_tree_within_root(root: Path, target: Path) -> None:
+    if not target.exists():
+        return
+    root_resolved = root.resolve()
+    target_resolved = target.resolve()
+    if target_resolved == root_resolved or root_resolved not in target_resolved.parents:
+        raise RuntimeError(f"Refusing to delete path outside root: {target_resolved}")
+    shutil.rmtree(target_resolved)
+def _cleanup_notebook_storage(owner_user_id: int, notebook_id: int) -> None:
+    storage_base = Path(os.getenv("STORAGE_BASE_DIR", "data"))
+    notebook_root = storage_base / "users" / str(owner_user_id) / "notebooks"
+    notebook_path = notebook_root / str(notebook_id)
+    _remove_tree_within_root(notebook_root, notebook_path)
+    upload_path = UPLOADS_ROOT / f"notebook_{notebook_id}"
+    _remove_tree_within_root(UPLOADS_ROOT, upload_path)
 @app.get("/health", tags=["system"])
 def health_check() -> dict[str, str]:
     return {"status": "ok"}
     if notebook is None:
         raise HTTPException(status_code=404, detail="Notebook not found for this user.")
+    try:
+        _cleanup_notebook_storage(owner_user_id=current_user.id, notebook_id=notebook_id)
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"Failed to delete notebook storage: {exc}") from exc
     crud.delete_notebook(db=db, notebook=notebook)
     return NotebookDeleteResponse(status="deleted", notebook_id=notebook_id)
     if notebook is None:
         raise HTTPException(status_code=404, detail="Notebook not found for this user.")
+    destination = _resolve_notebook_upload_path(notebook_id=notebook_id, filename=file.filename)
     content = await file.read()
     destination.write_bytes(content)
+    original_name = Path(str(file.filename or destination.name)).name
+    source_title = title or original_name or destination.name
     source = crud.create_source(
         db=db,
         notebook_id=notebook_id,
         source_type="file",
+        title=source_title,
+        original_name=original_name,
         url=None,
         storage_path=str(destination),
         status=status,
         raise HTTPException(status_code=404, detail="Thread not found for this notebook.")
     messages = crud.list_messages_for_thread(db=db, thread_id=thread_id)
+    citations_by_message = crud.list_message_citations_for_thread(db=db, thread_id=thread_id)
     return [
         MessageResponse(
+            id=m.id,
+            thread_id=m.thread_id,
+            role=m.role,
+            content=m.content,
+            created_at=m.created_at,
+            citations=[
+                CitationResponse(
+                    source_title=entry.get("source_title"),
+                    source_id=int(entry.get("source_id", 0)),
+                    chunk_ref=(str(entry.get("chunk_ref")) if entry.get("chunk_ref") else None),
+                    quote=(str(entry.get("quote")) if entry.get("quote") else None),
+                    score=(float(entry["score"]) if entry.get("score") is not None else None),
+                )
+                for entry in citations_by_message.get(m.id, [])
+                if int(entry.get("source_id", 0)) > 0
+            ],
         )
         for m in messages
     ]
             role=user_message.role,
             content=user_message.content,
             created_at=user_message.created_at,
+            citations=[],
         ),
         assistant_message=MessageResponse(
             id=assistant_message.id,
             role=assistant_message.role,
             content=assistant_message.content,
             created_at=assistant_message.created_at,
+            citations=citations,
         ),
         citations=citations,
     )

auth/session.py CHANGED Viewed

@@ -15,6 +15,7 @@ from data.db import get_db
 AUTH_MODE_DEV = "dev"
 AUTH_MODE_HF = "hf_oauth"
 AUTH_BRIDGE_SALT = "streamlit-auth-bridge"
 @dataclass(frozen=True)
@@ -36,18 +37,31 @@ def get_auth_mode() -> str:
 def configure_session_middleware(app) -> None:
     """Attach Starlette session middleware once during app setup."""
-    secret = os.getenv("APP_SESSION_SECRET", "dev-only-session-secret-change-me")
     app.add_middleware(
         SessionMiddleware,
         secret_key=secret,
-        same_site="lax",
-        https_only=False,
         max_age=60 * 60 * 24 * 7,  # 7 days
     )
 def _bridge_serializer() -> URLSafeTimedSerializer:
-    secret = os.getenv("APP_SESSION_SECRET", "dev-only-session-secret-change-me")
     return URLSafeTimedSerializer(secret_key=secret, salt=AUTH_BRIDGE_SALT)

 AUTH_MODE_DEV = "dev"
 AUTH_MODE_HF = "hf_oauth"
 AUTH_BRIDGE_SALT = "streamlit-auth-bridge"
+DEFAULT_DEV_SESSION_SECRET = "dev-only-session-secret-change-me"
 @dataclass(frozen=True)
 def configure_session_middleware(app) -> None:
     """Attach Starlette session middleware once during app setup."""
+    secret = os.getenv("APP_SESSION_SECRET", DEFAULT_DEV_SESSION_SECRET).strip()
+    auth_mode = get_auth_mode()
+    if auth_mode == AUTH_MODE_HF and (not secret or secret == DEFAULT_DEV_SESSION_SECRET):
+        raise RuntimeError("APP_SESSION_SECRET must be set to a non-default value in hf_oauth mode.")
+    same_site = os.getenv("SESSION_COOKIE_SAMESITE", "lax").strip().lower()
+    if same_site not in {"lax", "strict", "none"}:
+        same_site = "lax"
+    secure_default = "1" if auth_mode == AUTH_MODE_HF else "0"
+    https_only = os.getenv("SESSION_COOKIE_SECURE", secure_default).strip().lower() in {
+        "1",
+        "true",
+        "yes",
+        "on",
+    }
     app.add_middleware(
         SessionMiddleware,
         secret_key=secret,
+        same_site=same_site,
+        https_only=https_only,
         max_age=60 * 60 * 24 * 7,  # 7 days
     )
 def _bridge_serializer() -> URLSafeTimedSerializer:
+    secret = os.getenv("APP_SESSION_SECRET", DEFAULT_DEV_SESSION_SECRET)
     return URLSafeTimedSerializer(secret_key=secret, salt=AUTH_BRIDGE_SALT)

data/crud.py CHANGED Viewed

@@ -1,5 +1,6 @@
 from __future__ import annotations
 from datetime import datetime
 from data.models import Artifact
 from sqlalchemy.orm import Session
@@ -212,6 +213,33 @@ def create_message_citations(
         db.refresh(row)
     return rows
 def get_artifact(db: Session, artifact_id: int) -> Artifact | None:
     return db.get(Artifact, artifact_id)

 from __future__ import annotations
+from collections import defaultdict
 from datetime import datetime
 from data.models import Artifact
 from sqlalchemy.orm import Session
         db.refresh(row)
     return rows
+def list_message_citations_for_thread(
+    db: Session, thread_id: int
+) -> dict[int, list[dict[str, int | str | float | None]]]:
+    rows = (
+        db.query(MessageCitation, Source.title)
+        .join(Source, Source.id == MessageCitation.source_id)
+        .join(Message, Message.id == MessageCitation.message_id)
+        .filter(Message.thread_id == thread_id)
+        .order_by(MessageCitation.id.asc())
+        .all()
+    )
+    citations_by_message: dict[int, list[dict[str, int | str | float | None]]] = defaultdict(list)
+    for citation, source_title in rows:
+        citations_by_message[int(citation.message_id)].append(
+            {
+                "source_id": int(citation.source_id),
+                "source_title": source_title,
+                "chunk_ref": citation.chunk_ref,
+                "quote": citation.quote,
+                "score": citation.score,
+            }
+        )
+    return dict(citations_by_message)
 def get_artifact(db: Session, artifact_id: int) -> Artifact | None:
     return db.get(Artifact, artifact_id)

docs/DESIGN_BRIEF.md ADDED Viewed

	@@ -0,0 +1,164 @@

+# NotebookLM Clone Design Brief
+## 1. System Overview
+This system is a full-stack NotebookLM-style application that supports:
+- source ingestion (`.pdf`, `.pptx`, `.txt`, web URL)
+- retrieval-augmented chat with citations
+- artifact generation (report, quiz, podcast transcript + audio)
+- strict per-user data isolation with multiple notebooks per user
+The stack is optimized for Hugging Face Spaces deployment:
+- frontend: Streamlit (`frontend/app.py`)
+- backend API: FastAPI (`app.py`)
+- metadata store: SQLite via SQLAlchemy (`data/models.py`, `data/crud.py`)
+- vector store: ChromaDB per user+notebook (`src/ingestion/vectorstore.py`)
+- ingestion/artifact services: `src/ingestion/*`, `src/artifacts/*`
+## 2. Architecture Diagram
+```mermaid
+flowchart TD
+    A[Streamlit Frontend] --> B[FastAPI Backend]
+    B --> C[Auth Layer<br/>HF OAuth / Dev Auth]
+    B --> D[Notebook & Source APIs]
+    B --> E[Thread & Chat APIs]
+    B --> F[Artifact APIs]
+    D --> G[Ingestion Service]
+    G --> H[Extractors<br/>PDF/PPTX/TXT/URL]
+    G --> I[Chunker]
+    G --> J[Embedding Adapter]
+    G --> K[ChromaDB]
+    E --> K
+    E --> L[LLM Client]
+    E --> M[Message + Citation Tables]
+    F --> L
+    F --> N[TTS Adapter<br/>Edge/OpenAI/ElevenLabs]
+    F --> O[Artifacts on Disk]
+    B --> P[(SQLite DB)]
+    B --> Q[/data + uploads Storage]
+```
+## 3. Component Responsibilities
+- `frontend/app.py`
+  - authentication-aware UI
+  - notebook switching
+  - source upload/URL ingestion
+  - chat interface + citation display
+  - artifact generation, preview, and downloads
+- `app.py`
+  - route orchestration and auth enforcement
+  - notebook/source/thread/artifact lifecycle endpoints
+  - chat orchestration with retrieval + prompting
+  - background podcast generation
+- `auth/oauth.py`, `auth/session.py`
+  - HF OAuth code exchange
+  - secure session bridging to Streamlit
+  - current-user resolution
+- `src/ingestion/*`
+  - extraction, chunking, embedding, vector upsert/query
+- `src/artifacts/*`
+  - report/quiz/podcast generation and storage
+  - pluggable TTS providers (`edge`, `openai`, `elevenlabs`)
+- `data/models.py`, `data/crud.py`
+  - relational schema and ownership-scoped queries
+## 4. Data Model and Storage Strategy
+Relational entities:
+- `users`
+- `notebooks` (`owner_user_id` foreign key)
+- `sources` (per notebook)
+- `chat_threads` and `messages`
+- `message_citations` (assistant message -> source references)
+- `artifacts` (status, metadata, content, file path)
+Filesystem layout:
+```text
+<STORAGE_BASE_DIR>/users/<user_id>/notebooks/<notebook_id>/
+  files_raw/
+  files_extracted/
+  chroma/
+  artifacts/reports/
+  artifacts/quizzes/
+  artifacts/podcasts/
+uploads/notebook_<notebook_id>/
+```
+Design rationale:
+- SQLite keeps operational complexity low for MVP.
+- Chroma per notebook enables practical RAG retrieval with low infra overhead.
+- Disk layout mirrors ownership boundaries for simple cleanup and auditability.
+## 5. End-to-End Flow
+### 5.1 Ingestion
+1. User uploads file or submits URL from Streamlit.
+2. Backend verifies notebook ownership and validates URL safety (if URL).
+3. Source record is created with `processing` status.
+4. Ingestion service extracts text, chunks, embeds, and upserts into Chroma.
+5. Source status transitions to `ready` or `failed`.
+### 5.2 Retrieval + Chat
+1. User sends a message in a notebook thread.
+2. Backend checks notebook/thread ownership.
+3. Query embedding is computed and top-k chunks are retrieved from notebook Chroma.
+4. Prompt is assembled with conversation history and retrieved context.
+5. LLM generates an answer.
+6. Assistant message and structured citations are persisted.
+7. UI shows answer and citations; citations remain available on subsequent reloads.
+## 6. Security Plan
+Authentication and identity:
+- `AUTH_MODE=hf_oauth` for production deployments.
+- Session-based current-user identity with signed bridge tokens.
+User isolation:
+- all notebook/thread/source/artifact endpoints verify ownership (`owner_user_id`)
+- retrieval path binds queries to current user and notebook
+Path/data protection:
+- upload filenames are sanitized and constrained to notebook upload roots
+- deletion is bounded to expected storage roots to prevent unsafe recursive deletes
+- URL ingestion blocks local/private network targets (SSRF reduction)
+Operational controls:
+- environment-based secrets (`APP_SESSION_SECRET`, API keys)
+- CI test gate before deploy
+## 7. Milestone Plan
+### MVP (Milestone 1)
+- auth + sessions
+- notebook CRUD + isolation checks
+- ingestion for PDF/PPTX/TXT/URL
+- notebook-scoped RAG chat with citations
+### Milestone 2
+- artifact generation endpoints (report/quiz/podcast)
+- transcript/audio persistence and frontend playback/download
+- improved chat UX and citation persistence in history
+### Milestone 3 (Extensions)
+- compare retrieval techniques (baseline semantic vs hybrid/rerank)
+- latency/quality benchmarking and report
+- stronger observability and error analytics
+## 8. Key Risks and Mitigations
+- LLM/API cost volatility
+  - mitigate with model selection defaults, request limits, caching opportunities
+- HF `/data` ephemerality on free tier
+  - document tradeoff; optional HF dataset persistence extension
+- retrieval quality drift across document types
+  - tune chunking and top-k; evaluate reranking/hybrid methods
+- URL ingestion abuse
+  - strict scheme/host/IP/redirect/content-size checks
+- dependency/runtime mismatch
+  - CI tests and pinned dependency strategy where practical
+## 9. Specifications and References in Repo
+- ingestion spec: `docs/INGESTION_SPEC.md`
+- architecture spec: `docs/STREAMLIT_ARCHITECTURE_SPEC.md`
+- integration notes: `INTEGRATION.md`
+- schema docs: `ER_DIAGRAM.md`, `DATABASE_SCHEMA.md`
+This brief is intended for export to PDF as the 2-4 page design deliverable.

frontend/app.py CHANGED Viewed

@@ -525,8 +525,12 @@ elif page == "Notebooks":
                             for msg in message_result:
                                 role = msg.get("role", "unknown")
                                 content = msg.get("content", "")
                                 if role == "assistant":
                                     st.markdown(f"**Assistant:** {content}")
                                 else:
                                     st.markdown(f"**You:** {content}")
                         else:

                             for msg in message_result:
                                 role = msg.get("role", "unknown")
                                 content = msg.get("content", "")
+                                citations = msg.get("citations", [])
                                 if role == "assistant":
                                     st.markdown(f"**Assistant:** {content}")
+                                    if isinstance(citations, list) and citations:
+                                        with st.expander("Citations", expanded=False):
+                                            st.dataframe(citations, use_container_width=True)
                                 else:
                                     st.markdown(f"**You:** {content}")
                         else:

requirements.txt CHANGED Viewed

@@ -25,6 +25,7 @@ nltk
 tqdm
 pytest
 edge-tts
 pydub
 ffmpeg-python
 # NOTE: install ffmpeg system binary separately (e.g., `brew install ffmpeg`)

 tqdm
 pytest
 edge-tts
+elevenlabs>=1.0.0
 pydub
 ffmpeg-python
 # NOTE: install ffmpeg system binary separately (e.g., `brew install ffmpeg`)

tests/test_chat_citations.py ADDED Viewed

	@@ -0,0 +1,119 @@

+"""
+Integration tests for citation persistence in chat threads.
+"""
+from __future__ import annotations
+import pathlib
+import sys
+from unittest.mock import patch
+import pytest
+from fastapi.testclient import TestClient
+from sqlalchemy import create_engine
+from sqlalchemy.orm import sessionmaker
+ROOT = pathlib.Path(__file__).resolve().parents[1]
+sys.path.insert(0, str(ROOT))
+from app import app
+from data.db import Base, get_db
+@pytest.fixture()
+def db_engine(tmp_path):
+    db_file = tmp_path / "test_chat_citations.db"
+    engine = create_engine(
+        f"sqlite:///{db_file}",
+        connect_args={"check_same_thread": False},
+    )
+    import data.models  # noqa: F401
+    Base.metadata.create_all(bind=engine)
+    yield engine
+    Base.metadata.drop_all(bind=engine)
+    engine.dispose()
+@pytest.fixture()
+def db_session(db_engine):
+    Session = sessionmaker(autocommit=False, autoflush=False, bind=db_engine)
+    session = Session()
+    yield session
+    session.close()
+@pytest.fixture()
+def client(db_session, monkeypatch):
+    monkeypatch.setenv("AUTH_MODE", "dev")
+    monkeypatch.setenv("APP_SESSION_SECRET", "chat-citations-test-secret")
+    def _override_get_db():
+        yield db_session
+    app.dependency_overrides[get_db] = _override_get_db
+    with TestClient(app, raise_server_exceptions=True) as c:
+        yield c
+    app.dependency_overrides.clear()
+def test_thread_messages_include_persisted_citations(client):
+    create_notebook = client.post("/notebooks", json={"title": "Citation Notebook"})
+    assert create_notebook.status_code == 200
+    notebook_id = int(create_notebook.json()["id"])
+    create_source = client.post(
+        f"/notebooks/{notebook_id}/sources",
+        json={
+            "type": "text",
+            "title": "Lecture Notes",
+            "status": "ready",
+        },
+    )
+    assert create_source.status_code == 200
+    source_id = int(create_source.json()["id"])
+    create_thread = client.post(
+        f"/notebooks/{notebook_id}/threads",
+        json={"title": "Q&A"},
+    )
+    assert create_thread.status_code == 200
+    thread_id = int(create_thread.json()["id"])
+    retrieval_rows = [
+        {
+            "chunk_id": "chunk-1",
+            "score": 0.12,
+            "document": "Neural networks learn from examples.",
+            "metadata": {
+                "source_id": str(source_id),
+                "source_title": "Lecture Notes",
+                "chunk_index": 0,
+            },
+        }
+    ]
+    with patch("app.query_notebook_chunks", return_value=retrieval_rows), patch(
+        "app.generate_chat_completion", return_value="They learn from examples in the data."
+    ):
+        chat_resp = client.post(
+            f"/threads/{thread_id}/chat",
+            params={"notebook_id": notebook_id},
+            json={"question": "How do neural networks learn?", "top_k": 5},
+        )
+    assert chat_resp.status_code == 200
+    chat_payload = chat_resp.json()
+    assert len(chat_payload["citations"]) == 1
+    assert int(chat_payload["citations"][0]["source_id"]) == source_id
+    messages_resp = client.get(
+        f"/threads/{thread_id}/messages",
+        params={"notebook_id": notebook_id},
+    )
+    assert messages_resp.status_code == 200
+    messages = messages_resp.json()
+    assistant_message = next((m for m in messages if m["role"] == "assistant"), None)
+    assert assistant_message is not None
+    assert len(assistant_message["citations"]) == 1
+    assert int(assistant_message["citations"][0]["source_id"]) == source_id
+    assert assistant_message["citations"][0]["source_title"] == "Lecture Notes"

tests/test_notebook_management_api.py CHANGED Viewed

@@ -3,6 +3,7 @@ Integration tests for notebook rename/delete management endpoints.
 """
 from __future__ import annotations
 import pathlib
 import sys
 from unittest.mock import patch
@@ -15,6 +16,7 @@ from sqlalchemy.orm import sessionmaker
 ROOT = pathlib.Path(__file__).resolve().parents[1]
 sys.path.insert(0, str(ROOT))
 from app import app
 from data.db import Base, get_db
@@ -43,9 +45,11 @@ def db_session(db_engine):
 @pytest.fixture()
-def client(db_session, monkeypatch):
     monkeypatch.setenv("AUTH_MODE", "dev")
     monkeypatch.setenv("APP_SESSION_SECRET", "notebook-mgmt-test-secret")
     def _override_get_db():
         yield db_session
@@ -179,3 +183,44 @@ def test_create_url_source_accepts_public_url(client):
     assert payload["status"] == "ready"
     assert payload["ingested_at"] is not None
     mock_ingest.assert_called_once()

 """
 from __future__ import annotations
+import os
 import pathlib
 import sys
 from unittest.mock import patch
 ROOT = pathlib.Path(__file__).resolve().parents[1]
 sys.path.insert(0, str(ROOT))
+import app as app_module
 from app import app
 from data.db import Base, get_db
 @pytest.fixture()
+def client(db_session, monkeypatch, tmp_path):
     monkeypatch.setenv("AUTH_MODE", "dev")
     monkeypatch.setenv("APP_SESSION_SECRET", "notebook-mgmt-test-secret")
+    monkeypatch.setenv("STORAGE_BASE_DIR", str(tmp_path / "storage"))
+    monkeypatch.setattr("app.UPLOADS_ROOT", tmp_path / "uploads")
     def _override_get_db():
         yield db_session
     assert payload["status"] == "ready"
     assert payload["ingested_at"] is not None
     mock_ingest.assert_called_once()
+def test_upload_source_sanitizes_filename(client):
+    create_resp = client.post("/notebooks", json={"title": "Uploads"})
+    assert create_resp.status_code == 200
+    notebook_id = create_resp.json()["id"]
+    with patch("app.ingest_source", return_value=1):
+        upload_resp = client.post(
+            f"/notebooks/{notebook_id}/sources/upload",
+            data={"status": "pending"},
+            files={"file": ("../../../../evil.txt", b"hello world", "text/plain")},
+        )
+    assert upload_resp.status_code == 200
+    payload = upload_resp.json()
+    assert payload["original_name"] == "evil.txt"
+    assert payload["storage_path"] is not None
+    assert ".." not in payload["storage_path"]
+    assert f"notebook_{notebook_id}" in payload["storage_path"]
+    assert pathlib.Path(payload["storage_path"]).exists()
+def test_delete_notebook_removes_notebook_storage_and_uploads(client):
+    create_resp = client.post("/notebooks", json={"title": "Delete storage"})
+    assert create_resp.status_code == 200
+    notebook_id = create_resp.json()["id"]
+    storage_root = pathlib.Path(os.environ["STORAGE_BASE_DIR"]) / "users" / "1" / "notebooks" / str(notebook_id)
+    storage_root.mkdir(parents=True, exist_ok=True)
+    (storage_root / "marker.txt").write_text("x", encoding="utf-8")
+    upload_root = pathlib.Path(app_module.UPLOADS_ROOT) / f"notebook_{notebook_id}"
+    upload_root.mkdir(parents=True, exist_ok=True)
+    (upload_root / "upload.txt").write_text("x", encoding="utf-8")
+    delete_resp = client.delete(f"/notebooks/{notebook_id}")
+    assert delete_resp.status_code == 200
+    assert not storage_root.exists()
+    assert not upload_root.exists()