# Collab Editor - Project Specification ## 1. Overview A collaborative, real-time scientific article editor deployed as a Hugging Face Space. Users write rich content (math, citations, custom components) in a TipTap-based editor synced via Yjs/Hocuspocus, then publish a self-contained static HTML article. An AI assistant helps with writing and editing. **Stack**: React 18 + TipTap 3 + Yjs (frontend) / Express + Hocuspocus + Node 20 (backend) / Docker on HF Spaces. No CSS-in-JS - all styling via vanilla CSS custom properties. **Relationship to `research-article-template`**: the CSS foundation, design tokens, and visual language come from the [research-article-template](https://huggingface.co/spaces/tfrere/research-article-template) project. The editor imports the template's CSS files (`_variables.css`, `_reset.css`, `_base.css`, `_layout.css`, component partials) and the publisher injects them inline into published HTML. The published output is designed to look identical to articles built with the Astro-based template. --- ## 2. Architecture ```mermaid graph TB subgraph browser [Browser] SPA["React SPA
TipTap + Yjs"] end subgraph server [Node Backend - port 8080] Express["Express HTTP"] Hocuspocus["Hocuspocus
WebSocket"] Publisher["Publisher Pipeline
HTML + PDF"] Agent["AI Agent
HF Inference"] Auth["HF OAuth"] end subgraph storage [Persistence] LocalFS["Local FS
data/*.yjs"] HFDataset["HF Dataset
articles/ published/"] end SPA -->|"WebSocket /collab"| Hocuspocus SPA -->|"REST /api/*"| Express Hocuspocus -->|"Database ext"| LocalFS LocalFS -->|"schedulePush"| HFDataset HFDataset -->|"pullDocument"| LocalFS Express --> Publisher Express --> Agent Express --> Auth Publisher --> LocalFS Publisher -->|"uploadPublishedAssets"| HFDataset ``` **Single process in production**: the backend serves the Vite-built frontend, all REST APIs, the WebSocket collab channel, and static published articles. No reverse proxy needed. --- ## 3. Data Model (Yjs Shared Types) The entire collaborative state lives in a single `Y.Doc`: - **`Y.XmlFragment("default")`** - TipTap document content (ProseMirror nodes synced via Collaboration extension) - **`Y.Map("frontmatter")`** - scalar metadata: `title`, `subtitle`, `description`, `published`, `doi`, `template`, `licence` - **`Y.Array("frontmatter.authors")`** - `{ name, url?, affiliations: number[] }[]` - **`Y.Array("frontmatter.affiliations")`** - `{ name, url? }[]` - **`Y.Map("citations")`** - CSL-JSON entries keyed by citation ID - **`Y.Map("settings")`** - `citationStyle`, `primaryHue`, and future editor preferences - **`Y.Map("comments")`** - comment threads keyed by `commentId`, each with author/text/resolved All types are concurrently editable by multiple users and persist to `data/default.yjs`. --- ## 4. Backend Components ### 4.1 HTTP Routes | Method | Path | Auth | Purpose | |--------|------|------|---------| | `GET` | `/oauth/authorize` | Public | Redirect to HF OAuth | | `GET` | `/auth/callback` | Public (CSRF state) | Exchange code, set cookie, redirect to `/editor` | | `GET` | `/api/auth/status` | Cookie | Return `{ authenticated, canEdit, user }` | | `POST` | `/api/chat` | OAuth (optional) | Stream AI agent responses (HF Inference Providers) | | `POST` | `/api/publish` | OAuth (canEdit) | Run publish pipeline, generate HTML/PDF | | `POST` | `/api/admin/reset-document` | OAuth (canEdit) | Delete local `.yjs`, close connections | | `POST` | `/api/upload` | None (uses cookie for HF) | Upload image (multipart, max 10MB) | | `POST` | `/api/citations/resolve` | None | Resolve DOI/URL to CSL-JSON | | `POST` | `/api/citations/format` | None | Format entries to HTML bibliography | | `POST` | `/api/citations/import-bib` | None | Parse BibTeX to CSL-JSON | | `GET` | `/editor` | OAuth (canEdit) | Serve SPA (or login page) | | `GET` | `*` | Public | Serve published article (or login page) | ### 4.2 WebSocket Collaboration - Upgrade on `/collab` only; all other paths rejected - Single document: `DEFAULT_DOC_NAME = "default"` - Hocuspocus `onAuthenticate`: validates OAuth token if enabled, checks `canEdit` - `Database` extension: `fetch` reads local `.yjs` or pulls from HF; `store` writes local + schedules HF push (10s debounce) ### 4.3 HF Storage - Dataset ID: `HF_DATASET_ID` or `{SPACE_ID}-data` - Dataset is created **private by default** (`createRepo({ private: true })`). The OAuth grant needs `manage-repos` on the user's first write; subsequent containers reuse the cached token. - Token: `HF_TOKEN` (env) or cached OAuth token from last authenticated user - **Documents**: `articles/.yjs` - debounced push on every Hocuspocus store - **Published assets**: `published//{index.html, article.pdf, thumb.jpg, meta.json, llms.txt}` - **Images**: `images/` referenced from articles via `/d/images/...` proxy URLs - `flushAll()` on `SIGTERM`/`SIGINT` to push pending changes ### 4.3.1 Storage Status & Recovery The persistence pipeline used to fail silently in multiple places (`createRepo` 403 on a missing scope, `uploadFile` 5xx mid-debounce, `writeFileSync` on a readonly FS, ...) and the editor would happily keep showing "Saved". To make data first-class: - **In-memory tracker** in `hf-storage.ts` records `datasetReady`, `lastLocalSaveAt`, `lastCloudPushAt`, `pendingPush`, `lastError {stage, message, statusCode, at, docName}`. Every write path updates it; every error path records the failure. - **`GET /api/storage/status`** exposes the tracker (canEdit-gated). The frontend `SyncIndicator` polls it every 5s and displays a three-state badge: green "Saved" / amber "Saving..." / **red "Storage error"** (pulsing, with the exact reason in the tooltip + actionable hint for the 403 / missing-scope case). - **Eager `ensureDatasetExists`** on first `/api/auth/status` for a canEdit user. A misconfigured fork now surfaces its error within ~10s of login instead of waiting for an edit + 12s debounce cycle. - **`beforeunload` guard** on the editor: if a local edit is in flight, a push is armed, the WS is offline, or the tracker reports an error, the browser pops the standard "Leave site?" confirm. - **`GET /api/admin/export-doc`** (canEdit-gated) streams the on-disk `.yjs` snapshot as a download. The escape hatch for disaster recovery: when the cloud push has been failing and the container is about to rebuild, an admin can grab the doc bytes manually. ### 4.3.2 Dataset Reverse Proxy (`/d/*`) Since the dataset is private, anonymous viewers of a published article can't fetch its images / PDF / og:image directly from `huggingface.co/datasets/...`. The editor server exposes `GET /d/:path*` as an authenticated forward-proxy: - **Whitelist**: only `images/` and `published/` are reachable; `articles/` (raw `.yjs` drafts) is **always 404** regardless of caller. - **Token cascade**: request cookie → cached user token → `HF_TOKEN` env → anonymous fetch. The cookie token is also promoted into the cache opportunistically, so the first signed-in viewer warms the proxy for subsequent anonymous viewers within the same container lifetime. - **Streaming**: WHATWG body piped straight to the Express response - no buffering of full PDFs in Node memory. - **Caching**: `images/*` is served as `immutable, max-age=1y` (UUID names, never overwritten); `published/*` as `max-age=300, stale-while-revalidate=60` (re-published in place). - **Error mapping**: upstream 401/403 collapse to 502 so the browser never gets prompted for credentials it can't supply; upstream 404 passes through. ### 4.4 Publisher Pipeline ```mermaid flowchart LR YDoc["Y.Doc (.yjs)"] --> Extract["extractFromYDoc
frontmatter + JSON"] Extract --> GenHTML["generateHTML
@tiptap/html"] GenHTML --> PostProc["postProcess
accordion, biblio,
mermaid, htmlEmbed"] PostProc --> Render["renderArticleHTML
full HTML page"] CSS["loadCSS
template styles"] --> Render Render --> LocalWrite["Write local
index.html"] Render --> PDF["Playwright
PDF + thumbnail"] LocalWrite --> HFUpload["uploadPublishedAssets
HF dataset"] ``` - **CSS loading**: reads template CSS files, resolves `@custom-media` queries via `resolveCustomMedia()`, splits into variables/reset/base/layout/components/article/print - **Post-processing**: accordion divs to `
`, bibliography injection, mermaid to `
`, htmlEmbed to `