# Collab Editor - Project Specification
## 1. Overview
A collaborative, real-time scientific article editor deployed as a Hugging Face Space. Users write rich content (math, citations, custom components) in a TipTap-based editor synced via Yjs/Hocuspocus, then publish a self-contained static HTML article. An AI assistant helps with writing and editing.
**Stack**: React 18 + TipTap 3 + Yjs (frontend) / Express + Hocuspocus + Node 20 (backend) / Docker on HF Spaces. No CSS-in-JS - all styling via vanilla CSS custom properties.
**Relationship to `research-article-template`**: the CSS foundation, design tokens, and visual language come from the [research-article-template](https://huggingface.co/spaces/tfrere/research-article-template) project. The editor imports the template's CSS files (`_variables.css`, `_reset.css`, `_base.css`, `_layout.css`, component partials) and the publisher injects them inline into published HTML. The published output is designed to look identical to articles built with the Astro-based template.
---
## 2. Architecture
```mermaid
graph TB
subgraph browser [Browser]
SPA["React SPA TipTap + Yjs"]
end
subgraph server [Node Backend - port 8080]
Express["Express HTTP"]
Hocuspocus["Hocuspocus WebSocket"]
Publisher["Publisher Pipeline HTML + PDF"]
Agent["AI Agent HF Inference"]
Auth["HF OAuth"]
end
subgraph storage [Persistence]
LocalFS["Local FS data/*.yjs"]
HFDataset["HF Dataset articles/ published/"]
end
SPA -->|"WebSocket /collab"| Hocuspocus
SPA -->|"REST /api/*"| Express
Hocuspocus -->|"Database ext"| LocalFS
LocalFS -->|"schedulePush"| HFDataset
HFDataset -->|"pullDocument"| LocalFS
Express --> Publisher
Express --> Agent
Express --> Auth
Publisher --> LocalFS
Publisher -->|"uploadPublishedAssets"| HFDataset
```
**Single process in production**: the backend serves the Vite-built frontend, all REST APIs, the WebSocket collab channel, and static published articles. No reverse proxy needed.
---
## 3. Data Model (Yjs Shared Types)
The entire collaborative state lives in a single `Y.Doc`:
- **`Y.XmlFragment("default")`** - TipTap document content (ProseMirror nodes synced via Collaboration extension)
- **`Y.Map("frontmatter")`** - scalar metadata: `title`, `subtitle`, `description`, `published`, `doi`, `template`, `licence`
- **`Y.Array("frontmatter.authors")`** - `{ name, url?, affiliations: number[] }[]`
- **`Y.Array("frontmatter.affiliations")`** - `{ name, url? }[]`
- **`Y.Map("citations")`** - CSL-JSON entries keyed by citation ID
- **`Y.Map("settings")`** - `citationStyle`, `primaryHue`, and future editor preferences
- **`Y.Map("comments")`** - comment threads keyed by `commentId`, each with author/text/resolved
All types are concurrently editable by multiple users and persist to `data/default.yjs`.
---
## 4. Backend Components
### 4.1 HTTP Routes
| Method | Path | Auth | Purpose |
|--------|------|------|---------|
| `GET` | `/oauth/authorize` | Public | Redirect to HF OAuth |
| `GET` | `/auth/callback` | Public (CSRF state) | Exchange code, set cookie, redirect to `/editor` |
| `GET` | `/api/auth/status` | Cookie | Return `{ authenticated, canEdit, user }` |
| `POST` | `/api/chat` | OAuth (optional) | Stream AI agent responses (HF Inference Providers) |
| `POST` | `/api/publish` | OAuth (canEdit) | Run publish pipeline, generate HTML/PDF |
| `POST` | `/api/admin/reset-document` | OAuth (canEdit) | Delete local `.yjs`, close connections |
| `POST` | `/api/upload` | None (uses cookie for HF) | Upload image (multipart, max 10MB) |
| `POST` | `/api/citations/resolve` | None | Resolve DOI/URL to CSL-JSON |
| `POST` | `/api/citations/format` | None | Format entries to HTML bibliography |
| `POST` | `/api/citations/import-bib` | None | Parse BibTeX to CSL-JSON |
| `GET` | `/editor` | OAuth (canEdit) | Serve SPA (or login page) |
| `GET` | `*` | Public | Serve published article (or login page) |
### 4.2 WebSocket Collaboration
- Upgrade on `/collab` only; all other paths rejected
- Single document: `DEFAULT_DOC_NAME = "default"`
- Hocuspocus `onAuthenticate`: validates OAuth token if enabled, checks `canEdit`
- `Database` extension: `fetch` reads local `.yjs` or pulls from HF; `store` writes local + schedules HF push (10s debounce)
### 4.3 HF Storage
- Dataset ID: `HF_DATASET_ID` or `{SPACE_ID}-data`
- Dataset is created **private by default** (`createRepo({ private: true })`). The OAuth grant needs `manage-repos` on the user's first write; subsequent containers reuse the cached token.
- Token: `HF_TOKEN` (env) or cached OAuth token from last authenticated user
- **Documents**: `articles/.yjs` - debounced push on every Hocuspocus store
- **Published assets**: `published//{index.html, article.pdf, thumb.jpg, meta.json, llms.txt}`
- **Images**: `images/` referenced from articles via `/d/images/...` proxy URLs
- `flushAll()` on `SIGTERM`/`SIGINT` to push pending changes
### 4.3.1 Storage Status & Recovery
The persistence pipeline used to fail silently in multiple places (`createRepo` 403 on a missing scope, `uploadFile` 5xx mid-debounce, `writeFileSync` on a readonly FS, ...) and the editor would happily keep showing "Saved". To make data first-class:
- **In-memory tracker** in `hf-storage.ts` records `datasetReady`, `lastLocalSaveAt`, `lastCloudPushAt`, `pendingPush`, `lastError {stage, message, statusCode, at, docName}`. Every write path updates it; every error path records the failure.
- **`GET /api/storage/status`** exposes the tracker (canEdit-gated). The frontend `SyncIndicator` polls it every 5s and displays a three-state badge: green "Saved" / amber "Saving..." / **red "Storage error"** (pulsing, with the exact reason in the tooltip + actionable hint for the 403 / missing-scope case).
- **Eager `ensureDatasetExists`** on first `/api/auth/status` for a canEdit user. A misconfigured fork now surfaces its error within ~10s of login instead of waiting for an edit + 12s debounce cycle.
- **`beforeunload` guard** on the editor: if a local edit is in flight, a push is armed, the WS is offline, or the tracker reports an error, the browser pops the standard "Leave site?" confirm.
- **`GET /api/admin/export-doc`** (canEdit-gated) streams the on-disk `.yjs` snapshot as a download. The escape hatch for disaster recovery: when the cloud push has been failing and the container is about to rebuild, an admin can grab the doc bytes manually.
### 4.3.2 Dataset Reverse Proxy (`/d/*`)
Since the dataset is private, anonymous viewers of a published article can't fetch its images / PDF / og:image directly from `huggingface.co/datasets/...`. The editor server exposes `GET /d/:path*` as an authenticated forward-proxy:
- **Whitelist**: only `images/` and `published/` are reachable; `articles/` (raw `.yjs` drafts) is **always 404** regardless of caller.
- **Token cascade**: request cookie → cached user token → `HF_TOKEN` env → anonymous fetch. The cookie token is also promoted into the cache opportunistically, so the first signed-in viewer warms the proxy for subsequent anonymous viewers within the same container lifetime.
- **Streaming**: WHATWG body piped straight to the Express response - no buffering of full PDFs in Node memory.
- **Caching**: `images/*` is served as `immutable, max-age=1y` (UUID names, never overwritten); `published/*` as `max-age=300, stale-while-revalidate=60` (re-published in place).
- **Error mapping**: upstream 401/403 collapse to 502 so the browser never gets prompted for credentials it can't supply; upstream 404 passes through.
### 4.4 Publisher Pipeline
```mermaid
flowchart LR
YDoc["Y.Doc (.yjs)"] --> Extract["extractFromYDoc frontmatter + JSON"]
Extract --> GenHTML["generateHTML @tiptap/html"]
GenHTML --> PostProc["postProcess accordion, biblio, mermaid, htmlEmbed"]
PostProc --> Render["renderArticleHTML full HTML page"]
CSS["loadCSS template styles"] --> Render
Render --> LocalWrite["Write local index.html"]
Render --> PDF["Playwright PDF + thumbnail"]
LocalWrite --> HFUpload["uploadPublishedAssets HF dataset"]
```
- **CSS loading**: reads template CSS files, resolves `@custom-media` queries via `resolveCustomMedia()`, splits into variables/reset/base/layout/components/article/print
- **Post-processing**: accordion divs to ``, bibliography injection, mermaid to `