tfrere HF Staff Cursor commited on
Commit
2b3982d
·
1 Parent(s): d10f68e

revert(readme): restore body and title

Browse files

A previous "Update README.md" commit (3557bec) was made via the HF
web UI on the carbon-tokenization Space and ended up propagating
to all three Spaces during the last force-push, stripping the
README down to just the frontmatter and renaming every Space to
"Carbon tokenization".

The OAuth frontmatter (hf_oauth + scopes) was untouched so the
auth flow is technically still wired, but having a one-line
README on Spaces meant for the editor template hides the doc
visitors land on. Restoring the full body + original title keeps
the editor and editor2 Spaces self-documenting. The
carbon-tokenization Space owner can rewrite the title via the
web UI in two clicks if they want it back to "Carbon
tokenization".

Co-authored-by: Cursor <cursoragent@cursor.com>

Files changed (1) hide show
  1. README.md +175 -1
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Carbon tokenization
3
  emoji: ✏️
4
  colorFrom: purple
5
  colorTo: blue
@@ -11,3 +11,177 @@ hf_oauth_scopes:
11
  - manage-repos
12
  - inference-api
13
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Research Article Template Editor
3
  emoji: ✏️
4
  colorFrom: purple
5
  colorTo: blue
 
11
  - manage-repos
12
  - inference-api
13
  ---
14
+
15
+ # Research Article Template Editor
16
+
17
+ A collaborative, real-time editor for web-native scientific articles. It lets multiple authors co-write a paper with rich text, math, citations, figures and interactive D3 embeds, then publishes the result as a static HTML page (or a PDF) aligned with the [research-article-template](https://github.com/huggingface/research-article-template).
18
+
19
+ ## What it gives you
20
+
21
+ - **Real-time collaboration** over WebSocket (Y.js + Hocuspocus), with visible cursors and per-user selection colors
22
+ - **Rich article authoring**: headings, lists, tables, code blocks with syntax highlighting, LaTeX math (KaTeX), footnotes, sidenotes, block quotes, callouts
23
+ - **Research-specific blocks**: citations + bibliography (BibTeX), figures with captions, stacks / wide / full-width layouts, glossary terms, Mermaid/Wardley/architecture diagrams
24
+ - **Interactive D3 embeds** authored inline: each embed is a self-contained HTML file the editor can generate and iterate on via an **AI-assisted "embed studio"**
25
+ - **Comments & discussion** anchored on any selection
26
+ - **Slash menu** (`/`) and drag/drop block handles, in the spirit of Notion
27
+ - **Click-to-edit frontmatter**: title, subtitle, authors, affiliations, links, banner color
28
+ - **Publishing pipeline**: one-click export to a standalone static HTML bundle, plus PDF generation (Puppeteer) and an `llms.txt` Markdown twin for LLM agents/crawlers (served at `/llms.txt`, advertised in `/robots.txt`)
29
+ - **Persistence**:
30
+ - Local mode: documents stored on disk under `DATA_DIR`
31
+ - HF mode: documents pushed/pulled from a Hugging Face dataset via OAuth
32
+ - **Dark mode**, responsive layout (TOC drawer on mobile), live table of contents with scroll-spy
33
+ - **AI chat side-panel** that can edit the article via structured tool calls (agent loop over the current TipTap doc)
34
+
35
+ ## Stack
36
+
37
+ | Layer | Tech |
38
+ |---|---|
39
+ | Editor | React 18, TypeScript, TipTap v3, ProseMirror |
40
+ | Collaboration | Y.js, Hocuspocus (WebSocket), y-tiptap |
41
+ | Backend | Node.js, Express, Vite (dev proxy), Hocuspocus server |
42
+ | Publishing | Custom TipTap-JSON → HTML renderer, Puppeteer for PDF |
43
+ | AI | Vercel AI SDK v6 (`ai`, `@ai-sdk/react`) → Hugging Face Inference Providers (OpenAI-compatible router) |
44
+ | Styling | Plain CSS with custom properties, no framework |
45
+ | Storage | Local FS or Hugging Face datasets (via `@huggingface/hub`) |
46
+ | Container | Single-image Docker build, runs on port 8080 |
47
+
48
+ Around **3.6k LOC backend** and **9.5k LOC frontend** (TypeScript/TSX, excluding generated code).
49
+
50
+ ## Repo layout
51
+
52
+ ```
53
+ collab-editor/
54
+ ├── backend/ # Express + Hocuspocus server, publisher, AI agent routes
55
+ │ └── src/
56
+ │ ├── server.ts # Entry point
57
+ │ ├── create-app.ts # App factory (routes, middleware, Hocuspocus)
58
+ │ ├── publisher/ # TipTap-JSON → HTML + PDF
59
+ │ ├── agent/ # LLM agent (tool calls over the doc)
60
+ │ ├── shared/ # Component defs shared with the frontend
61
+ │ └── hf-storage.ts # HF dataset sync
62
+ ├── frontend/ # Vite + React + TipTap editor
63
+ │ └── src/
64
+ │ ├── App.tsx # Top-level shell
65
+ │ ├── editor/ # TipTap editor + extensions + components
66
+ │ ├── components/ # Shared UI pieces (TOC, Chat, Dialog, ...)
67
+ │ ├── hooks/ # React hooks (agent chat, selection, ...)
68
+ │ ├── styles/ # CSS layers (see docs/ARCHITECTURE.md)
69
+ │ └── utils/
70
+ ├── docs/
71
+ │ ├── ARCHITECTURE.md # Deep dive on layers, data flow, CSS
72
+ │ ├── SPECIFICATION.md # Feature spec and contracts
73
+ │ ├── TESTS.md # Testing strategy
74
+ │ └── embed-studio.md # How the AI-authored embeds pipeline works
75
+ └── Dockerfile # Production multi-stage build
76
+ ```
77
+
78
+ See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) for a diagram and the full tour.
79
+
80
+ ## Getting started
81
+
82
+ ### Prerequisites
83
+
84
+ - Node.js 20+
85
+ - A Hugging Face token with the `Make calls to Inference Providers` permission for the AI features (embed studio, chat agent). Generate one at https://huggingface.co/settings/tokens. On a HF Space the logged-in user's OAuth token is used instead - no manual setup needed.
86
+ - A Hugging Face OAuth app (client id/secret) if you want login + HF dataset persistence
87
+
88
+ ### Local development
89
+
90
+ Backend and frontend run as two separate processes in dev (Vite proxies `/api`, `/collab`, `/uploads`, `/published`, `/oauth`, `/auth` to the backend).
91
+
92
+ ```bash
93
+ # terminal 1 — backend (Express + Hocuspocus on :8080)
94
+ cd backend
95
+ cp .env.example .env # set HF_TOKEN, optional OAUTH_* and HF_DATASET_ID
96
+ npm install
97
+ npm run dev
98
+
99
+ # terminal 2 — frontend (Vite on :5678)
100
+ cd frontend
101
+ npm install
102
+ npm run dev
103
+ ```
104
+
105
+ Then open http://localhost:5678. Open a second tab or browser to see collaboration in action.
106
+
107
+ ### Production (Docker / HF Spaces)
108
+
109
+ The `Dockerfile` builds both frontend and backend into a single image listening on port 8080. This is the image used by the Hugging Face Space.
110
+
111
+ ```bash
112
+ docker build -t collab-editor .
113
+ docker run -p 8080:8080 --env-file backend/.env collab-editor
114
+ ```
115
+
116
+ Then open http://localhost:8080.
117
+
118
+ ### Run your own copy on a Hugging Face Space
119
+
120
+ Want your own editor? One step:
121
+
122
+ 1. **Duplicate the Space.** On https://huggingface.co/spaces/tfrere/research-article-template-editor, click `⋯ → Duplicate this Space`. Pick your namespace and visibility. HF copies the Dockerfile, the OAuth wiring and rebuilds the image automatically.
123
+
124
+ That's it. No API key to wire up. The AI features (chat agent + embed studio) call **Hugging Face Inference Providers** at `https://router.huggingface.co/v1` using the OAuth token of whoever is currently logged in. As long as your duplicated Space requests the `inference-api` scope (already declared in the README frontmatter as `hf_oauth_scopes`), every editor gets AI for free under their own Inference Providers quota.
125
+
126
+ Optional public variable: `HF_INFERENCE_MODEL` (e.g. `meta-llama/Llama-3.3-70B-Instruct`) to override the default model id. The full list of supported chat-completion models lives at https://huggingface.co/models?inference_provider=all&other=conversational.
127
+
128
+ ## Scripts
129
+
130
+ ### Backend (`cd backend`)
131
+
132
+ | Command | What it does |
133
+ |---|---|
134
+ | `npm run dev` | Start Express + Hocuspocus in watch mode |
135
+ | `npm run build` | Compile TypeScript to `dist/` |
136
+ | `npm start` | Run the compiled server |
137
+ | `npm run test` | Unit + integration tests (Vitest) |
138
+ | `npm run test:e2e` | End-to-end tests (Playwright) |
139
+
140
+ ### Frontend (`cd frontend`)
141
+
142
+ | Command | What it does |
143
+ |---|---|
144
+ | `npm run dev` | Start Vite dev server on :5678 |
145
+ | `npm run build` | Production bundle to `dist/` |
146
+ | `npm run preview` | Preview the built bundle |
147
+ | `npm run test` | Unit tests (Vitest) |
148
+ | `npm run typecheck` | `tsc --noEmit` on the whole frontend |
149
+
150
+ ## Environment variables
151
+
152
+ Copy `backend/.env.example` to `backend/.env` and fill the relevant values. Key ones:
153
+
154
+ | Variable | Purpose |
155
+ |---|---|
156
+ | `OAUTH_CLIENT_ID` / `OAUTH_CLIENT_SECRET` | HF OAuth app for user login (required to edit when running on a Space) |
157
+ | `OAUTH_SCOPES` | OAuth scopes (default `openid profile`). Add `manage-repos` for dataset persistence and `inference-api` to power the AI features with the user's token |
158
+ | `HF_TOKEN` | Server-side Hugging Face token. Used as a fallback when no user OAuth token is present (e.g. local dev). Needs the `Make calls to Inference Providers` permission to enable the chat agent + embed studio |
159
+ | `HF_INFERENCE_MODEL` | Override the default chat-completion model id (defaults to `openai/gpt-oss-120b`). Any tool-calling-capable model exposed by HF Inference Providers works |
160
+ | `HF_DATASET_ID` | Target HF dataset repo for document persistence (when not running on a Space) |
161
+ | `SPACE_ID` / `SPACE_HOST` | Auto-set by HF Spaces; drive dataset id + secure cookies in production |
162
+ | `DATA_DIR` | Where documents, uploads and published bundles are stored on disk (default: `./data`) |
163
+ | `PUBLISH_BASE_URL` | Absolute base URL used when publishing (defaults to `http://127.0.0.1:${PORT}`) |
164
+ | `ENABLE_PDF` | Set to `false` to disable Playwright-based PDF export |
165
+ | `PORT` | Server port (default 8080) |
166
+
167
+ ## Testing
168
+
169
+ - **Backend unit tests**: Vitest covers the publisher (HTML renderer, frontmatter, bibliography), storage, auth utilities.
170
+ - **Backend E2E**: Playwright drives the full editor against a real backend.
171
+ - **Frontend unit tests**: Vitest covers chat persistence and a handful of utilities.
172
+ - **Type checking**: `npm run typecheck` in both workspaces.
173
+
174
+ See [`docs/TESTS.md`](docs/TESTS.md) for the current strategy and gaps.
175
+
176
+ ## Known technical debt
177
+
178
+ These are tracked explicitly so new contributors don't trip on them:
179
+
180
+ - **`useEmbedChat` still lacks dedicated unit tests**; the rest of the stores (frontmatter, comments, embeds) and the agent undo batching primitive are now covered.
181
+ - **Bundle size warning**: the frontend bundle is over the 500 kB Vite warning threshold. Code-splitting the Mermaid / KaTeX / D3 stacks via dynamic imports would help.
182
+ - **`addToolOutput` typing**: the ai-sdk v6 `ChatAddToolOutputFunction` is a generic over the tool name union. We currently cast to a plain signature at the two call sites because we don't export a typed tool registry yet.
183
+ - **`backend/src/publisher/html-renderer.ts` is ~1000 LOC**: a per-node-type registry would make it more maintainable.
184
+
185
+ ## License
186
+
187
+ Follow the upstream [research-article-template](https://github.com/huggingface/research-article-template) license.