carbon-tokenization

Running

tfrere HF Staff Cursor commited on May 20

Commit

2b3982d

1 Parent(s): d10f68e

revert(readme): restore body and title

A previous "Update README.md" commit (3557bec) was made via the HF
web UI on the carbon-tokenization Space and ended up propagating
to all three Spaces during the last force-push, stripping the
README down to just the frontmatter and renaming every Space to
"Carbon tokenization".

The OAuth frontmatter (hf_oauth + scopes) was untouched so the
auth flow is technically still wired, but having a one-line
README on Spaces meant for the editor template hides the doc
visitors land on. Restoring the full body + original title keeps
the editor and editor2 Spaces self-documenting. The
carbon-tokenization Space owner can rewrite the title via the
web UI in two clicks if they want it back to "Carbon
tokenization".

Co-authored-by: Cursor <cursoragent@cursor.com>

Files changed (1) hide show

README.md +175 -1

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Carbon tokenization
 emoji: ✏️
 colorFrom: purple
 colorTo: blue
@@ -11,3 +11,177 @@ hf_oauth_scopes:
   - manage-repos
   - inference-api
 ---

 ---
+title: Research Article Template Editor
 emoji: ✏️
 colorFrom: purple
 colorTo: blue
   - manage-repos
   - inference-api
 ---
+# Research Article Template Editor
+A collaborative, real-time editor for web-native scientific articles. It lets multiple authors co-write a paper with rich text, math, citations, figures and interactive D3 embeds, then publishes the result as a static HTML page (or a PDF) aligned with the [research-article-template](https://github.com/huggingface/research-article-template).
+## What it gives you
+- **Real-time collaboration** over WebSocket (Y.js + Hocuspocus), with visible cursors and per-user selection colors
+- **Rich article authoring**: headings, lists, tables, code blocks with syntax highlighting, LaTeX math (KaTeX), footnotes, sidenotes, block quotes, callouts
+- **Research-specific blocks**: citations + bibliography (BibTeX), figures with captions, stacks / wide / full-width layouts, glossary terms, Mermaid/Wardley/architecture diagrams
+- **Interactive D3 embeds** authored inline: each embed is a self-contained HTML file the editor can generate and iterate on via an **AI-assisted "embed studio"**
+- **Comments & discussion** anchored on any selection
+- **Slash menu** (`/`) and drag/drop block handles, in the spirit of Notion
+- **Click-to-edit frontmatter**: title, subtitle, authors, affiliations, links, banner color
+- **Publishing pipeline**: one-click export to a standalone static HTML bundle, plus PDF generation (Puppeteer) and an `llms.txt` Markdown twin for LLM agents/crawlers (served at `/llms.txt`, advertised in `/robots.txt`)
+- **Persistence**:
+  - Local mode: documents stored on disk under `DATA_DIR`
+  - HF mode: documents pushed/pulled from a Hugging Face dataset via OAuth
+- **Dark mode**, responsive layout (TOC drawer on mobile), live table of contents with scroll-spy
+- **AI chat side-panel** that can edit the article via structured tool calls (agent loop over the current TipTap doc)
+## Stack
+| Layer | Tech |
+|---|---|
+| Editor | React 18, TypeScript, TipTap v3, ProseMirror |
+| Collaboration | Y.js, Hocuspocus (WebSocket), y-tiptap |
+| Backend | Node.js, Express, Vite (dev proxy), Hocuspocus server |
+| Publishing | Custom TipTap-JSON → HTML renderer, Puppeteer for PDF |
+| AI | Vercel AI SDK v6 (`ai`, `@ai-sdk/react`) → Hugging Face Inference Providers (OpenAI-compatible router) |
+| Styling | Plain CSS with custom properties, no framework |
+| Storage | Local FS or Hugging Face datasets (via `@huggingface/hub`) |
+| Container | Single-image Docker build, runs on port 8080 |
+Around **3.6k LOC backend** and **9.5k LOC frontend** (TypeScript/TSX, excluding generated code).
+## Repo layout
+```
+collab-editor/
+├── backend/              # Express + Hocuspocus server, publisher, AI agent routes
+│   └── src/
+│       ├── server.ts             # Entry point
+│       ├── create-app.ts         # App factory (routes, middleware, Hocuspocus)
+│       ├── publisher/            # TipTap-JSON → HTML + PDF
+│       ├── agent/                # LLM agent (tool calls over the doc)
+│       ├── shared/               # Component defs shared with the frontend
+│       └── hf-storage.ts         # HF dataset sync
+├── frontend/             # Vite + React + TipTap editor
+│   └── src/
+│       ├── App.tsx               # Top-level shell
+│       ├── editor/               # TipTap editor + extensions + components
+│       ├── components/           # Shared UI pieces (TOC, Chat, Dialog, ...)
+│       ├── hooks/                # React hooks (agent chat, selection, ...)
+│       ├── styles/               # CSS layers (see docs/ARCHITECTURE.md)
+│       └── utils/
+├── docs/
+│   ├── ARCHITECTURE.md           # Deep dive on layers, data flow, CSS
+│   ├── SPECIFICATION.md          # Feature spec and contracts
+│   ├── TESTS.md                  # Testing strategy
+│   └── embed-studio.md           # How the AI-authored embeds pipeline works
+└── Dockerfile            # Production multi-stage build
+```
+See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) for a diagram and the full tour.
+## Getting started
+### Prerequisites
+- Node.js 20+
+- A Hugging Face token with the `Make calls to Inference Providers` permission for the AI features (embed studio, chat agent). Generate one at https://huggingface.co/settings/tokens. On a HF Space the logged-in user's OAuth token is used instead - no manual setup needed.
+- A Hugging Face OAuth app (client id/secret) if you want login + HF dataset persistence
+### Local development
+Backend and frontend run as two separate processes in dev (Vite proxies `/api`, `/collab`, `/uploads`, `/published`, `/oauth`, `/auth` to the backend).
+```bash
+# terminal 1 — backend (Express + Hocuspocus on :8080)
+cd backend
+cp .env.example .env          # set HF_TOKEN, optional OAUTH_* and HF_DATASET_ID
+npm install
+npm run dev
+# terminal 2 — frontend (Vite on :5678)
+cd frontend
+npm install
+npm run dev
+```
+Then open http://localhost:5678. Open a second tab or browser to see collaboration in action.
+### Production (Docker / HF Spaces)
+The `Dockerfile` builds both frontend and backend into a single image listening on port 8080. This is the image used by the Hugging Face Space.
+```bash
+docker build -t collab-editor .
+docker run -p 8080:8080 --env-file backend/.env collab-editor
+```
+Then open http://localhost:8080.
+### Run your own copy on a Hugging Face Space
+Want your own editor? One step:
+1. **Duplicate the Space.** On https://huggingface.co/spaces/tfrere/research-article-template-editor, click `⋯ → Duplicate this Space`. Pick your namespace and visibility. HF copies the Dockerfile, the OAuth wiring and rebuilds the image automatically.
+That's it. No API key to wire up. The AI features (chat agent + embed studio) call **Hugging Face Inference Providers** at `https://router.huggingface.co/v1` using the OAuth token of whoever is currently logged in. As long as your duplicated Space requests the `inference-api` scope (already declared in the README frontmatter as `hf_oauth_scopes`), every editor gets AI for free under their own Inference Providers quota.
+Optional public variable: `HF_INFERENCE_MODEL` (e.g. `meta-llama/Llama-3.3-70B-Instruct`) to override the default model id. The full list of supported chat-completion models lives at https://huggingface.co/models?inference_provider=all&other=conversational.
+## Scripts
+### Backend (`cd backend`)
+| Command | What it does |
+|---|---|
+| `npm run dev` | Start Express + Hocuspocus in watch mode |
+| `npm run build` | Compile TypeScript to `dist/` |
+| `npm start` | Run the compiled server |
+| `npm run test` | Unit + integration tests (Vitest) |
+| `npm run test:e2e` | End-to-end tests (Playwright) |
+### Frontend (`cd frontend`)
+| Command | What it does |
+|---|---|
+| `npm run dev` | Start Vite dev server on :5678 |
+| `npm run build` | Production bundle to `dist/` |
+| `npm run preview` | Preview the built bundle |
+| `npm run test` | Unit tests (Vitest) |
+| `npm run typecheck` | `tsc --noEmit` on the whole frontend |
+## Environment variables
+Copy `backend/.env.example` to `backend/.env` and fill the relevant values. Key ones:
+| Variable | Purpose |
+|---|---|
+| `OAUTH_CLIENT_ID` / `OAUTH_CLIENT_SECRET` | HF OAuth app for user login (required to edit when running on a Space) |
+| `OAUTH_SCOPES` | OAuth scopes (default `openid profile`). Add `manage-repos` for dataset persistence and `inference-api` to power the AI features with the user's token |
+| `HF_TOKEN` | Server-side Hugging Face token. Used as a fallback when no user OAuth token is present (e.g. local dev). Needs the `Make calls to Inference Providers` permission to enable the chat agent + embed studio |
+| `HF_INFERENCE_MODEL` | Override the default chat-completion model id (defaults to `openai/gpt-oss-120b`). Any tool-calling-capable model exposed by HF Inference Providers works |
+| `HF_DATASET_ID` | Target HF dataset repo for document persistence (when not running on a Space) |
+| `SPACE_ID` / `SPACE_HOST` | Auto-set by HF Spaces; drive dataset id + secure cookies in production |
+| `DATA_DIR` | Where documents, uploads and published bundles are stored on disk (default: `./data`) |
+| `PUBLISH_BASE_URL` | Absolute base URL used when publishing (defaults to `http://127.0.0.1:${PORT}`) |
+| `ENABLE_PDF` | Set to `false` to disable Playwright-based PDF export |
+| `PORT` | Server port (default 8080) |
+## Testing
+- **Backend unit tests**: Vitest covers the publisher (HTML renderer, frontmatter, bibliography), storage, auth utilities.
+- **Backend E2E**: Playwright drives the full editor against a real backend.
+- **Frontend unit tests**: Vitest covers chat persistence and a handful of utilities.
+- **Type checking**: `npm run typecheck` in both workspaces.
+See [`docs/TESTS.md`](docs/TESTS.md) for the current strategy and gaps.
+## Known technical debt
+These are tracked explicitly so new contributors don't trip on them:
+- **`useEmbedChat` still lacks dedicated unit tests**; the rest of the stores (frontmatter, comments, embeds) and the agent undo batching primitive are now covered.
+- **Bundle size warning**: the frontend bundle is over the 500 kB Vite warning threshold. Code-splitting the Mermaid / KaTeX / D3 stacks via dynamic imports would help.
+- **`addToolOutput` typing**: the ai-sdk v6 `ChatAddToolOutputFunction` is a generic over the tool name union. We currently cast to a plain signature at the two call sites because we don't export a typed tool registry yet.
+- **`backend/src/publisher/html-renderer.ts` is ~1000 LOC**: a per-node-type registry would make it more maintainable.
+## License
+Follow the upstream [research-article-template](https://github.com/huggingface/research-article-template) license.