raqim / replit.md
RAQIM Deploy
Deploy RAQIM 2026-05-02 23:08
3e9069b
# RAQIM — رقيم
Arabic-first document-to-Markdown conversion platform.
## Overview
pnpm workspace monorepo using TypeScript. Full-stack: React+Vite frontend, Express backend, SQLite (node:sqlite built-in) + Drizzle ORM. Deployment target: Hugging Face Spaces (Docker).
## Stack
- **Monorepo tool**: pnpm workspaces
- **Node.js version**: 24
- **Package manager**: pnpm
- **TypeScript version**: 5.9
- **API framework**: Express 5
- **Database**: SQLite via `node:sqlite` (built-in, no native compilation) + `drizzle-orm/sqlite-proxy`
- **Validation**: Zod (`zod/v4`), `drizzle-zod`
- **API codegen**: Orval (from OpenAPI spec in `artifacts/api-spec/openapi.yaml`)
- **Build**: esbuild (ESM bundle)
- **Frontend**: React 19 + Vite 7 + Tailwind CSS v4 + shadcn/ui
## Artifacts
| Artifact | Path | Port |
|---|---|---|
| `@workspace/raqim` | `/` | $PORT (23567) |
| `@workspace/api-server` | `/api` | $PORT (8080) |
## Key Commands
- `pnpm run typecheck` — full typecheck across all packages
- `pnpm run typecheck:libs` — build composite libs (run before API/frontend typecheck)
- `pnpm --filter @workspace/raqim run typecheck` — frontend only
- `pnpm --filter @workspace/api-server run typecheck` — backend only
- `pnpm --filter @workspace/api-spec run codegen` — regenerate API hooks and Zod schemas from OpenAPI spec
- `node scripts/seed-admin.mjs` — seed the admin user (run once after first startup)
## Database
- **Engine**: `node:sqlite` (Node 24 built-in, zero native deps)
- **ORM**: `drizzle-orm/sqlite-proxy` (async callback-based wrapper)
- **Dev DB**: `artifacts/api-server/raqim.db` (auto-created on startup)
- **Docker/HF DB**: `/data/raqim.db` (set via `DB_PATH` env var)
- **Migrations**: run automatically at server startup from `lib/db/drizzle/*.sql`
- **Backup**: `scripts/db-hf-sync.mjs` — git-based push/pull to a private HF Dataset
### sqlite-proxy callback contract (lib/db/src/index.ts)
- `method === "run"``stmt.run(...params)`, return `{ rows: [] }`
- `method === "get"``stmt.get(...params)`, return `{ rows: Object.values(row) }` (flat array, NOT wrapped)
- `method === "all"``stmt.all(...params)`, return `{ rows: result.map(r => Object.values(r)) }` (array of arrays)
- `method === "values"` → same as `all`
### Admin seed
`node scripts/seed-admin.mjs` — inserts `admin@raqim.app` / `Admin1234!` (bcrypt hash, role=admin, status=active).
Respects `DB_PATH`, `ADMIN_EMAIL`, `ADMIN_PASSWORD`, `ADMIN_NAME` env vars.
## Database Schema (lib/db/src/schema/)
- **users** — id, email, passwordHash, displayName, role (user|admin), status (pending|active|suspended), lastLoginAt
- **refresh_tokens** — id, userId, token (unique), expiresAt
- **folders** — id, name, parentId, ownerId, trashed, trashedAt
- **files** — id, name, folderId, ownerId, originalName, originalType, sizeBytes, status (uploading|queued|processing|done|failed|trashed), markdownContent, originalMarkdown, qualityScore, wordCount, language, trashedAt
- **conversions** — id, fileId, userId, status (queued|analyzing|routing|ocr|layout|scoring|merging|cleanup|done|failed), progress, steps (jsonb), elapsedSeconds, estimatedSeconds, errorMessage
- **shares** — id, fileId, folderId, ownerId, sharedWithId, permission (read|edit|full)
All timestamps stored as `INTEGER` (milliseconds since epoch). Enums stored as `TEXT`.
## Auth
- JWT accessToken (1h) in memory + refreshToken (30d) in httpOnly cookie
- `SESSION_SECRET` env var used as JWT secret
- refreshToken includes random `jti` UUID to prevent duplicate key collisions
- Admin seed: `admin@raqim.app` / `Admin1234!`
## Frontend Pages
| Route | Page | Auth |
|---|---|---|
| `/` | LandingPage | public |
| `/login` | LoginPage | public |
| `/request-access` | RequestAccessPage | public |
| `/dashboard` | DashboardPage | protected |
| `/upload` | UploadPage | protected |
| `/editor/:fileId` | EditorPage | protected |
| `/convert/:jobId` | ConversionPage | protected |
| `/shared` | SharedPage | protected |
| `/trash` | TrashPage | protected |
| `/admin/users` | AdminUsersPage | admin |
| `/admin/requests` | AdminRequestsPage | admin |
| `/admin/stats` | AdminStatsPage | admin |
## Design System
- Dark sanctuary theme: `--background: #0d0d0f`, `--primary: #f5a623` (amber gold)
- Fonts: Fraunces (display) + Noto Naskh Arabic (body)
- All pages use `dir="rtl"` for Arabic-first layout
- shadcn/ui components with custom CSS vars
## API Routes
- `POST /api/auth/login` — login, returns accessToken + sets refresh cookie
- `POST /api/auth/refresh` — refresh accessToken using cookie
- `POST /api/auth/logout` — clear refresh cookie
- `GET/POST /api/folders` — list/create folders
- `PATCH/DELETE /api/folders/:id` — update/trash folder
- `GET/POST /api/files` — list/upload files
- `GET/PATCH/DELETE /api/files/:id` — get/update/trash file
- `GET/PUT /api/files/:id/content` — get/save markdown content
- `POST /api/files/:id/restore` — restore from trash
- `POST /api/files/:id/export` — export to md/docx/pdf
- `GET /api/files/:id/download` — download file
- `POST /api/convert` — start conversion job
- `GET /api/convert/:jobId` — poll conversion status
- `GET/POST/PATCH/DELETE /api/shares` — share management
- `GET /api/shares/search-users` — search users to share with
- `GET /api/admin/users` — list users
- `POST /api/admin/users` — create user
- `PATCH/DELETE /api/admin/users/:id` — update/delete user
- `GET /api/admin/join-requests` — list pending join requests
- `PATCH /api/admin/join-requests/:id` — approve/reject request
- `GET /api/admin/stats` — platform statistics
- `GET /api/admin/trash` — list all trashed items
- `DELETE /api/admin/trash/empty` — empty trash
## Arabic PDF Text Extraction Pipeline
Many Arabic PDFs have broken ToUnicode CMap font tables. pdfjs-dist returns the correct Unicode codepoints but in wrong order (e.g. `امحلد هلل` instead of `الحمد لله`). The pipeline uses three stages — all completely free, no paid API keys, no page limits.
### Stage 1: pdfjs-dist text extraction
Custom RTL-aware extractor that buckets text items by Y coordinate, sorts right-to-left, and joins with gap-based space insertion. Fast, zero dependencies.
### Stage 2: Garbling detection → VLM-OCR → Tesseract fallback
`isGarbledArabic()` checks for telltale CMap transposition patterns (` يف ` for `في`, `امحلد` for `الحمد`, bidi-wrapped Latin noise like `‎OA‏`, etc.).
If garbled, `extractPdfViaOcr()` runs the following pipeline:
1. `pdftoppm` renders every page to PNG — **200 DPI** when HF_TOKEN available (VLM-optimised), **300 DPI** for Tesseract-only mode
2. **Per page**: try VLM-OCR via HF Inference API first (best Arabic accuracy)
- Primary: `allenai/olmOCR-7B-0225-preview` — Allen Institute, fine-tuned on document OCR, #1 on Arabic KITAB-Bench
- Fallback: `Qwen/Qwen2.5-VL-7B-Instruct` — strong VLM with excellent Arabic
3. **If VLM fails or rate-limited**: Tesseract.js lazy-initialised (only created when needed), Arabic+English models
After OCR, `cleanOcrOutput()` filters each line: drops lines >80% Latin chars with <4 Arabic chars (decorative page noise, artefacts like `Me NY 1`, `dl pl a gl`). Keeps all lines with ≥4 Arabic characters.
**No page cap** — processes every page in the document.
### Stage 3: AI text correction — full HF model access, 6-model fallback chain
Endpoint priority chain (tried in order, falls back on rate-limit / 5xx):
1. **Replit AI proxy** (`AI_INTEGRATIONS_OPENAI_BASE_URL`) — auto on Replit, uses `gpt-4o`
2. **HF: Qwen/Qwen3-72B** — best open-source Arabic model (Apr 2025), thinking disabled via `/no_think`
3. **HF: Qwen/Qwen3-30B-A3B** — MoE, fast and very capable
4. **HF: Qwen/Qwen2.5-72B-Instruct** — proven Arabic quality
5. **HF: meta-llama/Llama-3.3-70B-Instruct** — strong multilingual
6. **HF: mistralai/Mistral-Nemo-Instruct-2407** — fast 12B fallback
Chunk size: 3000 chars. Timeout: 120s per chunk. Temperature: 0.1. Qwen3 `<think>` blocks stripped from output. On rate-limit, `activeEpIdx` advances to next model and continues without interruption.
No `OPENAI_API_KEY` needed. `HF_TOKEN` secret (already required for HF Spaces) doubles as the AI key.
## Architect Engine — 100% Free, No External APIs, No Limits
The "Super Architect" is a fully deterministic rule-based engine (`runRuleBasedArchitect` in `convert.ts`).
Zero external API calls. Zero cost. Zero rate limits. Runs entirely on the server.
**Arabic document rules:**
- Metadata table: detects مادة/زمن/نموذج/تاريخ (even multiple fields inline on one line) → Markdown table
- Document title: promoted to `# heading` when metadata block precedes it
- Section markers: أولاً/ثانياً/ثالثاً/... → `## heading`
- Questions: س1/سؤال 1/Q1 → `**bold**` with blank lines
- Multiple choice: `أ- X ب- Y ج- Z د- W` or `a) X b) Y c) Z d) W` → vertical bullet list
- Keywords: التعليل:/الإجابة:/المطلوب:/الحل: → separate lines
- OCR noise: box-drawing chars, control chars removed
- Lone short lines (surrounded by blanks) → `### subheading`
**Latin/English document rules:**
- ALL CAPS short lines → `### heading`
- `Q1:`/`Question N:`/`Section`/`Part` → bold / ## heading
- `a) b) c) d)` inline choices → vertical list
- Existing Markdown kept as-is
## Supported File Types for Conversion
ALL file types accepted (500MB max):
- **PDF** — pdf-parse
- **DOCX/DOC** — mammoth (HTML extraction)
- **PPTX/PPT** — jszip (slide text extraction)
- **XLSX/XLS/CSV** — xlsx (table to markdown)
- **Images** (PNG, JPG, JPEG, WEBP, GIF, BMP, TIFF) — Tesseract.js OCR (Arabic+English)
- **HTML/HTM** — custom HTML-to-text
- **EPUB** — jszip (chapter text extraction)
- **TXT/MD** — direct read
- **Any other** — UTF-8 text fallback
## Export
- **Markdown (.md)** — direct download
- **Word (.docx)** — real RTL DOCX via `docx` npm package:
- Traditional Arabic font (30pt body, 44pt headings)
- 100% RTL bidirectional support
- Numbered header/footer with "منسق آلياً بواسطة رقيم"
- Proper bullet/numbered lists, tables, inline bold/italic/code
- **PDF** — currently exports as markdown (browser-based PDF printing recommended)
## Frontend Animations
Framer Motion used throughout (already in devDependencies):
- Upload zone: spring scale on hover/drag, AnimatePresence for state transitions
- File upload progress bar with smooth motion
- Conversion steps: staggered fade-in, animated step indicators
- Live AI message cycling during processing steps
- Success/failure states with spring animations
## Hugging Face Spaces Deployment
- `Dockerfile` — multi-stage build, runs as non-root, DB at `/data/raqim.db`
- `deploy-hf.sh` — one-command deploy (requires `HF_TOKEN` + `HF_USERNAME` Replit secrets)
- `scripts/db-hf-sync.mjs` — pull/push SQLite DB to private HF Dataset (`${HF_USERNAME}/raqim-db`) via git
- Space secrets to set after deploy: `JWT_SECRET`, `SESSION_SECRET`
- No `DATABASE_URL` needed — SQLite is fully embedded
See the `pnpm-workspace` skill for workspace structure, TypeScript setup, and package details.