raqim / replit.md
RAQIM Deploy
Deploy RAQIM 2026-05-02 23:08
3e9069b

RAQIM — رقيم

Arabic-first document-to-Markdown conversion platform.

Overview

pnpm workspace monorepo using TypeScript. Full-stack: React+Vite frontend, Express backend, SQLite (node:sqlite built-in) + Drizzle ORM. Deployment target: Hugging Face Spaces (Docker).

Stack

  • Monorepo tool: pnpm workspaces
  • Node.js version: 24
  • Package manager: pnpm
  • TypeScript version: 5.9
  • API framework: Express 5
  • Database: SQLite via node:sqlite (built-in, no native compilation) + drizzle-orm/sqlite-proxy
  • Validation: Zod (zod/v4), drizzle-zod
  • API codegen: Orval (from OpenAPI spec in artifacts/api-spec/openapi.yaml)
  • Build: esbuild (ESM bundle)
  • Frontend: React 19 + Vite 7 + Tailwind CSS v4 + shadcn/ui

Artifacts

Artifact Path Port
@workspace/raqim / $PORT (23567)
@workspace/api-server /api $PORT (8080)

Key Commands

  • pnpm run typecheck — full typecheck across all packages
  • pnpm run typecheck:libs — build composite libs (run before API/frontend typecheck)
  • pnpm --filter @workspace/raqim run typecheck — frontend only
  • pnpm --filter @workspace/api-server run typecheck — backend only
  • pnpm --filter @workspace/api-spec run codegen — regenerate API hooks and Zod schemas from OpenAPI spec
  • node scripts/seed-admin.mjs — seed the admin user (run once after first startup)

Database

  • Engine: node:sqlite (Node 24 built-in, zero native deps)
  • ORM: drizzle-orm/sqlite-proxy (async callback-based wrapper)
  • Dev DB: artifacts/api-server/raqim.db (auto-created on startup)
  • Docker/HF DB: /data/raqim.db (set via DB_PATH env var)
  • Migrations: run automatically at server startup from lib/db/drizzle/*.sql
  • Backup: scripts/db-hf-sync.mjs — git-based push/pull to a private HF Dataset

sqlite-proxy callback contract (lib/db/src/index.ts)

  • method === "run"stmt.run(...params), return { rows: [] }
  • method === "get"stmt.get(...params), return { rows: Object.values(row) } (flat array, NOT wrapped)
  • method === "all"stmt.all(...params), return { rows: result.map(r => Object.values(r)) } (array of arrays)
  • method === "values" → same as all

Admin seed

node scripts/seed-admin.mjs — inserts admin@raqim.app / Admin1234! (bcrypt hash, role=admin, status=active). Respects DB_PATH, ADMIN_EMAIL, ADMIN_PASSWORD, ADMIN_NAME env vars.

Database Schema (lib/db/src/schema/)

  • users — id, email, passwordHash, displayName, role (user|admin), status (pending|active|suspended), lastLoginAt
  • refresh_tokens — id, userId, token (unique), expiresAt
  • folders — id, name, parentId, ownerId, trashed, trashedAt
  • files — id, name, folderId, ownerId, originalName, originalType, sizeBytes, status (uploading|queued|processing|done|failed|trashed), markdownContent, originalMarkdown, qualityScore, wordCount, language, trashedAt
  • conversions — id, fileId, userId, status (queued|analyzing|routing|ocr|layout|scoring|merging|cleanup|done|failed), progress, steps (jsonb), elapsedSeconds, estimatedSeconds, errorMessage
  • shares — id, fileId, folderId, ownerId, sharedWithId, permission (read|edit|full)

All timestamps stored as INTEGER (milliseconds since epoch). Enums stored as TEXT.

Auth

  • JWT accessToken (1h) in memory + refreshToken (30d) in httpOnly cookie
  • SESSION_SECRET env var used as JWT secret
  • refreshToken includes random jti UUID to prevent duplicate key collisions
  • Admin seed: admin@raqim.app / Admin1234!

Frontend Pages

Route Page Auth
/ LandingPage public
/login LoginPage public
/request-access RequestAccessPage public
/dashboard DashboardPage protected
/upload UploadPage protected
/editor/:fileId EditorPage protected
/convert/:jobId ConversionPage protected
/shared SharedPage protected
/trash TrashPage protected
/admin/users AdminUsersPage admin
/admin/requests AdminRequestsPage admin
/admin/stats AdminStatsPage admin

Design System

  • Dark sanctuary theme: --background: #0d0d0f, --primary: #f5a623 (amber gold)
  • Fonts: Fraunces (display) + Noto Naskh Arabic (body)
  • All pages use dir="rtl" for Arabic-first layout
  • shadcn/ui components with custom CSS vars

API Routes

  • POST /api/auth/login — login, returns accessToken + sets refresh cookie
  • POST /api/auth/refresh — refresh accessToken using cookie
  • POST /api/auth/logout — clear refresh cookie
  • GET/POST /api/folders — list/create folders
  • PATCH/DELETE /api/folders/:id — update/trash folder
  • GET/POST /api/files — list/upload files
  • GET/PATCH/DELETE /api/files/:id — get/update/trash file
  • GET/PUT /api/files/:id/content — get/save markdown content
  • POST /api/files/:id/restore — restore from trash
  • POST /api/files/:id/export — export to md/docx/pdf
  • GET /api/files/:id/download — download file
  • POST /api/convert — start conversion job
  • GET /api/convert/:jobId — poll conversion status
  • GET/POST/PATCH/DELETE /api/shares — share management
  • GET /api/shares/search-users — search users to share with
  • GET /api/admin/users — list users
  • POST /api/admin/users — create user
  • PATCH/DELETE /api/admin/users/:id — update/delete user
  • GET /api/admin/join-requests — list pending join requests
  • PATCH /api/admin/join-requests/:id — approve/reject request
  • GET /api/admin/stats — platform statistics
  • GET /api/admin/trash — list all trashed items
  • DELETE /api/admin/trash/empty — empty trash

Arabic PDF Text Extraction Pipeline

Many Arabic PDFs have broken ToUnicode CMap font tables. pdfjs-dist returns the correct Unicode codepoints but in wrong order (e.g. امحلد هلل instead of الحمد لله). The pipeline uses three stages — all completely free, no paid API keys, no page limits.

Stage 1: pdfjs-dist text extraction

Custom RTL-aware extractor that buckets text items by Y coordinate, sorts right-to-left, and joins with gap-based space insertion. Fast, zero dependencies.

Stage 2: Garbling detection → VLM-OCR → Tesseract fallback

isGarbledArabic() checks for telltale CMap transposition patterns (يف for في, امحلد for الحمد, bidi-wrapped Latin noise like ‎OA‏, etc.).

If garbled, extractPdfViaOcr() runs the following pipeline:

  1. pdftoppm renders every page to PNG — 200 DPI when HF_TOKEN available (VLM-optimised), 300 DPI for Tesseract-only mode
  2. Per page: try VLM-OCR via HF Inference API first (best Arabic accuracy)
    • Primary: allenai/olmOCR-7B-0225-preview — Allen Institute, fine-tuned on document OCR, #1 on Arabic KITAB-Bench
    • Fallback: Qwen/Qwen2.5-VL-7B-Instruct — strong VLM with excellent Arabic
  3. If VLM fails or rate-limited: Tesseract.js lazy-initialised (only created when needed), Arabic+English models

After OCR, cleanOcrOutput() filters each line: drops lines >80% Latin chars with <4 Arabic chars (decorative page noise, artefacts like Me NY 1, dl pl a gl). Keeps all lines with ≥4 Arabic characters.

No page cap — processes every page in the document.

Stage 3: AI text correction — full HF model access, 6-model fallback chain

Endpoint priority chain (tried in order, falls back on rate-limit / 5xx):

  1. Replit AI proxy (AI_INTEGRATIONS_OPENAI_BASE_URL) — auto on Replit, uses gpt-4o
  2. HF: Qwen/Qwen3-72B — best open-source Arabic model (Apr 2025), thinking disabled via /no_think
  3. HF: Qwen/Qwen3-30B-A3B — MoE, fast and very capable
  4. HF: Qwen/Qwen2.5-72B-Instruct — proven Arabic quality
  5. HF: meta-llama/Llama-3.3-70B-Instruct — strong multilingual
  6. HF: mistralai/Mistral-Nemo-Instruct-2407 — fast 12B fallback

Chunk size: 3000 chars. Timeout: 120s per chunk. Temperature: 0.1. Qwen3 <think> blocks stripped from output. On rate-limit, activeEpIdx advances to next model and continues without interruption.

No OPENAI_API_KEY needed. HF_TOKEN secret (already required for HF Spaces) doubles as the AI key.

Architect Engine — 100% Free, No External APIs, No Limits

The "Super Architect" is a fully deterministic rule-based engine (runRuleBasedArchitect in convert.ts). Zero external API calls. Zero cost. Zero rate limits. Runs entirely on the server.

Arabic document rules:

  • Metadata table: detects مادة/زمن/نموذج/تاريخ (even multiple fields inline on one line) → Markdown table
  • Document title: promoted to # heading when metadata block precedes it
  • Section markers: أولاً/ثانياً/ثالثاً/... → ## heading
  • Questions: س1/سؤال 1/Q1 → **bold** with blank lines
  • Multiple choice: أ- X ب- Y ج- Z د- W or a) X b) Y c) Z d) W → vertical bullet list
  • Keywords: التعليل:/الإجابة:/المطلوب:/الحل: → separate lines
  • OCR noise: box-drawing chars, control chars removed
  • Lone short lines (surrounded by blanks) → ### subheading

Latin/English document rules:

  • ALL CAPS short lines → ### heading
  • Q1:/Question N:/Section/Part → bold / ## heading
  • a) b) c) d) inline choices → vertical list
  • Existing Markdown kept as-is

Supported File Types for Conversion

ALL file types accepted (500MB max):

  • PDF — pdf-parse
  • DOCX/DOC — mammoth (HTML extraction)
  • PPTX/PPT — jszip (slide text extraction)
  • XLSX/XLS/CSV — xlsx (table to markdown)
  • Images (PNG, JPG, JPEG, WEBP, GIF, BMP, TIFF) — Tesseract.js OCR (Arabic+English)
  • HTML/HTM — custom HTML-to-text
  • EPUB — jszip (chapter text extraction)
  • TXT/MD — direct read
  • Any other — UTF-8 text fallback

Export

  • Markdown (.md) — direct download
  • Word (.docx) — real RTL DOCX via docx npm package:
    • Traditional Arabic font (30pt body, 44pt headings)
    • 100% RTL bidirectional support
    • Numbered header/footer with "منسق آلياً بواسطة رقيم"
    • Proper bullet/numbered lists, tables, inline bold/italic/code
  • PDF — currently exports as markdown (browser-based PDF printing recommended)

Frontend Animations

Framer Motion used throughout (already in devDependencies):

  • Upload zone: spring scale on hover/drag, AnimatePresence for state transitions
  • File upload progress bar with smooth motion
  • Conversion steps: staggered fade-in, animated step indicators
  • Live AI message cycling during processing steps
  • Success/failure states with spring animations

Hugging Face Spaces Deployment

  • Dockerfile — multi-stage build, runs as non-root, DB at /data/raqim.db
  • deploy-hf.sh — one-command deploy (requires HF_TOKEN + HF_USERNAME Replit secrets)
  • scripts/db-hf-sync.mjs — pull/push SQLite DB to private HF Dataset (${HF_USERNAME}/raqim-db) via git
  • Space secrets to set after deploy: JWT_SECRET, SESSION_SECRET
  • No DATABASE_URL needed — SQLite is fully embedded

See the pnpm-workspace skill for workspace structure, TypeScript setup, and package details.