# RAQIM — رقيم Arabic-first document-to-Markdown conversion platform. ## Overview pnpm workspace monorepo using TypeScript. Full-stack: React+Vite frontend, Express backend, SQLite (node:sqlite built-in) + Drizzle ORM. Deployment target: Hugging Face Spaces (Docker). ## Stack - **Monorepo tool**: pnpm workspaces - **Node.js version**: 24 - **Package manager**: pnpm - **TypeScript version**: 5.9 - **API framework**: Express 5 - **Database**: SQLite via `node:sqlite` (built-in, no native compilation) + `drizzle-orm/sqlite-proxy` - **Validation**: Zod (`zod/v4`), `drizzle-zod` - **API codegen**: Orval (from OpenAPI spec in `artifacts/api-spec/openapi.yaml`) - **Build**: esbuild (ESM bundle) - **Frontend**: React 19 + Vite 7 + Tailwind CSS v4 + shadcn/ui ## Artifacts | Artifact | Path | Port | |---|---|---| | `@workspace/raqim` | `/` | $PORT (23567) | | `@workspace/api-server` | `/api` | $PORT (8080) | ## Key Commands - `pnpm run typecheck` — full typecheck across all packages - `pnpm run typecheck:libs` — build composite libs (run before API/frontend typecheck) - `pnpm --filter @workspace/raqim run typecheck` — frontend only - `pnpm --filter @workspace/api-server run typecheck` — backend only - `pnpm --filter @workspace/api-spec run codegen` — regenerate API hooks and Zod schemas from OpenAPI spec - `node scripts/seed-admin.mjs` — seed the admin user (run once after first startup) ## Database - **Engine**: `node:sqlite` (Node 24 built-in, zero native deps) - **ORM**: `drizzle-orm/sqlite-proxy` (async callback-based wrapper) - **Dev DB**: `artifacts/api-server/raqim.db` (auto-created on startup) - **Docker/HF DB**: `/data/raqim.db` (set via `DB_PATH` env var) - **Migrations**: run automatically at server startup from `lib/db/drizzle/*.sql` - **Backup**: `scripts/db-hf-sync.mjs` — git-based push/pull to a private HF Dataset ### sqlite-proxy callback contract (lib/db/src/index.ts) - `method === "run"` → `stmt.run(...params)`, return `{ rows: [] }` - `method === "get"` → `stmt.get(...params)`, return `{ rows: Object.values(row) }` (flat array, NOT wrapped) - `method === "all"` → `stmt.all(...params)`, return `{ rows: result.map(r => Object.values(r)) }` (array of arrays) - `method === "values"` → same as `all` ### Admin seed `node scripts/seed-admin.mjs` — inserts `admin@raqim.app` / `Admin1234!` (bcrypt hash, role=admin, status=active). Respects `DB_PATH`, `ADMIN_EMAIL`, `ADMIN_PASSWORD`, `ADMIN_NAME` env vars. ## Database Schema (lib/db/src/schema/) - **users** — id, email, passwordHash, displayName, role (user|admin), status (pending|active|suspended), lastLoginAt - **refresh_tokens** — id, userId, token (unique), expiresAt - **folders** — id, name, parentId, ownerId, trashed, trashedAt - **files** — id, name, folderId, ownerId, originalName, originalType, sizeBytes, status (uploading|queued|processing|done|failed|trashed), markdownContent, originalMarkdown, qualityScore, wordCount, language, trashedAt - **conversions** — id, fileId, userId, status (queued|analyzing|routing|ocr|layout|scoring|merging|cleanup|done|failed), progress, steps (jsonb), elapsedSeconds, estimatedSeconds, errorMessage - **shares** — id, fileId, folderId, ownerId, sharedWithId, permission (read|edit|full) All timestamps stored as `INTEGER` (milliseconds since epoch). Enums stored as `TEXT`. ## Auth - JWT accessToken (1h) in memory + refreshToken (30d) in httpOnly cookie - `SESSION_SECRET` env var used as JWT secret - refreshToken includes random `jti` UUID to prevent duplicate key collisions - Admin seed: `admin@raqim.app` / `Admin1234!` ## Frontend Pages | Route | Page | Auth | |---|---|---| | `/` | LandingPage | public | | `/login` | LoginPage | public | | `/request-access` | RequestAccessPage | public | | `/dashboard` | DashboardPage | protected | | `/upload` | UploadPage | protected | | `/editor/:fileId` | EditorPage | protected | | `/convert/:jobId` | ConversionPage | protected | | `/shared` | SharedPage | protected | | `/trash` | TrashPage | protected | | `/admin/users` | AdminUsersPage | admin | | `/admin/requests` | AdminRequestsPage | admin | | `/admin/stats` | AdminStatsPage | admin | ## Design System - Dark sanctuary theme: `--background: #0d0d0f`, `--primary: #f5a623` (amber gold) - Fonts: Fraunces (display) + Noto Naskh Arabic (body) - All pages use `dir="rtl"` for Arabic-first layout - shadcn/ui components with custom CSS vars ## API Routes - `POST /api/auth/login` — login, returns accessToken + sets refresh cookie - `POST /api/auth/refresh` — refresh accessToken using cookie - `POST /api/auth/logout` — clear refresh cookie - `GET/POST /api/folders` — list/create folders - `PATCH/DELETE /api/folders/:id` — update/trash folder - `GET/POST /api/files` — list/upload files - `GET/PATCH/DELETE /api/files/:id` — get/update/trash file - `GET/PUT /api/files/:id/content` — get/save markdown content - `POST /api/files/:id/restore` — restore from trash - `POST /api/files/:id/export` — export to md/docx/pdf - `GET /api/files/:id/download` — download file - `POST /api/convert` — start conversion job - `GET /api/convert/:jobId` — poll conversion status - `GET/POST/PATCH/DELETE /api/shares` — share management - `GET /api/shares/search-users` — search users to share with - `GET /api/admin/users` — list users - `POST /api/admin/users` — create user - `PATCH/DELETE /api/admin/users/:id` — update/delete user - `GET /api/admin/join-requests` — list pending join requests - `PATCH /api/admin/join-requests/:id` — approve/reject request - `GET /api/admin/stats` — platform statistics - `GET /api/admin/trash` — list all trashed items - `DELETE /api/admin/trash/empty` — empty trash ## Arabic PDF Text Extraction Pipeline Many Arabic PDFs have broken ToUnicode CMap font tables. pdfjs-dist returns the correct Unicode codepoints but in wrong order (e.g. `امحلد هلل` instead of `الحمد لله`). The pipeline uses three stages — all completely free, no paid API keys, no page limits. ### Stage 1: pdfjs-dist text extraction Custom RTL-aware extractor that buckets text items by Y coordinate, sorts right-to-left, and joins with gap-based space insertion. Fast, zero dependencies. ### Stage 2: Garbling detection → VLM-OCR → Tesseract fallback `isGarbledArabic()` checks for telltale CMap transposition patterns (` يف ` for `في`, `امحلد` for `الحمد`, bidi-wrapped Latin noise like `‎OA‏`, etc.). If garbled, `extractPdfViaOcr()` runs the following pipeline: 1. `pdftoppm` renders every page to PNG — **200 DPI** when HF_TOKEN available (VLM-optimised), **300 DPI** for Tesseract-only mode 2. **Per page**: try VLM-OCR via HF Inference API first (best Arabic accuracy) - Primary: `allenai/olmOCR-7B-0225-preview` — Allen Institute, fine-tuned on document OCR, #1 on Arabic KITAB-Bench - Fallback: `Qwen/Qwen2.5-VL-7B-Instruct` — strong VLM with excellent Arabic 3. **If VLM fails or rate-limited**: Tesseract.js lazy-initialised (only created when needed), Arabic+English models After OCR, `cleanOcrOutput()` filters each line: drops lines >80% Latin chars with <4 Arabic chars (decorative page noise, artefacts like `Me NY 1`, `dl pl a gl`). Keeps all lines with ≥4 Arabic characters. **No page cap** — processes every page in the document. ### Stage 3: AI text correction — full HF model access, 6-model fallback chain Endpoint priority chain (tried in order, falls back on rate-limit / 5xx): 1. **Replit AI proxy** (`AI_INTEGRATIONS_OPENAI_BASE_URL`) — auto on Replit, uses `gpt-4o` 2. **HF: Qwen/Qwen3-72B** — best open-source Arabic model (Apr 2025), thinking disabled via `/no_think` 3. **HF: Qwen/Qwen3-30B-A3B** — MoE, fast and very capable 4. **HF: Qwen/Qwen2.5-72B-Instruct** — proven Arabic quality 5. **HF: meta-llama/Llama-3.3-70B-Instruct** — strong multilingual 6. **HF: mistralai/Mistral-Nemo-Instruct-2407** — fast 12B fallback Chunk size: 3000 chars. Timeout: 120s per chunk. Temperature: 0.1. Qwen3 `` blocks stripped from output. On rate-limit, `activeEpIdx` advances to next model and continues without interruption. No `OPENAI_API_KEY` needed. `HF_TOKEN` secret (already required for HF Spaces) doubles as the AI key. ## Architect Engine — 100% Free, No External APIs, No Limits The "Super Architect" is a fully deterministic rule-based engine (`runRuleBasedArchitect` in `convert.ts`). Zero external API calls. Zero cost. Zero rate limits. Runs entirely on the server. **Arabic document rules:** - Metadata table: detects مادة/زمن/نموذج/تاريخ (even multiple fields inline on one line) → Markdown table - Document title: promoted to `# heading` when metadata block precedes it - Section markers: أولاً/ثانياً/ثالثاً/... → `## heading` - Questions: س1/سؤال 1/Q1 → `**bold**` with blank lines - Multiple choice: `أ- X ب- Y ج- Z د- W` or `a) X b) Y c) Z d) W` → vertical bullet list - Keywords: التعليل:/الإجابة:/المطلوب:/الحل: → separate lines - OCR noise: box-drawing chars, control chars removed - Lone short lines (surrounded by blanks) → `### subheading` **Latin/English document rules:** - ALL CAPS short lines → `### heading` - `Q1:`/`Question N:`/`Section`/`Part` → bold / ## heading - `a) b) c) d)` inline choices → vertical list - Existing Markdown kept as-is ## Supported File Types for Conversion ALL file types accepted (500MB max): - **PDF** — pdf-parse - **DOCX/DOC** — mammoth (HTML extraction) - **PPTX/PPT** — jszip (slide text extraction) - **XLSX/XLS/CSV** — xlsx (table to markdown) - **Images** (PNG, JPG, JPEG, WEBP, GIF, BMP, TIFF) — Tesseract.js OCR (Arabic+English) - **HTML/HTM** — custom HTML-to-text - **EPUB** — jszip (chapter text extraction) - **TXT/MD** — direct read - **Any other** — UTF-8 text fallback ## Export - **Markdown (.md)** — direct download - **Word (.docx)** — real RTL DOCX via `docx` npm package: - Traditional Arabic font (30pt body, 44pt headings) - 100% RTL bidirectional support - Numbered header/footer with "منسق آلياً بواسطة رقيم" - Proper bullet/numbered lists, tables, inline bold/italic/code - **PDF** — currently exports as markdown (browser-based PDF printing recommended) ## Frontend Animations Framer Motion used throughout (already in devDependencies): - Upload zone: spring scale on hover/drag, AnimatePresence for state transitions - File upload progress bar with smooth motion - Conversion steps: staggered fade-in, animated step indicators - Live AI message cycling during processing steps - Success/failure states with spring animations ## Hugging Face Spaces Deployment - `Dockerfile` — multi-stage build, runs as non-root, DB at `/data/raqim.db` - `deploy-hf.sh` — one-command deploy (requires `HF_TOKEN` + `HF_USERNAME` Replit secrets) - `scripts/db-hf-sync.mjs` — pull/push SQLite DB to private HF Dataset (`${HF_USERNAME}/raqim-db`) via git - Space secrets to set after deploy: `JWT_SECRET`, `SESSION_SECRET` - No `DATABASE_URL` needed — SQLite is fully embedded See the `pnpm-workspace` skill for workspace structure, TypeScript setup, and package details.