Spaces:
Sleeping
RAQIM — رقيم
Arabic-first document-to-Markdown conversion platform.
Overview
pnpm workspace monorepo using TypeScript. Full-stack: React+Vite frontend, Express backend, SQLite (node:sqlite built-in) + Drizzle ORM. Deployment target: Hugging Face Spaces (Docker).
Stack
- Monorepo tool: pnpm workspaces
- Node.js version: 24
- Package manager: pnpm
- TypeScript version: 5.9
- API framework: Express 5
- Database: SQLite via
node:sqlite(built-in, no native compilation) +drizzle-orm/sqlite-proxy - Validation: Zod (
zod/v4),drizzle-zod - API codegen: Orval (from OpenAPI spec in
artifacts/api-spec/openapi.yaml) - Build: esbuild (ESM bundle)
- Frontend: React 19 + Vite 7 + Tailwind CSS v4 + shadcn/ui
Artifacts
| Artifact | Path | Port |
|---|---|---|
@workspace/raqim |
/ |
$PORT (23567) |
@workspace/api-server |
/api |
$PORT (8080) |
Key Commands
pnpm run typecheck— full typecheck across all packagespnpm run typecheck:libs— build composite libs (run before API/frontend typecheck)pnpm --filter @workspace/raqim run typecheck— frontend onlypnpm --filter @workspace/api-server run typecheck— backend onlypnpm --filter @workspace/api-spec run codegen— regenerate API hooks and Zod schemas from OpenAPI specnode scripts/seed-admin.mjs— seed the admin user (run once after first startup)
Database
- Engine:
node:sqlite(Node 24 built-in, zero native deps) - ORM:
drizzle-orm/sqlite-proxy(async callback-based wrapper) - Dev DB:
artifacts/api-server/raqim.db(auto-created on startup) - Docker/HF DB:
/data/raqim.db(set viaDB_PATHenv var) - Migrations: run automatically at server startup from
lib/db/drizzle/*.sql - Backup:
scripts/db-hf-sync.mjs— git-based push/pull to a private HF Dataset
sqlite-proxy callback contract (lib/db/src/index.ts)
method === "run"→stmt.run(...params), return{ rows: [] }method === "get"→stmt.get(...params), return{ rows: Object.values(row) }(flat array, NOT wrapped)method === "all"→stmt.all(...params), return{ rows: result.map(r => Object.values(r)) }(array of arrays)method === "values"→ same asall
Admin seed
node scripts/seed-admin.mjs — inserts admin@raqim.app / Admin1234! (bcrypt hash, role=admin, status=active).
Respects DB_PATH, ADMIN_EMAIL, ADMIN_PASSWORD, ADMIN_NAME env vars.
Database Schema (lib/db/src/schema/)
- users — id, email, passwordHash, displayName, role (user|admin), status (pending|active|suspended), lastLoginAt
- refresh_tokens — id, userId, token (unique), expiresAt
- folders — id, name, parentId, ownerId, trashed, trashedAt
- files — id, name, folderId, ownerId, originalName, originalType, sizeBytes, status (uploading|queued|processing|done|failed|trashed), markdownContent, originalMarkdown, qualityScore, wordCount, language, trashedAt
- conversions — id, fileId, userId, status (queued|analyzing|routing|ocr|layout|scoring|merging|cleanup|done|failed), progress, steps (jsonb), elapsedSeconds, estimatedSeconds, errorMessage
- shares — id, fileId, folderId, ownerId, sharedWithId, permission (read|edit|full)
All timestamps stored as INTEGER (milliseconds since epoch). Enums stored as TEXT.
Auth
- JWT accessToken (1h) in memory + refreshToken (30d) in httpOnly cookie
SESSION_SECRETenv var used as JWT secret- refreshToken includes random
jtiUUID to prevent duplicate key collisions - Admin seed:
admin@raqim.app/Admin1234!
Frontend Pages
| Route | Page | Auth |
|---|---|---|
/ |
LandingPage | public |
/login |
LoginPage | public |
/request-access |
RequestAccessPage | public |
/dashboard |
DashboardPage | protected |
/upload |
UploadPage | protected |
/editor/:fileId |
EditorPage | protected |
/convert/:jobId |
ConversionPage | protected |
/shared |
SharedPage | protected |
/trash |
TrashPage | protected |
/admin/users |
AdminUsersPage | admin |
/admin/requests |
AdminRequestsPage | admin |
/admin/stats |
AdminStatsPage | admin |
Design System
- Dark sanctuary theme:
--background: #0d0d0f,--primary: #f5a623(amber gold) - Fonts: Fraunces (display) + Noto Naskh Arabic (body)
- All pages use
dir="rtl"for Arabic-first layout - shadcn/ui components with custom CSS vars
API Routes
POST /api/auth/login— login, returns accessToken + sets refresh cookiePOST /api/auth/refresh— refresh accessToken using cookiePOST /api/auth/logout— clear refresh cookieGET/POST /api/folders— list/create foldersPATCH/DELETE /api/folders/:id— update/trash folderGET/POST /api/files— list/upload filesGET/PATCH/DELETE /api/files/:id— get/update/trash fileGET/PUT /api/files/:id/content— get/save markdown contentPOST /api/files/:id/restore— restore from trashPOST /api/files/:id/export— export to md/docx/pdfGET /api/files/:id/download— download filePOST /api/convert— start conversion jobGET /api/convert/:jobId— poll conversion statusGET/POST/PATCH/DELETE /api/shares— share managementGET /api/shares/search-users— search users to share withGET /api/admin/users— list usersPOST /api/admin/users— create userPATCH/DELETE /api/admin/users/:id— update/delete userGET /api/admin/join-requests— list pending join requestsPATCH /api/admin/join-requests/:id— approve/reject requestGET /api/admin/stats— platform statisticsGET /api/admin/trash— list all trashed itemsDELETE /api/admin/trash/empty— empty trash
Arabic PDF Text Extraction Pipeline
Many Arabic PDFs have broken ToUnicode CMap font tables. pdfjs-dist returns the correct Unicode codepoints but in wrong order (e.g. امحلد هلل instead of الحمد لله). The pipeline uses three stages — all completely free, no paid API keys, no page limits.
Stage 1: pdfjs-dist text extraction
Custom RTL-aware extractor that buckets text items by Y coordinate, sorts right-to-left, and joins with gap-based space insertion. Fast, zero dependencies.
Stage 2: Garbling detection → VLM-OCR → Tesseract fallback
isGarbledArabic() checks for telltale CMap transposition patterns (يف for في, امحلد for الحمد, bidi-wrapped Latin noise like OA, etc.).
If garbled, extractPdfViaOcr() runs the following pipeline:
pdftoppmrenders every page to PNG — 200 DPI when HF_TOKEN available (VLM-optimised), 300 DPI for Tesseract-only mode- Per page: try VLM-OCR via HF Inference API first (best Arabic accuracy)
- Primary:
allenai/olmOCR-7B-0225-preview— Allen Institute, fine-tuned on document OCR, #1 on Arabic KITAB-Bench - Fallback:
Qwen/Qwen2.5-VL-7B-Instruct— strong VLM with excellent Arabic
- Primary:
- If VLM fails or rate-limited: Tesseract.js lazy-initialised (only created when needed), Arabic+English models
After OCR, cleanOcrOutput() filters each line: drops lines >80% Latin chars with <4 Arabic chars (decorative page noise, artefacts like Me NY 1, dl pl a gl). Keeps all lines with ≥4 Arabic characters.
No page cap — processes every page in the document.
Stage 3: AI text correction — full HF model access, 6-model fallback chain
Endpoint priority chain (tried in order, falls back on rate-limit / 5xx):
- Replit AI proxy (
AI_INTEGRATIONS_OPENAI_BASE_URL) — auto on Replit, usesgpt-4o - HF: Qwen/Qwen3-72B — best open-source Arabic model (Apr 2025), thinking disabled via
/no_think - HF: Qwen/Qwen3-30B-A3B — MoE, fast and very capable
- HF: Qwen/Qwen2.5-72B-Instruct — proven Arabic quality
- HF: meta-llama/Llama-3.3-70B-Instruct — strong multilingual
- HF: mistralai/Mistral-Nemo-Instruct-2407 — fast 12B fallback
Chunk size: 3000 chars. Timeout: 120s per chunk. Temperature: 0.1. Qwen3 <think> blocks stripped from output. On rate-limit, activeEpIdx advances to next model and continues without interruption.
No OPENAI_API_KEY needed. HF_TOKEN secret (already required for HF Spaces) doubles as the AI key.
Architect Engine — 100% Free, No External APIs, No Limits
The "Super Architect" is a fully deterministic rule-based engine (runRuleBasedArchitect in convert.ts).
Zero external API calls. Zero cost. Zero rate limits. Runs entirely on the server.
Arabic document rules:
- Metadata table: detects مادة/زمن/نموذج/تاريخ (even multiple fields inline on one line) → Markdown table
- Document title: promoted to
# headingwhen metadata block precedes it - Section markers: أولاً/ثانياً/ثالثاً/... →
## heading - Questions: س1/سؤال 1/Q1 →
**bold**with blank lines - Multiple choice:
أ- X ب- Y ج- Z د- Wora) X b) Y c) Z d) W→ vertical bullet list - Keywords: التعليل:/الإجابة:/المطلوب:/الحل: → separate lines
- OCR noise: box-drawing chars, control chars removed
- Lone short lines (surrounded by blanks) →
### subheading
Latin/English document rules:
- ALL CAPS short lines →
### heading Q1:/Question N:/Section/Part→ bold / ## headinga) b) c) d)inline choices → vertical list- Existing Markdown kept as-is
Supported File Types for Conversion
ALL file types accepted (500MB max):
- PDF — pdf-parse
- DOCX/DOC — mammoth (HTML extraction)
- PPTX/PPT — jszip (slide text extraction)
- XLSX/XLS/CSV — xlsx (table to markdown)
- Images (PNG, JPG, JPEG, WEBP, GIF, BMP, TIFF) — Tesseract.js OCR (Arabic+English)
- HTML/HTM — custom HTML-to-text
- EPUB — jszip (chapter text extraction)
- TXT/MD — direct read
- Any other — UTF-8 text fallback
Export
- Markdown (.md) — direct download
- Word (.docx) — real RTL DOCX via
docxnpm package:- Traditional Arabic font (30pt body, 44pt headings)
- 100% RTL bidirectional support
- Numbered header/footer with "منسق آلياً بواسطة رقيم"
- Proper bullet/numbered lists, tables, inline bold/italic/code
- PDF — currently exports as markdown (browser-based PDF printing recommended)
Frontend Animations
Framer Motion used throughout (already in devDependencies):
- Upload zone: spring scale on hover/drag, AnimatePresence for state transitions
- File upload progress bar with smooth motion
- Conversion steps: staggered fade-in, animated step indicators
- Live AI message cycling during processing steps
- Success/failure states with spring animations
Hugging Face Spaces Deployment
Dockerfile— multi-stage build, runs as non-root, DB at/data/raqim.dbdeploy-hf.sh— one-command deploy (requiresHF_TOKEN+HF_USERNAMEReplit secrets)scripts/db-hf-sync.mjs— pull/push SQLite DB to private HF Dataset (${HF_USERNAME}/raqim-db) via git- Space secrets to set after deploy:
JWT_SECRET,SESSION_SECRET - No
DATABASE_URLneeded — SQLite is fully embedded
See the pnpm-workspace skill for workspace structure, TypeScript setup, and package details.