Spaces:

abedelbahnasy55
/

raqim

Sleeping

App Files Files Community

raqim / replit.md

RAQIM Deploy

Deploy RAQIM 2026-05-02 23:08

3e9069b about 1 month ago

preview code

raw

history blame contribute delete

11.3 kB

RAQIM — رقيم

Arabic-first document-to-Markdown conversion platform.

Overview

pnpm workspace monorepo using TypeScript. Full-stack: React+Vite frontend, Express backend, SQLite (node:sqlite built-in) + Drizzle ORM. Deployment target: Hugging Face Spaces (Docker).

Stack

Monorepo tool: pnpm workspaces
Node.js version: 24
Package manager: pnpm
TypeScript version: 5.9
API framework: Express 5
Database: SQLite via node:sqlite (built-in, no native compilation) + drizzle-orm/sqlite-proxy
Validation: Zod (zod/v4), drizzle-zod
API codegen: Orval (from OpenAPI spec in artifacts/api-spec/openapi.yaml)
Build: esbuild (ESM bundle)
Frontend: React 19 + Vite 7 + Tailwind CSS v4 + shadcn/ui

Artifacts

Artifact	Path	Port
`@workspace/raqim`	`/`	$PORT (23567)
`@workspace/api-server`	`/api`	$PORT (8080)

Key Commands

pnpm run typecheck — full typecheck across all packages
pnpm run typecheck:libs — build composite libs (run before API/frontend typecheck)
pnpm --filter @workspace/raqim run typecheck — frontend only
pnpm --filter @workspace/api-server run typecheck — backend only
pnpm --filter @workspace/api-spec run codegen — regenerate API hooks and Zod schemas from OpenAPI spec
node scripts/seed-admin.mjs — seed the admin user (run once after first startup)

Database

Engine: node:sqlite (Node 24 built-in, zero native deps)
ORM: drizzle-orm/sqlite-proxy (async callback-based wrapper)
Dev DB: artifacts/api-server/raqim.db (auto-created on startup)
Docker/HF DB: /data/raqim.db (set via DB_PATH env var)
Migrations: run automatically at server startup from lib/db/drizzle/*.sql
Backup: scripts/db-hf-sync.mjs — git-based push/pull to a private HF Dataset

sqlite-proxy callback contract (lib/db/src/index.ts)

method === "run" → stmt.run(...params), return { rows: [] }
method === "get" → stmt.get(...params), return { rows: Object.values(row) } (flat array, NOT wrapped)
method === "all" → stmt.all(...params), return { rows: result.map(r => Object.values(r)) } (array of arrays)
method === "values" → same as all

Admin seed

node scripts/seed-admin.mjs — inserts admin@raqim.app / Admin1234! (bcrypt hash, role=admin, status=active). Respects DB_PATH, ADMIN_EMAIL, ADMIN_PASSWORD, ADMIN_NAME env vars.

Database Schema (lib/db/src/schema/)

users — id, email, passwordHash, displayName, role (user|admin), status (pending|active|suspended), lastLoginAt
refresh_tokens — id, userId, token (unique), expiresAt
folders — id, name, parentId, ownerId, trashed, trashedAt
files — id, name, folderId, ownerId, originalName, originalType, sizeBytes, status (uploading|queued|processing|done|failed|trashed), markdownContent, originalMarkdown, qualityScore, wordCount, language, trashedAt
conversions — id, fileId, userId, status (queued|analyzing|routing|ocr|layout|scoring|merging|cleanup|done|failed), progress, steps (jsonb), elapsedSeconds, estimatedSeconds, errorMessage
shares — id, fileId, folderId, ownerId, sharedWithId, permission (read|edit|full)

All timestamps stored as INTEGER (milliseconds since epoch). Enums stored as TEXT.

Auth

JWT accessToken (1h) in memory + refreshToken (30d) in httpOnly cookie
SESSION_SECRET env var used as JWT secret
refreshToken includes random jti UUID to prevent duplicate key collisions
Admin seed: admin@raqim.app / Admin1234!

Frontend Pages

Route	Page	Auth
`/`	LandingPage	public
`/login`	LoginPage	public
`/request-access`	RequestAccessPage	public
`/dashboard`	DashboardPage	protected
`/upload`	UploadPage	protected
`/editor/:fileId`	EditorPage	protected
`/convert/:jobId`	ConversionPage	protected
`/shared`	SharedPage	protected
`/trash`	TrashPage	protected
`/admin/users`	AdminUsersPage	admin
`/admin/requests`	AdminRequestsPage	admin
`/admin/stats`	AdminStatsPage	admin

Design System

Dark sanctuary theme: --background: #0d0d0f, --primary: #f5a623 (amber gold)
Fonts: Fraunces (display) + Noto Naskh Arabic (body)
All pages use dir="rtl" for Arabic-first layout
shadcn/ui components with custom CSS vars

API Routes

POST /api/auth/login — login, returns accessToken + sets refresh cookie
POST /api/auth/refresh — refresh accessToken using cookie
POST /api/auth/logout — clear refresh cookie
GET/POST /api/folders — list/create folders
PATCH/DELETE /api/folders/:id — update/trash folder
GET/POST /api/files — list/upload files
GET/PATCH/DELETE /api/files/:id — get/update/trash file
GET/PUT /api/files/:id/content — get/save markdown content
POST /api/files/:id/restore — restore from trash
POST /api/files/:id/export — export to md/docx/pdf
GET /api/files/:id/download — download file
POST /api/convert — start conversion job
GET /api/convert/:jobId — poll conversion status
GET/POST/PATCH/DELETE /api/shares — share management
GET /api/shares/search-users — search users to share with
GET /api/admin/users — list users
POST /api/admin/users — create user
PATCH/DELETE /api/admin/users/:id — update/delete user
GET /api/admin/join-requests — list pending join requests
PATCH /api/admin/join-requests/:id — approve/reject request
GET /api/admin/stats — platform statistics
GET /api/admin/trash — list all trashed items
DELETE /api/admin/trash/empty — empty trash

Arabic PDF Text Extraction Pipeline

Many Arabic PDFs have broken ToUnicode CMap font tables. pdfjs-dist returns the correct Unicode codepoints but in wrong order (e.g. امحلد هلل instead of الحمد لله). The pipeline uses three stages — all completely free, no paid API keys, no page limits.

Stage 1: pdfjs-dist text extraction

Custom RTL-aware extractor that buckets text items by Y coordinate, sorts right-to-left, and joins with gap-based space insertion. Fast, zero dependencies.

Stage 2: Garbling detection → VLM-OCR → Tesseract fallback

isGarbledArabic() checks for telltale CMap transposition patterns (يف for في, امحلد for الحمد, bidi-wrapped Latin noise like ‎OA‏, etc.).

If garbled, extractPdfViaOcr() runs the following pipeline:

pdftoppm renders every page to PNG — 200 DPI when HF_TOKEN available (VLM-optimised), 300 DPI for Tesseract-only mode
Per page: try VLM-OCR via HF Inference API first (best Arabic accuracy)
- Primary: allenai/olmOCR-7B-0225-preview — Allen Institute, fine-tuned on document OCR, #1 on Arabic KITAB-Bench
- Fallback: Qwen/Qwen2.5-VL-7B-Instruct — strong VLM with excellent Arabic
If VLM fails or rate-limited: Tesseract.js lazy-initialised (only created when needed), Arabic+English models

After OCR, cleanOcrOutput() filters each line: drops lines >80% Latin chars with <4 Arabic chars (decorative page noise, artefacts like Me NY 1, dl pl a gl). Keeps all lines with ≥4 Arabic characters.

No page cap — processes every page in the document.

Stage 3: AI text correction — full HF model access, 6-model fallback chain

Endpoint priority chain (tried in order, falls back on rate-limit / 5xx):

Replit AI proxy (AI_INTEGRATIONS_OPENAI_BASE_URL) — auto on Replit, uses gpt-4o
HF: Qwen/Qwen3-72B — best open-source Arabic model (Apr 2025), thinking disabled via /no_think
HF: Qwen/Qwen3-30B-A3B — MoE, fast and very capable
HF: Qwen/Qwen2.5-72B-Instruct — proven Arabic quality
HF: meta-llama/Llama-3.3-70B-Instruct — strong multilingual
HF: mistralai/Mistral-Nemo-Instruct-2407 — fast 12B fallback

Chunk size: 3000 chars. Timeout: 120s per chunk. Temperature: 0.1. Qwen3 <think> blocks stripped from output. On rate-limit, activeEpIdx advances to next model and continues without interruption.

No OPENAI_API_KEY needed. HF_TOKEN secret (already required for HF Spaces) doubles as the AI key.

Architect Engine — 100% Free, No External APIs, No Limits

The "Super Architect" is a fully deterministic rule-based engine (runRuleBasedArchitect in convert.ts). Zero external API calls. Zero cost. Zero rate limits. Runs entirely on the server.

Arabic document rules:

Metadata table: detects مادة/زمن/نموذج/تاريخ (even multiple fields inline on one line) → Markdown table
Document title: promoted to # heading when metadata block precedes it
Section markers: أولاً/ثانياً/ثالثاً/... → ## heading
Questions: س1/سؤال 1/Q1 → **bold** with blank lines
Multiple choice: أ- X ب- Y ج- Z د- W or a) X b) Y c) Z d) W → vertical bullet list
Keywords: التعليل:/الإجابة:/المطلوب:/الحل: → separate lines
OCR noise: box-drawing chars, control chars removed
Lone short lines (surrounded by blanks) → ### subheading

Latin/English document rules:

ALL CAPS short lines → ### heading
Q1:/Question N:/Section/Part → bold / ## heading
a) b) c) d) inline choices → vertical list
Existing Markdown kept as-is

Supported File Types for Conversion

ALL file types accepted (500MB max):

PDF — pdf-parse
DOCX/DOC — mammoth (HTML extraction)
PPTX/PPT — jszip (slide text extraction)
XLSX/XLS/CSV — xlsx (table to markdown)
Images (PNG, JPG, JPEG, WEBP, GIF, BMP, TIFF) — Tesseract.js OCR (Arabic+English)
HTML/HTM — custom HTML-to-text
EPUB — jszip (chapter text extraction)
TXT/MD — direct read
Any other — UTF-8 text fallback

Export

Markdown (.md) — direct download
Word (.docx) — real RTL DOCX via docx npm package:
- Traditional Arabic font (30pt body, 44pt headings)
- 100% RTL bidirectional support
- Numbered header/footer with "منسق آلياً بواسطة رقيم"
- Proper bullet/numbered lists, tables, inline bold/italic/code
PDF — currently exports as markdown (browser-based PDF printing recommended)

Frontend Animations

Framer Motion used throughout (already in devDependencies):

Upload zone: spring scale on hover/drag, AnimatePresence for state transitions
File upload progress bar with smooth motion
Conversion steps: staggered fade-in, animated step indicators
Live AI message cycling during processing steps
Success/failure states with spring animations

Hugging Face Spaces Deployment

Dockerfile — multi-stage build, runs as non-root, DB at /data/raqim.db
deploy-hf.sh — one-command deploy (requires HF_TOKEN + HF_USERNAME Replit secrets)
scripts/db-hf-sync.mjs — pull/push SQLite DB to private HF Dataset (${HF_USERNAME}/raqim-db) via git
Space secrets to set after deploy: JWT_SECRET, SESSION_SECRET
No DATABASE_URL needed — SQLite is fully embedded

See the pnpm-workspace skill for workspace structure, TypeScript setup, and package details.