Spaces:
Sleeping
Sleeping
| # RAQIM — رقيم | |
| Arabic-first document-to-Markdown conversion platform. | |
| ## Overview | |
| pnpm workspace monorepo using TypeScript. Full-stack: React+Vite frontend, Express backend, SQLite (node:sqlite built-in) + Drizzle ORM. Deployment target: Hugging Face Spaces (Docker). | |
| ## Stack | |
| - **Monorepo tool**: pnpm workspaces | |
| - **Node.js version**: 24 | |
| - **Package manager**: pnpm | |
| - **TypeScript version**: 5.9 | |
| - **API framework**: Express 5 | |
| - **Database**: SQLite via `node:sqlite` (built-in, no native compilation) + `drizzle-orm/sqlite-proxy` | |
| - **Validation**: Zod (`zod/v4`), `drizzle-zod` | |
| - **API codegen**: Orval (from OpenAPI spec in `artifacts/api-spec/openapi.yaml`) | |
| - **Build**: esbuild (ESM bundle) | |
| - **Frontend**: React 19 + Vite 7 + Tailwind CSS v4 + shadcn/ui | |
| ## Artifacts | |
| | Artifact | Path | Port | | |
| |---|---|---| | |
| | `@workspace/raqim` | `/` | $PORT (23567) | | |
| | `@workspace/api-server` | `/api` | $PORT (8080) | | |
| ## Key Commands | |
| - `pnpm run typecheck` — full typecheck across all packages | |
| - `pnpm run typecheck:libs` — build composite libs (run before API/frontend typecheck) | |
| - `pnpm --filter @workspace/raqim run typecheck` — frontend only | |
| - `pnpm --filter @workspace/api-server run typecheck` — backend only | |
| - `pnpm --filter @workspace/api-spec run codegen` — regenerate API hooks and Zod schemas from OpenAPI spec | |
| - `node scripts/seed-admin.mjs` — seed the admin user (run once after first startup) | |
| ## Database | |
| - **Engine**: `node:sqlite` (Node 24 built-in, zero native deps) | |
| - **ORM**: `drizzle-orm/sqlite-proxy` (async callback-based wrapper) | |
| - **Dev DB**: `artifacts/api-server/raqim.db` (auto-created on startup) | |
| - **Docker/HF DB**: `/data/raqim.db` (set via `DB_PATH` env var) | |
| - **Migrations**: run automatically at server startup from `lib/db/drizzle/*.sql` | |
| - **Backup**: `scripts/db-hf-sync.mjs` — git-based push/pull to a private HF Dataset | |
| ### sqlite-proxy callback contract (lib/db/src/index.ts) | |
| - `method === "run"` → `stmt.run(...params)`, return `{ rows: [] }` | |
| - `method === "get"` → `stmt.get(...params)`, return `{ rows: Object.values(row) }` (flat array, NOT wrapped) | |
| - `method === "all"` → `stmt.all(...params)`, return `{ rows: result.map(r => Object.values(r)) }` (array of arrays) | |
| - `method === "values"` → same as `all` | |
| ### Admin seed | |
| `node scripts/seed-admin.mjs` — inserts `admin@raqim.app` / `Admin1234!` (bcrypt hash, role=admin, status=active). | |
| Respects `DB_PATH`, `ADMIN_EMAIL`, `ADMIN_PASSWORD`, `ADMIN_NAME` env vars. | |
| ## Database Schema (lib/db/src/schema/) | |
| - **users** — id, email, passwordHash, displayName, role (user|admin), status (pending|active|suspended), lastLoginAt | |
| - **refresh_tokens** — id, userId, token (unique), expiresAt | |
| - **folders** — id, name, parentId, ownerId, trashed, trashedAt | |
| - **files** — id, name, folderId, ownerId, originalName, originalType, sizeBytes, status (uploading|queued|processing|done|failed|trashed), markdownContent, originalMarkdown, qualityScore, wordCount, language, trashedAt | |
| - **conversions** — id, fileId, userId, status (queued|analyzing|routing|ocr|layout|scoring|merging|cleanup|done|failed), progress, steps (jsonb), elapsedSeconds, estimatedSeconds, errorMessage | |
| - **shares** — id, fileId, folderId, ownerId, sharedWithId, permission (read|edit|full) | |
| All timestamps stored as `INTEGER` (milliseconds since epoch). Enums stored as `TEXT`. | |
| ## Auth | |
| - JWT accessToken (1h) in memory + refreshToken (30d) in httpOnly cookie | |
| - `SESSION_SECRET` env var used as JWT secret | |
| - refreshToken includes random `jti` UUID to prevent duplicate key collisions | |
| - Admin seed: `admin@raqim.app` / `Admin1234!` | |
| ## Frontend Pages | |
| | Route | Page | Auth | | |
| |---|---|---| | |
| | `/` | LandingPage | public | | |
| | `/login` | LoginPage | public | | |
| | `/request-access` | RequestAccessPage | public | | |
| | `/dashboard` | DashboardPage | protected | | |
| | `/upload` | UploadPage | protected | | |
| | `/editor/:fileId` | EditorPage | protected | | |
| | `/convert/:jobId` | ConversionPage | protected | | |
| | `/shared` | SharedPage | protected | | |
| | `/trash` | TrashPage | protected | | |
| | `/admin/users` | AdminUsersPage | admin | | |
| | `/admin/requests` | AdminRequestsPage | admin | | |
| | `/admin/stats` | AdminStatsPage | admin | | |
| ## Design System | |
| - Dark sanctuary theme: `--background: #0d0d0f`, `--primary: #f5a623` (amber gold) | |
| - Fonts: Fraunces (display) + Noto Naskh Arabic (body) | |
| - All pages use `dir="rtl"` for Arabic-first layout | |
| - shadcn/ui components with custom CSS vars | |
| ## API Routes | |
| - `POST /api/auth/login` — login, returns accessToken + sets refresh cookie | |
| - `POST /api/auth/refresh` — refresh accessToken using cookie | |
| - `POST /api/auth/logout` — clear refresh cookie | |
| - `GET/POST /api/folders` — list/create folders | |
| - `PATCH/DELETE /api/folders/:id` — update/trash folder | |
| - `GET/POST /api/files` — list/upload files | |
| - `GET/PATCH/DELETE /api/files/:id` — get/update/trash file | |
| - `GET/PUT /api/files/:id/content` — get/save markdown content | |
| - `POST /api/files/:id/restore` — restore from trash | |
| - `POST /api/files/:id/export` — export to md/docx/pdf | |
| - `GET /api/files/:id/download` — download file | |
| - `POST /api/convert` — start conversion job | |
| - `GET /api/convert/:jobId` — poll conversion status | |
| - `GET/POST/PATCH/DELETE /api/shares` — share management | |
| - `GET /api/shares/search-users` — search users to share with | |
| - `GET /api/admin/users` — list users | |
| - `POST /api/admin/users` — create user | |
| - `PATCH/DELETE /api/admin/users/:id` — update/delete user | |
| - `GET /api/admin/join-requests` — list pending join requests | |
| - `PATCH /api/admin/join-requests/:id` — approve/reject request | |
| - `GET /api/admin/stats` — platform statistics | |
| - `GET /api/admin/trash` — list all trashed items | |
| - `DELETE /api/admin/trash/empty` — empty trash | |
| ## Arabic PDF Text Extraction Pipeline | |
| Many Arabic PDFs have broken ToUnicode CMap font tables. pdfjs-dist returns the correct Unicode codepoints but in wrong order (e.g. `امحلد هلل` instead of `الحمد لله`). The pipeline uses three stages — all completely free, no paid API keys, no page limits. | |
| ### Stage 1: pdfjs-dist text extraction | |
| Custom RTL-aware extractor that buckets text items by Y coordinate, sorts right-to-left, and joins with gap-based space insertion. Fast, zero dependencies. | |
| ### Stage 2: Garbling detection → VLM-OCR → Tesseract fallback | |
| `isGarbledArabic()` checks for telltale CMap transposition patterns (` يف ` for `في`, `امحلد` for `الحمد`, bidi-wrapped Latin noise like `OA`, etc.). | |
| If garbled, `extractPdfViaOcr()` runs the following pipeline: | |
| 1. `pdftoppm` renders every page to PNG — **200 DPI** when HF_TOKEN available (VLM-optimised), **300 DPI** for Tesseract-only mode | |
| 2. **Per page**: try VLM-OCR via HF Inference API first (best Arabic accuracy) | |
| - Primary: `allenai/olmOCR-7B-0225-preview` — Allen Institute, fine-tuned on document OCR, #1 on Arabic KITAB-Bench | |
| - Fallback: `Qwen/Qwen2.5-VL-7B-Instruct` — strong VLM with excellent Arabic | |
| 3. **If VLM fails or rate-limited**: Tesseract.js lazy-initialised (only created when needed), Arabic+English models | |
| After OCR, `cleanOcrOutput()` filters each line: drops lines >80% Latin chars with <4 Arabic chars (decorative page noise, artefacts like `Me NY 1`, `dl pl a gl`). Keeps all lines with ≥4 Arabic characters. | |
| **No page cap** — processes every page in the document. | |
| ### Stage 3: AI text correction — full HF model access, 6-model fallback chain | |
| Endpoint priority chain (tried in order, falls back on rate-limit / 5xx): | |
| 1. **Replit AI proxy** (`AI_INTEGRATIONS_OPENAI_BASE_URL`) — auto on Replit, uses `gpt-4o` | |
| 2. **HF: Qwen/Qwen3-72B** — best open-source Arabic model (Apr 2025), thinking disabled via `/no_think` | |
| 3. **HF: Qwen/Qwen3-30B-A3B** — MoE, fast and very capable | |
| 4. **HF: Qwen/Qwen2.5-72B-Instruct** — proven Arabic quality | |
| 5. **HF: meta-llama/Llama-3.3-70B-Instruct** — strong multilingual | |
| 6. **HF: mistralai/Mistral-Nemo-Instruct-2407** — fast 12B fallback | |
| Chunk size: 3000 chars. Timeout: 120s per chunk. Temperature: 0.1. Qwen3 `<think>` blocks stripped from output. On rate-limit, `activeEpIdx` advances to next model and continues without interruption. | |
| No `OPENAI_API_KEY` needed. `HF_TOKEN` secret (already required for HF Spaces) doubles as the AI key. | |
| ## Architect Engine — 100% Free, No External APIs, No Limits | |
| The "Super Architect" is a fully deterministic rule-based engine (`runRuleBasedArchitect` in `convert.ts`). | |
| Zero external API calls. Zero cost. Zero rate limits. Runs entirely on the server. | |
| **Arabic document rules:** | |
| - Metadata table: detects مادة/زمن/نموذج/تاريخ (even multiple fields inline on one line) → Markdown table | |
| - Document title: promoted to `# heading` when metadata block precedes it | |
| - Section markers: أولاً/ثانياً/ثالثاً/... → `## heading` | |
| - Questions: س1/سؤال 1/Q1 → `**bold**` with blank lines | |
| - Multiple choice: `أ- X ب- Y ج- Z د- W` or `a) X b) Y c) Z d) W` → vertical bullet list | |
| - Keywords: التعليل:/الإجابة:/المطلوب:/الحل: → separate lines | |
| - OCR noise: box-drawing chars, control chars removed | |
| - Lone short lines (surrounded by blanks) → `### subheading` | |
| **Latin/English document rules:** | |
| - ALL CAPS short lines → `### heading` | |
| - `Q1:`/`Question N:`/`Section`/`Part` → bold / ## heading | |
| - `a) b) c) d)` inline choices → vertical list | |
| - Existing Markdown kept as-is | |
| ## Supported File Types for Conversion | |
| ALL file types accepted (500MB max): | |
| - **PDF** — pdf-parse | |
| - **DOCX/DOC** — mammoth (HTML extraction) | |
| - **PPTX/PPT** — jszip (slide text extraction) | |
| - **XLSX/XLS/CSV** — xlsx (table to markdown) | |
| - **Images** (PNG, JPG, JPEG, WEBP, GIF, BMP, TIFF) — Tesseract.js OCR (Arabic+English) | |
| - **HTML/HTM** — custom HTML-to-text | |
| - **EPUB** — jszip (chapter text extraction) | |
| - **TXT/MD** — direct read | |
| - **Any other** — UTF-8 text fallback | |
| ## Export | |
| - **Markdown (.md)** — direct download | |
| - **Word (.docx)** — real RTL DOCX via `docx` npm package: | |
| - Traditional Arabic font (30pt body, 44pt headings) | |
| - 100% RTL bidirectional support | |
| - Numbered header/footer with "منسق آلياً بواسطة رقيم" | |
| - Proper bullet/numbered lists, tables, inline bold/italic/code | |
| - **PDF** — currently exports as markdown (browser-based PDF printing recommended) | |
| ## Frontend Animations | |
| Framer Motion used throughout (already in devDependencies): | |
| - Upload zone: spring scale on hover/drag, AnimatePresence for state transitions | |
| - File upload progress bar with smooth motion | |
| - Conversion steps: staggered fade-in, animated step indicators | |
| - Live AI message cycling during processing steps | |
| - Success/failure states with spring animations | |
| ## Hugging Face Spaces Deployment | |
| - `Dockerfile` — multi-stage build, runs as non-root, DB at `/data/raqim.db` | |
| - `deploy-hf.sh` — one-command deploy (requires `HF_TOKEN` + `HF_USERNAME` Replit secrets) | |
| - `scripts/db-hf-sync.mjs` — pull/push SQLite DB to private HF Dataset (`${HF_USERNAME}/raqim-db`) via git | |
| - Space secrets to set after deploy: `JWT_SECRET`, `SESSION_SECRET` | |
| - No `DATABASE_URL` needed — SQLite is fully embedded | |
| See the `pnpm-workspace` skill for workspace structure, TypeScript setup, and package details. | |