Spaces:
Sleeping
Sleeping
File size: 11,273 Bytes
0e14acb 89b70cb eec1fbf 3e9069b eec1fbf 89b70cb eec1fbf 3e9069b eec1fbf 3e9069b eec1fbf 3e9069b 89b70cb 3e9069b 89b70cb 3e9069b 89b70cb 3e9069b eec1fbf 0e14acb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | # RAQIM — رقيم
Arabic-first document-to-Markdown conversion platform.
## Overview
pnpm workspace monorepo using TypeScript. Full-stack: React+Vite frontend, Express backend, SQLite (node:sqlite built-in) + Drizzle ORM. Deployment target: Hugging Face Spaces (Docker).
## Stack
- **Monorepo tool**: pnpm workspaces
- **Node.js version**: 24
- **Package manager**: pnpm
- **TypeScript version**: 5.9
- **API framework**: Express 5
- **Database**: SQLite via `node:sqlite` (built-in, no native compilation) + `drizzle-orm/sqlite-proxy`
- **Validation**: Zod (`zod/v4`), `drizzle-zod`
- **API codegen**: Orval (from OpenAPI spec in `artifacts/api-spec/openapi.yaml`)
- **Build**: esbuild (ESM bundle)
- **Frontend**: React 19 + Vite 7 + Tailwind CSS v4 + shadcn/ui
## Artifacts
| Artifact | Path | Port |
|---|---|---|
| `@workspace/raqim` | `/` | $PORT (23567) |
| `@workspace/api-server` | `/api` | $PORT (8080) |
## Key Commands
- `pnpm run typecheck` — full typecheck across all packages
- `pnpm run typecheck:libs` — build composite libs (run before API/frontend typecheck)
- `pnpm --filter @workspace/raqim run typecheck` — frontend only
- `pnpm --filter @workspace/api-server run typecheck` — backend only
- `pnpm --filter @workspace/api-spec run codegen` — regenerate API hooks and Zod schemas from OpenAPI spec
- `node scripts/seed-admin.mjs` — seed the admin user (run once after first startup)
## Database
- **Engine**: `node:sqlite` (Node 24 built-in, zero native deps)
- **ORM**: `drizzle-orm/sqlite-proxy` (async callback-based wrapper)
- **Dev DB**: `artifacts/api-server/raqim.db` (auto-created on startup)
- **Docker/HF DB**: `/data/raqim.db` (set via `DB_PATH` env var)
- **Migrations**: run automatically at server startup from `lib/db/drizzle/*.sql`
- **Backup**: `scripts/db-hf-sync.mjs` — git-based push/pull to a private HF Dataset
### sqlite-proxy callback contract (lib/db/src/index.ts)
- `method === "run"` → `stmt.run(...params)`, return `{ rows: [] }`
- `method === "get"` → `stmt.get(...params)`, return `{ rows: Object.values(row) }` (flat array, NOT wrapped)
- `method === "all"` → `stmt.all(...params)`, return `{ rows: result.map(r => Object.values(r)) }` (array of arrays)
- `method === "values"` → same as `all`
### Admin seed
`node scripts/seed-admin.mjs` — inserts `admin@raqim.app` / `Admin1234!` (bcrypt hash, role=admin, status=active).
Respects `DB_PATH`, `ADMIN_EMAIL`, `ADMIN_PASSWORD`, `ADMIN_NAME` env vars.
## Database Schema (lib/db/src/schema/)
- **users** — id, email, passwordHash, displayName, role (user|admin), status (pending|active|suspended), lastLoginAt
- **refresh_tokens** — id, userId, token (unique), expiresAt
- **folders** — id, name, parentId, ownerId, trashed, trashedAt
- **files** — id, name, folderId, ownerId, originalName, originalType, sizeBytes, status (uploading|queued|processing|done|failed|trashed), markdownContent, originalMarkdown, qualityScore, wordCount, language, trashedAt
- **conversions** — id, fileId, userId, status (queued|analyzing|routing|ocr|layout|scoring|merging|cleanup|done|failed), progress, steps (jsonb), elapsedSeconds, estimatedSeconds, errorMessage
- **shares** — id, fileId, folderId, ownerId, sharedWithId, permission (read|edit|full)
All timestamps stored as `INTEGER` (milliseconds since epoch). Enums stored as `TEXT`.
## Auth
- JWT accessToken (1h) in memory + refreshToken (30d) in httpOnly cookie
- `SESSION_SECRET` env var used as JWT secret
- refreshToken includes random `jti` UUID to prevent duplicate key collisions
- Admin seed: `admin@raqim.app` / `Admin1234!`
## Frontend Pages
| Route | Page | Auth |
|---|---|---|
| `/` | LandingPage | public |
| `/login` | LoginPage | public |
| `/request-access` | RequestAccessPage | public |
| `/dashboard` | DashboardPage | protected |
| `/upload` | UploadPage | protected |
| `/editor/:fileId` | EditorPage | protected |
| `/convert/:jobId` | ConversionPage | protected |
| `/shared` | SharedPage | protected |
| `/trash` | TrashPage | protected |
| `/admin/users` | AdminUsersPage | admin |
| `/admin/requests` | AdminRequestsPage | admin |
| `/admin/stats` | AdminStatsPage | admin |
## Design System
- Dark sanctuary theme: `--background: #0d0d0f`, `--primary: #f5a623` (amber gold)
- Fonts: Fraunces (display) + Noto Naskh Arabic (body)
- All pages use `dir="rtl"` for Arabic-first layout
- shadcn/ui components with custom CSS vars
## API Routes
- `POST /api/auth/login` — login, returns accessToken + sets refresh cookie
- `POST /api/auth/refresh` — refresh accessToken using cookie
- `POST /api/auth/logout` — clear refresh cookie
- `GET/POST /api/folders` — list/create folders
- `PATCH/DELETE /api/folders/:id` — update/trash folder
- `GET/POST /api/files` — list/upload files
- `GET/PATCH/DELETE /api/files/:id` — get/update/trash file
- `GET/PUT /api/files/:id/content` — get/save markdown content
- `POST /api/files/:id/restore` — restore from trash
- `POST /api/files/:id/export` — export to md/docx/pdf
- `GET /api/files/:id/download` — download file
- `POST /api/convert` — start conversion job
- `GET /api/convert/:jobId` — poll conversion status
- `GET/POST/PATCH/DELETE /api/shares` — share management
- `GET /api/shares/search-users` — search users to share with
- `GET /api/admin/users` — list users
- `POST /api/admin/users` — create user
- `PATCH/DELETE /api/admin/users/:id` — update/delete user
- `GET /api/admin/join-requests` — list pending join requests
- `PATCH /api/admin/join-requests/:id` — approve/reject request
- `GET /api/admin/stats` — platform statistics
- `GET /api/admin/trash` — list all trashed items
- `DELETE /api/admin/trash/empty` — empty trash
## Arabic PDF Text Extraction Pipeline
Many Arabic PDFs have broken ToUnicode CMap font tables. pdfjs-dist returns the correct Unicode codepoints but in wrong order (e.g. `امحلد هلل` instead of `الحمد لله`). The pipeline uses three stages — all completely free, no paid API keys, no page limits.
### Stage 1: pdfjs-dist text extraction
Custom RTL-aware extractor that buckets text items by Y coordinate, sorts right-to-left, and joins with gap-based space insertion. Fast, zero dependencies.
### Stage 2: Garbling detection → VLM-OCR → Tesseract fallback
`isGarbledArabic()` checks for telltale CMap transposition patterns (` يف ` for `في`, `امحلد` for `الحمد`, bidi-wrapped Latin noise like `OA`, etc.).
If garbled, `extractPdfViaOcr()` runs the following pipeline:
1. `pdftoppm` renders every page to PNG — **200 DPI** when HF_TOKEN available (VLM-optimised), **300 DPI** for Tesseract-only mode
2. **Per page**: try VLM-OCR via HF Inference API first (best Arabic accuracy)
- Primary: `allenai/olmOCR-7B-0225-preview` — Allen Institute, fine-tuned on document OCR, #1 on Arabic KITAB-Bench
- Fallback: `Qwen/Qwen2.5-VL-7B-Instruct` — strong VLM with excellent Arabic
3. **If VLM fails or rate-limited**: Tesseract.js lazy-initialised (only created when needed), Arabic+English models
After OCR, `cleanOcrOutput()` filters each line: drops lines >80% Latin chars with <4 Arabic chars (decorative page noise, artefacts like `Me NY 1`, `dl pl a gl`). Keeps all lines with ≥4 Arabic characters.
**No page cap** — processes every page in the document.
### Stage 3: AI text correction — full HF model access, 6-model fallback chain
Endpoint priority chain (tried in order, falls back on rate-limit / 5xx):
1. **Replit AI proxy** (`AI_INTEGRATIONS_OPENAI_BASE_URL`) — auto on Replit, uses `gpt-4o`
2. **HF: Qwen/Qwen3-72B** — best open-source Arabic model (Apr 2025), thinking disabled via `/no_think`
3. **HF: Qwen/Qwen3-30B-A3B** — MoE, fast and very capable
4. **HF: Qwen/Qwen2.5-72B-Instruct** — proven Arabic quality
5. **HF: meta-llama/Llama-3.3-70B-Instruct** — strong multilingual
6. **HF: mistralai/Mistral-Nemo-Instruct-2407** — fast 12B fallback
Chunk size: 3000 chars. Timeout: 120s per chunk. Temperature: 0.1. Qwen3 `<think>` blocks stripped from output. On rate-limit, `activeEpIdx` advances to next model and continues without interruption.
No `OPENAI_API_KEY` needed. `HF_TOKEN` secret (already required for HF Spaces) doubles as the AI key.
## Architect Engine — 100% Free, No External APIs, No Limits
The "Super Architect" is a fully deterministic rule-based engine (`runRuleBasedArchitect` in `convert.ts`).
Zero external API calls. Zero cost. Zero rate limits. Runs entirely on the server.
**Arabic document rules:**
- Metadata table: detects مادة/زمن/نموذج/تاريخ (even multiple fields inline on one line) → Markdown table
- Document title: promoted to `# heading` when metadata block precedes it
- Section markers: أولاً/ثانياً/ثالثاً/... → `## heading`
- Questions: س1/سؤال 1/Q1 → `**bold**` with blank lines
- Multiple choice: `أ- X ب- Y ج- Z د- W` or `a) X b) Y c) Z d) W` → vertical bullet list
- Keywords: التعليل:/الإجابة:/المطلوب:/الحل: → separate lines
- OCR noise: box-drawing chars, control chars removed
- Lone short lines (surrounded by blanks) → `### subheading`
**Latin/English document rules:**
- ALL CAPS short lines → `### heading`
- `Q1:`/`Question N:`/`Section`/`Part` → bold / ## heading
- `a) b) c) d)` inline choices → vertical list
- Existing Markdown kept as-is
## Supported File Types for Conversion
ALL file types accepted (500MB max):
- **PDF** — pdf-parse
- **DOCX/DOC** — mammoth (HTML extraction)
- **PPTX/PPT** — jszip (slide text extraction)
- **XLSX/XLS/CSV** — xlsx (table to markdown)
- **Images** (PNG, JPG, JPEG, WEBP, GIF, BMP, TIFF) — Tesseract.js OCR (Arabic+English)
- **HTML/HTM** — custom HTML-to-text
- **EPUB** — jszip (chapter text extraction)
- **TXT/MD** — direct read
- **Any other** — UTF-8 text fallback
## Export
- **Markdown (.md)** — direct download
- **Word (.docx)** — real RTL DOCX via `docx` npm package:
- Traditional Arabic font (30pt body, 44pt headings)
- 100% RTL bidirectional support
- Numbered header/footer with "منسق آلياً بواسطة رقيم"
- Proper bullet/numbered lists, tables, inline bold/italic/code
- **PDF** — currently exports as markdown (browser-based PDF printing recommended)
## Frontend Animations
Framer Motion used throughout (already in devDependencies):
- Upload zone: spring scale on hover/drag, AnimatePresence for state transitions
- File upload progress bar with smooth motion
- Conversion steps: staggered fade-in, animated step indicators
- Live AI message cycling during processing steps
- Success/failure states with spring animations
## Hugging Face Spaces Deployment
- `Dockerfile` — multi-stage build, runs as non-root, DB at `/data/raqim.db`
- `deploy-hf.sh` — one-command deploy (requires `HF_TOKEN` + `HF_USERNAME` Replit secrets)
- `scripts/db-hf-sync.mjs` — pull/push SQLite DB to private HF Dataset (`${HF_USERNAME}/raqim-db`) via git
- Space secrets to set after deploy: `JWT_SECRET`, `SESSION_SECRET`
- No `DATABASE_URL` needed — SQLite is fully embedded
See the `pnpm-workspace` skill for workspace structure, TypeScript setup, and package details.
|