File size: 11,273 Bytes
0e14acb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89b70cb
eec1fbf
3e9069b
eec1fbf
89b70cb
 
eec1fbf
3e9069b
 
eec1fbf
3e9069b
 
 
 
 
 
eec1fbf
3e9069b
89b70cb
3e9069b
89b70cb
3e9069b
 
 
 
 
 
 
 
89b70cb
3e9069b
 
 
eec1fbf
0e14acb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# RAQIM — رقيم

Arabic-first document-to-Markdown conversion platform.

## Overview

pnpm workspace monorepo using TypeScript. Full-stack: React+Vite frontend, Express backend, SQLite (node:sqlite built-in) + Drizzle ORM. Deployment target: Hugging Face Spaces (Docker).

## Stack

- **Monorepo tool**: pnpm workspaces
- **Node.js version**: 24
- **Package manager**: pnpm
- **TypeScript version**: 5.9
- **API framework**: Express 5
- **Database**: SQLite via `node:sqlite` (built-in, no native compilation) + `drizzle-orm/sqlite-proxy`
- **Validation**: Zod (`zod/v4`), `drizzle-zod`
- **API codegen**: Orval (from OpenAPI spec in `artifacts/api-spec/openapi.yaml`)
- **Build**: esbuild (ESM bundle)
- **Frontend**: React 19 + Vite 7 + Tailwind CSS v4 + shadcn/ui

## Artifacts

| Artifact | Path | Port |
|---|---|---|
| `@workspace/raqim` | `/` | $PORT (23567) |
| `@workspace/api-server` | `/api` | $PORT (8080) |

## Key Commands

- `pnpm run typecheck` — full typecheck across all packages
- `pnpm run typecheck:libs` — build composite libs (run before API/frontend typecheck)
- `pnpm --filter @workspace/raqim run typecheck` — frontend only
- `pnpm --filter @workspace/api-server run typecheck` — backend only
- `pnpm --filter @workspace/api-spec run codegen` — regenerate API hooks and Zod schemas from OpenAPI spec
- `node scripts/seed-admin.mjs` — seed the admin user (run once after first startup)

## Database

- **Engine**: `node:sqlite` (Node 24 built-in, zero native deps)
- **ORM**: `drizzle-orm/sqlite-proxy` (async callback-based wrapper)
- **Dev DB**: `artifacts/api-server/raqim.db` (auto-created on startup)
- **Docker/HF DB**: `/data/raqim.db` (set via `DB_PATH` env var)
- **Migrations**: run automatically at server startup from `lib/db/drizzle/*.sql`
- **Backup**: `scripts/db-hf-sync.mjs` — git-based push/pull to a private HF Dataset

### sqlite-proxy callback contract (lib/db/src/index.ts)
- `method === "run"``stmt.run(...params)`, return `{ rows: [] }`
- `method === "get"``stmt.get(...params)`, return `{ rows: Object.values(row) }` (flat array, NOT wrapped)
- `method === "all"``stmt.all(...params)`, return `{ rows: result.map(r => Object.values(r)) }` (array of arrays)
- `method === "values"` → same as `all`

### Admin seed

`node scripts/seed-admin.mjs` — inserts `admin@raqim.app` / `Admin1234!` (bcrypt hash, role=admin, status=active).
Respects `DB_PATH`, `ADMIN_EMAIL`, `ADMIN_PASSWORD`, `ADMIN_NAME` env vars.

## Database Schema (lib/db/src/schema/)

- **users** — id, email, passwordHash, displayName, role (user|admin), status (pending|active|suspended), lastLoginAt
- **refresh_tokens** — id, userId, token (unique), expiresAt
- **folders** — id, name, parentId, ownerId, trashed, trashedAt
- **files** — id, name, folderId, ownerId, originalName, originalType, sizeBytes, status (uploading|queued|processing|done|failed|trashed), markdownContent, originalMarkdown, qualityScore, wordCount, language, trashedAt
- **conversions** — id, fileId, userId, status (queued|analyzing|routing|ocr|layout|scoring|merging|cleanup|done|failed), progress, steps (jsonb), elapsedSeconds, estimatedSeconds, errorMessage
- **shares** — id, fileId, folderId, ownerId, sharedWithId, permission (read|edit|full)

All timestamps stored as `INTEGER` (milliseconds since epoch). Enums stored as `TEXT`.

## Auth

- JWT accessToken (1h) in memory + refreshToken (30d) in httpOnly cookie
- `SESSION_SECRET` env var used as JWT secret
- refreshToken includes random `jti` UUID to prevent duplicate key collisions
- Admin seed: `admin@raqim.app` / `Admin1234!`

## Frontend Pages

| Route | Page | Auth |
|---|---|---|
| `/` | LandingPage | public |
| `/login` | LoginPage | public |
| `/request-access` | RequestAccessPage | public |
| `/dashboard` | DashboardPage | protected |
| `/upload` | UploadPage | protected |
| `/editor/:fileId` | EditorPage | protected |
| `/convert/:jobId` | ConversionPage | protected |
| `/shared` | SharedPage | protected |
| `/trash` | TrashPage | protected |
| `/admin/users` | AdminUsersPage | admin |
| `/admin/requests` | AdminRequestsPage | admin |
| `/admin/stats` | AdminStatsPage | admin |

## Design System

- Dark sanctuary theme: `--background: #0d0d0f`, `--primary: #f5a623` (amber gold)
- Fonts: Fraunces (display) + Noto Naskh Arabic (body)
- All pages use `dir="rtl"` for Arabic-first layout
- shadcn/ui components with custom CSS vars

## API Routes

- `POST /api/auth/login` — login, returns accessToken + sets refresh cookie
- `POST /api/auth/refresh` — refresh accessToken using cookie
- `POST /api/auth/logout` — clear refresh cookie
- `GET/POST /api/folders` — list/create folders
- `PATCH/DELETE /api/folders/:id` — update/trash folder
- `GET/POST /api/files` — list/upload files
- `GET/PATCH/DELETE /api/files/:id` — get/update/trash file
- `GET/PUT /api/files/:id/content` — get/save markdown content
- `POST /api/files/:id/restore` — restore from trash
- `POST /api/files/:id/export` — export to md/docx/pdf
- `GET /api/files/:id/download` — download file
- `POST /api/convert` — start conversion job
- `GET /api/convert/:jobId` — poll conversion status
- `GET/POST/PATCH/DELETE /api/shares` — share management
- `GET /api/shares/search-users` — search users to share with
- `GET /api/admin/users` — list users
- `POST /api/admin/users` — create user
- `PATCH/DELETE /api/admin/users/:id` — update/delete user
- `GET /api/admin/join-requests` — list pending join requests
- `PATCH /api/admin/join-requests/:id` — approve/reject request
- `GET /api/admin/stats` — platform statistics
- `GET /api/admin/trash` — list all trashed items
- `DELETE /api/admin/trash/empty` — empty trash

## Arabic PDF Text Extraction Pipeline

Many Arabic PDFs have broken ToUnicode CMap font tables. pdfjs-dist returns the correct Unicode codepoints but in wrong order (e.g. `امحلد هلل` instead of `الحمد لله`). The pipeline uses three stages — all completely free, no paid API keys, no page limits.

### Stage 1: pdfjs-dist text extraction
Custom RTL-aware extractor that buckets text items by Y coordinate, sorts right-to-left, and joins with gap-based space insertion. Fast, zero dependencies.

### Stage 2: Garbling detection → VLM-OCR → Tesseract fallback
`isGarbledArabic()` checks for telltale CMap transposition patterns (` يف ` for `في`, `امحلد` for `الحمد`, bidi-wrapped Latin noise like `‎OA‏`, etc.).

If garbled, `extractPdfViaOcr()` runs the following pipeline:
1. `pdftoppm` renders every page to PNG — **200 DPI** when HF_TOKEN available (VLM-optimised), **300 DPI** for Tesseract-only mode
2. **Per page**: try VLM-OCR via HF Inference API first (best Arabic accuracy)
   - Primary: `allenai/olmOCR-7B-0225-preview` — Allen Institute, fine-tuned on document OCR, #1 on Arabic KITAB-Bench
   - Fallback: `Qwen/Qwen2.5-VL-7B-Instruct` — strong VLM with excellent Arabic
3. **If VLM fails or rate-limited**: Tesseract.js lazy-initialised (only created when needed), Arabic+English models

After OCR, `cleanOcrOutput()` filters each line: drops lines >80% Latin chars with <4 Arabic chars (decorative page noise, artefacts like `Me NY 1`, `dl pl a gl`). Keeps all lines with ≥4 Arabic characters.

**No page cap** — processes every page in the document.

### Stage 3: AI text correction — full HF model access, 6-model fallback chain
Endpoint priority chain (tried in order, falls back on rate-limit / 5xx):
1. **Replit AI proxy** (`AI_INTEGRATIONS_OPENAI_BASE_URL`) — auto on Replit, uses `gpt-4o`
2. **HF: Qwen/Qwen3-72B** — best open-source Arabic model (Apr 2025), thinking disabled via `/no_think`
3. **HF: Qwen/Qwen3-30B-A3B** — MoE, fast and very capable
4. **HF: Qwen/Qwen2.5-72B-Instruct** — proven Arabic quality
5. **HF: meta-llama/Llama-3.3-70B-Instruct** — strong multilingual
6. **HF: mistralai/Mistral-Nemo-Instruct-2407** — fast 12B fallback

Chunk size: 3000 chars. Timeout: 120s per chunk. Temperature: 0.1. Qwen3 `<think>` blocks stripped from output. On rate-limit, `activeEpIdx` advances to next model and continues without interruption.

No `OPENAI_API_KEY` needed. `HF_TOKEN` secret (already required for HF Spaces) doubles as the AI key.

## Architect Engine — 100% Free, No External APIs, No Limits

The "Super Architect" is a fully deterministic rule-based engine (`runRuleBasedArchitect` in `convert.ts`).
Zero external API calls. Zero cost. Zero rate limits. Runs entirely on the server.

**Arabic document rules:**
- Metadata table: detects مادة/زمن/نموذج/تاريخ (even multiple fields inline on one line) → Markdown table
- Document title: promoted to `# heading` when metadata block precedes it
- Section markers: أولاً/ثانياً/ثالثاً/... → `## heading`
- Questions: س1/سؤال 1/Q1 → `**bold**` with blank lines
- Multiple choice: `أ- X  ب- Y  ج- Z  د- W` or `a) X  b) Y  c) Z  d) W` → vertical bullet list
- Keywords: التعليل:/الإجابة:/المطلوب:/الحل: → separate lines
- OCR noise: box-drawing chars, control chars removed
- Lone short lines (surrounded by blanks) → `### subheading`

**Latin/English document rules:**
- ALL CAPS short lines → `### heading`
- `Q1:`/`Question N:`/`Section`/`Part` → bold / ## heading
- `a) b) c) d)` inline choices → vertical list
- Existing Markdown kept as-is

## Supported File Types for Conversion

ALL file types accepted (500MB max):
- **PDF** — pdf-parse
- **DOCX/DOC** — mammoth (HTML extraction)
- **PPTX/PPT** — jszip (slide text extraction)
- **XLSX/XLS/CSV** — xlsx (table to markdown)
- **Images** (PNG, JPG, JPEG, WEBP, GIF, BMP, TIFF) — Tesseract.js OCR (Arabic+English)
- **HTML/HTM** — custom HTML-to-text
- **EPUB** — jszip (chapter text extraction)
- **TXT/MD** — direct read
- **Any other** — UTF-8 text fallback

## Export

- **Markdown (.md)** — direct download
- **Word (.docx)** — real RTL DOCX via `docx` npm package:
  - Traditional Arabic font (30pt body, 44pt headings)
  - 100% RTL bidirectional support
  - Numbered header/footer with "منسق آلياً بواسطة رقيم"
  - Proper bullet/numbered lists, tables, inline bold/italic/code
- **PDF** — currently exports as markdown (browser-based PDF printing recommended)

## Frontend Animations

Framer Motion used throughout (already in devDependencies):
- Upload zone: spring scale on hover/drag, AnimatePresence for state transitions
- File upload progress bar with smooth motion
- Conversion steps: staggered fade-in, animated step indicators
- Live AI message cycling during processing steps
- Success/failure states with spring animations

## Hugging Face Spaces Deployment

- `Dockerfile` — multi-stage build, runs as non-root, DB at `/data/raqim.db`
- `deploy-hf.sh` — one-command deploy (requires `HF_TOKEN` + `HF_USERNAME` Replit secrets)
- `scripts/db-hf-sync.mjs` — pull/push SQLite DB to private HF Dataset (`${HF_USERNAME}/raqim-db`) via git
- Space secrets to set after deploy: `JWT_SECRET`, `SESSION_SECRET`
- No `DATABASE_URL` needed — SQLite is fully embedded

See the `pnpm-workspace` skill for workspace structure, TypeScript setup, and package details.