Spaces:
Running
Running
| # GraphoLab — Demo Checklist | |
| Everything you need to run all eight GraphoLab notebooks end-to-end, including which AI models are downloaded automatically and which sample images you must provide. | |
| --- | |
| ## TL;DR | |
| | What | Where | Notes | | |
| |------|-------|-------| | |
| | Python environment | local or Docker | see [NOTEBOOKS_GUIDE.md](NOTEBOOKS_GUIDE.md) | | |
| | `requirements.txt` installed | — | `pip install -r requirements.txt` | | |
| | SigNet weights | `models/signet.pth` | manual download — see Lab 03 section | | |
| | Sample images | `data/samples/` | see per-lab sections below | | |
| | AI models | downloaded automatically | internet connection needed on first run | | |
| --- | |
| ## AI Models — Downloaded Automatically | |
| All Hugging Face models are fetched on first run and cached locally (or in the Docker named volume `grapholab-hf-cache`). | |
| | Model | Downloaded by | Size | Cache location | | |
| |-------|--------------|------|----------------| | |
| | **TrOCR** (`microsoft/trocr-base-handwritten`) | `transformers` | ~400 MB | `~/.cache/huggingface/` | | |
| | **EasyOCR** (Italian + English models) | `easyocr` | ~100 MB | `~/.EasyOCR/` | | |
| | **Conditional DETR signature detector** (`tech4humans/conditional-detr-50-signature-detector`) | `transformers` | ~170 MB | `~/.cache/huggingface/` | | |
| | **WikiNEural NER** (`Babelscape/wikineural-multilingual-ner`) | `transformers` | ~700 MB | `~/.cache/huggingface/` | | |
| | **dots.ocr** (`rednote-hilab/dots.ocr`) | `transformers` | ~3.5 GB (bf16) / ~7 GB (fp32 CPU) | `~/.cache/huggingface/` | | |
| > **Internet connection is required on the first run of Labs 02, 04, 07, and 08.** Subsequent runs use the cached models. | |
| > | |
| > **dots.ocr (Lab 08) also requires a one-time `git clone`** — see the installation cell in the notebook. | |
| ## AI Models — Manual Download Required | |
| | Model | File | Size | Source | | |
| |-------|------|------|--------| | |
| | **SigNet** (GPDS pre-trained) | `models/signet.pth` | ~63 MB | [luizgh/sigver](https://github.com/luizgh/sigver) | | |
| Download `signet.pth` from the sigver repository and place it in the `models/` directory before running Lab 03. | |
| --- | |
| ## Sample Images — What You Need to Provide | |
| Place all images in `data/samples/`. Synthetic placeholder images are generated automatically when real images are missing, so the notebooks always run — but results on synthetic data are not meaningful for real forensic use. | |
| ### Lab 01 — Introduction | |
| **Nothing required.** Markdown-only notebook. | |
| --- | |
| ### Lab 02 — Handwritten Text Recognition (TrOCR) | |
| | File | Description | | |
| |------|-------------| | |
| | `handwritten_text_01.png` | A single line of handwritten text | | |
| | `handwritten_text_02.png` | (optional) A second single-line sample | | |
| | `handwritten_multiline_01.png` | A multi-line handwritten document (for the HTR→NER pipeline demo) | | |
| **Requirements:** | |
| - Clear scan or photo of handwritten text | |
| - Recommended resolution: 300 DPI or higher | |
| - White or light background, dark ink | |
| - TrOCR is a line-level model; multi-line images are split automatically by horizontal projection before inference | |
| **Ground-truth comparison (optional):** if you have a known transcript of the handwritten text, you can compute the Character Error Rate (CER) in the optional section of Lab 02. | |
| --- | |
| ### Lab 03 — Signature Verification (SigNet) | |
| | File | Description | | |
| |------|-------------| | |
| | `genuine_N_1.png` | **Reference signature** — known genuine (writer N, sample 1) | | |
| | `genuine_N_2.png` | Second genuine signature from the same writer | | |
| | `forged_N_M.png` | A forged signature (writer N, forgery M) | | |
| Repeat for each writer you want to demonstrate (e.g. N = 1, 2, 3, …). | |
| **Requirements:** | |
| - Isolated signatures (no surrounding document text) | |
| - White or light background, dark ink | |
| - Consistent scan quality across samples from the same person | |
| - Recommended resolution: 300 DPI or higher | |
| > **Pre-selected demo samples:** The repository includes curated pairs from the **CEDAR** signature database. These pairs have been pre-scanned with SigNet to confirm the model correctly classifies the forgery (cosine distance > 0.35). Writers 1–5 correspond to CEDAR writers 51, 26, 34, 32, and 21 respectively. | |
| > **SigNet weights required:** download `models/signet.pth` from [luizgh/sigver](https://github.com/luizgh/sigver) before running this lab. | |
| --- | |
| ### Lab 04 — Signature Detection in Documents (Conditional DETR) | |
| | File | Description | | |
| |------|-------------| | |
| | `document_with_signature_01.png` | A scanned document page containing at least one signature | | |
| **Optional additional files:** `document_with_signature_02.png`, `document_with_signature_03.png`, … | |
| **Requirements:** | |
| - Full document page image (not a pre-cropped signature) | |
| - The model handles multi-signature pages | |
| - Recommended resolution: 200–300 DPI | |
| - Works on contracts, letters, forms, bank cheques | |
| > **Output:** detected signatures are cropped and saved as `detected_signature_N.png` in `data/samples/`. These crops can be used directly as input to Lab 03. | |
| --- | |
| ### Lab 05 — Writer Identification | |
| Organised in per-writer subdirectories inside `data/samples/`: | |
| ``` | |
| data/samples/ | |
| writer_01/ | |
| sample_01.png | |
| sample_02.png | |
| sample_03.png | |
| sample_04.png | |
| sample_05.png | |
| writer_02/ | |
| sample_01.png | |
| ... | |
| writer_03/ | |
| sample_01.png | |
| ... | |
| ``` | |
| **Requirements:** | |
| - Minimum **3 writers** (more = better accuracy) | |
| - Minimum **5 samples per writer** (the notebook uses leave-one-out cross-validation) | |
| - Each sample: a few lines of continuous handwritten text | |
| - Consistent scan conditions across all samples | |
| - Recommended resolution: 300 DPI | |
| > **Training note:** Lab 05 trains a lightweight SVM classifier on the provided samples each time the notebook runs. No pre-trained writer identification model is used — your own samples are the training data. | |
| --- | |
| ### Lab 06 — Graphological Feature Analysis | |
| Reuses the handwritten text images from Lab 02: | |
| | File | Description | | |
| |------|-------------| | |
| | `handwritten_text_01.png` | Primary sample for feature extraction | | |
| | `handwritten_text_02.png` | (optional) Second sample for side-by-side comparison | | |
| No additional files needed if Lab 02 samples are already in place. | |
| --- | |
| ### Lab 07 — Named Entity Recognition (NER) | |
| **No image files required.** The NER model operates on text strings directly. | |
| - **Demo 1 & 2:** hard-coded Italian and English example texts — no files needed. | |
| - **Demo 3 (HTR→NER pipeline):** loads `handwritten_multiline_01.png` (shared with Lab 02). | |
| The `Babelscape/wikineural-multilingual-ner` model (~700 MB) is downloaded automatically on first run. It supports 9 languages including Italian and English. | |
| --- | |
| ### Lab 08 — dots.ocr (VLM-based OCR) | |
| | File | Description | | |
| | ---- | ----------- | | |
| | `writer_00/sample_000.png` | Single writer_00 sample (shared with Lab 05) | | |
| | `testamento_writer00.png` | Full testamento document — generate with `scripts/create_testamento_writer00.py` | | |
| | `lorella/*.png` | (optional) Real-world handwriting samples | | |
| **Requirements:** | |
| - First run: internet connection for model download (~3.5 GB bf16 or ~7 GB fp32 on CPU) | |
| - On CPU: ~7 GB free RAM; 2–5 min per image | |
| - On GPU: ≥4 GB VRAM recommended | |
| **One-time installation (before first run):** | |
| ```bash | |
| git clone https://github.com/rednote-hilab/dots.ocr.git DotsOCR | |
| pip install -e DotsOCR | |
| pip install qwen_vl_utils accelerate | |
| ``` | |
| --- | |
| ## Naming Convention Summary | |
| ``` | |
| data/samples/ | |
| handwritten_text_01.png # Labs 02, 06 | |
| handwritten_text_02.png # Labs 02, 06 (optional) | |
| handwritten_multiline_01.png # Labs 02, 07 (multi-line HTR + NER pipeline) | |
| genuine_1_1.png # Lab 03 — writer 1, reference | |
| genuine_1_2.png # Lab 03 — writer 1, second genuine sample | |
| forged_1_1.png # Lab 03 — writer 1, forged | |
| genuine_2_1.png # Lab 03 — writer 2, reference | |
| ... | |
| document_with_signature_01.png # Lab 04 | |
| writer_01/sample_01.png # Lab 05 | |
| writer_01/sample_02.png # Lab 05 | |
| ... | |
| ``` | |
| --- | |
| ## Minimum Viable Demo (5 images) | |
| If you want a quick demo covering Labs 02, 03, 04, 06, and 07 with a single minimal set: | |
| 1. `handwritten_text_01.png` — for Labs 02 and 06 | |
| 2. `handwritten_multiline_01.png` — for Lab 07 HTR→NER pipeline | |
| 3. `genuine_1_1.png` — reference signature | |
| 4. `forged_1_1.png` — forged signature | |
| 5. `document_with_signature_01.png` — document page for Lab 04 | |
| Lab 01 needs nothing. Lab 05 needs per-writer subdirectories (not covered by this minimum set). Lab 07 Demos 1 & 2 need no files at all. | |
| --- | |
| ## Quick Checklist Before Running | |
| - [ ] Python environment created and `requirements.txt` installed | |
| - [ ] Internet connection available (first-run model downloads: TrOCR ~400 MB, EasyOCR ~100 MB, WikiNEural NER ~700 MB, Conditional DETR ~170 MB, dots.ocr ~3.5 GB) | |
| - [ ] `models/signet.pth` downloaded from [luizgh/sigver](https://github.com/luizgh/sigver) | |
| - [ ] `data/samples/` directory exists | |
| - [ ] Handwritten text images placed (`handwritten_text_*.png`, `handwritten_multiline_01.png`) | |
| - [ ] Signature images placed (`genuine_N_M.png`, `forged_N_M.png`) | |
| - [ ] Document scan placed (`document_with_signature_*.png`) | |
| - [ ] Writer subdirectories populated (`writer_XX/sample_YY.png`) — for Lab 05 | |
| - [ ] JupyterLab running (`jupyter lab` or `docker compose up jupyter`) | |