Spaces:
Sleeping
feat(sprint-1): implémentation complète du Sprint 1 Picarones
Browse filesStructure du projet :
- Package Python `picarones/` avec sous-modules `core/` et `engines/`
- `pyproject.toml` (setuptools, entry point CLI), `requirements.txt`, `.gitignore`
Adaptateurs moteurs OCR (Module 2) :
- `engines/base.py` : interface abstraite `BaseOCREngine` + `EngineResult`
- Gestion automatique des erreurs, mesure du temps d'exécution
- `engines/tesseract.py` : adaptateur Tesseract 5 via pytesseract
- Configuration lang, PSM, OEM, chemin binaire
- `engines/pero_ocr.py` : adaptateur Pero OCR (pero-ocr optionnel)
- Extraction du texte plat depuis la sortie PAGE XML structurée
Métriques CER/WER (Module 4) :
- `core/metrics.py` : calcul via jiwer
- CER brut, CER NFC, CER caseless
- WER brut, WER normalisé, MER, WIL
- Agrégation statistique (mean, median, min, max, stdev)
Gestion des corpus (Module 1) :
- `core/corpus.py` : chargement dossier local de paires image / `.gt.txt`
- Détection automatique des images (jpg, png, tif, webp…)
- Statistiques de corpus, gestion des erreurs
Résultats et export JSON (Module 6) :
- `core/results.py` : modèles `DocumentResult`, `EngineReport`, `BenchmarkResult`
- Sérialisation JSON complète avec classement par CER
- `to_json()` / `from_json()` pour persistance
Orchestrateur (Module 6) :
- `core/runner.py` : `run_benchmark()` avec barre de progression tqdm
CLI Click (Module 6) :
- `picarones run` : benchmark complet avec option `--fail-if-cer-above` (CI/CD)
- `picarones metrics` : CER/WER entre deux fichiers texte
- `picarones engines` : liste les moteurs disponibles
- `picarones info` : versions des dépendances
Tests unitaires : 58 tests, 100% passants
- `test_metrics.py`, `test_corpus.py`, `test_engines.py`, `test_results.py`
https://claude.ai/code/session_017gXea9mxBQqDTAsSQd7aAq
- .gitignore +13 -0
- README.md +119 -0
- picarones/__init__.py +9 -0
- picarones/cli.py +295 -0
- picarones/core/__init__.py +1 -0
- picarones/core/corpus.py +152 -0
- picarones/core/metrics.py +214 -0
- picarones/core/results.py +155 -0
- picarones/core/runner.py +115 -0
- picarones/engines/__init__.py +13 -0
- picarones/engines/base.py +85 -0
- picarones/engines/pero_ocr.py +112 -0
- picarones/engines/tesseract.py +84 -0
- pyproject.toml +44 -0
- requirements.txt +19 -0
- tests/__init__.py +0 -0
- tests/test_corpus.py +96 -0
- tests/test_engines.py +214 -0
- tests/test_metrics.py +117 -0
- tests/test_results.py +124 -0
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
__pycache__/
|
| 2 |
+
*.py[cod]
|
| 3 |
+
*.egg-info/
|
| 4 |
+
*.egg
|
| 5 |
+
dist/
|
| 6 |
+
build/
|
| 7 |
+
.eggs/
|
| 8 |
+
.pytest_cache/
|
| 9 |
+
.coverage
|
| 10 |
+
htmlcov/
|
| 11 |
+
.venv/
|
| 12 |
+
venv/
|
| 13 |
+
*.log
|
|
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Picarones
|
| 2 |
+
|
| 3 |
+
> **Plateforme de comparaison de moteurs OCR/HTR pour documents patrimoniaux**
|
| 4 |
+
> BnF — Département numérique · Apache 2.0
|
| 5 |
+
|
| 6 |
+
Picarones permet d'évaluer et de comparer rigoureusement des moteurs OCR (Tesseract, Pero OCR, Kraken, APIs cloud…) ainsi que des pipelines OCR+LLM sur des corpus de documents historiques — manuscrits, imprimés anciens, archives.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## Sprint 1 — Ce qui est implémenté
|
| 11 |
+
|
| 12 |
+
- Structure complète du projet Python (`picarones/`)
|
| 13 |
+
- Adaptateur **Tesseract 5** (`pytesseract`)
|
| 14 |
+
- Adaptateur **Pero OCR** (necessite `pero-ocr`)
|
| 15 |
+
- Interface abstraite `BaseOCREngine` pour ajouter facilement de nouveaux moteurs
|
| 16 |
+
- Calcul **CER** et **WER** via `jiwer` (brut, NFC, caseless, normalisé, MER, WIL)
|
| 17 |
+
- Chargement de **corpus** depuis dossier local (paires image / `.gt.txt`)
|
| 18 |
+
- **Export JSON** structuré des résultats avec classement
|
| 19 |
+
- **CLI** `click` : commandes `run`, `metrics`, `engines`, `info`
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Installation
|
| 24 |
+
|
| 25 |
+
```bash
|
| 26 |
+
pip install -e .
|
| 27 |
+
|
| 28 |
+
# Pour Tesseract, installer aussi le binaire système :
|
| 29 |
+
# Ubuntu/Debian : sudo apt install tesseract-ocr tesseract-ocr-fra
|
| 30 |
+
# macOS : brew install tesseract
|
| 31 |
+
|
| 32 |
+
# Pour Pero OCR (optionnel) :
|
| 33 |
+
pip install pero-ocr
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
## Usage rapide
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
# Lancer un benchmark sur un corpus local
|
| 40 |
+
picarones run --corpus ./mon_corpus/ --engines tesseract --output resultats.json
|
| 41 |
+
|
| 42 |
+
# Plusieurs moteurs
|
| 43 |
+
picarones run --corpus ./corpus/ --engines tesseract,pero_ocr --lang fra
|
| 44 |
+
|
| 45 |
+
# Calculer CER/WER entre deux fichiers
|
| 46 |
+
picarones metrics --reference gt.txt --hypothesis ocr.txt
|
| 47 |
+
|
| 48 |
+
# Lister les moteurs disponibles
|
| 49 |
+
picarones engines
|
| 50 |
+
|
| 51 |
+
# Infos de version
|
| 52 |
+
picarones info
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Structure du projet
|
| 56 |
+
|
| 57 |
+
```
|
| 58 |
+
picarones/
|
| 59 |
+
├── __init__.py
|
| 60 |
+
├── cli.py # CLI Click
|
| 61 |
+
├── core/
|
| 62 |
+
│ ├── corpus.py # Chargement corpus
|
| 63 |
+
│ ├── metrics.py # CER/WER (jiwer)
|
| 64 |
+
│ ├── results.py # Modèles de données + export JSON
|
| 65 |
+
│ └── runner.py # Orchestrateur benchmark
|
| 66 |
+
└── engines/
|
| 67 |
+
├── base.py # Interface abstraite BaseOCREngine
|
| 68 |
+
├── tesseract.py # Adaptateur Tesseract
|
| 69 |
+
└── pero_ocr.py # Adaptateur Pero OCR
|
| 70 |
+
tests/
|
| 71 |
+
├── test_metrics.py
|
| 72 |
+
├── test_corpus.py
|
| 73 |
+
├── test_engines.py
|
| 74 |
+
└── test_results.py
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
## Format du corpus
|
| 78 |
+
|
| 79 |
+
Un corpus local est un dossier contenant des paires :
|
| 80 |
+
|
| 81 |
+
```
|
| 82 |
+
corpus/
|
| 83 |
+
├── page_001.jpg
|
| 84 |
+
├── page_001.gt.txt ← vérité terrain UTF-8
|
| 85 |
+
├── page_002.png
|
| 86 |
+
├── page_002.gt.txt
|
| 87 |
+
└── ...
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
## Format de sortie JSON
|
| 91 |
+
|
| 92 |
+
```json
|
| 93 |
+
{
|
| 94 |
+
"picarones_version": "0.1.0",
|
| 95 |
+
"run_date": "2025-03-04T...",
|
| 96 |
+
"corpus": { "name": "...", "document_count": 50 },
|
| 97 |
+
"ranking": [
|
| 98 |
+
{ "engine": "tesseract", "mean_cer": 0.043, "mean_wer": 0.112 }
|
| 99 |
+
],
|
| 100 |
+
"engine_reports": [...]
|
| 101 |
+
}
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
## Lancer les tests
|
| 105 |
+
|
| 106 |
+
```bash
|
| 107 |
+
pytest
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
## Roadmap
|
| 111 |
+
|
| 112 |
+
| Sprint | Livrables |
|
| 113 |
+
|--------|-----------|
|
| 114 |
+
| **Sprint 1** ✅ | Structure, adaptateurs Tesseract + Pero OCR, CER/WER, JSON, CLI |
|
| 115 |
+
| Sprint 2 | Rapport HTML interactif avec diff coloré |
|
| 116 |
+
| Sprint 3 | Pipelines OCR+LLM (GPT-4o, Claude) |
|
| 117 |
+
| Sprint 4 | APIs cloud OCR, import IIIF, normalisation diplomatique |
|
| 118 |
+
| Sprint 5 | Métriques avancées : matrice de confusion unicode, ligatures |
|
| 119 |
+
| Sprint 6 | Interface web FastAPI, import HTR-United / HuggingFace |
|
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Picarones — Plateforme de comparaison de moteurs OCR pour documents patrimoniaux.
|
| 3 |
+
|
| 4 |
+
BnF — Département numérique, 2025.
|
| 5 |
+
Licence Apache 2.0.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
__version__ = "0.1.0"
|
| 9 |
+
__author__ = "BnF — Département numérique"
|
|
@@ -0,0 +1,295 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Interface en ligne de commande Picarones (Click).
|
| 2 |
+
|
| 3 |
+
Commandes disponibles
|
| 4 |
+
---------------------
|
| 5 |
+
picarones run — Lance un benchmark complet
|
| 6 |
+
picarones metrics — Calcule CER/WER entre deux fichiers texte
|
| 7 |
+
picarones engines — Liste les moteurs disponibles
|
| 8 |
+
picarones info — Informations de version
|
| 9 |
+
|
| 10 |
+
Exemples d'usage
|
| 11 |
+
----------------
|
| 12 |
+
picarones run --corpus ./corpus/ --engines tesseract --output results.json
|
| 13 |
+
picarones metrics --reference gt.txt --hypothesis ocr.txt
|
| 14 |
+
picarones engines
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
from __future__ import annotations
|
| 18 |
+
|
| 19 |
+
import json
|
| 20 |
+
import logging
|
| 21 |
+
import sys
|
| 22 |
+
from pathlib import Path
|
| 23 |
+
|
| 24 |
+
import click
|
| 25 |
+
|
| 26 |
+
from picarones import __version__
|
| 27 |
+
|
| 28 |
+
# ---------------------------------------------------------------------------
|
| 29 |
+
# Helpers
|
| 30 |
+
# ---------------------------------------------------------------------------
|
| 31 |
+
|
| 32 |
+
def _setup_logging(verbose: bool) -> None:
|
| 33 |
+
level = logging.DEBUG if verbose else logging.INFO
|
| 34 |
+
logging.basicConfig(
|
| 35 |
+
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
|
| 36 |
+
datefmt="%H:%M:%S",
|
| 37 |
+
level=level,
|
| 38 |
+
)
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def _engine_from_name(engine_name: str, lang: str, psm: int) -> "BaseOCREngine":
|
| 42 |
+
"""Instancie un moteur par son nom."""
|
| 43 |
+
from picarones.engines.tesseract import TesseractEngine
|
| 44 |
+
|
| 45 |
+
if engine_name in {"tesseract", "tess"}:
|
| 46 |
+
return TesseractEngine(config={"lang": lang, "psm": psm})
|
| 47 |
+
|
| 48 |
+
try:
|
| 49 |
+
from picarones.engines.pero_ocr import PeroOCREngine
|
| 50 |
+
|
| 51 |
+
if engine_name in {"pero_ocr", "pero"}:
|
| 52 |
+
return PeroOCREngine(config={"name": "pero_ocr"})
|
| 53 |
+
except ImportError:
|
| 54 |
+
pass
|
| 55 |
+
|
| 56 |
+
raise click.BadParameter(
|
| 57 |
+
f"Moteur inconnu ou non disponible : '{engine_name}'. "
|
| 58 |
+
"Moteurs supportés : tesseract, pero_ocr"
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
# ---------------------------------------------------------------------------
|
| 63 |
+
# Groupe principal
|
| 64 |
+
# ---------------------------------------------------------------------------
|
| 65 |
+
|
| 66 |
+
@click.group(context_settings={"help_option_names": ["-h", "--help"]})
|
| 67 |
+
@click.version_option(__version__, "-V", "--version", prog_name="picarones")
|
| 68 |
+
def cli() -> None:
|
| 69 |
+
"""Picarones — Plateforme de comparaison de moteurs OCR pour documents patrimoniaux.
|
| 70 |
+
|
| 71 |
+
Bibliothèque nationale de France — Département numérique.
|
| 72 |
+
"""
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
# ---------------------------------------------------------------------------
|
| 76 |
+
# picarones run
|
| 77 |
+
# ---------------------------------------------------------------------------
|
| 78 |
+
|
| 79 |
+
@cli.command("run")
|
| 80 |
+
@click.option(
|
| 81 |
+
"--corpus", "-c",
|
| 82 |
+
required=True,
|
| 83 |
+
type=click.Path(exists=True, file_okay=False, resolve_path=True),
|
| 84 |
+
help="Dossier contenant les paires image / .gt.txt",
|
| 85 |
+
)
|
| 86 |
+
@click.option(
|
| 87 |
+
"--engines", "-e",
|
| 88 |
+
default="tesseract",
|
| 89 |
+
show_default=True,
|
| 90 |
+
help="Liste de moteurs séparés par des virgules (ex : tesseract,pero_ocr)",
|
| 91 |
+
)
|
| 92 |
+
@click.option(
|
| 93 |
+
"--output", "-o",
|
| 94 |
+
default="results.json",
|
| 95 |
+
show_default=True,
|
| 96 |
+
type=click.Path(resolve_path=True),
|
| 97 |
+
help="Fichier JSON de sortie",
|
| 98 |
+
)
|
| 99 |
+
@click.option(
|
| 100 |
+
"--lang", "-l",
|
| 101 |
+
default="fra",
|
| 102 |
+
show_default=True,
|
| 103 |
+
help="Code langue Tesseract (fra, lat, eng…)",
|
| 104 |
+
)
|
| 105 |
+
@click.option("--psm", default=6, show_default=True, help="Page Segmentation Mode Tesseract (0-13)")
|
| 106 |
+
@click.option("--no-progress", is_flag=True, default=False, help="Désactive la barre de progression")
|
| 107 |
+
@click.option("--verbose", "-v", is_flag=True, default=False, help="Mode verbeux")
|
| 108 |
+
@click.option(
|
| 109 |
+
"--fail-if-cer-above",
|
| 110 |
+
default=None,
|
| 111 |
+
type=float,
|
| 112 |
+
metavar="THRESHOLD",
|
| 113 |
+
help="Quitte avec code 1 si CER moyen > THRESHOLD (usage CI/CD)",
|
| 114 |
+
)
|
| 115 |
+
def run_cmd(
|
| 116 |
+
corpus: str,
|
| 117 |
+
engines: str,
|
| 118 |
+
output: str,
|
| 119 |
+
lang: str,
|
| 120 |
+
psm: int,
|
| 121 |
+
no_progress: bool,
|
| 122 |
+
verbose: bool,
|
| 123 |
+
fail_if_cer_above: float | None,
|
| 124 |
+
) -> None:
|
| 125 |
+
"""Lance un benchmark OCR sur un corpus de documents.
|
| 126 |
+
|
| 127 |
+
Le corpus doit être un dossier contenant des paires
|
| 128 |
+
<image>.<ext> + <image>.gt.txt (vérité terrain).
|
| 129 |
+
"""
|
| 130 |
+
_setup_logging(verbose)
|
| 131 |
+
|
| 132 |
+
from picarones.core.corpus import load_corpus_from_directory
|
| 133 |
+
from picarones.core.runner import run_benchmark
|
| 134 |
+
|
| 135 |
+
# Chargement du corpus
|
| 136 |
+
try:
|
| 137 |
+
corp = load_corpus_from_directory(corpus)
|
| 138 |
+
except (FileNotFoundError, ValueError) as exc:
|
| 139 |
+
click.echo(f"Erreur corpus : {exc}", err=True)
|
| 140 |
+
sys.exit(1)
|
| 141 |
+
|
| 142 |
+
click.echo(f"Corpus '{corp.name}' — {len(corp)} documents chargés.")
|
| 143 |
+
|
| 144 |
+
# Instanciation des moteurs
|
| 145 |
+
engine_names = [e.strip() for e in engines.split(",") if e.strip()]
|
| 146 |
+
ocr_engines = []
|
| 147 |
+
for name in engine_names:
|
| 148 |
+
try:
|
| 149 |
+
engine = _engine_from_name(name, lang=lang, psm=psm)
|
| 150 |
+
ocr_engines.append(engine)
|
| 151 |
+
except click.BadParameter as exc:
|
| 152 |
+
click.echo(f"Erreur moteur : {exc}", err=True)
|
| 153 |
+
sys.exit(1)
|
| 154 |
+
|
| 155 |
+
if not ocr_engines:
|
| 156 |
+
click.echo("Aucun moteur valide spécifié.", err=True)
|
| 157 |
+
sys.exit(1)
|
| 158 |
+
|
| 159 |
+
click.echo(f"Moteurs : {', '.join(e.name for e in ocr_engines)}")
|
| 160 |
+
|
| 161 |
+
# Lancement du benchmark
|
| 162 |
+
result = run_benchmark(
|
| 163 |
+
corpus=corp,
|
| 164 |
+
engines=ocr_engines,
|
| 165 |
+
output_json=output,
|
| 166 |
+
show_progress=not no_progress,
|
| 167 |
+
)
|
| 168 |
+
|
| 169 |
+
# Affichage du classement
|
| 170 |
+
click.echo("\n── Classement ──────────────────────────────────")
|
| 171 |
+
for rank, entry in enumerate(result.ranking(), 1):
|
| 172 |
+
cer_pct = f"{entry['mean_cer'] * 100:.2f}%" if entry["mean_cer"] is not None else "N/A"
|
| 173 |
+
wer_pct = f"{entry['mean_wer'] * 100:.2f}%" if entry["mean_wer"] is not None else "N/A"
|
| 174 |
+
failed = entry["failed"]
|
| 175 |
+
failed_str = f" ({failed} erreur(s))" if failed else ""
|
| 176 |
+
click.echo(f" {rank}. {entry['engine']:<20} CER={cer_pct:<8} WER={wer_pct}{failed_str}")
|
| 177 |
+
|
| 178 |
+
click.echo(f"\nRésultats écrits dans : {output}")
|
| 179 |
+
|
| 180 |
+
# Mode CI/CD : exit code non-zero si CER > seuil
|
| 181 |
+
if fail_if_cer_above is not None:
|
| 182 |
+
for entry in result.ranking():
|
| 183 |
+
if entry["mean_cer"] is not None and entry["mean_cer"] * 100 > fail_if_cer_above:
|
| 184 |
+
click.echo(
|
| 185 |
+
f"\nECHEC : {entry['engine']} CER={entry['mean_cer']*100:.2f}% "
|
| 186 |
+
f"> seuil {fail_if_cer_above:.2f}%",
|
| 187 |
+
err=True,
|
| 188 |
+
)
|
| 189 |
+
sys.exit(1)
|
| 190 |
+
|
| 191 |
+
|
| 192 |
+
# ---------------------------------------------------------------------------
|
| 193 |
+
# picarones metrics
|
| 194 |
+
# ---------------------------------------------------------------------------
|
| 195 |
+
|
| 196 |
+
@cli.command("metrics")
|
| 197 |
+
@click.option(
|
| 198 |
+
"--reference", "-r",
|
| 199 |
+
required=True,
|
| 200 |
+
type=click.Path(exists=True, dir_okay=False),
|
| 201 |
+
help="Fichier vérité terrain (texte brut UTF-8)",
|
| 202 |
+
)
|
| 203 |
+
@click.option(
|
| 204 |
+
"--hypothesis", "-H",
|
| 205 |
+
required=True,
|
| 206 |
+
type=click.Path(exists=True, dir_okay=False),
|
| 207 |
+
help="Fichier transcription OCR (texte brut UTF-8)",
|
| 208 |
+
)
|
| 209 |
+
@click.option("--json-output", is_flag=True, default=False, help="Sortie en JSON")
|
| 210 |
+
def metrics_cmd(reference: str, hypothesis: str, json_output: bool) -> None:
|
| 211 |
+
"""Calcule CER et WER entre deux fichiers texte."""
|
| 212 |
+
from picarones.core.metrics import compute_metrics
|
| 213 |
+
|
| 214 |
+
ref_text = Path(reference).read_text(encoding="utf-8").strip()
|
| 215 |
+
hyp_text = Path(hypothesis).read_text(encoding="utf-8").strip()
|
| 216 |
+
|
| 217 |
+
result = compute_metrics(ref_text, hyp_text)
|
| 218 |
+
|
| 219 |
+
if json_output:
|
| 220 |
+
click.echo(json.dumps(result.as_dict(), ensure_ascii=False, indent=2))
|
| 221 |
+
else:
|
| 222 |
+
click.echo(f"CER : {result.cer_percent:.2f}%")
|
| 223 |
+
click.echo(f"CER (NFC) : {result.cer_nfc * 100:.2f}%")
|
| 224 |
+
click.echo(f"CER (caseless) : {result.cer_caseless * 100:.2f}%")
|
| 225 |
+
click.echo(f"WER : {result.wer_percent:.2f}%")
|
| 226 |
+
click.echo(f"WER (normalisé): {result.wer_normalized * 100:.2f}%")
|
| 227 |
+
click.echo(f"MER : {result.mer * 100:.2f}%")
|
| 228 |
+
click.echo(f"WIL : {result.wil * 100:.2f}%")
|
| 229 |
+
click.echo(f"Longueur GT : {result.reference_length} chars")
|
| 230 |
+
click.echo(f"Longueur OCR : {result.hypothesis_length} chars")
|
| 231 |
+
if result.error:
|
| 232 |
+
click.echo(f"Erreur : {result.error}", err=True)
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
# ---------------------------------------------------------------------------
|
| 236 |
+
# picarones engines
|
| 237 |
+
# ---------------------------------------------------------------------------
|
| 238 |
+
|
| 239 |
+
@cli.command("engines")
|
| 240 |
+
def engines_cmd() -> None:
|
| 241 |
+
"""Liste les moteurs OCR disponibles et vérifie leur installation."""
|
| 242 |
+
engines = [
|
| 243 |
+
("tesseract", "Tesseract 5 (pytesseract)", "pytesseract"),
|
| 244 |
+
("pero_ocr", "Pero OCR", "pero_ocr"),
|
| 245 |
+
]
|
| 246 |
+
|
| 247 |
+
click.echo("Moteurs OCR disponibles :\n")
|
| 248 |
+
for engine_id, label, module in engines:
|
| 249 |
+
try:
|
| 250 |
+
__import__(module)
|
| 251 |
+
status = click.style("✓ disponible", fg="green")
|
| 252 |
+
except ImportError:
|
| 253 |
+
status = click.style("✗ non installé", fg="red")
|
| 254 |
+
click.echo(f" {engine_id:<15} {label:<35} {status}")
|
| 255 |
+
|
| 256 |
+
click.echo(
|
| 257 |
+
"\nPour installer un moteur manquant :\n"
|
| 258 |
+
" pip install pytesseract\n"
|
| 259 |
+
" pip install pero-ocr"
|
| 260 |
+
)
|
| 261 |
+
|
| 262 |
+
|
| 263 |
+
# ---------------------------------------------------------------------------
|
| 264 |
+
# picarones info
|
| 265 |
+
# ---------------------------------------------------------------------------
|
| 266 |
+
|
| 267 |
+
@cli.command("info")
|
| 268 |
+
def info_cmd() -> None:
|
| 269 |
+
"""Affiche les informations de version de Picarones et de ses dépendances."""
|
| 270 |
+
click.echo(f"Picarones v{__version__}")
|
| 271 |
+
click.echo("BnF — Département numérique\n")
|
| 272 |
+
|
| 273 |
+
deps = [
|
| 274 |
+
("click", "click"),
|
| 275 |
+
("jiwer", "jiwer"),
|
| 276 |
+
("Pillow", "PIL"),
|
| 277 |
+
("pytesseract", "pytesseract"),
|
| 278 |
+
("tqdm", "tqdm"),
|
| 279 |
+
("numpy", "numpy"),
|
| 280 |
+
("pyyaml", "yaml"),
|
| 281 |
+
]
|
| 282 |
+
|
| 283 |
+
click.echo("Dépendances :")
|
| 284 |
+
for name, module in deps:
|
| 285 |
+
try:
|
| 286 |
+
mod = __import__(module)
|
| 287 |
+
version = getattr(mod, "__version__", "installé")
|
| 288 |
+
status = click.style(f"v{version}", fg="green")
|
| 289 |
+
except ImportError:
|
| 290 |
+
status = click.style("non installé", fg="red")
|
| 291 |
+
click.echo(f" {name:<15} {status}")
|
| 292 |
+
|
| 293 |
+
|
| 294 |
+
if __name__ == "__main__":
|
| 295 |
+
cli()
|
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""Core modules : corpus, métriques, résultats, orchestration."""
|
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Chargement et gestion des corpus de documents.
|
| 2 |
+
|
| 3 |
+
Format supporté (Sprint 1) : dossier local avec paires image / .gt.txt
|
| 4 |
+
|
| 5 |
+
Convention :
|
| 6 |
+
mon_document.jpg ←→ mon_document.gt.txt
|
| 7 |
+
page_001.png ←→ page_001.gt.txt
|
| 8 |
+
|
| 9 |
+
Extensions d'images acceptées : .jpg, .jpeg, .png, .tif, .tiff, .bmp, .webp
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
from __future__ import annotations
|
| 13 |
+
|
| 14 |
+
import logging
|
| 15 |
+
from dataclasses import dataclass, field
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
from typing import Iterator, Optional
|
| 18 |
+
|
| 19 |
+
logger = logging.getLogger(__name__)
|
| 20 |
+
|
| 21 |
+
# Extensions image reconnues
|
| 22 |
+
IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png", ".tif", ".tiff", ".bmp", ".webp"}
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
@dataclass
|
| 26 |
+
class Document:
|
| 27 |
+
"""Une paire (image, texte de vérité terrain)."""
|
| 28 |
+
|
| 29 |
+
image_path: Path
|
| 30 |
+
ground_truth: str
|
| 31 |
+
doc_id: str = ""
|
| 32 |
+
metadata: dict = field(default_factory=dict)
|
| 33 |
+
|
| 34 |
+
def __post_init__(self) -> None:
|
| 35 |
+
if not self.doc_id:
|
| 36 |
+
self.doc_id = self.image_path.stem
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
@dataclass
|
| 40 |
+
class Corpus:
|
| 41 |
+
"""Collection de documents avec leurs métadonnées."""
|
| 42 |
+
|
| 43 |
+
name: str
|
| 44 |
+
documents: list[Document]
|
| 45 |
+
source_path: Optional[str] = None
|
| 46 |
+
metadata: dict = field(default_factory=dict)
|
| 47 |
+
|
| 48 |
+
def __len__(self) -> int:
|
| 49 |
+
return len(self.documents)
|
| 50 |
+
|
| 51 |
+
def __iter__(self) -> Iterator[Document]:
|
| 52 |
+
return iter(self.documents)
|
| 53 |
+
|
| 54 |
+
def __repr__(self) -> str:
|
| 55 |
+
return f"Corpus(name={self.name!r}, documents={len(self.documents)})"
|
| 56 |
+
|
| 57 |
+
@property
|
| 58 |
+
def stats(self) -> dict:
|
| 59 |
+
gt_lengths = [len(doc.ground_truth) for doc in self.documents]
|
| 60 |
+
if not gt_lengths:
|
| 61 |
+
return {"document_count": 0}
|
| 62 |
+
import statistics
|
| 63 |
+
|
| 64 |
+
return {
|
| 65 |
+
"document_count": len(self.documents),
|
| 66 |
+
"gt_length_mean": round(statistics.mean(gt_lengths), 1),
|
| 67 |
+
"gt_length_median": round(statistics.median(gt_lengths), 1),
|
| 68 |
+
"gt_length_min": min(gt_lengths),
|
| 69 |
+
"gt_length_max": max(gt_lengths),
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def load_corpus_from_directory(
|
| 74 |
+
directory: str | Path,
|
| 75 |
+
name: Optional[str] = None,
|
| 76 |
+
gt_suffix: str = ".gt.txt",
|
| 77 |
+
encoding: str = "utf-8",
|
| 78 |
+
) -> Corpus:
|
| 79 |
+
"""Charge un corpus depuis un dossier local de paires image / GT.
|
| 80 |
+
|
| 81 |
+
Parameters
|
| 82 |
+
----------
|
| 83 |
+
directory:
|
| 84 |
+
Chemin vers le dossier contenant les paires image + fichier GT.
|
| 85 |
+
name:
|
| 86 |
+
Nom du corpus (par défaut : nom du dossier).
|
| 87 |
+
gt_suffix:
|
| 88 |
+
Suffixe des fichiers vérité terrain (par défaut : ``.gt.txt``).
|
| 89 |
+
encoding:
|
| 90 |
+
Encodage des fichiers texte (par défaut : utf-8).
|
| 91 |
+
|
| 92 |
+
Returns
|
| 93 |
+
-------
|
| 94 |
+
Corpus
|
| 95 |
+
Objet Corpus prêt à être utilisé dans le pipeline.
|
| 96 |
+
|
| 97 |
+
Raises
|
| 98 |
+
------
|
| 99 |
+
FileNotFoundError
|
| 100 |
+
Si le dossier n'existe pas.
|
| 101 |
+
ValueError
|
| 102 |
+
Si aucun document valide n'est trouvé.
|
| 103 |
+
"""
|
| 104 |
+
directory = Path(directory)
|
| 105 |
+
if not directory.is_dir():
|
| 106 |
+
raise FileNotFoundError(f"Dossier introuvable : {directory}")
|
| 107 |
+
|
| 108 |
+
corpus_name = name or directory.name
|
| 109 |
+
documents: list[Document] = []
|
| 110 |
+
skipped = 0
|
| 111 |
+
|
| 112 |
+
# Collecte de toutes les images
|
| 113 |
+
image_paths = sorted(
|
| 114 |
+
p for p in directory.iterdir() if p.suffix.lower() in IMAGE_EXTENSIONS
|
| 115 |
+
)
|
| 116 |
+
|
| 117 |
+
for image_path in image_paths:
|
| 118 |
+
gt_path = image_path.with_name(image_path.stem + gt_suffix)
|
| 119 |
+
if not gt_path.exists():
|
| 120 |
+
logger.debug("Pas de fichier GT pour %s — ignoré.", image_path.name)
|
| 121 |
+
skipped += 1
|
| 122 |
+
continue
|
| 123 |
+
|
| 124 |
+
try:
|
| 125 |
+
ground_truth = gt_path.read_text(encoding=encoding).strip()
|
| 126 |
+
except OSError as exc:
|
| 127 |
+
logger.warning("Impossible de lire %s : %s — ignoré.", gt_path, exc)
|
| 128 |
+
skipped += 1
|
| 129 |
+
continue
|
| 130 |
+
|
| 131 |
+
documents.append(
|
| 132 |
+
Document(
|
| 133 |
+
image_path=image_path,
|
| 134 |
+
ground_truth=ground_truth,
|
| 135 |
+
)
|
| 136 |
+
)
|
| 137 |
+
|
| 138 |
+
if not documents:
|
| 139 |
+
raise ValueError(
|
| 140 |
+
f"Aucun document valide trouvé dans {directory}. "
|
| 141 |
+
f"Vérifiez que les fichiers GT portent le suffixe '{gt_suffix}'."
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
if skipped:
|
| 145 |
+
logger.info("%d image(s) ignorée(s) faute de fichier GT.", skipped)
|
| 146 |
+
|
| 147 |
+
logger.info("Corpus '%s' chargé : %d documents.", corpus_name, len(documents))
|
| 148 |
+
return Corpus(
|
| 149 |
+
name=corpus_name,
|
| 150 |
+
documents=documents,
|
| 151 |
+
source_path=str(directory),
|
| 152 |
+
)
|
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Calcul des métriques CER et WER via jiwer.
|
| 2 |
+
|
| 3 |
+
Métriques implémentées
|
| 4 |
+
----------------------
|
| 5 |
+
- CER brut : distance d'édition caractère / longueur GT
|
| 6 |
+
- CER normalisé NFC : après normalisation Unicode NFC
|
| 7 |
+
- CER sans casse : insensible aux majuscules/minuscules
|
| 8 |
+
- WER brut : word error rate standard
|
| 9 |
+
- WER normalisé : après normalisation des espaces
|
| 10 |
+
- MER : Match Error Rate (jiwer)
|
| 11 |
+
- WIL : Word Information Lost (jiwer)
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
import unicodedata
|
| 17 |
+
from dataclasses import dataclass
|
| 18 |
+
from typing import Optional
|
| 19 |
+
|
| 20 |
+
try:
|
| 21 |
+
import jiwer
|
| 22 |
+
|
| 23 |
+
_JIWER_AVAILABLE = True
|
| 24 |
+
except ImportError:
|
| 25 |
+
_JIWER_AVAILABLE = False
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
# ---------------------------------------------------------------------------
|
| 29 |
+
# Transformations / normalisations
|
| 30 |
+
# ---------------------------------------------------------------------------
|
| 31 |
+
|
| 32 |
+
def _normalize_nfc(text: str) -> str:
|
| 33 |
+
return unicodedata.normalize("NFC", text)
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def _normalize_caseless(text: str) -> str:
|
| 37 |
+
return unicodedata.normalize("NFC", text).casefold()
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def _normalize_whitespace(text: str) -> str:
|
| 41 |
+
return " ".join(text.split())
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
# Transformations jiwer pour le CER (chaque char devient un "mot")
|
| 45 |
+
_CHAR_TRANSFORM = jiwer.transforms.Compose([]) if _JIWER_AVAILABLE else None
|
| 46 |
+
|
| 47 |
+
# Transformations jiwer pour le WER (normalisation légère des espaces)
|
| 48 |
+
_WER_TRANSFORM = (
|
| 49 |
+
jiwer.transforms.Compose(
|
| 50 |
+
[
|
| 51 |
+
jiwer.transforms.RemoveMultipleSpaces(),
|
| 52 |
+
jiwer.transforms.Strip(),
|
| 53 |
+
jiwer.transforms.ReduceToListOfListOfWords(),
|
| 54 |
+
]
|
| 55 |
+
)
|
| 56 |
+
if _JIWER_AVAILABLE
|
| 57 |
+
else None
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def _cer_from_strings(reference: str, hypothesis: str) -> float:
|
| 62 |
+
"""CER brut : distance d'édition sur les caractères."""
|
| 63 |
+
if not reference:
|
| 64 |
+
return 0.0 if not hypothesis else 1.0
|
| 65 |
+
# jiwer.cer traite chaque caractère comme un token
|
| 66 |
+
return jiwer.cer(reference, hypothesis)
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
# ---------------------------------------------------------------------------
|
| 70 |
+
# Résultat structuré
|
| 71 |
+
# ---------------------------------------------------------------------------
|
| 72 |
+
|
| 73 |
+
@dataclass
|
| 74 |
+
class MetricsResult:
|
| 75 |
+
"""Ensemble des métriques calculées pour une paire (référence, hypothèse)."""
|
| 76 |
+
|
| 77 |
+
cer: float
|
| 78 |
+
cer_nfc: float
|
| 79 |
+
cer_caseless: float
|
| 80 |
+
wer: float
|
| 81 |
+
wer_normalized: float
|
| 82 |
+
mer: float
|
| 83 |
+
wil: float
|
| 84 |
+
reference_length: int
|
| 85 |
+
hypothesis_length: int
|
| 86 |
+
error: Optional[str] = None
|
| 87 |
+
|
| 88 |
+
def as_dict(self) -> dict:
|
| 89 |
+
return {
|
| 90 |
+
"cer": round(self.cer, 6),
|
| 91 |
+
"cer_nfc": round(self.cer_nfc, 6),
|
| 92 |
+
"cer_caseless": round(self.cer_caseless, 6),
|
| 93 |
+
"wer": round(self.wer, 6),
|
| 94 |
+
"wer_normalized": round(self.wer_normalized, 6),
|
| 95 |
+
"mer": round(self.mer, 6),
|
| 96 |
+
"wil": round(self.wil, 6),
|
| 97 |
+
"reference_length": self.reference_length,
|
| 98 |
+
"hypothesis_length": self.hypothesis_length,
|
| 99 |
+
"error": self.error,
|
| 100 |
+
}
|
| 101 |
+
|
| 102 |
+
@property
|
| 103 |
+
def cer_percent(self) -> float:
|
| 104 |
+
return round(self.cer * 100, 2)
|
| 105 |
+
|
| 106 |
+
@property
|
| 107 |
+
def wer_percent(self) -> float:
|
| 108 |
+
return round(self.wer * 100, 2)
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def compute_metrics(reference: str, hypothesis: str) -> MetricsResult:
|
| 112 |
+
"""Calcule l'ensemble des métriques CER/WER pour une paire de textes.
|
| 113 |
+
|
| 114 |
+
Parameters
|
| 115 |
+
----------
|
| 116 |
+
reference:
|
| 117 |
+
Texte de vérité terrain (ground truth).
|
| 118 |
+
hypothesis:
|
| 119 |
+
Texte produit par le moteur OCR.
|
| 120 |
+
|
| 121 |
+
Returns
|
| 122 |
+
-------
|
| 123 |
+
MetricsResult
|
| 124 |
+
Objet contenant toutes les métriques calculées.
|
| 125 |
+
"""
|
| 126 |
+
if not _JIWER_AVAILABLE:
|
| 127 |
+
return MetricsResult(
|
| 128 |
+
cer=0.0, cer_nfc=0.0, cer_caseless=0.0,
|
| 129 |
+
wer=0.0, wer_normalized=0.0, mer=0.0, wil=0.0,
|
| 130 |
+
reference_length=len(reference),
|
| 131 |
+
hypothesis_length=len(hypothesis),
|
| 132 |
+
error="jiwer n'est pas installé (pip install jiwer)",
|
| 133 |
+
)
|
| 134 |
+
|
| 135 |
+
try:
|
| 136 |
+
# CER variants
|
| 137 |
+
cer_raw = _cer_from_strings(reference, hypothesis)
|
| 138 |
+
cer_nfc = _cer_from_strings(
|
| 139 |
+
_normalize_nfc(reference), _normalize_nfc(hypothesis)
|
| 140 |
+
)
|
| 141 |
+
cer_caseless = _cer_from_strings(
|
| 142 |
+
_normalize_caseless(reference), _normalize_caseless(hypothesis)
|
| 143 |
+
)
|
| 144 |
+
|
| 145 |
+
# WER variants
|
| 146 |
+
ref_norm = _normalize_whitespace(reference)
|
| 147 |
+
hyp_norm = _normalize_whitespace(hypothesis)
|
| 148 |
+
|
| 149 |
+
wer_raw = jiwer.wer(reference, hypothesis)
|
| 150 |
+
wer_normalized = jiwer.wer(ref_norm, hyp_norm)
|
| 151 |
+
mer = jiwer.mer(reference, hypothesis)
|
| 152 |
+
wil = jiwer.wil(reference, hypothesis)
|
| 153 |
+
|
| 154 |
+
return MetricsResult(
|
| 155 |
+
cer=cer_raw,
|
| 156 |
+
cer_nfc=cer_nfc,
|
| 157 |
+
cer_caseless=cer_caseless,
|
| 158 |
+
wer=wer_raw,
|
| 159 |
+
wer_normalized=wer_normalized,
|
| 160 |
+
mer=mer,
|
| 161 |
+
wil=wil,
|
| 162 |
+
reference_length=len(reference),
|
| 163 |
+
hypothesis_length=len(hypothesis),
|
| 164 |
+
)
|
| 165 |
+
|
| 166 |
+
except Exception as exc: # noqa: BLE001
|
| 167 |
+
return MetricsResult(
|
| 168 |
+
cer=0.0, cer_nfc=0.0, cer_caseless=0.0,
|
| 169 |
+
wer=0.0, wer_normalized=0.0, mer=0.0, wil=0.0,
|
| 170 |
+
reference_length=len(reference),
|
| 171 |
+
hypothesis_length=len(hypothesis),
|
| 172 |
+
error=str(exc),
|
| 173 |
+
)
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
def aggregate_metrics(results: list[MetricsResult]) -> dict:
|
| 177 |
+
"""Calcule les statistiques agrégées sur un ensemble de résultats.
|
| 178 |
+
|
| 179 |
+
Parameters
|
| 180 |
+
----------
|
| 181 |
+
results:
|
| 182 |
+
Liste de MetricsResult correspondant à plusieurs documents.
|
| 183 |
+
|
| 184 |
+
Returns
|
| 185 |
+
-------
|
| 186 |
+
dict
|
| 187 |
+
Statistiques : moyenne, médiane, min, max, std pour chaque métrique.
|
| 188 |
+
"""
|
| 189 |
+
import statistics
|
| 190 |
+
|
| 191 |
+
if not results:
|
| 192 |
+
return {}
|
| 193 |
+
|
| 194 |
+
def _stats(values: list[float]) -> dict:
|
| 195 |
+
if not values:
|
| 196 |
+
return {}
|
| 197 |
+
return {
|
| 198 |
+
"mean": round(statistics.mean(values), 6),
|
| 199 |
+
"median": round(statistics.median(values), 6),
|
| 200 |
+
"min": round(min(values), 6),
|
| 201 |
+
"max": round(max(values), 6),
|
| 202 |
+
"stdev": round(statistics.stdev(values), 6) if len(values) > 1 else 0.0,
|
| 203 |
+
}
|
| 204 |
+
|
| 205 |
+
metric_names = ["cer", "cer_nfc", "cer_caseless", "wer", "wer_normalized", "mer", "wil"]
|
| 206 |
+
aggregated: dict = {}
|
| 207 |
+
for metric in metric_names:
|
| 208 |
+
values = [getattr(r, metric) for r in results if r.error is None]
|
| 209 |
+
aggregated[metric] = _stats(values)
|
| 210 |
+
|
| 211 |
+
aggregated["document_count"] = len(results)
|
| 212 |
+
aggregated["failed_count"] = sum(1 for r in results if r.error is not None)
|
| 213 |
+
|
| 214 |
+
return aggregated
|
|
@@ -0,0 +1,155 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Modèle de données des résultats et export JSON.
|
| 2 |
+
|
| 3 |
+
Hiérarchie
|
| 4 |
+
----------
|
| 5 |
+
BenchmarkResult
|
| 6 |
+
└── EngineReport (un par moteur)
|
| 7 |
+
└── DocumentResult (un par document)
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
from __future__ import annotations
|
| 11 |
+
|
| 12 |
+
import json
|
| 13 |
+
from dataclasses import asdict, dataclass, field
|
| 14 |
+
from datetime import datetime, timezone
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
from typing import Optional
|
| 17 |
+
|
| 18 |
+
from picarones import __version__
|
| 19 |
+
from picarones.core.metrics import MetricsResult, aggregate_metrics
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
@dataclass
|
| 23 |
+
class DocumentResult:
|
| 24 |
+
"""Résultat d'un moteur sur un seul document."""
|
| 25 |
+
|
| 26 |
+
doc_id: str
|
| 27 |
+
image_path: str
|
| 28 |
+
ground_truth: str
|
| 29 |
+
hypothesis: str
|
| 30 |
+
metrics: MetricsResult
|
| 31 |
+
duration_seconds: float
|
| 32 |
+
engine_error: Optional[str] = None
|
| 33 |
+
|
| 34 |
+
def as_dict(self) -> dict:
|
| 35 |
+
return {
|
| 36 |
+
"doc_id": self.doc_id,
|
| 37 |
+
"image_path": self.image_path,
|
| 38 |
+
"ground_truth": self.ground_truth,
|
| 39 |
+
"hypothesis": self.hypothesis,
|
| 40 |
+
"metrics": self.metrics.as_dict(),
|
| 41 |
+
"duration_seconds": self.duration_seconds,
|
| 42 |
+
"engine_error": self.engine_error,
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
@dataclass
|
| 47 |
+
class EngineReport:
|
| 48 |
+
"""Rapport complet d'un moteur sur l'ensemble du corpus."""
|
| 49 |
+
|
| 50 |
+
engine_name: str
|
| 51 |
+
engine_version: str
|
| 52 |
+
engine_config: dict
|
| 53 |
+
document_results: list[DocumentResult]
|
| 54 |
+
aggregated_metrics: dict = field(default_factory=dict)
|
| 55 |
+
|
| 56 |
+
def __post_init__(self) -> None:
|
| 57 |
+
if not self.aggregated_metrics and self.document_results:
|
| 58 |
+
self.aggregated_metrics = aggregate_metrics(
|
| 59 |
+
[dr.metrics for dr in self.document_results]
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
@property
|
| 63 |
+
def mean_cer(self) -> Optional[float]:
|
| 64 |
+
cer_stats = self.aggregated_metrics.get("cer", {})
|
| 65 |
+
return cer_stats.get("mean")
|
| 66 |
+
|
| 67 |
+
@property
|
| 68 |
+
def mean_wer(self) -> Optional[float]:
|
| 69 |
+
wer_stats = self.aggregated_metrics.get("wer", {})
|
| 70 |
+
return wer_stats.get("mean")
|
| 71 |
+
|
| 72 |
+
def as_dict(self) -> dict:
|
| 73 |
+
return {
|
| 74 |
+
"engine_name": self.engine_name,
|
| 75 |
+
"engine_version": self.engine_version,
|
| 76 |
+
"engine_config": self.engine_config,
|
| 77 |
+
"aggregated_metrics": self.aggregated_metrics,
|
| 78 |
+
"document_results": [dr.as_dict() for dr in self.document_results],
|
| 79 |
+
}
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
@dataclass
|
| 83 |
+
class BenchmarkResult:
|
| 84 |
+
"""Résultat complet d'un benchmark multi-moteurs sur un corpus."""
|
| 85 |
+
|
| 86 |
+
corpus_name: str
|
| 87 |
+
corpus_source: Optional[str]
|
| 88 |
+
document_count: int
|
| 89 |
+
engine_reports: list[EngineReport]
|
| 90 |
+
run_date: str = field(default_factory=lambda: datetime.now(tz=timezone.utc).isoformat())
|
| 91 |
+
picarones_version: str = __version__
|
| 92 |
+
metadata: dict = field(default_factory=dict)
|
| 93 |
+
|
| 94 |
+
def ranking(self) -> list[dict]:
|
| 95 |
+
"""Retourne le classement des moteurs trié par CER croissant."""
|
| 96 |
+
ranked = []
|
| 97 |
+
for report in self.engine_reports:
|
| 98 |
+
ranked.append(
|
| 99 |
+
{
|
| 100 |
+
"engine": report.engine_name,
|
| 101 |
+
"mean_cer": report.mean_cer,
|
| 102 |
+
"mean_wer": report.mean_wer,
|
| 103 |
+
"documents": len(report.document_results),
|
| 104 |
+
"failed": report.aggregated_metrics.get("failed_count", 0),
|
| 105 |
+
}
|
| 106 |
+
)
|
| 107 |
+
return sorted(
|
| 108 |
+
ranked,
|
| 109 |
+
key=lambda x: (x["mean_cer"] is None, x["mean_cer"] or float("inf")),
|
| 110 |
+
)
|
| 111 |
+
|
| 112 |
+
def as_dict(self) -> dict:
|
| 113 |
+
return {
|
| 114 |
+
"picarones_version": self.picarones_version,
|
| 115 |
+
"run_date": self.run_date,
|
| 116 |
+
"corpus": {
|
| 117 |
+
"name": self.corpus_name,
|
| 118 |
+
"source": self.corpus_source,
|
| 119 |
+
"document_count": self.document_count,
|
| 120 |
+
},
|
| 121 |
+
"ranking": self.ranking(),
|
| 122 |
+
"engine_reports": [r.as_dict() for r in self.engine_reports],
|
| 123 |
+
"metadata": self.metadata,
|
| 124 |
+
}
|
| 125 |
+
|
| 126 |
+
def to_json(self, path: str | Path, indent: int = 2) -> Path:
|
| 127 |
+
"""Sérialise le benchmark en JSON et l'écrit sur disque.
|
| 128 |
+
|
| 129 |
+
Parameters
|
| 130 |
+
----------
|
| 131 |
+
path:
|
| 132 |
+
Chemin du fichier JSON de sortie.
|
| 133 |
+
indent:
|
| 134 |
+
Indentation JSON (défaut : 2 espaces).
|
| 135 |
+
|
| 136 |
+
Returns
|
| 137 |
+
-------
|
| 138 |
+
Path
|
| 139 |
+
Chemin absolu du fichier écrit.
|
| 140 |
+
"""
|
| 141 |
+
output_path = Path(path)
|
| 142 |
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 143 |
+
with output_path.open("w", encoding="utf-8") as fh:
|
| 144 |
+
json.dump(self.as_dict(), fh, ensure_ascii=False, indent=indent)
|
| 145 |
+
return output_path.resolve()
|
| 146 |
+
|
| 147 |
+
@classmethod
|
| 148 |
+
def from_json(cls, path: str | Path) -> dict:
|
| 149 |
+
"""Charge un résultat JSON brut depuis le disque (pour le rapport HTML).
|
| 150 |
+
|
| 151 |
+
Retourne le dict Python — la reconstruction complète en objets
|
| 152 |
+
est réservée aux sprints suivants.
|
| 153 |
+
"""
|
| 154 |
+
with Path(path).open(encoding="utf-8") as fh:
|
| 155 |
+
return json.load(fh)
|
|
@@ -0,0 +1,115 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Orchestrateur du benchmark : exécute les moteurs sur le corpus et agrège les résultats."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import logging
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
from typing import Optional
|
| 8 |
+
|
| 9 |
+
from tqdm import tqdm
|
| 10 |
+
|
| 11 |
+
from picarones.core.corpus import Corpus
|
| 12 |
+
from picarones.core.metrics import compute_metrics
|
| 13 |
+
from picarones.core.results import BenchmarkResult, DocumentResult, EngineReport
|
| 14 |
+
from picarones.engines.base import BaseOCREngine
|
| 15 |
+
|
| 16 |
+
logger = logging.getLogger(__name__)
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def run_benchmark(
|
| 20 |
+
corpus: Corpus,
|
| 21 |
+
engines: list[BaseOCREngine],
|
| 22 |
+
output_json: Optional[str | Path] = None,
|
| 23 |
+
show_progress: bool = True,
|
| 24 |
+
) -> BenchmarkResult:
|
| 25 |
+
"""Exécute le benchmark d'un ou plusieurs moteurs sur un corpus.
|
| 26 |
+
|
| 27 |
+
Pour chaque moteur, chaque document est traité séquentiellement.
|
| 28 |
+
Les sorties sont évaluées par rapport à la vérité terrain via
|
| 29 |
+
les métriques CER et WER.
|
| 30 |
+
|
| 31 |
+
Parameters
|
| 32 |
+
----------
|
| 33 |
+
corpus:
|
| 34 |
+
Corpus à évaluer (objet ``Corpus`` avec ses ``Document``).
|
| 35 |
+
engines:
|
| 36 |
+
Liste d'adaptateurs moteurs à comparer.
|
| 37 |
+
output_json:
|
| 38 |
+
Chemin optionnel pour écrire le résultat JSON. Si ``None``, pas
|
| 39 |
+
d'écriture disque.
|
| 40 |
+
show_progress:
|
| 41 |
+
Affiche une barre de progression tqdm (défaut : True).
|
| 42 |
+
|
| 43 |
+
Returns
|
| 44 |
+
-------
|
| 45 |
+
BenchmarkResult
|
| 46 |
+
Objet contenant tous les résultats, agrégations et classement.
|
| 47 |
+
"""
|
| 48 |
+
engine_reports: list[EngineReport] = []
|
| 49 |
+
|
| 50 |
+
for engine in engines:
|
| 51 |
+
logger.info("Démarrage moteur : %s", engine.name)
|
| 52 |
+
document_results: list[DocumentResult] = []
|
| 53 |
+
|
| 54 |
+
iterator = tqdm(
|
| 55 |
+
corpus.documents,
|
| 56 |
+
desc=f"[{engine.name}]",
|
| 57 |
+
unit="doc",
|
| 58 |
+
disable=not show_progress,
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
for doc in iterator:
|
| 62 |
+
ocr_result = engine.run(doc.image_path)
|
| 63 |
+
|
| 64 |
+
if ocr_result.success:
|
| 65 |
+
metrics = compute_metrics(doc.ground_truth, ocr_result.text)
|
| 66 |
+
else:
|
| 67 |
+
# Moteur en erreur → métriques dégradées avec erreur tracée
|
| 68 |
+
from picarones.core.metrics import MetricsResult
|
| 69 |
+
|
| 70 |
+
metrics = MetricsResult(
|
| 71 |
+
cer=1.0, cer_nfc=1.0, cer_caseless=1.0,
|
| 72 |
+
wer=1.0, wer_normalized=1.0, mer=1.0, wil=1.0,
|
| 73 |
+
reference_length=len(doc.ground_truth),
|
| 74 |
+
hypothesis_length=0,
|
| 75 |
+
error=ocr_result.error,
|
| 76 |
+
)
|
| 77 |
+
|
| 78 |
+
document_results.append(
|
| 79 |
+
DocumentResult(
|
| 80 |
+
doc_id=doc.doc_id,
|
| 81 |
+
image_path=str(doc.image_path),
|
| 82 |
+
ground_truth=doc.ground_truth,
|
| 83 |
+
hypothesis=ocr_result.text,
|
| 84 |
+
metrics=metrics,
|
| 85 |
+
duration_seconds=ocr_result.duration_seconds,
|
| 86 |
+
engine_error=ocr_result.error,
|
| 87 |
+
)
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
engine_version = engine._safe_version()
|
| 91 |
+
report = EngineReport(
|
| 92 |
+
engine_name=engine.name,
|
| 93 |
+
engine_version=engine_version,
|
| 94 |
+
engine_config=engine.config,
|
| 95 |
+
document_results=document_results,
|
| 96 |
+
)
|
| 97 |
+
engine_reports.append(report)
|
| 98 |
+
logger.info(
|
| 99 |
+
"Moteur %s terminé — CER moyen : %.2f%%",
|
| 100 |
+
engine.name,
|
| 101 |
+
(report.mean_cer or 0) * 100,
|
| 102 |
+
)
|
| 103 |
+
|
| 104 |
+
benchmark = BenchmarkResult(
|
| 105 |
+
corpus_name=corpus.name,
|
| 106 |
+
corpus_source=corpus.source_path,
|
| 107 |
+
document_count=len(corpus),
|
| 108 |
+
engine_reports=engine_reports,
|
| 109 |
+
)
|
| 110 |
+
|
| 111 |
+
if output_json:
|
| 112 |
+
path = benchmark.to_json(output_json)
|
| 113 |
+
logger.info("Résultats écrits dans : %s", path)
|
| 114 |
+
|
| 115 |
+
return benchmark
|
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Adaptateurs moteurs OCR."""
|
| 2 |
+
|
| 3 |
+
from picarones.engines.base import BaseOCREngine, EngineResult
|
| 4 |
+
from picarones.engines.tesseract import TesseractEngine
|
| 5 |
+
|
| 6 |
+
__all__ = ["BaseOCREngine", "EngineResult", "TesseractEngine"]
|
| 7 |
+
|
| 8 |
+
try:
|
| 9 |
+
from picarones.engines.pero_ocr import PeroOCREngine
|
| 10 |
+
|
| 11 |
+
__all__.append("PeroOCREngine")
|
| 12 |
+
except ImportError:
|
| 13 |
+
pass
|
|
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Interface abstraite commune à tous les adaptateurs moteurs OCR."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import hashlib
|
| 6 |
+
import time
|
| 7 |
+
from abc import ABC, abstractmethod
|
| 8 |
+
from dataclasses import dataclass, field
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
from typing import Optional
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
@dataclass
|
| 14 |
+
class EngineResult:
|
| 15 |
+
"""Résultat brut produit par un moteur OCR sur une image."""
|
| 16 |
+
|
| 17 |
+
engine_name: str
|
| 18 |
+
image_path: str
|
| 19 |
+
text: str
|
| 20 |
+
duration_seconds: float
|
| 21 |
+
error: Optional[str] = None
|
| 22 |
+
metadata: dict = field(default_factory=dict)
|
| 23 |
+
|
| 24 |
+
@property
|
| 25 |
+
def success(self) -> bool:
|
| 26 |
+
return self.error is None
|
| 27 |
+
|
| 28 |
+
@property
|
| 29 |
+
def image_sha256(self) -> str:
|
| 30 |
+
return hashlib.sha256(Path(self.image_path).read_bytes()).hexdigest()
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
class BaseOCREngine(ABC):
|
| 34 |
+
"""Classe de base dont héritent tous les adaptateurs OCR.
|
| 35 |
+
|
| 36 |
+
Chaque adaptateur doit implémenter :
|
| 37 |
+
- ``name`` : identifiant unique du moteur
|
| 38 |
+
- ``version()`` : retourne la version du moteur sous forme de chaîne
|
| 39 |
+
- ``_run_ocr(image_path)`` : logique d'exécution OCR, retourne le texte brut
|
| 40 |
+
"""
|
| 41 |
+
|
| 42 |
+
def __init__(self, config: Optional[dict] = None) -> None:
|
| 43 |
+
self.config: dict = config or {}
|
| 44 |
+
|
| 45 |
+
@property
|
| 46 |
+
@abstractmethod
|
| 47 |
+
def name(self) -> str:
|
| 48 |
+
"""Identifiant unique et stable du moteur."""
|
| 49 |
+
|
| 50 |
+
@abstractmethod
|
| 51 |
+
def version(self) -> str:
|
| 52 |
+
"""Retourne la version du moteur (ex : '5.3.0')."""
|
| 53 |
+
|
| 54 |
+
@abstractmethod
|
| 55 |
+
def _run_ocr(self, image_path: Path) -> str:
|
| 56 |
+
"""Exécute l'OCR et retourne le texte brut extrait."""
|
| 57 |
+
|
| 58 |
+
def run(self, image_path: str | Path) -> EngineResult:
|
| 59 |
+
"""Point d'entrée public : exécute l'OCR et mesure le temps d'exécution."""
|
| 60 |
+
image_path = Path(image_path)
|
| 61 |
+
start = time.perf_counter()
|
| 62 |
+
try:
|
| 63 |
+
text = self._run_ocr(image_path)
|
| 64 |
+
error = None
|
| 65 |
+
except Exception as exc: # noqa: BLE001
|
| 66 |
+
text = ""
|
| 67 |
+
error = str(exc)
|
| 68 |
+
duration = time.perf_counter() - start
|
| 69 |
+
return EngineResult(
|
| 70 |
+
engine_name=self.name,
|
| 71 |
+
image_path=str(image_path),
|
| 72 |
+
text=text,
|
| 73 |
+
duration_seconds=round(duration, 4),
|
| 74 |
+
error=error,
|
| 75 |
+
metadata={"engine_version": self._safe_version()},
|
| 76 |
+
)
|
| 77 |
+
|
| 78 |
+
def _safe_version(self) -> str:
|
| 79 |
+
try:
|
| 80 |
+
return self.version()
|
| 81 |
+
except Exception: # noqa: BLE001
|
| 82 |
+
return "unknown"
|
| 83 |
+
|
| 84 |
+
def __repr__(self) -> str:
|
| 85 |
+
return f"{self.__class__.__name__}(name={self.name!r})"
|
|
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Adaptateur Pero OCR.
|
| 2 |
+
|
| 3 |
+
Pero OCR est un moteur HTR/OCR performant sur les documents historiques,
|
| 4 |
+
développé par l'Université technologique de Brno.
|
| 5 |
+
|
| 6 |
+
Dépendance : pero-ocr (pip install pero-ocr)
|
| 7 |
+
Dépôt : https://github.com/DCGM/pero-ocr
|
| 8 |
+
|
| 9 |
+
Configuration YAML :
|
| 10 |
+
```yaml
|
| 11 |
+
name: pero_ocr
|
| 12 |
+
engine: pero_ocr
|
| 13 |
+
config: /chemin/vers/config.ini # fichier de configuration Pero OCR
|
| 14 |
+
cuda: false # utiliser le GPU si disponible
|
| 15 |
+
```
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
from __future__ import annotations
|
| 19 |
+
|
| 20 |
+
import tempfile
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
from typing import Optional
|
| 23 |
+
|
| 24 |
+
from picarones.engines.base import BaseOCREngine
|
| 25 |
+
|
| 26 |
+
try:
|
| 27 |
+
import numpy as np
|
| 28 |
+
from PIL import Image
|
| 29 |
+
|
| 30 |
+
_PIL_AVAILABLE = True
|
| 31 |
+
except ImportError:
|
| 32 |
+
_PIL_AVAILABLE = False
|
| 33 |
+
|
| 34 |
+
try:
|
| 35 |
+
from pero_ocr.document_ocr.layout import PageLayout
|
| 36 |
+
from pero_ocr.document_ocr.page_parser import PageParser
|
| 37 |
+
|
| 38 |
+
_PERO_AVAILABLE = True
|
| 39 |
+
except ImportError:
|
| 40 |
+
_PERO_AVAILABLE = False
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
class PeroOCREngine(BaseOCREngine):
|
| 44 |
+
"""Adaptateur pour Pero OCR.
|
| 45 |
+
|
| 46 |
+
Pero OCR produit une sortie structurée (PAGE XML) ; cet adaptateur
|
| 47 |
+
en extrait le texte plat dans l'ordre de lecture naturel.
|
| 48 |
+
"""
|
| 49 |
+
|
| 50 |
+
def __init__(self, config: Optional[dict] = None) -> None:
|
| 51 |
+
super().__init__(config)
|
| 52 |
+
self._parser: Optional[object] = None
|
| 53 |
+
|
| 54 |
+
@property
|
| 55 |
+
def name(self) -> str:
|
| 56 |
+
return self.config.get("name", "pero_ocr")
|
| 57 |
+
|
| 58 |
+
def version(self) -> str:
|
| 59 |
+
if not _PERO_AVAILABLE:
|
| 60 |
+
raise RuntimeError("pero-ocr n'est pas installé.")
|
| 61 |
+
try:
|
| 62 |
+
import pero_ocr
|
| 63 |
+
|
| 64 |
+
return getattr(pero_ocr, "__version__", "unknown")
|
| 65 |
+
except Exception: # noqa: BLE001
|
| 66 |
+
return "unknown"
|
| 67 |
+
|
| 68 |
+
def _get_parser(self) -> "PageParser":
|
| 69 |
+
"""Instancie le PageParser (lazy, une seule fois par moteur)."""
|
| 70 |
+
if self._parser is None:
|
| 71 |
+
if not _PERO_AVAILABLE:
|
| 72 |
+
raise RuntimeError(
|
| 73 |
+
"pero-ocr n'est pas installé. "
|
| 74 |
+
"Installez-le avec : pip install pero-ocr"
|
| 75 |
+
)
|
| 76 |
+
config_path = self.config.get("config")
|
| 77 |
+
if not config_path:
|
| 78 |
+
raise ValueError(
|
| 79 |
+
"La configuration Pero OCR requiert un paramètre 'config' "
|
| 80 |
+
"pointant vers un fichier .ini Pero OCR valide."
|
| 81 |
+
)
|
| 82 |
+
import configparser
|
| 83 |
+
|
| 84 |
+
parser_config = configparser.ConfigParser()
|
| 85 |
+
parser_config.read(config_path)
|
| 86 |
+
self._parser = PageParser(parser_config)
|
| 87 |
+
return self._parser # type: ignore[return-value]
|
| 88 |
+
|
| 89 |
+
def _run_ocr(self, image_path: Path) -> str:
|
| 90 |
+
if not _PIL_AVAILABLE:
|
| 91 |
+
raise RuntimeError("Pillow n'est pas installé.")
|
| 92 |
+
|
| 93 |
+
parser = self._get_parser()
|
| 94 |
+
|
| 95 |
+
image = np.array(Image.open(image_path).convert("RGB"))
|
| 96 |
+
page_layout = PageLayout(id=image_path.stem, page_size=(image.shape[0], image.shape[1]))
|
| 97 |
+
|
| 98 |
+
# Exécution du pipeline Pero OCR
|
| 99 |
+
parser.process_page(image, page_layout)
|
| 100 |
+
|
| 101 |
+
# Extraction du texte plat dans l'ordre des lignes
|
| 102 |
+
lines = []
|
| 103 |
+
for region in page_layout.regions:
|
| 104 |
+
for line in region.lines:
|
| 105 |
+
if line.transcription:
|
| 106 |
+
lines.append(line.transcription.strip())
|
| 107 |
+
|
| 108 |
+
return "\n".join(lines)
|
| 109 |
+
|
| 110 |
+
@classmethod
|
| 111 |
+
def from_config(cls, config: Optional[dict] = None) -> "PeroOCREngine":
|
| 112 |
+
return cls(config=config or {})
|
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Adaptateur Tesseract 5 via pytesseract."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
from typing import Optional
|
| 7 |
+
|
| 8 |
+
from picarones.engines.base import BaseOCREngine
|
| 9 |
+
|
| 10 |
+
try:
|
| 11 |
+
import pytesseract
|
| 12 |
+
from PIL import Image
|
| 13 |
+
|
| 14 |
+
_PYTESSERACT_AVAILABLE = True
|
| 15 |
+
except ImportError:
|
| 16 |
+
_PYTESSERACT_AVAILABLE = False
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
# Correspondance des valeurs PSM acceptées en argument YAML/CLI
|
| 20 |
+
_PSM_LABELS = {
|
| 21 |
+
0: "Orientation and script detection only",
|
| 22 |
+
1: "Automatic page segmentation with OSD",
|
| 23 |
+
3: "Fully automatic page segmentation (default)",
|
| 24 |
+
4: "Single column of text",
|
| 25 |
+
5: "Single uniform block of vertically aligned text",
|
| 26 |
+
6: "Single uniform block of text",
|
| 27 |
+
7: "Single text line",
|
| 28 |
+
8: "Single word",
|
| 29 |
+
9: "Single word in a circle",
|
| 30 |
+
10: "Single character",
|
| 31 |
+
11: "Sparse text",
|
| 32 |
+
12: "Sparse text with OSD",
|
| 33 |
+
13: "Raw line",
|
| 34 |
+
}
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
class TesseractEngine(BaseOCREngine):
|
| 38 |
+
"""Adaptateur pour Tesseract 5 (via pytesseract).
|
| 39 |
+
|
| 40 |
+
Configuration YAML :
|
| 41 |
+
```yaml
|
| 42 |
+
name: tesseract
|
| 43 |
+
engine: tesseract
|
| 44 |
+
lang: fra # code langue Tesseract (fra, lat, eng, ...)
|
| 45 |
+
psm: 6 # Page Segmentation Mode (0-13)
|
| 46 |
+
oem: 3 # OCR Engine Mode (0=legacy, 3=LSTM, 3=default)
|
| 47 |
+
tesseract_cmd: tesseract # chemin vers l'exécutable si non standard
|
| 48 |
+
```
|
| 49 |
+
"""
|
| 50 |
+
|
| 51 |
+
@property
|
| 52 |
+
def name(self) -> str:
|
| 53 |
+
return self.config.get("name", "tesseract")
|
| 54 |
+
|
| 55 |
+
def version(self) -> str:
|
| 56 |
+
if not _PYTESSERACT_AVAILABLE:
|
| 57 |
+
raise RuntimeError("pytesseract n'est pas installé.")
|
| 58 |
+
return pytesseract.get_tesseract_version().vstring
|
| 59 |
+
|
| 60 |
+
def _run_ocr(self, image_path: Path) -> str:
|
| 61 |
+
if not _PYTESSERACT_AVAILABLE:
|
| 62 |
+
raise RuntimeError(
|
| 63 |
+
"pytesseract n'est pas installé. "
|
| 64 |
+
"Installez-le avec : pip install pytesseract"
|
| 65 |
+
)
|
| 66 |
+
|
| 67 |
+
# Paramétrage optionnel de l'exécutable
|
| 68 |
+
tesseract_cmd = self.config.get("tesseract_cmd")
|
| 69 |
+
if tesseract_cmd:
|
| 70 |
+
pytesseract.pytesseract.tesseract_cmd = tesseract_cmd
|
| 71 |
+
|
| 72 |
+
lang = self.config.get("lang", "fra")
|
| 73 |
+
psm = int(self.config.get("psm", 6))
|
| 74 |
+
oem = int(self.config.get("oem", 3))
|
| 75 |
+
|
| 76 |
+
custom_config = f"--oem {oem} --psm {psm}"
|
| 77 |
+
|
| 78 |
+
image = Image.open(image_path)
|
| 79 |
+
text: str = pytesseract.image_to_string(image, lang=lang, config=custom_config)
|
| 80 |
+
return text.strip()
|
| 81 |
+
|
| 82 |
+
@classmethod
|
| 83 |
+
def from_config(cls, config: Optional[dict] = None) -> "TesseractEngine":
|
| 84 |
+
return cls(config=config or {})
|
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[build-system]
|
| 2 |
+
requires = ["setuptools>=68.0", "wheel"]
|
| 3 |
+
build-backend = "setuptools.build_meta"
|
| 4 |
+
|
| 5 |
+
[project]
|
| 6 |
+
name = "picarones"
|
| 7 |
+
version = "0.1.0"
|
| 8 |
+
description = "Plateforme de comparaison de moteurs OCR pour documents patrimoniaux"
|
| 9 |
+
readme = "README.md"
|
| 10 |
+
requires-python = ">=3.11"
|
| 11 |
+
license = { text = "Apache-2.0" }
|
| 12 |
+
authors = [{ name = "Bibliothèque nationale de France — Département numérique" }]
|
| 13 |
+
keywords = ["ocr", "htr", "patrimoine", "benchmark", "cer", "wer"]
|
| 14 |
+
classifiers = [
|
| 15 |
+
"Development Status :: 3 - Alpha",
|
| 16 |
+
"Programming Language :: Python :: 3.11",
|
| 17 |
+
"Programming Language :: Python :: 3.12",
|
| 18 |
+
"License :: OSI Approved :: Apache Software License",
|
| 19 |
+
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
| 20 |
+
]
|
| 21 |
+
dependencies = [
|
| 22 |
+
"click>=8.1.0",
|
| 23 |
+
"jiwer>=3.0.0",
|
| 24 |
+
"Pillow>=10.0.0",
|
| 25 |
+
"pyyaml>=6.0.0",
|
| 26 |
+
"pytesseract>=0.3.10",
|
| 27 |
+
"tqdm>=4.66.0",
|
| 28 |
+
"numpy>=1.24.0",
|
| 29 |
+
]
|
| 30 |
+
|
| 31 |
+
[project.optional-dependencies]
|
| 32 |
+
dev = ["pytest>=7.4.0", "pytest-cov>=4.1.0"]
|
| 33 |
+
pero = ["pero-ocr>=0.1.0"]
|
| 34 |
+
|
| 35 |
+
[project.scripts]
|
| 36 |
+
picarones = "picarones.cli:cli"
|
| 37 |
+
|
| 38 |
+
[tool.setuptools.packages.find]
|
| 39 |
+
where = ["."]
|
| 40 |
+
include = ["picarones*"]
|
| 41 |
+
|
| 42 |
+
[tool.pytest.ini_options]
|
| 43 |
+
testpaths = ["tests"]
|
| 44 |
+
addopts = "-v --tb=short"
|
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Core
|
| 2 |
+
click>=8.1.0
|
| 3 |
+
jiwer>=3.0.0
|
| 4 |
+
Pillow>=10.0.0
|
| 5 |
+
pyyaml>=6.0.0
|
| 6 |
+
|
| 7 |
+
# OCR engines
|
| 8 |
+
pytesseract>=0.3.10
|
| 9 |
+
|
| 10 |
+
# pero-ocr (optional, install separately if needed)
|
| 11 |
+
# pero-ocr>=0.1.0
|
| 12 |
+
|
| 13 |
+
# Utilities
|
| 14 |
+
tqdm>=4.66.0
|
| 15 |
+
numpy>=1.24.0
|
| 16 |
+
|
| 17 |
+
# Development / testing
|
| 18 |
+
pytest>=7.4.0
|
| 19 |
+
pytest-cov>=4.1.0
|
|
File without changes
|
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests unitaires pour picarones.core.corpus."""
|
| 2 |
+
|
| 3 |
+
import pytest
|
| 4 |
+
from pathlib import Path
|
| 5 |
+
|
| 6 |
+
from picarones.core.corpus import load_corpus_from_directory, Corpus, Document
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
@pytest.fixture
|
| 10 |
+
def sample_corpus_dir(tmp_path: Path) -> Path:
|
| 11 |
+
"""Crée un mini-corpus temporaire avec 3 paires image/GT."""
|
| 12 |
+
images = [
|
| 13 |
+
("page_001.png", "La première page du document médiéval."),
|
| 14 |
+
("page_002.png", "Deuxième folio avec des abréviations."),
|
| 15 |
+
("page_003.png", "Fin du manuscrit avec colophon."),
|
| 16 |
+
]
|
| 17 |
+
for filename, gt_text in images:
|
| 18 |
+
# Image factice (1×1 PNG valide)
|
| 19 |
+
image_path = tmp_path / filename
|
| 20 |
+
image_path.write_bytes(
|
| 21 |
+
b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01"
|
| 22 |
+
b"\x00\x00\x00\x01\x08\x02\x00\x00\x00\x90wS\xde\x00\x00"
|
| 23 |
+
b"\x00\x0cIDATx\x9cc\xf8\x0f\x00\x00\x01\x01\x00\x05\x18"
|
| 24 |
+
b"\xd8N\x00\x00\x00\x00IEND\xaeB`\x82"
|
| 25 |
+
)
|
| 26 |
+
gt_path = tmp_path / (Path(filename).stem + ".gt.txt")
|
| 27 |
+
gt_path.write_text(gt_text, encoding="utf-8")
|
| 28 |
+
return tmp_path
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
class TestLoadCorpusFromDirectory:
|
| 32 |
+
def test_loads_correct_count(self, sample_corpus_dir):
|
| 33 |
+
corpus = load_corpus_from_directory(sample_corpus_dir)
|
| 34 |
+
assert len(corpus) == 3
|
| 35 |
+
|
| 36 |
+
def test_corpus_name_defaults_to_dir_name(self, sample_corpus_dir):
|
| 37 |
+
corpus = load_corpus_from_directory(sample_corpus_dir)
|
| 38 |
+
assert corpus.name == sample_corpus_dir.name
|
| 39 |
+
|
| 40 |
+
def test_corpus_name_can_be_set(self, sample_corpus_dir):
|
| 41 |
+
corpus = load_corpus_from_directory(sample_corpus_dir, name="Mon corpus test")
|
| 42 |
+
assert corpus.name == "Mon corpus test"
|
| 43 |
+
|
| 44 |
+
def test_document_ids(self, sample_corpus_dir):
|
| 45 |
+
corpus = load_corpus_from_directory(sample_corpus_dir)
|
| 46 |
+
ids = {doc.doc_id for doc in corpus}
|
| 47 |
+
assert "page_001" in ids
|
| 48 |
+
assert "page_002" in ids
|
| 49 |
+
assert "page_003" in ids
|
| 50 |
+
|
| 51 |
+
def test_ground_truth_content(self, sample_corpus_dir):
|
| 52 |
+
corpus = load_corpus_from_directory(sample_corpus_dir)
|
| 53 |
+
doc = next(d for d in corpus if d.doc_id == "page_001")
|
| 54 |
+
assert "médiéval" in doc.ground_truth
|
| 55 |
+
|
| 56 |
+
def test_source_path_set(self, sample_corpus_dir):
|
| 57 |
+
corpus = load_corpus_from_directory(sample_corpus_dir)
|
| 58 |
+
assert corpus.source_path == str(sample_corpus_dir)
|
| 59 |
+
|
| 60 |
+
def test_nonexistent_directory_raises(self, tmp_path):
|
| 61 |
+
with pytest.raises(FileNotFoundError):
|
| 62 |
+
load_corpus_from_directory(tmp_path / "inexistant")
|
| 63 |
+
|
| 64 |
+
def test_directory_without_gt_raises(self, tmp_path):
|
| 65 |
+
(tmp_path / "image.png").write_bytes(b"fake")
|
| 66 |
+
with pytest.raises(ValueError):
|
| 67 |
+
load_corpus_from_directory(tmp_path)
|
| 68 |
+
|
| 69 |
+
def test_ignores_images_without_gt(self, sample_corpus_dir, tmp_path):
|
| 70 |
+
# Copie le corpus et ajoute une image sans GT
|
| 71 |
+
import shutil
|
| 72 |
+
dest = tmp_path / "corpus2"
|
| 73 |
+
shutil.copytree(sample_corpus_dir, dest)
|
| 74 |
+
(dest / "orphan.png").write_bytes(b"fake")
|
| 75 |
+
corpus = load_corpus_from_directory(dest)
|
| 76 |
+
assert len(corpus) == 3 # L'image orpheline est ignorée
|
| 77 |
+
|
| 78 |
+
def test_stats_computed(self, sample_corpus_dir):
|
| 79 |
+
corpus = load_corpus_from_directory(sample_corpus_dir)
|
| 80 |
+
stats = corpus.stats
|
| 81 |
+
assert stats["document_count"] == 3
|
| 82 |
+
assert stats["gt_length_min"] > 0
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
class TestCorpusIteration:
|
| 86 |
+
def test_iterable(self, sample_corpus_dir):
|
| 87 |
+
corpus = load_corpus_from_directory(sample_corpus_dir)
|
| 88 |
+
docs = list(corpus)
|
| 89 |
+
assert len(docs) == 3
|
| 90 |
+
assert all(isinstance(d, Document) for d in docs)
|
| 91 |
+
|
| 92 |
+
def test_repr(self, sample_corpus_dir):
|
| 93 |
+
corpus = load_corpus_from_directory(sample_corpus_dir)
|
| 94 |
+
r = repr(corpus)
|
| 95 |
+
assert "Corpus" in r
|
| 96 |
+
assert "3" in r
|
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests unitaires pour les adaptateurs moteurs OCR.
|
| 2 |
+
|
| 3 |
+
Les tests vérifient la structure et le comportement des adaptateurs
|
| 4 |
+
sans requérir que Tesseract ou Pero OCR soient réellement installés.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
import pytest
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from unittest.mock import MagicMock, patch
|
| 12 |
+
|
| 13 |
+
from picarones.engines.base import BaseOCREngine, EngineResult
|
| 14 |
+
from picarones.engines.tesseract import TesseractEngine
|
| 15 |
+
from picarones.engines.pero_ocr import PeroOCREngine
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
# ---------------------------------------------------------------------------
|
| 19 |
+
# Tests BaseOCREngine
|
| 20 |
+
# ---------------------------------------------------------------------------
|
| 21 |
+
|
| 22 |
+
class ConcreteEngine(BaseOCREngine):
|
| 23 |
+
"""Implémentation minimale pour tester la classe de base."""
|
| 24 |
+
|
| 25 |
+
@property
|
| 26 |
+
def name(self) -> str:
|
| 27 |
+
return "test_engine"
|
| 28 |
+
|
| 29 |
+
def version(self) -> str:
|
| 30 |
+
return "1.0.0"
|
| 31 |
+
|
| 32 |
+
def _run_ocr(self, image_path: Path) -> str:
|
| 33 |
+
return "Texte extrait par le moteur de test."
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
class FailingEngine(BaseOCREngine):
|
| 37 |
+
"""Moteur qui lève toujours une exception."""
|
| 38 |
+
|
| 39 |
+
@property
|
| 40 |
+
def name(self) -> str:
|
| 41 |
+
return "failing_engine"
|
| 42 |
+
|
| 43 |
+
def version(self) -> str:
|
| 44 |
+
return "0.0.0"
|
| 45 |
+
|
| 46 |
+
def _run_ocr(self, image_path: Path) -> str:
|
| 47 |
+
raise RuntimeError("OCR échoué intentionnellement.")
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
class TestBaseOCREngine:
|
| 51 |
+
def test_run_returns_engine_result(self, tmp_path):
|
| 52 |
+
(tmp_path / "image.png").write_bytes(b"fake_image")
|
| 53 |
+
engine = ConcreteEngine()
|
| 54 |
+
result = engine.run(tmp_path / "image.png")
|
| 55 |
+
assert isinstance(result, EngineResult)
|
| 56 |
+
|
| 57 |
+
def test_run_success(self, tmp_path):
|
| 58 |
+
(tmp_path / "image.png").write_bytes(b"fake_image")
|
| 59 |
+
engine = ConcreteEngine()
|
| 60 |
+
result = engine.run(tmp_path / "image.png")
|
| 61 |
+
assert result.success is True
|
| 62 |
+
assert result.error is None
|
| 63 |
+
assert result.text == "Texte extrait par le moteur de test."
|
| 64 |
+
|
| 65 |
+
def test_run_captures_exception(self, tmp_path):
|
| 66 |
+
(tmp_path / "image.png").write_bytes(b"fake_image")
|
| 67 |
+
engine = FailingEngine()
|
| 68 |
+
result = engine.run(tmp_path / "image.png")
|
| 69 |
+
assert result.success is False
|
| 70 |
+
assert result.error is not None
|
| 71 |
+
assert "OCR échoué" in result.error
|
| 72 |
+
|
| 73 |
+
def test_run_measures_duration(self, tmp_path):
|
| 74 |
+
(tmp_path / "image.png").write_bytes(b"fake_image")
|
| 75 |
+
engine = ConcreteEngine()
|
| 76 |
+
result = engine.run(tmp_path / "image.png")
|
| 77 |
+
assert result.duration_seconds >= 0.0
|
| 78 |
+
|
| 79 |
+
def test_engine_result_engine_name(self, tmp_path):
|
| 80 |
+
(tmp_path / "image.png").write_bytes(b"fake_image")
|
| 81 |
+
engine = ConcreteEngine()
|
| 82 |
+
result = engine.run(tmp_path / "image.png")
|
| 83 |
+
assert result.engine_name == "test_engine"
|
| 84 |
+
|
| 85 |
+
def test_repr(self):
|
| 86 |
+
engine = ConcreteEngine()
|
| 87 |
+
assert "ConcreteEngine" in repr(engine)
|
| 88 |
+
assert "test_engine" in repr(engine)
|
| 89 |
+
|
| 90 |
+
def test_image_path_stored(self, tmp_path):
|
| 91 |
+
img = tmp_path / "image.png"
|
| 92 |
+
img.write_bytes(b"fake_image")
|
| 93 |
+
engine = ConcreteEngine()
|
| 94 |
+
result = engine.run(img)
|
| 95 |
+
assert result.image_path == str(img)
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
# ---------------------------------------------------------------------------
|
| 99 |
+
# Tests TesseractEngine
|
| 100 |
+
# ---------------------------------------------------------------------------
|
| 101 |
+
|
| 102 |
+
class TestTesseractEngine:
|
| 103 |
+
def test_name_default(self):
|
| 104 |
+
engine = TesseractEngine()
|
| 105 |
+
assert engine.name == "tesseract"
|
| 106 |
+
|
| 107 |
+
def test_name_from_config(self):
|
| 108 |
+
engine = TesseractEngine(config={"name": "tesseract_fra"})
|
| 109 |
+
assert engine.name == "tesseract_fra"
|
| 110 |
+
|
| 111 |
+
def test_from_config_factory(self):
|
| 112 |
+
engine = TesseractEngine.from_config({"lang": "lat", "psm": 7})
|
| 113 |
+
assert engine.config["lang"] == "lat"
|
| 114 |
+
assert engine.config["psm"] == 7
|
| 115 |
+
|
| 116 |
+
def test_run_with_pytesseract_mocked(self, tmp_path):
|
| 117 |
+
"""Vérifie que le moteur appelle pytesseract correctement."""
|
| 118 |
+
img = tmp_path / "page.png"
|
| 119 |
+
img.write_bytes(b"fake")
|
| 120 |
+
|
| 121 |
+
with (
|
| 122 |
+
patch("picarones.engines.tesseract._PYTESSERACT_AVAILABLE", True),
|
| 123 |
+
patch("picarones.engines.tesseract.pytesseract") as mock_tess,
|
| 124 |
+
patch("picarones.engines.tesseract.Image") as mock_pil,
|
| 125 |
+
):
|
| 126 |
+
mock_tess.image_to_string.return_value = "Résultat OCR mock"
|
| 127 |
+
mock_pil.open.return_value = MagicMock()
|
| 128 |
+
|
| 129 |
+
engine = TesseractEngine(config={"lang": "fra", "psm": 6})
|
| 130 |
+
result = engine.run(img)
|
| 131 |
+
|
| 132 |
+
assert result.success is True
|
| 133 |
+
assert result.text == "Résultat OCR mock"
|
| 134 |
+
mock_tess.image_to_string.assert_called_once()
|
| 135 |
+
|
| 136 |
+
def test_run_without_pytesseract_raises(self, tmp_path):
|
| 137 |
+
"""Sans pytesseract, le moteur doit retourner un EngineResult avec erreur."""
|
| 138 |
+
img = tmp_path / "page.png"
|
| 139 |
+
img.write_bytes(b"fake")
|
| 140 |
+
|
| 141 |
+
with patch("picarones.engines.tesseract._PYTESSERACT_AVAILABLE", False):
|
| 142 |
+
engine = TesseractEngine()
|
| 143 |
+
result = engine.run(img)
|
| 144 |
+
|
| 145 |
+
assert result.success is False
|
| 146 |
+
assert "pytesseract" in result.error.lower()
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
# ---------------------------------------------------------------------------
|
| 150 |
+
# Tests PeroOCREngine
|
| 151 |
+
# ---------------------------------------------------------------------------
|
| 152 |
+
|
| 153 |
+
class TestPeroOCREngine:
|
| 154 |
+
def test_name_default(self):
|
| 155 |
+
engine = PeroOCREngine()
|
| 156 |
+
assert engine.name == "pero_ocr"
|
| 157 |
+
|
| 158 |
+
def test_name_from_config(self):
|
| 159 |
+
engine = PeroOCREngine(config={"name": "pero_historique"})
|
| 160 |
+
assert engine.name == "pero_historique"
|
| 161 |
+
|
| 162 |
+
def test_from_config_factory(self):
|
| 163 |
+
engine = PeroOCREngine.from_config({"config": "/path/to/pero.ini"})
|
| 164 |
+
assert engine.config["config"] == "/path/to/pero.ini"
|
| 165 |
+
|
| 166 |
+
def test_run_without_pero_raises(self, tmp_path):
|
| 167 |
+
"""Sans pero-ocr, le moteur doit retourner un EngineResult avec erreur."""
|
| 168 |
+
img = tmp_path / "page.png"
|
| 169 |
+
img.write_bytes(b"fake")
|
| 170 |
+
|
| 171 |
+
with patch("picarones.engines.pero_ocr._PERO_AVAILABLE", False):
|
| 172 |
+
engine = PeroOCREngine(config={"config": "/fake/config.ini"})
|
| 173 |
+
result = engine.run(img)
|
| 174 |
+
|
| 175 |
+
assert result.success is False
|
| 176 |
+
|
| 177 |
+
def test_run_without_config_raises(self, tmp_path):
|
| 178 |
+
"""Sans paramètre 'config', le moteur doit signaler une erreur claire."""
|
| 179 |
+
img = tmp_path / "page.png"
|
| 180 |
+
img.write_bytes(b"fake")
|
| 181 |
+
|
| 182 |
+
with patch("picarones.engines.pero_ocr._PERO_AVAILABLE", True):
|
| 183 |
+
engine = PeroOCREngine()
|
| 184 |
+
result = engine.run(img)
|
| 185 |
+
|
| 186 |
+
assert result.success is False
|
| 187 |
+
assert "config" in result.error.lower()
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
# ---------------------------------------------------------------------------
|
| 191 |
+
# Tests EngineResult
|
| 192 |
+
# ---------------------------------------------------------------------------
|
| 193 |
+
|
| 194 |
+
class TestEngineResult:
|
| 195 |
+
def test_success_true_when_no_error(self):
|
| 196 |
+
r = EngineResult(
|
| 197 |
+
engine_name="test", image_path="/img.png",
|
| 198 |
+
text="texte", duration_seconds=0.1
|
| 199 |
+
)
|
| 200 |
+
assert r.success is True
|
| 201 |
+
|
| 202 |
+
def test_success_false_when_error(self):
|
| 203 |
+
r = EngineResult(
|
| 204 |
+
engine_name="test", image_path="/img.png",
|
| 205 |
+
text="", duration_seconds=0.1, error="Erreur"
|
| 206 |
+
)
|
| 207 |
+
assert r.success is False
|
| 208 |
+
|
| 209 |
+
def test_metadata_default_empty(self):
|
| 210 |
+
r = EngineResult(
|
| 211 |
+
engine_name="test", image_path="/img.png",
|
| 212 |
+
text="", duration_seconds=0.0
|
| 213 |
+
)
|
| 214 |
+
assert r.metadata == {}
|
|
@@ -0,0 +1,117 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests unitaires pour le module picarones.core.metrics."""
|
| 2 |
+
|
| 3 |
+
import pytest
|
| 4 |
+
|
| 5 |
+
from picarones.core.metrics import aggregate_metrics, compute_metrics, MetricsResult
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
class TestComputeMetrics:
|
| 9 |
+
"""Tests de compute_metrics sur des cas connus."""
|
| 10 |
+
|
| 11 |
+
def test_perfect_match(self):
|
| 12 |
+
"""CER et WER doivent être 0 quand référence == hypothèse."""
|
| 13 |
+
result = compute_metrics("Bonjour le monde", "Bonjour le monde")
|
| 14 |
+
assert result.cer == pytest.approx(0.0)
|
| 15 |
+
assert result.wer == pytest.approx(0.0)
|
| 16 |
+
assert result.error is None
|
| 17 |
+
|
| 18 |
+
def test_complete_mismatch(self):
|
| 19 |
+
"""CER proche de 1 quand les textes sont totalement différents."""
|
| 20 |
+
result = compute_metrics("abc", "xyz")
|
| 21 |
+
assert result.cer > 0.0
|
| 22 |
+
assert result.error is None
|
| 23 |
+
|
| 24 |
+
def test_empty_reference(self):
|
| 25 |
+
"""Référence vide : CER = 1.0 si hypothèse non vide."""
|
| 26 |
+
result = compute_metrics("", "quelque chose")
|
| 27 |
+
assert result.cer == pytest.approx(1.0)
|
| 28 |
+
|
| 29 |
+
def test_empty_both(self):
|
| 30 |
+
"""Référence et hypothèse vides : CER = 0.0."""
|
| 31 |
+
result = compute_metrics("", "")
|
| 32 |
+
assert result.cer == pytest.approx(0.0)
|
| 33 |
+
|
| 34 |
+
def test_single_substitution(self):
|
| 35 |
+
"""Une seule substitution sur 4 chars → CER = 0.25."""
|
| 36 |
+
result = compute_metrics("abcd", "abce")
|
| 37 |
+
assert result.cer == pytest.approx(0.25)
|
| 38 |
+
|
| 39 |
+
def test_case_insensitive_cer(self):
|
| 40 |
+
"""CER caseless ignore les différences de casse."""
|
| 41 |
+
result = compute_metrics("Bonjour", "bonjour")
|
| 42 |
+
assert result.cer_caseless == pytest.approx(0.0)
|
| 43 |
+
# CER brut doit être > 0 (B ≠ b)
|
| 44 |
+
assert result.cer > 0.0
|
| 45 |
+
|
| 46 |
+
def test_nfc_normalization(self):
|
| 47 |
+
"""CER NFC normalise les séquences unicode équivalentes."""
|
| 48 |
+
# é peut être encodé en forme composée (U+00E9) ou décomposée (e + U+0301)
|
| 49 |
+
composed = "\u00e9" # é (NFC)
|
| 50 |
+
decomposed = "e\u0301" # e + combining accent (NFD)
|
| 51 |
+
result = compute_metrics(composed, decomposed)
|
| 52 |
+
# Après NFC, les deux sont identiques → cer_nfc = 0
|
| 53 |
+
assert result.cer_nfc == pytest.approx(0.0)
|
| 54 |
+
|
| 55 |
+
def test_wer_one_word_wrong(self):
|
| 56 |
+
"""WER = 1/3 pour 1 mot faux sur 3."""
|
| 57 |
+
result = compute_metrics("le chat dort", "le chien dort")
|
| 58 |
+
assert result.wer == pytest.approx(1 / 3, rel=1e-2)
|
| 59 |
+
|
| 60 |
+
def test_result_has_lengths(self):
|
| 61 |
+
ref = "Texte de référence"
|
| 62 |
+
result = compute_metrics(ref, "Texte différent")
|
| 63 |
+
assert result.reference_length == len(ref)
|
| 64 |
+
assert result.hypothesis_length > 0
|
| 65 |
+
|
| 66 |
+
def test_metrics_result_as_dict(self):
|
| 67 |
+
"""as_dict() doit retourner toutes les clés attendues."""
|
| 68 |
+
result = compute_metrics("abc", "abc")
|
| 69 |
+
d = result.as_dict()
|
| 70 |
+
for key in ["cer", "cer_nfc", "cer_caseless", "wer", "wer_normalized", "mer", "wil"]:
|
| 71 |
+
assert key in d
|
| 72 |
+
|
| 73 |
+
def test_cer_percent_property(self):
|
| 74 |
+
result = compute_metrics("abcd", "abce")
|
| 75 |
+
assert result.cer_percent == pytest.approx(25.0, rel=1e-2)
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
class TestAggregateMetrics:
|
| 79 |
+
"""Tests de aggregate_metrics."""
|
| 80 |
+
|
| 81 |
+
def _make_result(self, cer: float) -> MetricsResult:
|
| 82 |
+
return MetricsResult(
|
| 83 |
+
cer=cer, cer_nfc=cer, cer_caseless=cer,
|
| 84 |
+
wer=cer, wer_normalized=cer, mer=cer, wil=cer,
|
| 85 |
+
reference_length=100,
|
| 86 |
+
hypothesis_length=100,
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
+
def test_empty_list(self):
|
| 90 |
+
assert aggregate_metrics([]) == {}
|
| 91 |
+
|
| 92 |
+
def test_single_result(self):
|
| 93 |
+
results = [self._make_result(0.1)]
|
| 94 |
+
agg = aggregate_metrics(results)
|
| 95 |
+
assert agg["cer"]["mean"] == pytest.approx(0.1)
|
| 96 |
+
assert agg["cer"]["min"] == pytest.approx(0.1)
|
| 97 |
+
assert agg["cer"]["max"] == pytest.approx(0.1)
|
| 98 |
+
|
| 99 |
+
def test_multiple_results(self):
|
| 100 |
+
results = [self._make_result(0.1), self._make_result(0.3)]
|
| 101 |
+
agg = aggregate_metrics(results)
|
| 102 |
+
assert agg["cer"]["mean"] == pytest.approx(0.2)
|
| 103 |
+
assert agg["document_count"] == 2
|
| 104 |
+
assert agg["failed_count"] == 0
|
| 105 |
+
|
| 106 |
+
def test_failed_results_excluded(self):
|
| 107 |
+
ok = self._make_result(0.1)
|
| 108 |
+
failed = MetricsResult(
|
| 109 |
+
cer=1.0, cer_nfc=1.0, cer_caseless=1.0,
|
| 110 |
+
wer=1.0, wer_normalized=1.0, mer=1.0, wil=1.0,
|
| 111 |
+
reference_length=50, hypothesis_length=0,
|
| 112 |
+
error="Moteur en erreur",
|
| 113 |
+
)
|
| 114 |
+
agg = aggregate_metrics([ok, failed])
|
| 115 |
+
# Les métriques agrégées n'incluent que les résultats sans erreur
|
| 116 |
+
assert agg["cer"]["mean"] == pytest.approx(0.1)
|
| 117 |
+
assert agg["failed_count"] == 1
|
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests unitaires pour picarones.core.results."""
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
import pytest
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
|
| 7 |
+
from picarones.core.metrics import MetricsResult
|
| 8 |
+
from picarones.core.results import BenchmarkResult, DocumentResult, EngineReport
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def _make_metrics(cer: float = 0.05) -> MetricsResult:
|
| 12 |
+
return MetricsResult(
|
| 13 |
+
cer=cer, cer_nfc=cer, cer_caseless=cer,
|
| 14 |
+
wer=cer * 2, wer_normalized=cer * 2, mer=cer, wil=cer,
|
| 15 |
+
reference_length=200, hypothesis_length=195,
|
| 16 |
+
)
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def _make_document_result(doc_id: str = "doc1", cer: float = 0.05) -> DocumentResult:
|
| 20 |
+
return DocumentResult(
|
| 21 |
+
doc_id=doc_id,
|
| 22 |
+
image_path=f"/corpus/{doc_id}.png",
|
| 23 |
+
ground_truth="Texte de référence médiéval.",
|
| 24 |
+
hypothesis="Texte de référence médiéval.",
|
| 25 |
+
metrics=_make_metrics(cer),
|
| 26 |
+
duration_seconds=1.23,
|
| 27 |
+
)
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
def _make_engine_report(name: str = "tesseract", n_docs: int = 3) -> EngineReport:
|
| 31 |
+
docs = [_make_document_result(f"doc{i}", cer=0.03 * i) for i in range(1, n_docs + 1)]
|
| 32 |
+
return EngineReport(
|
| 33 |
+
engine_name=name,
|
| 34 |
+
engine_version="5.3.0",
|
| 35 |
+
engine_config={"lang": "fra"},
|
| 36 |
+
document_results=docs,
|
| 37 |
+
)
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
class TestDocumentResult:
|
| 41 |
+
def test_as_dict_keys(self):
|
| 42 |
+
dr = _make_document_result()
|
| 43 |
+
d = dr.as_dict()
|
| 44 |
+
for key in ["doc_id", "image_path", "ground_truth", "hypothesis", "metrics", "duration_seconds"]:
|
| 45 |
+
assert key in d
|
| 46 |
+
|
| 47 |
+
def test_metrics_serialized(self):
|
| 48 |
+
dr = _make_document_result(cer=0.1)
|
| 49 |
+
d = dr.as_dict()
|
| 50 |
+
assert d["metrics"]["cer"] == pytest.approx(0.1)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
class TestEngineReport:
|
| 54 |
+
def test_aggregation_computed(self):
|
| 55 |
+
report = _make_engine_report(n_docs=3)
|
| 56 |
+
assert report.aggregated_metrics != {}
|
| 57 |
+
assert "cer" in report.aggregated_metrics
|
| 58 |
+
|
| 59 |
+
def test_mean_cer(self):
|
| 60 |
+
report = _make_engine_report(n_docs=3)
|
| 61 |
+
# docs avec cer=0.03, 0.06, 0.09 → mean=0.06
|
| 62 |
+
assert report.mean_cer == pytest.approx(0.06, rel=1e-2)
|
| 63 |
+
|
| 64 |
+
def test_as_dict_structure(self):
|
| 65 |
+
report = _make_engine_report()
|
| 66 |
+
d = report.as_dict()
|
| 67 |
+
assert d["engine_name"] == "tesseract"
|
| 68 |
+
assert len(d["document_results"]) == 3
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
class TestBenchmarkResult:
|
| 72 |
+
def _make_benchmark(self) -> BenchmarkResult:
|
| 73 |
+
return BenchmarkResult(
|
| 74 |
+
corpus_name="Test corpus",
|
| 75 |
+
corpus_source="/corpus/",
|
| 76 |
+
document_count=3,
|
| 77 |
+
engine_reports=[
|
| 78 |
+
_make_engine_report("tesseract"),
|
| 79 |
+
_make_engine_report("pero_ocr"),
|
| 80 |
+
],
|
| 81 |
+
)
|
| 82 |
+
|
| 83 |
+
def test_ranking_sorted_by_cer(self):
|
| 84 |
+
bm = self._make_benchmark()
|
| 85 |
+
ranking = bm.ranking()
|
| 86 |
+
assert len(ranking) == 2
|
| 87 |
+
cers = [e["mean_cer"] for e in ranking]
|
| 88 |
+
assert cers == sorted(cers)
|
| 89 |
+
|
| 90 |
+
def test_to_json_writes_file(self, tmp_path):
|
| 91 |
+
bm = self._make_benchmark()
|
| 92 |
+
out = tmp_path / "results.json"
|
| 93 |
+
bm.to_json(out)
|
| 94 |
+
assert out.exists()
|
| 95 |
+
with out.open() as f:
|
| 96 |
+
data = json.load(f)
|
| 97 |
+
assert data["corpus"]["name"] == "Test corpus"
|
| 98 |
+
|
| 99 |
+
def test_to_json_creates_parent_dirs(self, tmp_path):
|
| 100 |
+
bm = self._make_benchmark()
|
| 101 |
+
out = tmp_path / "deep" / "nested" / "results.json"
|
| 102 |
+
bm.to_json(out)
|
| 103 |
+
assert out.exists()
|
| 104 |
+
|
| 105 |
+
def test_from_json_round_trip(self, tmp_path):
|
| 106 |
+
bm = self._make_benchmark()
|
| 107 |
+
out = tmp_path / "results.json"
|
| 108 |
+
bm.to_json(out)
|
| 109 |
+
loaded = BenchmarkResult.from_json(out)
|
| 110 |
+
assert loaded["corpus"]["name"] == "Test corpus"
|
| 111 |
+
assert len(loaded["engine_reports"]) == 2
|
| 112 |
+
|
| 113 |
+
def test_as_dict_has_version(self):
|
| 114 |
+
bm = self._make_benchmark()
|
| 115 |
+
d = bm.as_dict()
|
| 116 |
+
assert "picarones_version" in d
|
| 117 |
+
assert "run_date" in d
|
| 118 |
+
|
| 119 |
+
def test_ranking_has_required_fields(self):
|
| 120 |
+
bm = self._make_benchmark()
|
| 121 |
+
for entry in bm.ranking():
|
| 122 |
+
assert "engine" in entry
|
| 123 |
+
assert "mean_cer" in entry
|
| 124 |
+
assert "mean_wer" in entry
|