Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

Claude commited on Mar 4

Commit

49cc409

unverified ·

1 Parent(s): f0641dd

feat(sprint-1): implémentation complète du Sprint 1 Picarones

Structure du projet :
- Package Python `picarones/` avec sous-modules `core/` et `engines/`
- `pyproject.toml` (setuptools, entry point CLI), `requirements.txt`, `.gitignore`

Adaptateurs moteurs OCR (Module 2) :
- `engines/base.py` : interface abstraite `BaseOCREngine` + `EngineResult`
- Gestion automatique des erreurs, mesure du temps d'exécution
- `engines/tesseract.py` : adaptateur Tesseract 5 via pytesseract
- Configuration lang, PSM, OEM, chemin binaire
- `engines/pero_ocr.py` : adaptateur Pero OCR (pero-ocr optionnel)
- Extraction du texte plat depuis la sortie PAGE XML structurée

Métriques CER/WER (Module 4) :
- `core/metrics.py` : calcul via jiwer
- CER brut, CER NFC, CER caseless
- WER brut, WER normalisé, MER, WIL
- Agrégation statistique (mean, median, min, max, stdev)

Gestion des corpus (Module 1) :
- `core/corpus.py` : chargement dossier local de paires image / `.gt.txt`
- Détection automatique des images (jpg, png, tif, webp…)
- Statistiques de corpus, gestion des erreurs

Résultats et export JSON (Module 6) :
- `core/results.py` : modèles `DocumentResult`, `EngineReport`, `BenchmarkResult`
- Sérialisation JSON complète avec classement par CER
- `to_json()` / `from_json()` pour persistance

Orchestrateur (Module 6) :
- `core/runner.py` : `run_benchmark()` avec barre de progression tqdm

CLI Click (Module 6) :
- `picarones run` : benchmark complet avec option `--fail-if-cer-above` (CI/CD)
- `picarones metrics` : CER/WER entre deux fichiers texte
- `picarones engines` : liste les moteurs disponibles
- `picarones info` : versions des dépendances

Tests unitaires : 58 tests, 100% passants
- `test_metrics.py`, `test_corpus.py`, `test_engines.py`, `test_results.py`

https://claude.ai/code/session_017gXea9mxBQqDTAsSQd7aAq

Files changed (20) hide show

.gitignore +13 -0
README.md +119 -0
picarones/__init__.py +9 -0
picarones/cli.py +295 -0
picarones/core/__init__.py +1 -0
picarones/core/corpus.py +152 -0
picarones/core/metrics.py +214 -0
picarones/core/results.py +155 -0
picarones/core/runner.py +115 -0
picarones/engines/__init__.py +13 -0
picarones/engines/base.py +85 -0
picarones/engines/pero_ocr.py +112 -0
picarones/engines/tesseract.py +84 -0
pyproject.toml +44 -0
requirements.txt +19 -0
tests/__init__.py +0 -0
tests/test_corpus.py +96 -0
tests/test_engines.py +214 -0
tests/test_metrics.py +117 -0
tests/test_results.py +124 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,13 @@

+__pycache__/
+*.py[cod]
+*.egg-info/
+*.egg
+dist/
+build/
+.eggs/
+.pytest_cache/
+.coverage
+htmlcov/
+.venv/
+venv/
+*.log

README.md ADDED Viewed

	@@ -0,0 +1,119 @@

+# Picarones
+> **Plateforme de comparaison de moteurs OCR/HTR pour documents patrimoniaux**
+> BnF — Département numérique · Apache 2.0
+Picarones permet d'évaluer et de comparer rigoureusement des moteurs OCR (Tesseract, Pero OCR, Kraken, APIs cloud…) ainsi que des pipelines OCR+LLM sur des corpus de documents historiques — manuscrits, imprimés anciens, archives.
+---
+## Sprint 1 — Ce qui est implémenté
+- Structure complète du projet Python (`picarones/`)
+- Adaptateur **Tesseract 5** (`pytesseract`)
+- Adaptateur **Pero OCR** (necessite `pero-ocr`)
+- Interface abstraite `BaseOCREngine` pour ajouter facilement de nouveaux moteurs
+- Calcul **CER** et **WER** via `jiwer` (brut, NFC, caseless, normalisé, MER, WIL)
+- Chargement de **corpus** depuis dossier local (paires image / `.gt.txt`)
+- **Export JSON** structuré des résultats avec classement
+- **CLI** `click` : commandes `run`, `metrics`, `engines`, `info`
+---
+## Installation
+```bash
+pip install -e .
+# Pour Tesseract, installer aussi le binaire système :
+# Ubuntu/Debian : sudo apt install tesseract-ocr tesseract-ocr-fra
+# macOS         : brew install tesseract
+# Pour Pero OCR (optionnel) :
+pip install pero-ocr
+```
+## Usage rapide
+```bash
+# Lancer un benchmark sur un corpus local
+picarones run --corpus ./mon_corpus/ --engines tesseract --output resultats.json
+# Plusieurs moteurs
+picarones run --corpus ./corpus/ --engines tesseract,pero_ocr --lang fra
+# Calculer CER/WER entre deux fichiers
+picarones metrics --reference gt.txt --hypothesis ocr.txt
+# Lister les moteurs disponibles
+picarones engines
+# Infos de version
+picarones info
+```
+## Structure du projet
+```
+picarones/
+├── __init__.py
+├── cli.py                  # CLI Click
+├── core/
+│   ├── corpus.py           # Chargement corpus
+│   ├── metrics.py          # CER/WER (jiwer)
+│   ├── results.py          # Modèles de données + export JSON
+│   └── runner.py           # Orchestrateur benchmark
+└── engines/
+    ├── base.py             # Interface abstraite BaseOCREngine
+    ├── tesseract.py        # Adaptateur Tesseract
+    └── pero_ocr.py         # Adaptateur Pero OCR
+tests/
+├── test_metrics.py
+├── test_corpus.py
+├── test_engines.py
+└── test_results.py
+```
+## Format du corpus
+Un corpus local est un dossier contenant des paires :
+```
+corpus/
+├── page_001.jpg
+├── page_001.gt.txt    ← vérité terrain UTF-8
+├── page_002.png
+├── page_002.gt.txt
+└── ...
+```
+## Format de sortie JSON
+```json
+{
+  "picarones_version": "0.1.0",
+  "run_date": "2025-03-04T...",
+  "corpus": { "name": "...", "document_count": 50 },
+  "ranking": [
+    { "engine": "tesseract", "mean_cer": 0.043, "mean_wer": 0.112 }
+  ],
+  "engine_reports": [...]
+}
+```
+## Lancer les tests
+```bash
+pytest
+```
+## Roadmap
+| Sprint | Livrables |
+|--------|-----------|
+| **Sprint 1** ✅ | Structure, adaptateurs Tesseract + Pero OCR, CER/WER, JSON, CLI |
+| Sprint 2 | Rapport HTML interactif avec diff coloré |
+| Sprint 3 | Pipelines OCR+LLM (GPT-4o, Claude) |
+| Sprint 4 | APIs cloud OCR, import IIIF, normalisation diplomatique |
+| Sprint 5 | Métriques avancées : matrice de confusion unicode, ligatures |
+| Sprint 6 | Interface web FastAPI, import HTR-United / HuggingFace |

picarones/__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""
+Picarones — Plateforme de comparaison de moteurs OCR pour documents patrimoniaux.
+BnF — Département numérique, 2025.
+Licence Apache 2.0.
+"""
+__version__ = "0.1.0"
+__author__ = "BnF — Département numérique"

picarones/cli.py ADDED Viewed

	@@ -0,0 +1,295 @@

+"""Interface en ligne de commande Picarones (Click).
+Commandes disponibles
+---------------------
+picarones run      — Lance un benchmark complet
+picarones metrics  — Calcule CER/WER entre deux fichiers texte
+picarones engines  — Liste les moteurs disponibles
+picarones info     — Informations de version
+Exemples d'usage
+----------------
+    picarones run --corpus ./corpus/ --engines tesseract --output results.json
+    picarones metrics --reference gt.txt --hypothesis ocr.txt
+    picarones engines
+"""
+from __future__ import annotations
+import json
+import logging
+import sys
+from pathlib import Path
+import click
+from picarones import __version__
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+def _setup_logging(verbose: bool) -> None:
+    level = logging.DEBUG if verbose else logging.INFO
+    logging.basicConfig(
+        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
+        datefmt="%H:%M:%S",
+        level=level,
+    )
+def _engine_from_name(engine_name: str, lang: str, psm: int) -> "BaseOCREngine":
+    """Instancie un moteur par son nom."""
+    from picarones.engines.tesseract import TesseractEngine
+    if engine_name in {"tesseract", "tess"}:
+        return TesseractEngine(config={"lang": lang, "psm": psm})
+    try:
+        from picarones.engines.pero_ocr import PeroOCREngine
+        if engine_name in {"pero_ocr", "pero"}:
+            return PeroOCREngine(config={"name": "pero_ocr"})
+    except ImportError:
+        pass
+    raise click.BadParameter(
+        f"Moteur inconnu ou non disponible : '{engine_name}'. "
+        "Moteurs supportés : tesseract, pero_ocr"
+    )
+# ---------------------------------------------------------------------------
+# Groupe principal
+# ---------------------------------------------------------------------------
+@click.group(context_settings={"help_option_names": ["-h", "--help"]})
+@click.version_option(__version__, "-V", "--version", prog_name="picarones")
+def cli() -> None:
+    """Picarones — Plateforme de comparaison de moteurs OCR pour documents patrimoniaux.
+    Bibliothèque nationale de France — Département numérique.
+    """
+# ---------------------------------------------------------------------------
+# picarones run
+# ---------------------------------------------------------------------------
+@cli.command("run")
+@click.option(
+    "--corpus", "-c",
+    required=True,
+    type=click.Path(exists=True, file_okay=False, resolve_path=True),
+    help="Dossier contenant les paires image / .gt.txt",
+)
+@click.option(
+    "--engines", "-e",
+    default="tesseract",
+    show_default=True,
+    help="Liste de moteurs séparés par des virgules (ex : tesseract,pero_ocr)",
+)
+@click.option(
+    "--output", "-o",
+    default="results.json",
+    show_default=True,
+    type=click.Path(resolve_path=True),
+    help="Fichier JSON de sortie",
+)
+@click.option(
+    "--lang", "-l",
+    default="fra",
+    show_default=True,
+    help="Code langue Tesseract (fra, lat, eng…)",
+)
+@click.option("--psm", default=6, show_default=True, help="Page Segmentation Mode Tesseract (0-13)")
+@click.option("--no-progress", is_flag=True, default=False, help="Désactive la barre de progression")
+@click.option("--verbose", "-v", is_flag=True, default=False, help="Mode verbeux")
+@click.option(
+    "--fail-if-cer-above",
+    default=None,
+    type=float,
+    metavar="THRESHOLD",
+    help="Quitte avec code 1 si CER moyen > THRESHOLD (usage CI/CD)",
+)
+def run_cmd(
+    corpus: str,
+    engines: str,
+    output: str,
+    lang: str,
+    psm: int,
+    no_progress: bool,
+    verbose: bool,
+    fail_if_cer_above: float | None,
+) -> None:
+    """Lance un benchmark OCR sur un corpus de documents.
+    Le corpus doit être un dossier contenant des paires
+    <image>.<ext> + <image>.gt.txt (vérité terrain).
+    """
+    _setup_logging(verbose)
+    from picarones.core.corpus import load_corpus_from_directory
+    from picarones.core.runner import run_benchmark
+    # Chargement du corpus
+    try:
+        corp = load_corpus_from_directory(corpus)
+    except (FileNotFoundError, ValueError) as exc:
+        click.echo(f"Erreur corpus : {exc}", err=True)
+        sys.exit(1)
+    click.echo(f"Corpus '{corp.name}' — {len(corp)} documents chargés.")
+    # Instanciation des moteurs
+    engine_names = [e.strip() for e in engines.split(",") if e.strip()]
+    ocr_engines = []
+    for name in engine_names:
+        try:
+            engine = _engine_from_name(name, lang=lang, psm=psm)
+            ocr_engines.append(engine)
+        except click.BadParameter as exc:
+            click.echo(f"Erreur moteur : {exc}", err=True)
+            sys.exit(1)
+    if not ocr_engines:
+        click.echo("Aucun moteur valide spécifié.", err=True)
+        sys.exit(1)
+    click.echo(f"Moteurs : {', '.join(e.name for e in ocr_engines)}")
+    # Lancement du benchmark
+    result = run_benchmark(
+        corpus=corp,
+        engines=ocr_engines,
+        output_json=output,
+        show_progress=not no_progress,
+    )
+    # Affichage du classement
+    click.echo("\n── Classement ──────────────────────────────────")
+    for rank, entry in enumerate(result.ranking(), 1):
+        cer_pct = f"{entry['mean_cer'] * 100:.2f}%" if entry["mean_cer"] is not None else "N/A"
+        wer_pct = f"{entry['mean_wer'] * 100:.2f}%" if entry["mean_wer"] is not None else "N/A"
+        failed = entry["failed"]
+        failed_str = f" ({failed} erreur(s))" if failed else ""
+        click.echo(f"  {rank}. {entry['engine']:<20} CER={cer_pct:<8} WER={wer_pct}{failed_str}")
+    click.echo(f"\nRésultats écrits dans : {output}")
+    # Mode CI/CD : exit code non-zero si CER > seuil
+    if fail_if_cer_above is not None:
+        for entry in result.ranking():
+            if entry["mean_cer"] is not None and entry["mean_cer"] * 100 > fail_if_cer_above:
+                click.echo(
+                    f"\nECHEC : {entry['engine']} CER={entry['mean_cer']*100:.2f}% "
+                    f"> seuil {fail_if_cer_above:.2f}%",
+                    err=True,
+                )
+                sys.exit(1)
+# ---------------------------------------------------------------------------
+# picarones metrics
+# ---------------------------------------------------------------------------
+@cli.command("metrics")
+@click.option(
+    "--reference", "-r",
+    required=True,
+    type=click.Path(exists=True, dir_okay=False),
+    help="Fichier vérité terrain (texte brut UTF-8)",
+)
+@click.option(
+    "--hypothesis", "-H",
+    required=True,
+    type=click.Path(exists=True, dir_okay=False),
+    help="Fichier transcription OCR (texte brut UTF-8)",
+)
+@click.option("--json-output", is_flag=True, default=False, help="Sortie en JSON")
+def metrics_cmd(reference: str, hypothesis: str, json_output: bool) -> None:
+    """Calcule CER et WER entre deux fichiers texte."""
+    from picarones.core.metrics import compute_metrics
+    ref_text = Path(reference).read_text(encoding="utf-8").strip()
+    hyp_text = Path(hypothesis).read_text(encoding="utf-8").strip()
+    result = compute_metrics(ref_text, hyp_text)
+    if json_output:
+        click.echo(json.dumps(result.as_dict(), ensure_ascii=False, indent=2))
+    else:
+        click.echo(f"CER            : {result.cer_percent:.2f}%")
+        click.echo(f"CER (NFC)      : {result.cer_nfc * 100:.2f}%")
+        click.echo(f"CER (caseless) : {result.cer_caseless * 100:.2f}%")
+        click.echo(f"WER            : {result.wer_percent:.2f}%")
+        click.echo(f"WER (normalisé): {result.wer_normalized * 100:.2f}%")
+        click.echo(f"MER            : {result.mer * 100:.2f}%")
+        click.echo(f"WIL            : {result.wil * 100:.2f}%")
+        click.echo(f"Longueur GT    : {result.reference_length} chars")
+        click.echo(f"Longueur OCR   : {result.hypothesis_length} chars")
+        if result.error:
+            click.echo(f"Erreur         : {result.error}", err=True)
+# ---------------------------------------------------------------------------
+# picarones engines
+# ---------------------------------------------------------------------------
+@cli.command("engines")
+def engines_cmd() -> None:
+    """Liste les moteurs OCR disponibles et vérifie leur installation."""
+    engines = [
+        ("tesseract", "Tesseract 5 (pytesseract)", "pytesseract"),
+        ("pero_ocr", "Pero OCR", "pero_ocr"),
+    ]
+    click.echo("Moteurs OCR disponibles :\n")
+    for engine_id, label, module in engines:
+        try:
+            __import__(module)
+            status = click.style("✓ disponible", fg="green")
+        except ImportError:
+            status = click.style("✗ non installé", fg="red")
+        click.echo(f"  {engine_id:<15} {label:<35} {status}")
+    click.echo(
+        "\nPour installer un moteur manquant :\n"
+        "  pip install pytesseract\n"
+        "  pip install pero-ocr"
+    )
+# ---------------------------------------------------------------------------
+# picarones info
+# ---------------------------------------------------------------------------
+@cli.command("info")
+def info_cmd() -> None:
+    """Affiche les informations de version de Picarones et de ses dépendances."""
+    click.echo(f"Picarones v{__version__}")
+    click.echo("BnF — Département numérique\n")
+    deps = [
+        ("click", "click"),
+        ("jiwer", "jiwer"),
+        ("Pillow", "PIL"),
+        ("pytesseract", "pytesseract"),
+        ("tqdm", "tqdm"),
+        ("numpy", "numpy"),
+        ("pyyaml", "yaml"),
+    ]
+    click.echo("Dépendances :")
+    for name, module in deps:
+        try:
+            mod = __import__(module)
+            version = getattr(mod, "__version__", "installé")
+            status = click.style(f"v{version}", fg="green")
+        except ImportError:
+            status = click.style("non installé", fg="red")
+        click.echo(f"  {name:<15} {status}")
+if __name__ == "__main__":
+    cli()

picarones/core/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Core modules : corpus, métriques, résultats, orchestration."""

picarones/core/corpus.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""Chargement et gestion des corpus de documents.
+Format supporté (Sprint 1) : dossier local avec paires image / .gt.txt
+Convention :
+  mon_document.jpg   ←→   mon_document.gt.txt
+  page_001.png       ←→   page_001.gt.txt
+Extensions d'images acceptées : .jpg, .jpeg, .png, .tif, .tiff, .bmp, .webp
+"""
+from __future__ import annotations
+import logging
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Iterator, Optional
+logger = logging.getLogger(__name__)
+# Extensions image reconnues
+IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png", ".tif", ".tiff", ".bmp", ".webp"}
+@dataclass
+class Document:
+    """Une paire (image, texte de vérité terrain)."""
+    image_path: Path
+    ground_truth: str
+    doc_id: str = ""
+    metadata: dict = field(default_factory=dict)
+    def __post_init__(self) -> None:
+        if not self.doc_id:
+            self.doc_id = self.image_path.stem
+@dataclass
+class Corpus:
+    """Collection de documents avec leurs métadonnées."""
+    name: str
+    documents: list[Document]
+    source_path: Optional[str] = None
+    metadata: dict = field(default_factory=dict)
+    def __len__(self) -> int:
+        return len(self.documents)
+    def __iter__(self) -> Iterator[Document]:
+        return iter(self.documents)
+    def __repr__(self) -> str:
+        return f"Corpus(name={self.name!r}, documents={len(self.documents)})"
+    @property
+    def stats(self) -> dict:
+        gt_lengths = [len(doc.ground_truth) for doc in self.documents]
+        if not gt_lengths:
+            return {"document_count": 0}
+        import statistics
+        return {
+            "document_count": len(self.documents),
+            "gt_length_mean": round(statistics.mean(gt_lengths), 1),
+            "gt_length_median": round(statistics.median(gt_lengths), 1),
+            "gt_length_min": min(gt_lengths),
+            "gt_length_max": max(gt_lengths),
+        }
+def load_corpus_from_directory(
+    directory: str | Path,
+    name: Optional[str] = None,
+    gt_suffix: str = ".gt.txt",
+    encoding: str = "utf-8",
+) -> Corpus:
+    """Charge un corpus depuis un dossier local de paires image / GT.
+    Parameters
+    ----------
+    directory:
+        Chemin vers le dossier contenant les paires image + fichier GT.
+    name:
+        Nom du corpus (par défaut : nom du dossier).
+    gt_suffix:
+        Suffixe des fichiers vérité terrain (par défaut : ``.gt.txt``).
+    encoding:
+        Encodage des fichiers texte (par défaut : utf-8).
+    Returns
+    -------
+    Corpus
+        Objet Corpus prêt à être utilisé dans le pipeline.
+    Raises
+    ------
+    FileNotFoundError
+        Si le dossier n'existe pas.
+    ValueError
+        Si aucun document valide n'est trouvé.
+    """
+    directory = Path(directory)
+    if not directory.is_dir():
+        raise FileNotFoundError(f"Dossier introuvable : {directory}")
+    corpus_name = name or directory.name
+    documents: list[Document] = []
+    skipped = 0
+    # Collecte de toutes les images
+    image_paths = sorted(
+        p for p in directory.iterdir() if p.suffix.lower() in IMAGE_EXTENSIONS
+    )
+    for image_path in image_paths:
+        gt_path = image_path.with_name(image_path.stem + gt_suffix)
+        if not gt_path.exists():
+            logger.debug("Pas de fichier GT pour %s — ignoré.", image_path.name)
+            skipped += 1
+            continue
+        try:
+            ground_truth = gt_path.read_text(encoding=encoding).strip()
+        except OSError as exc:
+            logger.warning("Impossible de lire %s : %s — ignoré.", gt_path, exc)
+            skipped += 1
+            continue
+        documents.append(
+            Document(
+                image_path=image_path,
+                ground_truth=ground_truth,
+            )
+        )
+    if not documents:
+        raise ValueError(
+            f"Aucun document valide trouvé dans {directory}. "
+            f"Vérifiez que les fichiers GT portent le suffixe '{gt_suffix}'."
+        )
+    if skipped:
+        logger.info("%d image(s) ignorée(s) faute de fichier GT.", skipped)
+    logger.info("Corpus '%s' chargé : %d documents.", corpus_name, len(documents))
+    return Corpus(
+        name=corpus_name,
+        documents=documents,
+        source_path=str(directory),
+    )

picarones/core/metrics.py ADDED Viewed

	@@ -0,0 +1,214 @@

+"""Calcul des métriques CER et WER via jiwer.
+Métriques implémentées
+----------------------
+- CER brut                : distance d'édition caractère / longueur GT
+- CER normalisé NFC       : après normalisation Unicode NFC
+- CER sans casse          : insensible aux majuscules/minuscules
+- WER brut                : word error rate standard
+- WER normalisé           : après normalisation des espaces
+- MER                     : Match Error Rate (jiwer)
+- WIL                     : Word Information Lost (jiwer)
+"""
+from __future__ import annotations
+import unicodedata
+from dataclasses import dataclass
+from typing import Optional
+try:
+    import jiwer
+    _JIWER_AVAILABLE = True
+except ImportError:
+    _JIWER_AVAILABLE = False
+# ---------------------------------------------------------------------------
+# Transformations / normalisations
+# ---------------------------------------------------------------------------
+def _normalize_nfc(text: str) -> str:
+    return unicodedata.normalize("NFC", text)
+def _normalize_caseless(text: str) -> str:
+    return unicodedata.normalize("NFC", text).casefold()
+def _normalize_whitespace(text: str) -> str:
+    return " ".join(text.split())
+# Transformations jiwer pour le CER (chaque char devient un "mot")
+_CHAR_TRANSFORM = jiwer.transforms.Compose([]) if _JIWER_AVAILABLE else None
+# Transformations jiwer pour le WER (normalisation légère des espaces)
+_WER_TRANSFORM = (
+    jiwer.transforms.Compose(
+        [
+            jiwer.transforms.RemoveMultipleSpaces(),
+            jiwer.transforms.Strip(),
+            jiwer.transforms.ReduceToListOfListOfWords(),
+        ]
+    )
+    if _JIWER_AVAILABLE
+    else None
+)
+def _cer_from_strings(reference: str, hypothesis: str) -> float:
+    """CER brut : distance d'édition sur les caractères."""
+    if not reference:
+        return 0.0 if not hypothesis else 1.0
+    # jiwer.cer traite chaque caractère comme un token
+    return jiwer.cer(reference, hypothesis)
+# ---------------------------------------------------------------------------
+# Résultat structuré
+# ---------------------------------------------------------------------------
+@dataclass
+class MetricsResult:
+    """Ensemble des métriques calculées pour une paire (référence, hypothèse)."""
+    cer: float
+    cer_nfc: float
+    cer_caseless: float
+    wer: float
+    wer_normalized: float
+    mer: float
+    wil: float
+    reference_length: int
+    hypothesis_length: int
+    error: Optional[str] = None
+    def as_dict(self) -> dict:
+        return {
+            "cer": round(self.cer, 6),
+            "cer_nfc": round(self.cer_nfc, 6),
+            "cer_caseless": round(self.cer_caseless, 6),
+            "wer": round(self.wer, 6),
+            "wer_normalized": round(self.wer_normalized, 6),
+            "mer": round(self.mer, 6),
+            "wil": round(self.wil, 6),
+            "reference_length": self.reference_length,
+            "hypothesis_length": self.hypothesis_length,
+            "error": self.error,
+        }
+    @property
+    def cer_percent(self) -> float:
+        return round(self.cer * 100, 2)
+    @property
+    def wer_percent(self) -> float:
+        return round(self.wer * 100, 2)
+def compute_metrics(reference: str, hypothesis: str) -> MetricsResult:
+    """Calcule l'ensemble des métriques CER/WER pour une paire de textes.
+    Parameters
+    ----------
+    reference:
+        Texte de vérité terrain (ground truth).
+    hypothesis:
+        Texte produit par le moteur OCR.
+    Returns
+    -------
+    MetricsResult
+        Objet contenant toutes les métriques calculées.
+    """
+    if not _JIWER_AVAILABLE:
+        return MetricsResult(
+            cer=0.0, cer_nfc=0.0, cer_caseless=0.0,
+            wer=0.0, wer_normalized=0.0, mer=0.0, wil=0.0,
+            reference_length=len(reference),
+            hypothesis_length=len(hypothesis),
+            error="jiwer n'est pas installé (pip install jiwer)",
+        )
+    try:
+        # CER variants
+        cer_raw = _cer_from_strings(reference, hypothesis)
+        cer_nfc = _cer_from_strings(
+            _normalize_nfc(reference), _normalize_nfc(hypothesis)
+        )
+        cer_caseless = _cer_from_strings(
+            _normalize_caseless(reference), _normalize_caseless(hypothesis)
+        )
+        # WER variants
+        ref_norm = _normalize_whitespace(reference)
+        hyp_norm = _normalize_whitespace(hypothesis)
+        wer_raw = jiwer.wer(reference, hypothesis)
+        wer_normalized = jiwer.wer(ref_norm, hyp_norm)
+        mer = jiwer.mer(reference, hypothesis)
+        wil = jiwer.wil(reference, hypothesis)
+        return MetricsResult(
+            cer=cer_raw,
+            cer_nfc=cer_nfc,
+            cer_caseless=cer_caseless,
+            wer=wer_raw,
+            wer_normalized=wer_normalized,
+            mer=mer,
+            wil=wil,
+            reference_length=len(reference),
+            hypothesis_length=len(hypothesis),
+        )
+    except Exception as exc:  # noqa: BLE001
+        return MetricsResult(
+            cer=0.0, cer_nfc=0.0, cer_caseless=0.0,
+            wer=0.0, wer_normalized=0.0, mer=0.0, wil=0.0,
+            reference_length=len(reference),
+            hypothesis_length=len(hypothesis),
+            error=str(exc),
+        )
+def aggregate_metrics(results: list[MetricsResult]) -> dict:
+    """Calcule les statistiques agrégées sur un ensemble de résultats.
+    Parameters
+    ----------
+    results:
+        Liste de MetricsResult correspondant à plusieurs documents.
+    Returns
+    -------
+    dict
+        Statistiques : moyenne, médiane, min, max, std pour chaque métrique.
+    """
+    import statistics
+    if not results:
+        return {}
+    def _stats(values: list[float]) -> dict:
+        if not values:
+            return {}
+        return {
+            "mean": round(statistics.mean(values), 6),
+            "median": round(statistics.median(values), 6),
+            "min": round(min(values), 6),
+            "max": round(max(values), 6),
+            "stdev": round(statistics.stdev(values), 6) if len(values) > 1 else 0.0,
+        }
+    metric_names = ["cer", "cer_nfc", "cer_caseless", "wer", "wer_normalized", "mer", "wil"]
+    aggregated: dict = {}
+    for metric in metric_names:
+        values = [getattr(r, metric) for r in results if r.error is None]
+        aggregated[metric] = _stats(values)
+    aggregated["document_count"] = len(results)
+    aggregated["failed_count"] = sum(1 for r in results if r.error is not None)
+    return aggregated

picarones/core/results.py ADDED Viewed

	@@ -0,0 +1,155 @@

+"""Modèle de données des résultats et export JSON.
+Hiérarchie
+----------
+BenchmarkResult
+  └── EngineReport          (un par moteur)
+        └── DocumentResult  (un par document)
+"""
+from __future__ import annotations
+import json
+from dataclasses import asdict, dataclass, field
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Optional
+from picarones import __version__
+from picarones.core.metrics import MetricsResult, aggregate_metrics
+@dataclass
+class DocumentResult:
+    """Résultat d'un moteur sur un seul document."""
+    doc_id: str
+    image_path: str
+    ground_truth: str
+    hypothesis: str
+    metrics: MetricsResult
+    duration_seconds: float
+    engine_error: Optional[str] = None
+    def as_dict(self) -> dict:
+        return {
+            "doc_id": self.doc_id,
+            "image_path": self.image_path,
+            "ground_truth": self.ground_truth,
+            "hypothesis": self.hypothesis,
+            "metrics": self.metrics.as_dict(),
+            "duration_seconds": self.duration_seconds,
+            "engine_error": self.engine_error,
+        }
+@dataclass
+class EngineReport:
+    """Rapport complet d'un moteur sur l'ensemble du corpus."""
+    engine_name: str
+    engine_version: str
+    engine_config: dict
+    document_results: list[DocumentResult]
+    aggregated_metrics: dict = field(default_factory=dict)
+    def __post_init__(self) -> None:
+        if not self.aggregated_metrics and self.document_results:
+            self.aggregated_metrics = aggregate_metrics(
+                [dr.metrics for dr in self.document_results]
+            )
+    @property
+    def mean_cer(self) -> Optional[float]:
+        cer_stats = self.aggregated_metrics.get("cer", {})
+        return cer_stats.get("mean")
+    @property
+    def mean_wer(self) -> Optional[float]:
+        wer_stats = self.aggregated_metrics.get("wer", {})
+        return wer_stats.get("mean")
+    def as_dict(self) -> dict:
+        return {
+            "engine_name": self.engine_name,
+            "engine_version": self.engine_version,
+            "engine_config": self.engine_config,
+            "aggregated_metrics": self.aggregated_metrics,
+            "document_results": [dr.as_dict() for dr in self.document_results],
+        }
+@dataclass
+class BenchmarkResult:
+    """Résultat complet d'un benchmark multi-moteurs sur un corpus."""
+    corpus_name: str
+    corpus_source: Optional[str]
+    document_count: int
+    engine_reports: list[EngineReport]
+    run_date: str = field(default_factory=lambda: datetime.now(tz=timezone.utc).isoformat())
+    picarones_version: str = __version__
+    metadata: dict = field(default_factory=dict)
+    def ranking(self) -> list[dict]:
+        """Retourne le classement des moteurs trié par CER croissant."""
+        ranked = []
+        for report in self.engine_reports:
+            ranked.append(
+                {
+                    "engine": report.engine_name,
+                    "mean_cer": report.mean_cer,
+                    "mean_wer": report.mean_wer,
+                    "documents": len(report.document_results),
+                    "failed": report.aggregated_metrics.get("failed_count", 0),
+                }
+            )
+        return sorted(
+            ranked,
+            key=lambda x: (x["mean_cer"] is None, x["mean_cer"] or float("inf")),
+        )
+    def as_dict(self) -> dict:
+        return {
+            "picarones_version": self.picarones_version,
+            "run_date": self.run_date,
+            "corpus": {
+                "name": self.corpus_name,
+                "source": self.corpus_source,
+                "document_count": self.document_count,
+            },
+            "ranking": self.ranking(),
+            "engine_reports": [r.as_dict() for r in self.engine_reports],
+            "metadata": self.metadata,
+        }
+    def to_json(self, path: str | Path, indent: int = 2) -> Path:
+        """Sérialise le benchmark en JSON et l'écrit sur disque.
+        Parameters
+        ----------
+        path:
+            Chemin du fichier JSON de sortie.
+        indent:
+            Indentation JSON (défaut : 2 espaces).
+        Returns
+        -------
+        Path
+            Chemin absolu du fichier écrit.
+        """
+        output_path = Path(path)
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        with output_path.open("w", encoding="utf-8") as fh:
+            json.dump(self.as_dict(), fh, ensure_ascii=False, indent=indent)
+        return output_path.resolve()
+    @classmethod
+    def from_json(cls, path: str | Path) -> dict:
+        """Charge un résultat JSON brut depuis le disque (pour le rapport HTML).
+        Retourne le dict Python — la reconstruction complète en objets
+        est réservée aux sprints suivants.
+        """
+        with Path(path).open(encoding="utf-8") as fh:
+            return json.load(fh)

picarones/core/runner.py ADDED Viewed

	@@ -0,0 +1,115 @@

+"""Orchestrateur du benchmark : exécute les moteurs sur le corpus et agrège les résultats."""
+from __future__ import annotations
+import logging
+from pathlib import Path
+from typing import Optional
+from tqdm import tqdm
+from picarones.core.corpus import Corpus
+from picarones.core.metrics import compute_metrics
+from picarones.core.results import BenchmarkResult, DocumentResult, EngineReport
+from picarones.engines.base import BaseOCREngine
+logger = logging.getLogger(__name__)
+def run_benchmark(
+    corpus: Corpus,
+    engines: list[BaseOCREngine],
+    output_json: Optional[str | Path] = None,
+    show_progress: bool = True,
+) -> BenchmarkResult:
+    """Exécute le benchmark d'un ou plusieurs moteurs sur un corpus.
+    Pour chaque moteur, chaque document est traité séquentiellement.
+    Les sorties sont évaluées par rapport à la vérité terrain via
+    les métriques CER et WER.
+    Parameters
+    ----------
+    corpus:
+        Corpus à évaluer (objet ``Corpus`` avec ses ``Document``).
+    engines:
+        Liste d'adaptateurs moteurs à comparer.
+    output_json:
+        Chemin optionnel pour écrire le résultat JSON. Si ``None``, pas
+        d'écriture disque.
+    show_progress:
+        Affiche une barre de progression tqdm (défaut : True).
+    Returns
+    -------
+    BenchmarkResult
+        Objet contenant tous les résultats, agrégations et classement.
+    """
+    engine_reports: list[EngineReport] = []
+    for engine in engines:
+        logger.info("Démarrage moteur : %s", engine.name)
+        document_results: list[DocumentResult] = []
+        iterator = tqdm(
+            corpus.documents,
+            desc=f"[{engine.name}]",
+            unit="doc",
+            disable=not show_progress,
+        )
+        for doc in iterator:
+            ocr_result = engine.run(doc.image_path)
+            if ocr_result.success:
+                metrics = compute_metrics(doc.ground_truth, ocr_result.text)
+            else:
+                # Moteur en erreur → métriques dégradées avec erreur tracée
+                from picarones.core.metrics import MetricsResult
+                metrics = MetricsResult(
+                    cer=1.0, cer_nfc=1.0, cer_caseless=1.0,
+                    wer=1.0, wer_normalized=1.0, mer=1.0, wil=1.0,
+                    reference_length=len(doc.ground_truth),
+                    hypothesis_length=0,
+                    error=ocr_result.error,
+                )
+            document_results.append(
+                DocumentResult(
+                    doc_id=doc.doc_id,
+                    image_path=str(doc.image_path),
+                    ground_truth=doc.ground_truth,
+                    hypothesis=ocr_result.text,
+                    metrics=metrics,
+                    duration_seconds=ocr_result.duration_seconds,
+                    engine_error=ocr_result.error,
+                )
+            )
+        engine_version = engine._safe_version()
+        report = EngineReport(
+            engine_name=engine.name,
+            engine_version=engine_version,
+            engine_config=engine.config,
+            document_results=document_results,
+        )
+        engine_reports.append(report)
+        logger.info(
+            "Moteur %s terminé — CER moyen : %.2f%%",
+            engine.name,
+            (report.mean_cer or 0) * 100,
+        )
+    benchmark = BenchmarkResult(
+        corpus_name=corpus.name,
+        corpus_source=corpus.source_path,
+        document_count=len(corpus),
+        engine_reports=engine_reports,
+    )
+    if output_json:
+        path = benchmark.to_json(output_json)
+        logger.info("Résultats écrits dans : %s", path)
+    return benchmark

picarones/engines/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""Adaptateurs moteurs OCR."""
+from picarones.engines.base import BaseOCREngine, EngineResult
+from picarones.engines.tesseract import TesseractEngine
+__all__ = ["BaseOCREngine", "EngineResult", "TesseractEngine"]
+try:
+    from picarones.engines.pero_ocr import PeroOCREngine
+    __all__.append("PeroOCREngine")
+except ImportError:
+    pass

picarones/engines/base.py ADDED Viewed

	@@ -0,0 +1,85 @@

+"""Interface abstraite commune à tous les adaptateurs moteurs OCR."""
+from __future__ import annotations
+import hashlib
+import time
+from abc import ABC, abstractmethod
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+@dataclass
+class EngineResult:
+    """Résultat brut produit par un moteur OCR sur une image."""
+    engine_name: str
+    image_path: str
+    text: str
+    duration_seconds: float
+    error: Optional[str] = None
+    metadata: dict = field(default_factory=dict)
+    @property
+    def success(self) -> bool:
+        return self.error is None
+    @property
+    def image_sha256(self) -> str:
+        return hashlib.sha256(Path(self.image_path).read_bytes()).hexdigest()
+class BaseOCREngine(ABC):
+    """Classe de base dont héritent tous les adaptateurs OCR.
+    Chaque adaptateur doit implémenter :
+    - ``name`` : identifiant unique du moteur
+    - ``version()`` : retourne la version du moteur sous forme de chaîne
+    - ``_run_ocr(image_path)`` : logique d'exécution OCR, retourne le texte brut
+    """
+    def __init__(self, config: Optional[dict] = None) -> None:
+        self.config: dict = config or {}
+    @property
+    @abstractmethod
+    def name(self) -> str:
+        """Identifiant unique et stable du moteur."""
+    @abstractmethod
+    def version(self) -> str:
+        """Retourne la version du moteur (ex : '5.3.0')."""
+    @abstractmethod
+    def _run_ocr(self, image_path: Path) -> str:
+        """Exécute l'OCR et retourne le texte brut extrait."""
+    def run(self, image_path: str | Path) -> EngineResult:
+        """Point d'entrée public : exécute l'OCR et mesure le temps d'exécution."""
+        image_path = Path(image_path)
+        start = time.perf_counter()
+        try:
+            text = self._run_ocr(image_path)
+            error = None
+        except Exception as exc:  # noqa: BLE001
+            text = ""
+            error = str(exc)
+        duration = time.perf_counter() - start
+        return EngineResult(
+            engine_name=self.name,
+            image_path=str(image_path),
+            text=text,
+            duration_seconds=round(duration, 4),
+            error=error,
+            metadata={"engine_version": self._safe_version()},
+        )
+    def _safe_version(self) -> str:
+        try:
+            return self.version()
+        except Exception:  # noqa: BLE001
+            return "unknown"
+    def __repr__(self) -> str:
+        return f"{self.__class__.__name__}(name={self.name!r})"

picarones/engines/pero_ocr.py ADDED Viewed

	@@ -0,0 +1,112 @@

+"""Adaptateur Pero OCR.
+Pero OCR est un moteur HTR/OCR performant sur les documents historiques,
+développé par l'Université technologique de Brno.
+Dépendance : pero-ocr  (pip install pero-ocr)
+Dépôt      : https://github.com/DCGM/pero-ocr
+Configuration YAML :
+```yaml
+name: pero_ocr
+engine: pero_ocr
+config: /chemin/vers/config.ini   # fichier de configuration Pero OCR
+cuda: false                        # utiliser le GPU si disponible
+```
+"""
+from __future__ import annotations
+import tempfile
+from pathlib import Path
+from typing import Optional
+from picarones.engines.base import BaseOCREngine
+try:
+    import numpy as np
+    from PIL import Image
+    _PIL_AVAILABLE = True
+except ImportError:
+    _PIL_AVAILABLE = False
+try:
+    from pero_ocr.document_ocr.layout import PageLayout
+    from pero_ocr.document_ocr.page_parser import PageParser
+    _PERO_AVAILABLE = True
+except ImportError:
+    _PERO_AVAILABLE = False
+class PeroOCREngine(BaseOCREngine):
+    """Adaptateur pour Pero OCR.
+    Pero OCR produit une sortie structurée (PAGE XML) ; cet adaptateur
+    en extrait le texte plat dans l'ordre de lecture naturel.
+    """
+    def __init__(self, config: Optional[dict] = None) -> None:
+        super().__init__(config)
+        self._parser: Optional[object] = None
+    @property
+    def name(self) -> str:
+        return self.config.get("name", "pero_ocr")
+    def version(self) -> str:
+        if not _PERO_AVAILABLE:
+            raise RuntimeError("pero-ocr n'est pas installé.")
+        try:
+            import pero_ocr
+            return getattr(pero_ocr, "__version__", "unknown")
+        except Exception:  # noqa: BLE001
+            return "unknown"
+    def _get_parser(self) -> "PageParser":
+        """Instancie le PageParser (lazy, une seule fois par moteur)."""
+        if self._parser is None:
+            if not _PERO_AVAILABLE:
+                raise RuntimeError(
+                    "pero-ocr n'est pas installé. "
+                    "Installez-le avec : pip install pero-ocr"
+                )
+            config_path = self.config.get("config")
+            if not config_path:
+                raise ValueError(
+                    "La configuration Pero OCR requiert un paramètre 'config' "
+                    "pointant vers un fichier .ini Pero OCR valide."
+                )
+            import configparser
+            parser_config = configparser.ConfigParser()
+            parser_config.read(config_path)
+            self._parser = PageParser(parser_config)
+        return self._parser  # type: ignore[return-value]
+    def _run_ocr(self, image_path: Path) -> str:
+        if not _PIL_AVAILABLE:
+            raise RuntimeError("Pillow n'est pas installé.")
+        parser = self._get_parser()
+        image = np.array(Image.open(image_path).convert("RGB"))
+        page_layout = PageLayout(id=image_path.stem, page_size=(image.shape[0], image.shape[1]))
+        # Exécution du pipeline Pero OCR
+        parser.process_page(image, page_layout)
+        # Extraction du texte plat dans l'ordre des lignes
+        lines = []
+        for region in page_layout.regions:
+            for line in region.lines:
+                if line.transcription:
+                    lines.append(line.transcription.strip())
+        return "\n".join(lines)
+    @classmethod
+    def from_config(cls, config: Optional[dict] = None) -> "PeroOCREngine":
+        return cls(config=config or {})

picarones/engines/tesseract.py ADDED Viewed

	@@ -0,0 +1,84 @@

+"""Adaptateur Tesseract 5 via pytesseract."""
+from __future__ import annotations
+from pathlib import Path
+from typing import Optional
+from picarones.engines.base import BaseOCREngine
+try:
+    import pytesseract
+    from PIL import Image
+    _PYTESSERACT_AVAILABLE = True
+except ImportError:
+    _PYTESSERACT_AVAILABLE = False
+# Correspondance des valeurs PSM acceptées en argument YAML/CLI
+_PSM_LABELS = {
+    0: "Orientation and script detection only",
+    1: "Automatic page segmentation with OSD",
+    3: "Fully automatic page segmentation (default)",
+    4: "Single column of text",
+    5: "Single uniform block of vertically aligned text",
+    6: "Single uniform block of text",
+    7: "Single text line",
+    8: "Single word",
+    9: "Single word in a circle",
+    10: "Single character",
+    11: "Sparse text",
+    12: "Sparse text with OSD",
+    13: "Raw line",
+}
+class TesseractEngine(BaseOCREngine):
+    """Adaptateur pour Tesseract 5 (via pytesseract).
+    Configuration YAML :
+    ```yaml
+    name: tesseract
+    engine: tesseract
+    lang: fra          # code langue Tesseract (fra, lat, eng, ...)
+    psm: 6             # Page Segmentation Mode (0-13)
+    oem: 3             # OCR Engine Mode (0=legacy, 3=LSTM, 3=default)
+    tesseract_cmd: tesseract  # chemin vers l'exécutable si non standard
+    ```
+    """
+    @property
+    def name(self) -> str:
+        return self.config.get("name", "tesseract")
+    def version(self) -> str:
+        if not _PYTESSERACT_AVAILABLE:
+            raise RuntimeError("pytesseract n'est pas installé.")
+        return pytesseract.get_tesseract_version().vstring
+    def _run_ocr(self, image_path: Path) -> str:
+        if not _PYTESSERACT_AVAILABLE:
+            raise RuntimeError(
+                "pytesseract n'est pas installé. "
+                "Installez-le avec : pip install pytesseract"
+            )
+        # Paramétrage optionnel de l'exécutable
+        tesseract_cmd = self.config.get("tesseract_cmd")
+        if tesseract_cmd:
+            pytesseract.pytesseract.tesseract_cmd = tesseract_cmd
+        lang = self.config.get("lang", "fra")
+        psm = int(self.config.get("psm", 6))
+        oem = int(self.config.get("oem", 3))
+        custom_config = f"--oem {oem} --psm {psm}"
+        image = Image.open(image_path)
+        text: str = pytesseract.image_to_string(image, lang=lang, config=custom_config)
+        return text.strip()
+    @classmethod
+    def from_config(cls, config: Optional[dict] = None) -> "TesseractEngine":
+        return cls(config=config or {})

pyproject.toml ADDED Viewed

	@@ -0,0 +1,44 @@

+[build-system]
+requires = ["setuptools>=68.0", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "picarones"
+version = "0.1.0"
+description = "Plateforme de comparaison de moteurs OCR pour documents patrimoniaux"
+readme = "README.md"
+requires-python = ">=3.11"
+license = { text = "Apache-2.0" }
+authors = [{ name = "Bibliothèque nationale de France — Département numérique" }]
+keywords = ["ocr", "htr", "patrimoine", "benchmark", "cer", "wer"]
+classifiers = [
+    "Development Status :: 3 - Alpha",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "License :: OSI Approved :: Apache Software License",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+]
+dependencies = [
+    "click>=8.1.0",
+    "jiwer>=3.0.0",
+    "Pillow>=10.0.0",
+    "pyyaml>=6.0.0",
+    "pytesseract>=0.3.10",
+    "tqdm>=4.66.0",
+    "numpy>=1.24.0",
+]
+[project.optional-dependencies]
+dev = ["pytest>=7.4.0", "pytest-cov>=4.1.0"]
+pero = ["pero-ocr>=0.1.0"]
+[project.scripts]
+picarones = "picarones.cli:cli"
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["picarones*"]
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+addopts = "-v --tb=short"

requirements.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+# Core
+click>=8.1.0
+jiwer>=3.0.0
+Pillow>=10.0.0
+pyyaml>=6.0.0
+# OCR engines
+pytesseract>=0.3.10
+# pero-ocr (optional, install separately if needed)
+# pero-ocr>=0.1.0
+# Utilities
+tqdm>=4.66.0
+numpy>=1.24.0
+# Development / testing
+pytest>=7.4.0
+pytest-cov>=4.1.0

tests/__init__.py ADDED Viewed

File without changes

tests/test_corpus.py ADDED Viewed

	@@ -0,0 +1,96 @@

+"""Tests unitaires pour picarones.core.corpus."""
+import pytest
+from pathlib import Path
+from picarones.core.corpus import load_corpus_from_directory, Corpus, Document
+@pytest.fixture
+def sample_corpus_dir(tmp_path: Path) -> Path:
+    """Crée un mini-corpus temporaire avec 3 paires image/GT."""
+    images = [
+        ("page_001.png", "La première page du document médiéval."),
+        ("page_002.png", "Deuxième folio avec des abréviations."),
+        ("page_003.png", "Fin du manuscrit avec colophon."),
+    ]
+    for filename, gt_text in images:
+        # Image factice (1×1 PNG valide)
+        image_path = tmp_path / filename
+        image_path.write_bytes(
+            b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01"
+            b"\x00\x00\x00\x01\x08\x02\x00\x00\x00\x90wS\xde\x00\x00"
+            b"\x00\x0cIDATx\x9cc\xf8\x0f\x00\x00\x01\x01\x00\x05\x18"
+            b"\xd8N\x00\x00\x00\x00IEND\xaeB`\x82"
+        )
+        gt_path = tmp_path / (Path(filename).stem + ".gt.txt")
+        gt_path.write_text(gt_text, encoding="utf-8")
+    return tmp_path
+class TestLoadCorpusFromDirectory:
+    def test_loads_correct_count(self, sample_corpus_dir):
+        corpus = load_corpus_from_directory(sample_corpus_dir)
+        assert len(corpus) == 3
+    def test_corpus_name_defaults_to_dir_name(self, sample_corpus_dir):
+        corpus = load_corpus_from_directory(sample_corpus_dir)
+        assert corpus.name == sample_corpus_dir.name
+    def test_corpus_name_can_be_set(self, sample_corpus_dir):
+        corpus = load_corpus_from_directory(sample_corpus_dir, name="Mon corpus test")
+        assert corpus.name == "Mon corpus test"
+    def test_document_ids(self, sample_corpus_dir):
+        corpus = load_corpus_from_directory(sample_corpus_dir)
+        ids = {doc.doc_id for doc in corpus}
+        assert "page_001" in ids
+        assert "page_002" in ids
+        assert "page_003" in ids
+    def test_ground_truth_content(self, sample_corpus_dir):
+        corpus = load_corpus_from_directory(sample_corpus_dir)
+        doc = next(d for d in corpus if d.doc_id == "page_001")
+        assert "médiéval" in doc.ground_truth
+    def test_source_path_set(self, sample_corpus_dir):
+        corpus = load_corpus_from_directory(sample_corpus_dir)
+        assert corpus.source_path == str(sample_corpus_dir)
+    def test_nonexistent_directory_raises(self, tmp_path):
+        with pytest.raises(FileNotFoundError):
+            load_corpus_from_directory(tmp_path / "inexistant")
+    def test_directory_without_gt_raises(self, tmp_path):
+        (tmp_path / "image.png").write_bytes(b"fake")
+        with pytest.raises(ValueError):
+            load_corpus_from_directory(tmp_path)
+    def test_ignores_images_without_gt(self, sample_corpus_dir, tmp_path):
+        # Copie le corpus et ajoute une image sans GT
+        import shutil
+        dest = tmp_path / "corpus2"
+        shutil.copytree(sample_corpus_dir, dest)
+        (dest / "orphan.png").write_bytes(b"fake")
+        corpus = load_corpus_from_directory(dest)
+        assert len(corpus) == 3  # L'image orpheline est ignorée
+    def test_stats_computed(self, sample_corpus_dir):
+        corpus = load_corpus_from_directory(sample_corpus_dir)
+        stats = corpus.stats
+        assert stats["document_count"] == 3
+        assert stats["gt_length_min"] > 0
+class TestCorpusIteration:
+    def test_iterable(self, sample_corpus_dir):
+        corpus = load_corpus_from_directory(sample_corpus_dir)
+        docs = list(corpus)
+        assert len(docs) == 3
+        assert all(isinstance(d, Document) for d in docs)
+    def test_repr(self, sample_corpus_dir):
+        corpus = load_corpus_from_directory(sample_corpus_dir)
+        r = repr(corpus)
+        assert "Corpus" in r
+        assert "3" in r

tests/test_engines.py ADDED Viewed

	@@ -0,0 +1,214 @@

+"""Tests unitaires pour les adaptateurs moteurs OCR.
+Les tests vérifient la structure et le comportement des adaptateurs
+sans requérir que Tesseract ou Pero OCR soient réellement installés.
+"""
+from __future__ import annotations
+import pytest
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+from picarones.engines.base import BaseOCREngine, EngineResult
+from picarones.engines.tesseract import TesseractEngine
+from picarones.engines.pero_ocr import PeroOCREngine
+# ---------------------------------------------------------------------------
+# Tests BaseOCREngine
+# ---------------------------------------------------------------------------
+class ConcreteEngine(BaseOCREngine):
+    """Implémentation minimale pour tester la classe de base."""
+    @property
+    def name(self) -> str:
+        return "test_engine"
+    def version(self) -> str:
+        return "1.0.0"
+    def _run_ocr(self, image_path: Path) -> str:
+        return "Texte extrait par le moteur de test."
+class FailingEngine(BaseOCREngine):
+    """Moteur qui lève toujours une exception."""
+    @property
+    def name(self) -> str:
+        return "failing_engine"
+    def version(self) -> str:
+        return "0.0.0"
+    def _run_ocr(self, image_path: Path) -> str:
+        raise RuntimeError("OCR échoué intentionnellement.")
+class TestBaseOCREngine:
+    def test_run_returns_engine_result(self, tmp_path):
+        (tmp_path / "image.png").write_bytes(b"fake_image")
+        engine = ConcreteEngine()
+        result = engine.run(tmp_path / "image.png")
+        assert isinstance(result, EngineResult)
+    def test_run_success(self, tmp_path):
+        (tmp_path / "image.png").write_bytes(b"fake_image")
+        engine = ConcreteEngine()
+        result = engine.run(tmp_path / "image.png")
+        assert result.success is True
+        assert result.error is None
+        assert result.text == "Texte extrait par le moteur de test."
+    def test_run_captures_exception(self, tmp_path):
+        (tmp_path / "image.png").write_bytes(b"fake_image")
+        engine = FailingEngine()
+        result = engine.run(tmp_path / "image.png")
+        assert result.success is False
+        assert result.error is not None
+        assert "OCR échoué" in result.error
+    def test_run_measures_duration(self, tmp_path):
+        (tmp_path / "image.png").write_bytes(b"fake_image")
+        engine = ConcreteEngine()
+        result = engine.run(tmp_path / "image.png")
+        assert result.duration_seconds >= 0.0
+    def test_engine_result_engine_name(self, tmp_path):
+        (tmp_path / "image.png").write_bytes(b"fake_image")
+        engine = ConcreteEngine()
+        result = engine.run(tmp_path / "image.png")
+        assert result.engine_name == "test_engine"
+    def test_repr(self):
+        engine = ConcreteEngine()
+        assert "ConcreteEngine" in repr(engine)
+        assert "test_engine" in repr(engine)
+    def test_image_path_stored(self, tmp_path):
+        img = tmp_path / "image.png"
+        img.write_bytes(b"fake_image")
+        engine = ConcreteEngine()
+        result = engine.run(img)
+        assert result.image_path == str(img)
+# ---------------------------------------------------------------------------
+# Tests TesseractEngine
+# ---------------------------------------------------------------------------
+class TestTesseractEngine:
+    def test_name_default(self):
+        engine = TesseractEngine()
+        assert engine.name == "tesseract"
+    def test_name_from_config(self):
+        engine = TesseractEngine(config={"name": "tesseract_fra"})
+        assert engine.name == "tesseract_fra"
+    def test_from_config_factory(self):
+        engine = TesseractEngine.from_config({"lang": "lat", "psm": 7})
+        assert engine.config["lang"] == "lat"
+        assert engine.config["psm"] == 7
+    def test_run_with_pytesseract_mocked(self, tmp_path):
+        """Vérifie que le moteur appelle pytesseract correctement."""
+        img = tmp_path / "page.png"
+        img.write_bytes(b"fake")
+        with (
+            patch("picarones.engines.tesseract._PYTESSERACT_AVAILABLE", True),
+            patch("picarones.engines.tesseract.pytesseract") as mock_tess,
+            patch("picarones.engines.tesseract.Image") as mock_pil,
+        ):
+            mock_tess.image_to_string.return_value = "Résultat OCR mock"
+            mock_pil.open.return_value = MagicMock()
+            engine = TesseractEngine(config={"lang": "fra", "psm": 6})
+            result = engine.run(img)
+        assert result.success is True
+        assert result.text == "Résultat OCR mock"
+        mock_tess.image_to_string.assert_called_once()
+    def test_run_without_pytesseract_raises(self, tmp_path):
+        """Sans pytesseract, le moteur doit retourner un EngineResult avec erreur."""
+        img = tmp_path / "page.png"
+        img.write_bytes(b"fake")
+        with patch("picarones.engines.tesseract._PYTESSERACT_AVAILABLE", False):
+            engine = TesseractEngine()
+            result = engine.run(img)
+        assert result.success is False
+        assert "pytesseract" in result.error.lower()
+# ---------------------------------------------------------------------------
+# Tests PeroOCREngine
+# ---------------------------------------------------------------------------
+class TestPeroOCREngine:
+    def test_name_default(self):
+        engine = PeroOCREngine()
+        assert engine.name == "pero_ocr"
+    def test_name_from_config(self):
+        engine = PeroOCREngine(config={"name": "pero_historique"})
+        assert engine.name == "pero_historique"
+    def test_from_config_factory(self):
+        engine = PeroOCREngine.from_config({"config": "/path/to/pero.ini"})
+        assert engine.config["config"] == "/path/to/pero.ini"
+    def test_run_without_pero_raises(self, tmp_path):
+        """Sans pero-ocr, le moteur doit retourner un EngineResult avec erreur."""
+        img = tmp_path / "page.png"
+        img.write_bytes(b"fake")
+        with patch("picarones.engines.pero_ocr._PERO_AVAILABLE", False):
+            engine = PeroOCREngine(config={"config": "/fake/config.ini"})
+            result = engine.run(img)
+        assert result.success is False
+    def test_run_without_config_raises(self, tmp_path):
+        """Sans paramètre 'config', le moteur doit signaler une erreur claire."""
+        img = tmp_path / "page.png"
+        img.write_bytes(b"fake")
+        with patch("picarones.engines.pero_ocr._PERO_AVAILABLE", True):
+            engine = PeroOCREngine()
+            result = engine.run(img)
+        assert result.success is False
+        assert "config" in result.error.lower()
+# ---------------------------------------------------------------------------
+# Tests EngineResult
+# ---------------------------------------------------------------------------
+class TestEngineResult:
+    def test_success_true_when_no_error(self):
+        r = EngineResult(
+            engine_name="test", image_path="/img.png",
+            text="texte", duration_seconds=0.1
+        )
+        assert r.success is True
+    def test_success_false_when_error(self):
+        r = EngineResult(
+            engine_name="test", image_path="/img.png",
+            text="", duration_seconds=0.1, error="Erreur"
+        )
+        assert r.success is False
+    def test_metadata_default_empty(self):
+        r = EngineResult(
+            engine_name="test", image_path="/img.png",
+            text="", duration_seconds=0.0
+        )
+        assert r.metadata == {}

tests/test_metrics.py ADDED Viewed

	@@ -0,0 +1,117 @@

+"""Tests unitaires pour le module picarones.core.metrics."""
+import pytest
+from picarones.core.metrics import aggregate_metrics, compute_metrics, MetricsResult
+class TestComputeMetrics:
+    """Tests de compute_metrics sur des cas connus."""
+    def test_perfect_match(self):
+        """CER et WER doivent être 0 quand référence == hypothèse."""
+        result = compute_metrics("Bonjour le monde", "Bonjour le monde")
+        assert result.cer == pytest.approx(0.0)
+        assert result.wer == pytest.approx(0.0)
+        assert result.error is None
+    def test_complete_mismatch(self):
+        """CER proche de 1 quand les textes sont totalement différents."""
+        result = compute_metrics("abc", "xyz")
+        assert result.cer > 0.0
+        assert result.error is None
+    def test_empty_reference(self):
+        """Référence vide : CER = 1.0 si hypothèse non vide."""
+        result = compute_metrics("", "quelque chose")
+        assert result.cer == pytest.approx(1.0)
+    def test_empty_both(self):
+        """Référence et hypothèse vides : CER = 0.0."""
+        result = compute_metrics("", "")
+        assert result.cer == pytest.approx(0.0)
+    def test_single_substitution(self):
+        """Une seule substitution sur 4 chars → CER = 0.25."""
+        result = compute_metrics("abcd", "abce")
+        assert result.cer == pytest.approx(0.25)
+    def test_case_insensitive_cer(self):
+        """CER caseless ignore les différences de casse."""
+        result = compute_metrics("Bonjour", "bonjour")
+        assert result.cer_caseless == pytest.approx(0.0)
+        # CER brut doit être > 0 (B ≠ b)
+        assert result.cer > 0.0
+    def test_nfc_normalization(self):
+        """CER NFC normalise les séquences unicode équivalentes."""
+        # é peut être encodé en forme composée (U+00E9) ou décomposée (e + U+0301)
+        composed = "\u00e9"       # é (NFC)
+        decomposed = "e\u0301"    # e + combining accent (NFD)
+        result = compute_metrics(composed, decomposed)
+        # Après NFC, les deux sont identiques → cer_nfc = 0
+        assert result.cer_nfc == pytest.approx(0.0)
+    def test_wer_one_word_wrong(self):
+        """WER = 1/3 pour 1 mot faux sur 3."""
+        result = compute_metrics("le chat dort", "le chien dort")
+        assert result.wer == pytest.approx(1 / 3, rel=1e-2)
+    def test_result_has_lengths(self):
+        ref = "Texte de référence"
+        result = compute_metrics(ref, "Texte différent")
+        assert result.reference_length == len(ref)
+        assert result.hypothesis_length > 0
+    def test_metrics_result_as_dict(self):
+        """as_dict() doit retourner toutes les clés attendues."""
+        result = compute_metrics("abc", "abc")
+        d = result.as_dict()
+        for key in ["cer", "cer_nfc", "cer_caseless", "wer", "wer_normalized", "mer", "wil"]:
+            assert key in d
+    def test_cer_percent_property(self):
+        result = compute_metrics("abcd", "abce")
+        assert result.cer_percent == pytest.approx(25.0, rel=1e-2)
+class TestAggregateMetrics:
+    """Tests de aggregate_metrics."""
+    def _make_result(self, cer: float) -> MetricsResult:
+        return MetricsResult(
+            cer=cer, cer_nfc=cer, cer_caseless=cer,
+            wer=cer, wer_normalized=cer, mer=cer, wil=cer,
+            reference_length=100,
+            hypothesis_length=100,
+        )
+    def test_empty_list(self):
+        assert aggregate_metrics([]) == {}
+    def test_single_result(self):
+        results = [self._make_result(0.1)]
+        agg = aggregate_metrics(results)
+        assert agg["cer"]["mean"] == pytest.approx(0.1)
+        assert agg["cer"]["min"] == pytest.approx(0.1)
+        assert agg["cer"]["max"] == pytest.approx(0.1)
+    def test_multiple_results(self):
+        results = [self._make_result(0.1), self._make_result(0.3)]
+        agg = aggregate_metrics(results)
+        assert agg["cer"]["mean"] == pytest.approx(0.2)
+        assert agg["document_count"] == 2
+        assert agg["failed_count"] == 0
+    def test_failed_results_excluded(self):
+        ok = self._make_result(0.1)
+        failed = MetricsResult(
+            cer=1.0, cer_nfc=1.0, cer_caseless=1.0,
+            wer=1.0, wer_normalized=1.0, mer=1.0, wil=1.0,
+            reference_length=50, hypothesis_length=0,
+            error="Moteur en erreur",
+        )
+        agg = aggregate_metrics([ok, failed])
+        # Les métriques agrégées n'incluent que les résultats sans erreur
+        assert agg["cer"]["mean"] == pytest.approx(0.1)
+        assert agg["failed_count"] == 1

tests/test_results.py ADDED Viewed

	@@ -0,0 +1,124 @@

+"""Tests unitaires pour picarones.core.results."""
+import json
+import pytest
+from pathlib import Path
+from picarones.core.metrics import MetricsResult
+from picarones.core.results import BenchmarkResult, DocumentResult, EngineReport
+def _make_metrics(cer: float = 0.05) -> MetricsResult:
+    return MetricsResult(
+        cer=cer, cer_nfc=cer, cer_caseless=cer,
+        wer=cer * 2, wer_normalized=cer * 2, mer=cer, wil=cer,
+        reference_length=200, hypothesis_length=195,
+    )
+def _make_document_result(doc_id: str = "doc1", cer: float = 0.05) -> DocumentResult:
+    return DocumentResult(
+        doc_id=doc_id,
+        image_path=f"/corpus/{doc_id}.png",
+        ground_truth="Texte de référence médiéval.",
+        hypothesis="Texte de référence médiéval.",
+        metrics=_make_metrics(cer),
+        duration_seconds=1.23,
+    )
+def _make_engine_report(name: str = "tesseract", n_docs: int = 3) -> EngineReport:
+    docs = [_make_document_result(f"doc{i}", cer=0.03 * i) for i in range(1, n_docs + 1)]
+    return EngineReport(
+        engine_name=name,
+        engine_version="5.3.0",
+        engine_config={"lang": "fra"},
+        document_results=docs,
+    )
+class TestDocumentResult:
+    def test_as_dict_keys(self):
+        dr = _make_document_result()
+        d = dr.as_dict()
+        for key in ["doc_id", "image_path", "ground_truth", "hypothesis", "metrics", "duration_seconds"]:
+            assert key in d
+    def test_metrics_serialized(self):
+        dr = _make_document_result(cer=0.1)
+        d = dr.as_dict()
+        assert d["metrics"]["cer"] == pytest.approx(0.1)
+class TestEngineReport:
+    def test_aggregation_computed(self):
+        report = _make_engine_report(n_docs=3)
+        assert report.aggregated_metrics != {}
+        assert "cer" in report.aggregated_metrics
+    def test_mean_cer(self):
+        report = _make_engine_report(n_docs=3)
+        # docs avec cer=0.03, 0.06, 0.09 → mean=0.06
+        assert report.mean_cer == pytest.approx(0.06, rel=1e-2)
+    def test_as_dict_structure(self):
+        report = _make_engine_report()
+        d = report.as_dict()
+        assert d["engine_name"] == "tesseract"
+        assert len(d["document_results"]) == 3
+class TestBenchmarkResult:
+    def _make_benchmark(self) -> BenchmarkResult:
+        return BenchmarkResult(
+            corpus_name="Test corpus",
+            corpus_source="/corpus/",
+            document_count=3,
+            engine_reports=[
+                _make_engine_report("tesseract"),
+                _make_engine_report("pero_ocr"),
+            ],
+        )
+    def test_ranking_sorted_by_cer(self):
+        bm = self._make_benchmark()
+        ranking = bm.ranking()
+        assert len(ranking) == 2
+        cers = [e["mean_cer"] for e in ranking]
+        assert cers == sorted(cers)
+    def test_to_json_writes_file(self, tmp_path):
+        bm = self._make_benchmark()
+        out = tmp_path / "results.json"
+        bm.to_json(out)
+        assert out.exists()
+        with out.open() as f:
+            data = json.load(f)
+        assert data["corpus"]["name"] == "Test corpus"
+    def test_to_json_creates_parent_dirs(self, tmp_path):
+        bm = self._make_benchmark()
+        out = tmp_path / "deep" / "nested" / "results.json"
+        bm.to_json(out)
+        assert out.exists()
+    def test_from_json_round_trip(self, tmp_path):
+        bm = self._make_benchmark()
+        out = tmp_path / "results.json"
+        bm.to_json(out)
+        loaded = BenchmarkResult.from_json(out)
+        assert loaded["corpus"]["name"] == "Test corpus"
+        assert len(loaded["engine_reports"]) == 2
+    def test_as_dict_has_version(self):
+        bm = self._make_benchmark()
+        d = bm.as_dict()
+        assert "picarones_version" in d
+        assert "run_date" in d
+    def test_ranking_has_required_fields(self):
+        bm = self._make_benchmark()
+        for entry in bm.ranking():
+            assert "engine" in entry
+            assert "mean_cer" in entry
+            assert "mean_wer" in entry