Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

Claude commited on 19 days ago

Commit

bd5c812

unverified ·

1 Parent(s): 2c2bc0f

feat(audit): Phase 3 partielle — câblage des features inachevées (S2, S4, S6)

Trois features réellement inachevées débloquées par l'audit
code-quality. Plutôt que de supprimer du code "inutilisé", on
branche ce qui a une vraie valeur produit.

**3.4 (S4) — Sur-normalisation LLM agrégée corpus-wide**

``aggregate_over_normalization`` existait mais :
- 0 ``@register_corpus_aggregator`` → jamais exécuté par les hooks
- module pas importé par ``evaluation/metrics/__init__.py``
(seulement en docstring)
- ``synthetic.py`` réimplémentait l'agrégation à la main

Câblage propre :
- Ajout du hook décoré ``_aggregate_over_normalization_hook``
(profils ``philological``, ``diagnostics``, ``full``) qui extrait
l'info depuis ``DocumentResult.pipeline_metadata["over_normalization"]``
et délègue à la fonction pure (rétrocompat préservée).
- Nouveau champ ``EngineReport.aggregated_over_normalization`` +
round-trip JSON ``as_dict``/``from_dict``.
- Helper ``_from_metadata_dict`` reconstruit
``OverNormalizationResult`` depuis le dict stocké, gère les
erreurs de typage avec ``logger.warning("[over_normalization]...")``.
- Module ajouté à ``evaluation/metrics/__init__.py`` pour déclencher
l'auto-enregistrement à l'import.

Tests : ``test_over_normalization_hook.py`` (8 tests) — registry,
profils, fonction pure, hook, malformed dict, round-trip JSON.

**3.5 (S6) — Test live tesseract : marker guard**

L'audit avait flaggé ``test_tesseract_live.py`` comme « skip top-level
inconditionnel ». Vérification : le skip est en réalité **conditionnel**
(``if shutil.which("tesseract") is None``) et le marker
``@pytest.mark.live`` est bien posé. Aucun bug — l'audit s'est
trompé. Mais on ajoute un garde-fou pour éviter qu'une nouvelle
fonction de test dans ``tests/integration/live/`` n'oublie le marker
(fait s'exécuter le test en CI standard et casse sans clé API).

Tests : ``test_live_test_markers.py`` (2 tests) — AST scan des
``test_*`` au top-level de ``tests/integration/live/``, échoue si
manque ``@pytest.mark.live``.

**3.2 (S2) — Journal des fallbacks d'importer**

Le détecteur narratif ``IMPORTER_FALLBACK_TRIGGERED`` était écrit
(history.py:280) et attendait ``benchmark_data["importer_fallbacks"]``,
mais le wiring intermédiaire manquait :

1. ``HTRUnitedCatalogue.from_remote`` quand DNS/réseau échoue
→ loguait mais n'appelait pas ``record_fallback`` (alors que
HuggingFace et le ``_parse_yml_catalogue`` le faisaient).
Ajout de l'appel + ``extra={"url", "fallback_used": "demo"}``.
2. ``app/services/benchmark_runner.py`` : 2 sites de production de
``BenchmarkResult`` (``_run_benchmark_unified``,
``_run_benchmark_with_partial``) — aucun ne consommait le journal.
Ajout de ``consume_fallback_log()`` en fin de run + stockage
dans ``BenchmarkResult.metadata["importer_fallbacks"]``.
3. ``reports/html/data.build_report_data`` ne propageait pas
``metadata.importer_fallbacks`` dans ``report_data``. Ajout
de la clé ``importer_fallbacks`` (liste vide si rien).

Résultat : pipeline end-to-end fonctionnel — un fallback HTR-United
en mode démo apparaît désormais dans la synthèse narrative du
rapport HTML, avec traçabilité (URL distante + raison de l'échec).

Tests : ``test_importer_fallback_wiring.py`` (8 tests E2E) — du
``record_fallback`` jusqu'au ``Fact`` rendu dans la prose de
``build_synthesis``. Régression couverte : si
``HTRUnitedCatalogue.from_remote`` oublie d'appeler ``record_fallback``
dans son except, le test ``test_htr_united_fallback_records_entry``
échoue.

**Bilan**

Suite : 4 750 passed, 16 skipped, 8 deselected, 2 xfailed.
+18 tests vs Phase 2 (8+2+8). Ruff propre, sync-counters CI vert,
auto-incrémenté à 4 750 (cohérent avec la prose CLAUDE.md/README.md).

Phase 3 partielle — restent les sous-phases 3.1 (backend pure-Python
robustness) et 3.3 (exposer NormalizationProfile.from_yaml en CLI/API).

Files changed (9) hide show

picarones/adapters/corpus/htr_united.py +10 -1
picarones/app/services/benchmark_runner.py +22 -0
picarones/evaluation/benchmark_result.py +16 -0
picarones/evaluation/metrics/__init__.py +2 -0
picarones/evaluation/metrics/over_normalization.py +80 -1
picarones/reports/html/data/__init__.py +5 -0
tests/architecture/test_live_test_markers.py +83 -0
tests/evaluation/metrics/test_over_normalization_hook.py +217 -0
tests/integration/test_importer_fallback_wiring.py +196 -0

picarones/adapters/corpus/htr_united.py CHANGED Viewed

@@ -259,12 +259,21 @@ class HTRUnitedCatalogue:
             entries = _parse_yml_catalogue(raw)
             return cls(entries, source="remote")
         except (urllib.error.URLError, Exception) as exc:
-            # Fallback démo avec avertissement
             logger.warning(
                 "[HTR-United] impossible de charger le catalogue distant (%s) : %s. "
                 "Utilisation des données de démonstration.",
                 _CATALOGUE_URL, exc,
             )
             return cls.from_demo()
     def search(

             entries = _parse_yml_catalogue(raw)
             return cls(entries, source="remote")
         except (urllib.error.URLError, Exception) as exc:
+            # Fallback démo avec avertissement.  Phase 3.2 audit
+            # code-quality : enregistrement de l'incident pour le
+            # détecteur narratif ``IMPORTER_FALLBACK_TRIGGERED``.
             logger.warning(
                 "[HTR-United] impossible de charger le catalogue distant (%s) : %s. "
                 "Utilisation des données de démonstration.",
                 _CATALOGUE_URL, exc,
             )
+            from picarones.adapters.corpus._fallback_log import record_fallback
+            record_fallback(
+                importer="htr_united",
+                operation="catalogue_remote_fetch",
+                error=exc,
+                extra={"url": _CATALOGUE_URL, "fallback_used": "demo"},
+            )
             return cls.from_demo()
     def search(

picarones/app/services/benchmark_runner.py CHANGED Viewed

@@ -372,11 +372,25 @@ def run_result_to_benchmark_result(
             ),
         )
     return BenchmarkResult(
         corpus_name=corpus.name,
         corpus_source=str(corpus.source_path) if corpus.source_path else None,
         document_count=len(documents),
         engine_reports=engine_reports,
     )
@@ -1532,11 +1546,19 @@ def _run_benchmark_with_partial(
         # ``all_doc_results``.
         _delete_partial(partial_path)
     return BenchmarkResult(
         corpus_name=corpus.name,
         corpus_source=str(corpus.source_path) if corpus.source_path else None,
         document_count=len(corpus.documents),
         engine_reports=engine_reports,
     )

             ),
         )
+    # Phase 3.2 audit code-quality — consommer le journal des
+    # fallbacks d'importer (HTR-United, HuggingFace, etc.).  La liste
+    # est vidée à la fin du benchmark pour que le run suivant
+    # n'hérite pas des incidents du précédent.  Le détecteur narratif
+    # ``IMPORTER_FALLBACK_TRIGGERED`` (history.py:280) lit
+    # ``benchmark_data["importer_fallbacks"]`` propagé par
+    # ``build_report_data``.
+    from picarones.adapters.corpus._fallback_log import consume_fallback_log
+    fallbacks = consume_fallback_log()
+    metadata: dict[str, Any] = {}
+    if fallbacks:
+        metadata["importer_fallbacks"] = fallbacks
     return BenchmarkResult(
         corpus_name=corpus.name,
         corpus_source=str(corpus.source_path) if corpus.source_path else None,
         document_count=len(documents),
         engine_reports=engine_reports,
+        metadata=metadata,
     )
         # ``all_doc_results``.
         _delete_partial(partial_path)
+    # Phase 3.2 audit code-quality — cf. _run_benchmark_unified.
+    from picarones.adapters.corpus._fallback_log import consume_fallback_log
+    fallbacks = consume_fallback_log()
+    metadata: dict[str, Any] = {}
+    if fallbacks:
+        metadata["importer_fallbacks"] = fallbacks
     return BenchmarkResult(
         corpus_name=corpus.name,
         corpus_source=str(corpus.source_path) if corpus.source_path else None,
         document_count=len(corpus.documents),
         engine_reports=engine_reports,
+        metadata=metadata,
     )

picarones/evaluation/benchmark_result.py CHANGED Viewed

@@ -364,6 +364,17 @@ class EngineReport:
     delta_median, delta_min, delta_max, n_over_normalized,
     n_under_normalized, over_normalized_rate}``.  ``None`` si
     aucun document n'avait de ``readability_metrics``."""
     def __post_init__(self) -> None:
         if not self.aggregated_metrics and self.document_results:
@@ -450,6 +461,8 @@ class EngineReport:
             )
         if self.aggregated_readability is not None:
             d["aggregated_readability"] = self.aggregated_readability
         return d
     @classmethod
@@ -487,6 +500,9 @@ class EngineReport:
                 "aggregated_numerical_sequences",
             ),
             aggregated_readability=data.get("aggregated_readability"),
         )

     delta_median, delta_min, delta_max, n_over_normalized,
     n_under_normalized, over_normalized_rate}``.  ``None`` si
     aucun document n'avait de ``readability_metrics``."""
+    # Phase 3.4 audit code-quality (2026-05) — câblage de
+    # ``aggregate_over_normalization`` (classe 10 de la taxonomie).
+    aggregated_over_normalization: Optional[dict] = None
+    """Sur-normalisation LLM agrégée corpus-wide.
+    Format ``{score, total_correct_ocr_words, over_normalized_count,
+    document_count}`` produit par
+    :func:`picarones.evaluation.metrics.over_normalization.aggregate_over_normalization`.
+    ``None`` si aucun document n'a porté de
+    ``pipeline_metadata["over_normalization"]`` (cas d'un benchmark
+    OCR seul, sans étape LLM)."""
     def __post_init__(self) -> None:
         if not self.aggregated_metrics and self.document_results:
             )
         if self.aggregated_readability is not None:
             d["aggregated_readability"] = self.aggregated_readability
+        if self.aggregated_over_normalization is not None:
+            d["aggregated_over_normalization"] = self.aggregated_over_normalization
         return d
     @classmethod
                 "aggregated_numerical_sequences",
             ),
             aggregated_readability=data.get("aggregated_readability"),
+            aggregated_over_normalization=data.get(
+                "aggregated_over_normalization",
+            ),
         )

picarones/evaluation/metrics/__init__.py CHANGED Viewed

@@ -57,6 +57,7 @@ from picarones.evaluation.metrics import (  # noqa: F401
     longitudinal,
     marginal_cost,
     module_policy,
     pricing,
     rare_tokens,
     robustness_projection,
@@ -83,6 +84,7 @@ __all__ = [
     "longitudinal",
     "marginal_cost",
     "module_policy",
     "pricing",
     "rare_tokens",
     "robustness_projection",

     longitudinal,
     marginal_cost,
     module_policy,
+    over_normalization,
     pricing,
     rare_tokens,
     robustness_projection,
     "longitudinal",
     "marginal_cost",
     "module_policy",
+    "over_normalization",
     "pricing",
     "rare_tokens",
     "robustness_projection",

picarones/evaluation/metrics/over_normalization.py CHANGED Viewed

@@ -22,9 +22,19 @@ la graphie originale.
 from __future__ import annotations
 from dataclasses import dataclass, field
 from typing import Optional
 @dataclass
 class OverNormalizationResult:
@@ -111,7 +121,16 @@ def detect_over_normalization(
 def aggregate_over_normalization(results: list[Optional[OverNormalizationResult]]) -> dict:
-    """Agrège les résultats de sur-normalisation sur un ensemble de documents."""
     valid = [r for r in results if r is not None]
     if not valid:
         return {"score": None, "total_correct_ocr_words": 0, "over_normalized_count": 0}
@@ -126,3 +145,63 @@ def aggregate_over_normalization(results: list[Optional[OverNormalizationResult]
         "over_normalized_count": total_over,
         "document_count": len(valid),
     }

 from __future__ import annotations
+import logging
 from dataclasses import dataclass, field
 from typing import Optional
+from picarones.evaluation.metric_hooks import (
+    PROFILE_DIAGNOSTICS,
+    PROFILE_FULL,
+    PROFILE_PHILOLOGICAL,
+    register_corpus_aggregator,
+)
+logger = logging.getLogger(__name__)
 @dataclass
 class OverNormalizationResult:
 def aggregate_over_normalization(results: list[Optional[OverNormalizationResult]]) -> dict:
+    """Agrège les résultats de sur-normalisation sur un ensemble de documents.
+    Fonction pure utilitaire — reçoit directement une liste de
+    :class:`OverNormalizationResult` (typiquement le retour de
+    :func:`detect_over_normalization`).  Pour l'agrégation à partir
+    d'une liste de :class:`DocumentResult` produite par un benchmark,
+    le hook décoré :func:`_aggregate_over_normalization_hook`
+    (auto-enregistré) extrait l'information depuis
+    ``dr.pipeline_metadata["over_normalization"]``.
+    """
     valid = [r for r in results if r is not None]
     if not valid:
         return {"score": None, "total_correct_ocr_words": 0, "over_normalized_count": 0}
         "over_normalized_count": total_over,
         "document_count": len(valid),
     }
+# ---------------------------------------------------------------------------
+# Hook d'agrégation corpus-level — Phase 3.4 audit code-quality (2026-05)
+# ---------------------------------------------------------------------------
+#
+# Le calcul ``detect_over_normalization`` est branché en amont (synthétique
+# + pipelines OCR+LLM réels) et stocke son résultat dans
+# ``dr.pipeline_metadata["over_normalization"]`` (déjà sous forme de dict
+# via ``OverNormalizationResult.as_dict()``).  Le hook ci-dessous
+# l'extrait et invoque l'agrégateur pur ; la valeur retournée alimente
+# l'attribut ``EngineReport.aggregated_over_normalization``.
+#
+# Profils : disponible pour ``philological`` (analyse fine du LLM),
+# ``diagnostics`` (audit du pipeline) et ``full``.
+def _from_metadata_dict(meta: Optional[dict]) -> Optional[OverNormalizationResult]:
+    """Reconstruit un :class:`OverNormalizationResult` depuis le dict
+    stocké dans ``pipeline_metadata`` (forme ``as_dict()``)."""
+    if not isinstance(meta, dict):
+        return None
+    try:
+        return OverNormalizationResult(
+            total_correct_ocr_words=int(meta.get("total_correct_ocr_words", 0)),
+            over_normalized_count=int(meta.get("over_normalized_count", 0)),
+            over_normalized_passages=list(meta.get("over_normalized_passages", []) or []),
+        )
+    except (TypeError, ValueError) as exc:
+        logger.warning(
+            "[over_normalization] dict metadata mal formé, ignoré : %s", exc,
+        )
+        return None
+@register_corpus_aggregator(
+    name="over_normalization",
+    attribute="aggregated_over_normalization",
+    profiles=(PROFILE_PHILOLOGICAL, PROFILE_DIAGNOSTICS, PROFILE_FULL),
+)
+def _aggregate_over_normalization_hook(doc_results: list) -> Optional[dict]:
+    """Agrégateur corpus-level — auto-enregistré.
+    Extrait ``pipeline_metadata["over_normalization"]`` de chaque
+    document, reconstruit un :class:`OverNormalizationResult`, et
+    délègue à :func:`aggregate_over_normalization` (logique pure).
+    Retourne ``None`` si aucun document n'avait de données — pas
+    d'attribut ajouté au :class:`EngineReport` dans ce cas.
+    """
+    extracted = [
+        _from_metadata_dict(
+            getattr(dr, "pipeline_metadata", {}).get("over_normalization")
+            if hasattr(dr, "pipeline_metadata")
+            else None
+        )
+        for dr in doc_results
+    ]
+    if not any(r is not None for r in extracted):
+        return None
+    return aggregate_over_normalization(extracted)

picarones/reports/html/data/__init__.py CHANGED Viewed

@@ -126,6 +126,11 @@ def build_report_data(
         "taxonomy_intra_doc": compute_taxonomy_intra_doc_section(benchmark),
         # Sprint 91 (A.II.6) : matrice de coût marginal entre paires de moteurs.
         "marginal_cost": compute_marginal_cost_section(engines_summary),
     }

         "taxonomy_intra_doc": compute_taxonomy_intra_doc_section(benchmark),
         # Sprint 91 (A.II.6) : matrice de coût marginal entre paires de moteurs.
         "marginal_cost": compute_marginal_cost_section(engines_summary),
+        # Phase 3.2 audit code-quality — incidents d'importer (fallback
+        # mode démo HTR-United, fallback recherche HuggingFace, etc.)
+        # propagés au détecteur narratif ``IMPORTER_FALLBACK_TRIGGERED``.
+        # Liste vide si aucun fallback n'a eu lieu.
+        "importer_fallbacks": (benchmark.metadata or {}).get("importer_fallbacks", []),
     }

tests/architecture/test_live_test_markers.py ADDED Viewed

	@@ -0,0 +1,83 @@

+"""Phase 3.5 audit code-quality — les tests dans
+``tests/integration/live/`` doivent porter le marker
+``@pytest.mark.live`` sur **chacune** de leurs fonctions de test.
+Contexte : ``pyproject.toml`` déclare le marker ``live`` comme
+« tests d'intégration contre vraie API/binaire (Tesseract,
+Anthropic, OpenAI, Mistral) ; exclus par défaut, opt-in via
+``pytest -m live`` ».  Le filtre ``addopts = '-m "not live and not
+network"'`` les déselectionne au runner par défaut.
+Si une fonction dans ``tests/integration/live/`` oublie le marker,
+elle s'exécute lors du ``pytest tests/`` standard et :
+- échoue sur les runners sans la dep cloud → faux échec CI ;
+- consomme du quota API (clé en CI = facture surprise) ;
+- introduit une dépendance réseau non documentée.
+L'agent d'audit avait flaggé ``test_tesseract_live.py`` comme
+« skip top-level inconditionnel ».  Vérification : le skip est en
+fait **conditionnel** (``if shutil.which("tesseract") is None``),
+ce qui est légitime — un test live qui peut s'exécuter seulement
+si le binaire est présent.  Mais le garde-fou ci-dessous évite
+qu'une nouvelle fonction de test oublie le marker.
+"""
+from __future__ import annotations
+import ast
+from pathlib import Path
+import pytest
+LIVE_DIR = Path(__file__).resolve().parents[1] / "integration" / "live"
+def _test_functions(path: Path) -> list[tuple[str, ast.FunctionDef | ast.AsyncFunctionDef]]:
+    """Liste les fonctions ``test_*`` au top-level d'un fichier."""
+    tree = ast.parse(path.read_text(encoding="utf-8"))
+    out: list[tuple[str, ast.FunctionDef | ast.AsyncFunctionDef]] = []
+    for node in tree.body:
+        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)) and node.name.startswith("test_"):
+            out.append((node.name, node))
+    return out
+def _has_live_marker(fn: ast.FunctionDef | ast.AsyncFunctionDef) -> bool:
+    for deco in fn.decorator_list:
+        # ``@pytest.mark.live`` ou ``@pytest.mark.live(reason=...)``
+        if isinstance(deco, ast.Attribute) and deco.attr == "live":
+            return True
+        if isinstance(deco, ast.Call) and isinstance(deco.func, ast.Attribute) and deco.func.attr == "live":
+            return True
+    return False
+def _live_test_files() -> list[Path]:
+    if not LIVE_DIR.exists():
+        return []
+    return [
+        p for p in sorted(LIVE_DIR.glob("test_*.py"))
+        if p.name != "__init__.py" and p.name != "conftest.py"
+    ]
+@pytest.mark.parametrize("path", _live_test_files(), ids=lambda p: p.name)
+def test_every_function_in_live_dir_has_live_marker(path: Path) -> None:
+    """Chaque ``test_*`` dans ``tests/integration/live/`` porte ``@pytest.mark.live``.
+    Sinon le test peut s'exécuter en CI standard et casser sur
+    l'absence de clé API / binaire externe.
+    """
+    missing: list[str] = []
+    for name, fn in _test_functions(path):
+        if not _has_live_marker(fn):
+            missing.append(f"  {path.name}:{fn.lineno} :: {name}")
+    assert not missing, (
+        f"Fonctions dans {LIVE_DIR.name}/ sans ``@pytest.mark.live`` :\n"
+        + "\n".join(missing)
+        + "\n\nAjouter ``@pytest.mark.live`` au-dessus de chaque test "
+        "qui hit une API/un binaire externe — sinon le test "
+        "s'exécute sans opt-in et peut faire échouer le CI standard."
+    )

tests/evaluation/metrics/test_over_normalization_hook.py ADDED Viewed

	@@ -0,0 +1,217 @@

+"""Phase 3.4 audit code-quality — la sur-normalisation LLM est
+désormais agrégée automatiquement via le registre
+:mod:`picarones.evaluation.metric_hooks`.
+Avant la Phase 3.4, ``aggregate_over_normalization`` existait dans
+``picarones/evaluation/metrics/over_normalization.py`` mais :
+- n'avait aucun ``@register_corpus_aggregator`` ;
+- le module n'était même pas importé par ``evaluation/metrics/__init__.py``
+  (mentionné en docstring uniquement) ;
+- ``synthetic.py`` réimplémentait l'agrégation manuellement
+  (duplication silencieuse).
+Le hook ``_aggregate_over_normalization_hook`` (auto-enregistré)
+extrait désormais l'info depuis
+``DocumentResult.pipeline_metadata["over_normalization"]`` et
+alimente ``EngineReport.aggregated_over_normalization`` pour les
+profils ``philological``, ``diagnostics`` et ``full``.
+"""
+from __future__ import annotations
+from picarones.evaluation.benchmark_result import DocumentResult, EngineReport
+from picarones.evaluation.metric_hooks import (
+    PROFILE_DIAGNOSTICS,
+    PROFILE_FULL,
+    PROFILE_MINIMAL,
+    PROFILE_PHILOLOGICAL,
+    PROFILE_STANDARD,
+    _all_corpus_aggregator_names,
+    run_corpus_aggregators,
+    select_corpus_aggregators,
+)
+from picarones.evaluation.metric_result import MetricsResult
+from picarones.evaluation.metrics.over_normalization import (
+    OverNormalizationResult,
+    aggregate_over_normalization,
+)
+# --------------------------------------------------------------------------
+# Auto-enregistrement
+# --------------------------------------------------------------------------
+def test_over_normalization_aggregator_is_registered() -> None:
+    """L'import de ``picarones.evaluation.metrics`` doit déclencher
+    l'enregistrement de l'agrégateur ``over_normalization``."""
+    import picarones.evaluation.metrics  # noqa: F401 — déclenchement
+    assert "over_normalization" in _all_corpus_aggregator_names(), (
+        "Le hook ``_aggregate_over_normalization_hook`` n'est pas "
+        "enregistré.  Vérifier que ``over_normalization`` est dans "
+        "``picarones/evaluation/metrics/__init__.py`` (Phase 3.4)."
+    )
+def test_aggregator_in_correct_profiles() -> None:
+    """L'agrégateur doit être actif pour ``philological``,
+    ``diagnostics``, ``full`` — pas pour ``minimal`` ni ``standard``."""
+    import picarones.evaluation.metrics  # noqa: F401
+    for profile in (PROFILE_PHILOLOGICAL, PROFILE_DIAGNOSTICS, PROFILE_FULL):
+        names = [a.name for a in select_corpus_aggregators(profile)]
+        assert "over_normalization" in names, (
+            f"Profil ``{profile}`` n'inclut pas l'agrégateur over_normalization."
+        )
+    for profile in (PROFILE_MINIMAL, PROFILE_STANDARD):
+        names = [a.name for a in select_corpus_aggregators(profile)]
+        assert "over_normalization" not in names, (
+            f"Profil ``{profile}`` ne devrait pas inclure over_normalization."
+        )
+# --------------------------------------------------------------------------
+# Fonction pure aggregate_over_normalization (rétrocompat)
+# --------------------------------------------------------------------------
+def test_pure_aggregate_empty_list_returns_zero() -> None:
+    """Pas de docs → score None, compteurs à zéro (rétrocompat de la
+    fonction utilitaire pure)."""
+    out = aggregate_over_normalization([])
+    assert out == {
+        "score": None,
+        "total_correct_ocr_words": 0,
+        "over_normalized_count": 0,
+    }
+def test_pure_aggregate_sums_counts() -> None:
+    """L'agrégation somme les compteurs bruts puis recalcule le score."""
+    r1 = OverNormalizationResult(
+        total_correct_ocr_words=100,
+        over_normalized_count=10,
+    )
+    r2 = OverNormalizationResult(
+        total_correct_ocr_words=50,
+        over_normalized_count=5,
+    )
+    out = aggregate_over_normalization([r1, r2, None])  # None ignoré
+    assert out == {
+        "score": 0.1,  # 15 / 150
+        "total_correct_ocr_words": 150,
+        "over_normalized_count": 15,
+        "document_count": 2,
+    }
+# --------------------------------------------------------------------------
+# Hook décoré — extraction depuis DocumentResult.pipeline_metadata
+# --------------------------------------------------------------------------
+def _make_dr(
+    doc_id: str,
+    over_norm_dict: dict | None,
+) -> DocumentResult:
+    return DocumentResult(
+        doc_id=doc_id,
+        image_path=f"/tmp/{doc_id}.png",
+        ground_truth="fait",
+        hypothesis="fait",
+        metrics=MetricsResult(cer=0.0, wer=0.0),
+        duration_seconds=1.0,
+        ocr_intermediate="faict",
+        pipeline_metadata=(
+            {"over_normalization": over_norm_dict}
+            if over_norm_dict is not None
+            else {}
+        ),
+    )
+def test_hook_returns_none_when_no_pipeline_metadata() -> None:
+    """Benchmark OCR seul (sans LLM) → aucun ``pipeline_metadata``,
+    donc le hook retourne ``None`` et ``aggregated_over_normalization``
+    reste à ``None``."""
+    import picarones.evaluation.metrics  # noqa: F401
+    docs = [_make_dr("d1", None), _make_dr("d2", None)]
+    out = run_corpus_aggregators(PROFILE_FULL, docs)
+    assert "aggregated_over_normalization" not in out
+def test_hook_aggregates_from_pipeline_metadata() -> None:
+    """Pipeline OCR+LLM → ``pipeline_metadata["over_normalization"]``
+    est extrait et agrégé."""
+    import picarones.evaluation.metrics  # noqa: F401
+    docs = [
+        _make_dr("d1", {
+            "score": 0.1,
+            "total_correct_ocr_words": 100,
+            "over_normalized_count": 10,
+            "over_normalized_passages": [],
+        }),
+        _make_dr("d2", {
+            "score": 0.2,
+            "total_correct_ocr_words": 50,
+            "over_normalized_count": 10,
+            "over_normalized_passages": [],
+        }),
+    ]
+    out = run_corpus_aggregators(PROFILE_PHILOLOGICAL, docs)
+    assert "aggregated_over_normalization" in out
+    result = out["aggregated_over_normalization"]
+    # 20 over-normalized / 150 correct OCR = 0.1333
+    assert result["over_normalized_count"] == 20
+    assert result["total_correct_ocr_words"] == 150
+    assert result["document_count"] == 2
+    assert 0.13 < result["score"] < 0.14
+def test_hook_resilient_to_malformed_dict() -> None:
+    """Si un document a un ``pipeline_metadata["over_normalization"]``
+    mal formé (manque un champ, valeur non castable), il est skipé
+    avec un warning — l'agrégateur n'échoue pas."""
+    import picarones.evaluation.metrics  # noqa: F401
+    docs = [
+        _make_dr("d1", {"total_correct_ocr_words": 100, "over_normalized_count": 5}),
+        _make_dr("d2", {"total_correct_ocr_words": "garbage", "over_normalized_count": 0}),
+        _make_dr("d3", None),
+    ]
+    out = run_corpus_aggregators(PROFILE_FULL, docs)
+    # d1 est valide → l'agrégateur retourne un dict, même si d2 est ignoré
+    assert "aggregated_over_normalization" in out
+    assert out["aggregated_over_normalization"]["over_normalized_count"] == 5
+# --------------------------------------------------------------------------
+# Sérialisation EngineReport
+# --------------------------------------------------------------------------
+def test_engine_report_round_trip_with_over_normalization() -> None:
+    """Le champ ``aggregated_over_normalization`` est préservé par
+    ``as_dict`` / ``from_dict``."""
+    er = EngineReport(
+        engine_name="tesseract+ministral",
+        engine_version="5.3.0",
+        engine_config={},
+        document_results=[],
+        aggregated_over_normalization={
+            "score": 0.15,
+            "total_correct_ocr_words": 200,
+            "over_normalized_count": 30,
+            "document_count": 5,
+        },
+    )
+    d = er.as_dict()
+    assert d["aggregated_over_normalization"]["score"] == 0.15
+    rebuilt = EngineReport.from_dict(d)
+    assert rebuilt.aggregated_over_normalization == er.aggregated_over_normalization

tests/integration/test_importer_fallback_wiring.py ADDED Viewed

	@@ -0,0 +1,196 @@

+"""Phase 3.2 audit code-quality — end-to-end du journal de fallback.
+Vérifie que la chaîne complète fonctionne :
+1. Un importer (HTR-United) dégrade en mode démo →
+   ``record_fallback`` côté importer.
+2. Le runner consomme via ``consume_fallback_log()`` et stocke dans
+   ``BenchmarkResult.metadata["importer_fallbacks"]``.
+3. ``build_report_data`` propage la liste dans
+   ``report_data["importer_fallbacks"]``.
+4. Le détecteur narratif ``detect_importer_fallback`` (history.py:280)
+   produit un ``Fact(FactType.IMPORTER_FALLBACK_TRIGGERED, ...)``.
+5. ``build_synthesis`` rend une phrase qui mentionne l'incident.
+Avant la Phase 3.2 : étapes 2-3 manquaient — le détecteur ne
+recevait jamais de données malgré l'API ``_fallback_log`` câblée
+côté importer.
+"""
+from __future__ import annotations
+import pytest
+from picarones.adapters.corpus._fallback_log import (
+    consume_fallback_log,
+    peek_fallback_log,
+    record_fallback,
+    reset_fallback_log,
+)
+from picarones.domain.facts import FactType
+from picarones.evaluation.benchmark_result import BenchmarkResult
+from picarones.reports.html.data import build_report_data
+from picarones.reports.narrative import build_synthesis
+from picarones.reports.narrative.detectors.history import detect_importer_fallback
+@pytest.fixture(autouse=True)
+def _clean_fallback_log() -> None:
+    """Le journal est un singleton thread-safe — on le vide avant
+    et après chaque test pour éviter les contaminations croisées."""
+    reset_fallback_log()
+    yield
+    reset_fallback_log()
+# --------------------------------------------------------------------------
+# Étape 1 : record_fallback est appelable + sérialise correctement
+# --------------------------------------------------------------------------
+def test_record_fallback_appends_entry() -> None:
+    record_fallback(
+        importer="htr_united",
+        operation="catalogue_remote_fetch",
+        error=RuntimeError("DNS timeout"),
+        extra={"url": "https://example.org/cat.yml"},
+    )
+    entries = peek_fallback_log()
+    assert len(entries) == 1
+    assert entries[0]["importer"] == "htr_united"
+    assert entries[0]["operation"] == "catalogue_remote_fetch"
+    assert "DNS timeout" in entries[0]["error"]
+    assert entries[0]["extra"]["url"] == "https://example.org/cat.yml"
+def test_htr_united_fallback_records_entry(monkeypatch: pytest.MonkeyPatch) -> None:
+    """``HTRUnitedCatalogue.from_remote`` doit appeler ``record_fallback``
+    quand le réseau échoue (régression : avant Phase 3.2 le warning
+    log était là, le record manquait)."""
+    import urllib.error
+    from picarones.adapters.corpus.htr_united import HTRUnitedCatalogue
+    def _boom(*_a, **_kw):
+        raise urllib.error.URLError("simulated DNS failure")
+    monkeypatch.setattr(
+        "picarones.adapters.corpus.htr_united.urllib.request.urlopen",
+        _boom,
+    )
+    cat = HTRUnitedCatalogue.from_remote(timeout=1)
+    assert cat.source == "demo"  # fallback effectif
+    entries = peek_fallback_log()
+    assert len(entries) == 1
+    assert entries[0]["importer"] == "htr_united"
+    assert entries[0]["operation"] == "catalogue_remote_fetch"
+    assert entries[0]["extra"]["fallback_used"] == "demo"
+# --------------------------------------------------------------------------
+# Étape 4 : le détecteur narratif émet un Fact à partir de la liste
+# --------------------------------------------------------------------------
+def test_detector_emits_fact_from_benchmark_data() -> None:
+    benchmark_data = {
+        "importer_fallbacks": [
+            {
+                "importer": "htr_united",
+                "operation": "catalogue_remote_fetch",
+                "error": "URLError(...)",
+                "extra": {"fallback_used": "demo"},
+            },
+        ],
+    }
+    facts = detect_importer_fallback(benchmark_data)
+    assert len(facts) == 1
+    assert facts[0].type is FactType.IMPORTER_FALLBACK_TRIGGERED
+    assert facts[0].payload["importer"] == "htr_united"
+def test_detector_silent_when_no_fallback() -> None:
+    """Pas de clé → pas de Fact."""
+    assert detect_importer_fallback({}) == []
+    assert detect_importer_fallback({"importer_fallbacks": []}) == []
+# --------------------------------------------------------------------------
+# Étape 3 : build_report_data propage metadata.importer_fallbacks
+# --------------------------------------------------------------------------
+def _empty_benchmark_with_metadata(metadata: dict) -> BenchmarkResult:
+    """Benchmark sans engine (suffisant pour tester la propagation
+    de ``metadata.importer_fallbacks`` vers ``report_data``)."""
+    return BenchmarkResult(
+        corpus_name="t",
+        corpus_source=None,
+        document_count=0,
+        engine_reports=[],
+        metadata=metadata,
+    )
+def test_build_report_data_propagates_fallbacks() -> None:
+    bench = _empty_benchmark_with_metadata({
+        "importer_fallbacks": [
+            {"importer": "htr_united", "operation": "catalogue_remote_fetch",
+             "error": "URLError(timeout)"},
+        ],
+    })
+    data = build_report_data(bench, images_b64={})
+    assert "importer_fallbacks" in data
+    assert len(data["importer_fallbacks"]) == 1
+    assert data["importer_fallbacks"][0]["importer"] == "htr_united"
+def test_build_report_data_empty_when_no_fallback() -> None:
+    bench = _empty_benchmark_with_metadata({})
+    data = build_report_data(bench, images_b64={})
+    assert data["importer_fallbacks"] == []
+# --------------------------------------------------------------------------
+# Étape 5 : build_synthesis fait remonter l'incident dans la prose
+# --------------------------------------------------------------------------
+def test_build_synthesis_mentions_fallback_in_french() -> None:
+    """La synthèse française doit produire au moins un fragment
+    textuel qui mentionne l'importer en mode dégradé."""
+    data = {
+        "engines": [],
+        "ranking": [],
+        "importer_fallbacks": [
+            {
+                "importer": "htr_united",
+                "operation": "catalogue_remote_fetch",
+                "error": "URLError(timeout)",
+                "extra": {"fallback_used": "demo"},
+            },
+        ],
+    }
+    out = build_synthesis(data, lang="fr", max_facts=5)
+    # Le texte rendu doit contenir au moins le nom de l'importer.
+    rendered = " ".join(out.get("paragraphs", []) or []) + " " + str(out)
+    assert "htr_united" in rendered.lower() or "htr-united" in rendered.lower(), (
+        f"La synthèse FR ne mentionne pas l'importer HTR-United malgré "
+        f"un fallback enregistré.  Sortie : {out!r}"
+    )
+# --------------------------------------------------------------------------
+# Étape 2 : consume vide bien la liste (anti-contamination cross-run)
+# --------------------------------------------------------------------------
+def test_consume_clears_the_log() -> None:
+    record_fallback(importer="a", operation="x")
+    record_fallback(importer="b", operation="y")
+    first = consume_fallback_log()
+    assert len(first) == 2
+    second = consume_fallback_log()
+    assert second == []  # vidé par le premier consume