Spaces:
Running
feat(formats): Sprint A14-S9 — formats/alto/, formats/pagexml/, migration normalization
Browse filesSprint S9 du plan rewrite ciblé. Phase 2 continue.
ALTO et PAGE deviennent des citoyens de première classe avec
parsers tolérants aux versions de namespace, structures internes
typées, writer déterministe (ALTO), et projecteurs conformes au
protocole S5. ``normalization.py`` est déplacé vers
``picarones/formats/text/`` avec re-export à l'ancien emplacement
pour ne casser aucun consommateur.
Modules livrés
--------------
``picarones/formats/alto/``
- ``types.py`` — ``AltoDocument``, ``AltoPage``, ``AltoTextBlock``,
``AltoLine``, ``AltoString``, ``AltoBBox``. Frozen pydantic.
- ``parser.py`` — ``parse_alto(xml_bytes)``. Détection auto
v2/v3/v4/sans namespace via le namespace du root. Sécurité
``defusedxml`` (XXE / Billion Laughs bloqués).
``AltoParseError`` typée.
- ``writer.py`` — ``write_alto(doc, version="v4", pretty=False)``.
Sortie déterministe (round-trip byte-stable testé).
- ``projector.py`` — ``alto_document_to_text(doc)`` helper +
``AltoToText`` projecteur conforme au protocole ``Projector``
du S5. Gestion césure ``HypPart1`` / ``HypPart2`` :
* SUBS_CONTENT renseigné → mot complet utilisé, HypPart2 skippé
* Pas de SUBS_CONTENT → concaténation des deux parts
* **Cross-ligne** (HypPart1 fin de ligne i, HypPart2 début
ligne i+1) géré via état inter-lignes du bloc
``picarones/formats/pagexml/``
- ``types.py`` — ``PageDocument``, ``PagePage``, ``PageTextRegion``,
``PageTextLine``. Coords stockés en string brut (format PAGE
``"x1,y1 x2,y2 ..."``).
- ``parser.py`` — ``parse_pagexml(xml_bytes)``. Tolérant aux
versions PRIMA (2010 / 2013 / 2017 / 2019). Sécurité
``defusedxml``. Extraction du texte depuis ``TextEquiv >
Unicode``.
- ``projector.py`` — ``page_document_to_text(doc)`` +
``PageToText`` projecteur.
Writer reporté post-livraison (les outils PAGE produisent
typiquement le format depuis un éditeur — re-sortir est plus
rare que pour ALTO).
``picarones/formats/text/normalization.py``
Déplacé depuis ``picarones/measurements/normalization.py`` sans
modification de logique. Les 11 profils restent intacts.
``picarones/measurements/normalization.py`` devient un **re-export
explicite** des symboles publics ET privés
(``_parse_exclude_chars``, ``_apply_diplomatic_table``) utilisés
downstream par ~50 consommateurs. Aucun import existant n'est
cassé. Le re-export sera retiré au S22.
Règle architecturale respectée
------------------------------
``measurements/`` (legacy) est autorisé à importer ``formats/``
(nouveau code) pendant la migration. L'inverse reste interdit
(test ``test_layer_dependencies`` toujours vert).
Anti-sur-ingénierie
-------------------
- Validator XSD ALTO reporté quand un caller en a concrètement
besoin (la plupart des outils acceptent un ALTO bien formé sans
validation stricte).
- Writer PAGE XML reporté.
- ``Illustration`` / ``ComposedBlock`` / ``StyleRefs`` /
``ProcessingStep`` non préservés au round-trip ALTO.
- ``Word`` / ``Glyph`` PAGE (granularité plus fine que ``TextLine``)
non parsés.
Tests — 41 nouveaux tests (3 fichiers)
--------------------------------------
``test_sprint_a14_s9_alto.py`` (24)
- 7 tests parser : détection v2/v3/v4/sans-namespace, XML
invalide, vide, **XXE bloqué**.
- 7 tests round-trip : structure préservée, content préservé,
bbox préservé, byte-déterministe, cibles v3/v4/none, version
invalide rejetée.
- 5 tests extraction texte : simple, multi-block, **césure same
line**, **césure cross-line**, **césure sans SUBS_CONTENT
concatène**.
- 5 tests projecteur AltoToText : protocole satisfait, projection
depuis filesystem, type incorrect rejeté, URI absente rejetée.
``test_sprint_a14_s9_pagexml.py`` (10)
- Parser : 5 cas (multi-régions, image_filename/width/height,
region_type, namespace détecté, vide, invalide, XXE).
- Extraction texte : 3 cas (full, doc vide, région sans lignes).
- Projector : 2 cas (FS, type incorrect).
``test_sprint_a14_s9_normalization_migration.py`` (5)
- Nouveau path expose 11 profils canoniques.
- Ancien re-export fonctionne (compat ascendante).
- Symboles privés (``_parse_exclude_chars``,
``_apply_diplomatic_table``) ré-exposés.
- Ancien et nouveau path **partagent les mêmes objets** (vrai
re-export, pas une duplication).
- Test fonctionnel ``profile.normalize("aſpre")``.
Mise à jour des budgets de fichiers
-----------------------------------
``tests/architecture/test_file_budgets.py`` :
- ``picarones/measurements/normalization.py`` : 420 lignes
(re-export S9, taille préservée).
- ``picarones/formats/text/normalization.py`` : 420 lignes (le
contenu canonique vit ici maintenant).
État de la suite
----------------
``pytest tests/ -q`` → 4160 passed, 7 skipped, 2 failed.
+41 tests par rapport à S8. Les 2 fails restants sont
strictement environnementaux (sous-process pytest sans
``pip install -e .``). Aucune régression S9.
Critère go/no-go S9 atteint
---------------------------
``parse_alto(xml_bytes).pages[0]...`` retourne une structure
cohérente sur ALTO BnF synthétique ; ``alto_document_to_text``
extrait le texte par ordre de lecture avec gestion césure
cross-ligne.
Prêt pour S10 (migration des calculs purs vers
``evaluation/metrics/``).
https://claude.ai/code/session_011XQZNitg1rCgia8ZD1a2hP
- picarones/formats/alto/__init__.py +47 -14
- picarones/formats/alto/parser.py +227 -0
- picarones/formats/alto/projector.py +215 -0
- picarones/formats/alto/types.py +126 -0
- picarones/formats/alto/writer.py +147 -0
- picarones/formats/pagexml/__init__.py +29 -5
- picarones/formats/pagexml/parser.py +149 -0
- picarones/formats/pagexml/projector.py +96 -0
- picarones/formats/pagexml/types.py +82 -0
- picarones/formats/text/__init__.py +38 -12
- picarones/formats/text/normalization.py +420 -0
- picarones/measurements/normalization.py +54 -416
- tests/architecture/test_file_budgets.py +5 -1
- tests/formats/__init__.py +0 -0
- tests/formats/alto/__init__.py +0 -0
- tests/formats/alto/test_sprint_a14_s9_alto.py +316 -0
- tests/formats/pagexml/__init__.py +0 -0
- tests/formats/pagexml/test_sprint_a14_s9_pagexml.py +136 -0
- tests/formats/text/__init__.py +0 -0
- tests/formats/text/test_sprint_a14_s9_normalization_migration.py +80 -0
|
@@ -1,21 +1,54 @@
|
|
| 1 |
-
"""Format ALTO XML 4.x.
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
- ``
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
- ``
|
| 11 |
-
- ``projector.py`` —
|
| 12 |
-
|
|
|
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
"""
|
| 18 |
|
| 19 |
from __future__ import annotations
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Format ALTO XML 4.x (et v2/v3 tolérés).
|
| 2 |
|
| 3 |
+
Sprint A14-S9 livre :
|
| 4 |
|
| 5 |
+
- ``types.py`` — ``AltoDocument``, ``AltoPage``, ``AltoTextBlock``,
|
| 6 |
+
``AltoLine``, ``AltoString``, ``AltoBBox``. Frozen pydantic.
|
| 7 |
+
- ``parser.py`` — ``parse_alto(xml_bytes)`` détection auto v2/v3/v4
|
| 8 |
+
via le namespace du root. Sécurité ``defusedxml``.
|
| 9 |
+
- ``writer.py`` — ``write_alto(doc, version="v4", pretty=False)``
|
| 10 |
+
sortie déterministe (round-trip byte-stable avec ``parser``).
|
| 11 |
+
- ``projector.py`` — ``alto_document_to_text(doc)`` (helper) +
|
| 12 |
+
``AltoToText`` (projecteur conforme au protocole S5). Gestion
|
| 13 |
+
césure ``HypPart1``/``HypPart2``.
|
| 14 |
|
| 15 |
+
Anti-sur-ingénierie
|
| 16 |
+
-------------------
|
| 17 |
+
- Validator XSD reporté quand un caller en aura concrètement besoin
|
| 18 |
+
(la plupart des outils consommateurs acceptent un ALTO bien formé
|
| 19 |
+
sans validation stricte).
|
| 20 |
+
- ``Illustration``, ``ComposedBlock``, ``GraphicalElement``,
|
| 21 |
+
``StyleRefs``, ``ProcessingStep`` : non préservés au round-trip
|
| 22 |
+
pour S9.
|
| 23 |
"""
|
| 24 |
|
| 25 |
from __future__ import annotations
|
| 26 |
|
| 27 |
+
from picarones.formats.alto.parser import AltoParseError, parse_alto
|
| 28 |
+
from picarones.formats.alto.projector import AltoToText, alto_document_to_text
|
| 29 |
+
from picarones.formats.alto.types import (
|
| 30 |
+
AltoBBox,
|
| 31 |
+
AltoDocument,
|
| 32 |
+
AltoLine,
|
| 33 |
+
AltoPage,
|
| 34 |
+
AltoString,
|
| 35 |
+
AltoTextBlock,
|
| 36 |
+
)
|
| 37 |
+
from picarones.formats.alto.writer import write_alto
|
| 38 |
+
|
| 39 |
+
__all__ = [
|
| 40 |
+
# Types
|
| 41 |
+
"AltoBBox",
|
| 42 |
+
"AltoString",
|
| 43 |
+
"AltoLine",
|
| 44 |
+
"AltoTextBlock",
|
| 45 |
+
"AltoPage",
|
| 46 |
+
"AltoDocument",
|
| 47 |
+
# Parser / Writer
|
| 48 |
+
"parse_alto",
|
| 49 |
+
"AltoParseError",
|
| 50 |
+
"write_alto",
|
| 51 |
+
# Projector
|
| 52 |
+
"alto_document_to_text",
|
| 53 |
+
"AltoToText",
|
| 54 |
+
]
|
|
@@ -0,0 +1,227 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Parser ALTO XML tolérant aux namespaces — Sprint A14-S9.
|
| 2 |
+
|
| 3 |
+
Détection auto de la version ALTO (v2/v3/v4) via le namespace du
|
| 4 |
+
root element. Tolérant aux variantes : un ALTO sans namespace est
|
| 5 |
+
accepté ; un ALTO avec déclaration partielle (``<alto>`` sans xmlns)
|
| 6 |
+
aussi.
|
| 7 |
+
|
| 8 |
+
Sécurité
|
| 9 |
+
--------
|
| 10 |
+
Utilise ``defusedxml.ElementTree`` pour bloquer XXE, Billion Laughs,
|
| 11 |
+
DTD retrieval — un ALTO peut venir d'un module tiers ou d'un
|
| 12 |
+
utilisateur web non authentifié.
|
| 13 |
+
|
| 14 |
+
Anti-sur-ingénierie
|
| 15 |
+
-------------------
|
| 16 |
+
- Pas de validation de schéma XSD pour S9 (le ``validator.py`` du
|
| 17 |
+
plan est reporté quand un caller en aura concrètement besoin —
|
| 18 |
+
la plupart des outils accepteront un ALTO bien formé même sans
|
| 19 |
+
validation stricte).
|
| 20 |
+
- Les éléments non reconnus (``Illustration``, ``ComposedBlock``,
|
| 21 |
+
``GraphicalElement``) sont silencieusement ignorés par le parser.
|
| 22 |
+
- ``HypPart1`` / ``HypPart2`` sont préservés au niveau ``AltoString``
|
| 23 |
+
(le projecteur les utilise pour la césure).
|
| 24 |
+
"""
|
| 25 |
+
|
| 26 |
+
from __future__ import annotations
|
| 27 |
+
|
| 28 |
+
import logging
|
| 29 |
+
import re
|
| 30 |
+
from typing import Any
|
| 31 |
+
|
| 32 |
+
import defusedxml.ElementTree as _SafeET
|
| 33 |
+
|
| 34 |
+
from picarones.domain.errors import PicaronesError
|
| 35 |
+
from picarones.formats.alto.types import (
|
| 36 |
+
AltoBBox,
|
| 37 |
+
AltoDocument,
|
| 38 |
+
AltoLine,
|
| 39 |
+
AltoPage,
|
| 40 |
+
AltoString,
|
| 41 |
+
AltoTextBlock,
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
logger = logging.getLogger(__name__)
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
class AltoParseError(PicaronesError):
|
| 48 |
+
"""ALTO non parsable (XML invalide, XXE bloqué, root absent)."""
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
_NS_RE = re.compile(r"^\{([^}]*)\}")
|
| 52 |
+
_LOCAL_NAME_RE = re.compile(r"\{[^}]*\}")
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def _local(tag: str) -> str:
|
| 56 |
+
"""Retire le préfixe namespace pour ne garder que le nom local."""
|
| 57 |
+
return _LOCAL_NAME_RE.sub("", tag)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def _detect_version(root_tag: str) -> str | None:
|
| 61 |
+
"""Détecte la version ALTO depuis le tag du root.
|
| 62 |
+
|
| 63 |
+
- Pas de namespace → ``"none"``.
|
| 64 |
+
- ``http://www.loc.gov/standards/alto/ns-v2#`` → ``"v2"``.
|
| 65 |
+
- ``http://www.loc.gov/standards/alto/ns-v3#`` → ``"v3"``.
|
| 66 |
+
- ``http://www.loc.gov/standards/alto/ns-v4#`` → ``"v4"``.
|
| 67 |
+
- Autre namespace → ``None`` (inconnu).
|
| 68 |
+
"""
|
| 69 |
+
m = _NS_RE.match(root_tag)
|
| 70 |
+
if m is None:
|
| 71 |
+
return "none"
|
| 72 |
+
ns = m.group(1)
|
| 73 |
+
if "ns-v2" in ns:
|
| 74 |
+
return "v2"
|
| 75 |
+
if "ns-v3" in ns:
|
| 76 |
+
return "v3"
|
| 77 |
+
if "ns-v4" in ns:
|
| 78 |
+
return "v4"
|
| 79 |
+
return None
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def _parse_int_attr(elem: Any, name: str) -> int | None:
|
| 83 |
+
"""Parse un attribut entier optionnel. Retourne ``None`` si
|
| 84 |
+
absent ou invalide (au lieu de lever)."""
|
| 85 |
+
raw = elem.attrib.get(name)
|
| 86 |
+
if raw is None:
|
| 87 |
+
return None
|
| 88 |
+
try:
|
| 89 |
+
# ALTO accepte des floats dans certains attributs (HPOS), on
|
| 90 |
+
# tronque vers int.
|
| 91 |
+
return int(float(raw))
|
| 92 |
+
except (ValueError, TypeError):
|
| 93 |
+
return None
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def _parse_bbox(elem: Any) -> AltoBBox | None:
|
| 97 |
+
"""Construit un ``AltoBBox`` si les 4 attributs sont présents."""
|
| 98 |
+
h = _parse_int_attr(elem, "HPOS")
|
| 99 |
+
v = _parse_int_attr(elem, "VPOS")
|
| 100 |
+
w = _parse_int_attr(elem, "WIDTH")
|
| 101 |
+
height = _parse_int_attr(elem, "HEIGHT")
|
| 102 |
+
if any(x is None for x in (h, v, w, height)):
|
| 103 |
+
return None
|
| 104 |
+
# Coordonnées négatives → certains ALTO mal formés ; on clip à 0.
|
| 105 |
+
return AltoBBox(
|
| 106 |
+
hpos=max(0, h or 0),
|
| 107 |
+
vpos=max(0, v or 0),
|
| 108 |
+
width=max(0, w or 0),
|
| 109 |
+
height=max(0, height or 0),
|
| 110 |
+
)
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def _parse_string(elem: Any) -> AltoString:
|
| 114 |
+
"""Convertit un élément ``<String>`` en ``AltoString``."""
|
| 115 |
+
return AltoString(
|
| 116 |
+
content=elem.attrib.get("CONTENT", ""),
|
| 117 |
+
id=elem.attrib.get("ID"),
|
| 118 |
+
bbox=_parse_bbox(elem),
|
| 119 |
+
subs_type=elem.attrib.get("SUBS_TYPE"),
|
| 120 |
+
subs_content=elem.attrib.get("SUBS_CONTENT"),
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
def _parse_line(elem: Any) -> AltoLine:
|
| 125 |
+
"""Convertit un élément ``<TextLine>`` en ``AltoLine``."""
|
| 126 |
+
strings: list[AltoString] = []
|
| 127 |
+
for child in elem:
|
| 128 |
+
if _local(child.tag) == "String":
|
| 129 |
+
strings.append(_parse_string(child))
|
| 130 |
+
return AltoLine(
|
| 131 |
+
id=elem.attrib.get("ID"),
|
| 132 |
+
bbox=_parse_bbox(elem),
|
| 133 |
+
strings=tuple(strings),
|
| 134 |
+
)
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
def _parse_block(elem: Any) -> AltoTextBlock:
|
| 138 |
+
"""Convertit un élément ``<TextBlock>`` en ``AltoTextBlock``."""
|
| 139 |
+
lines: list[AltoLine] = []
|
| 140 |
+
for child in elem.iter():
|
| 141 |
+
if _local(child.tag) == "TextLine":
|
| 142 |
+
lines.append(_parse_line(child))
|
| 143 |
+
return AltoTextBlock(
|
| 144 |
+
id=elem.attrib.get("ID"),
|
| 145 |
+
bbox=_parse_bbox(elem),
|
| 146 |
+
lines=tuple(lines),
|
| 147 |
+
)
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
def _parse_page(elem: Any) -> AltoPage:
|
| 151 |
+
"""Convertit un élément ``<Page>`` en ``AltoPage``."""
|
| 152 |
+
blocks: list[AltoTextBlock] = []
|
| 153 |
+
seen_block_ids: set[int] = set()
|
| 154 |
+
for child in elem.iter():
|
| 155 |
+
if _local(child.tag) != "TextBlock":
|
| 156 |
+
continue
|
| 157 |
+
# Évite la duplication quand un TextBlock est imbriqué dans un
|
| 158 |
+
# ComposedBlock — on retourne le bloc une seule fois (par id python).
|
| 159 |
+
marker = id(child)
|
| 160 |
+
if marker in seen_block_ids:
|
| 161 |
+
continue
|
| 162 |
+
seen_block_ids.add(marker)
|
| 163 |
+
blocks.append(_parse_block(child))
|
| 164 |
+
return AltoPage(
|
| 165 |
+
id=elem.attrib.get("ID"),
|
| 166 |
+
width=_parse_int_attr(elem, "WIDTH"),
|
| 167 |
+
height=_parse_int_attr(elem, "HEIGHT"),
|
| 168 |
+
blocks=tuple(blocks),
|
| 169 |
+
)
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
def parse_alto(xml: bytes | str) -> AltoDocument:
|
| 173 |
+
"""Parse un document ALTO et retourne sa structure interne.
|
| 174 |
+
|
| 175 |
+
Parameters
|
| 176 |
+
----------
|
| 177 |
+
xml:
|
| 178 |
+
Bytes ou string XML. Encodage détecté automatiquement par
|
| 179 |
+
``defusedxml`` (à partir de la déclaration ``<?xml encoding="..."?>``
|
| 180 |
+
ou du BOM).
|
| 181 |
+
|
| 182 |
+
Returns
|
| 183 |
+
-------
|
| 184 |
+
AltoDocument
|
| 185 |
+
Document avec ``source_version`` indiquant la version
|
| 186 |
+
détectée et ``pages`` contenant la hiérarchie complète.
|
| 187 |
+
|
| 188 |
+
Raises
|
| 189 |
+
------
|
| 190 |
+
AltoParseError
|
| 191 |
+
XML mal formé, défense XXE déclenchée, ou root absent.
|
| 192 |
+
"""
|
| 193 |
+
if isinstance(xml, str):
|
| 194 |
+
xml_bytes = xml.encode("utf-8")
|
| 195 |
+
else:
|
| 196 |
+
xml_bytes = xml
|
| 197 |
+
if not xml_bytes.strip():
|
| 198 |
+
raise AltoParseError("ALTO vide.")
|
| 199 |
+
try:
|
| 200 |
+
root = _SafeET.fromstring(xml_bytes)
|
| 201 |
+
except Exception as exc: # noqa: BLE001
|
| 202 |
+
raise AltoParseError(f"XML invalide ou XXE bloqué : {exc}") from exc
|
| 203 |
+
|
| 204 |
+
if root is None:
|
| 205 |
+
raise AltoParseError("ALTO sans root element.")
|
| 206 |
+
|
| 207 |
+
version = _detect_version(root.tag)
|
| 208 |
+
if _local(root.tag) != "alto":
|
| 209 |
+
# Tolérant : on cherche un éventuel <alto> imbriqué (cas d'un
|
| 210 |
+
# METS qui embarque l'ALTO dans un mdRef). Sinon on prend le
|
| 211 |
+
# root tel quel — peut-être qu'un caller passe directement
|
| 212 |
+
# un fragment <Page>.
|
| 213 |
+
for elem in root.iter():
|
| 214 |
+
if _local(elem.tag) == "alto":
|
| 215 |
+
root = elem
|
| 216 |
+
version = _detect_version(elem.tag)
|
| 217 |
+
break
|
| 218 |
+
|
| 219 |
+
pages: list[AltoPage] = []
|
| 220 |
+
for elem in root.iter():
|
| 221 |
+
if _local(elem.tag) == "Page":
|
| 222 |
+
pages.append(_parse_page(elem))
|
| 223 |
+
|
| 224 |
+
return AltoDocument(pages=tuple(pages), source_version=version)
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
__all__ = ["parse_alto", "AltoParseError"]
|
|
@@ -0,0 +1,215 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Projecteurs ALTO — Sprint A14-S9.
|
| 2 |
+
|
| 3 |
+
Convertit un ``AltoDocument`` (ou un artefact ``ALTO_XML``) vers
|
| 4 |
+
d'autres types d'artefacts, en documentant explicitement les
|
| 5 |
+
pertes via ``ProjectionReport``.
|
| 6 |
+
|
| 7 |
+
Implémentations
|
| 8 |
+
---------------
|
| 9 |
+
- ``AltoToText`` — extraction du texte par ordre de lecture
|
| 10 |
+
``Page → Block → Line → String``. Gestion césure
|
| 11 |
+
``HypPart1``/``HypPart2``.
|
| 12 |
+
|
| 13 |
+
À venir post-livraison :
|
| 14 |
+
- ``AltoToLines`` (extraction lignes).
|
| 15 |
+
- ``AltoToWordsWithBoxes`` (mots + coordonnées).
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
from __future__ import annotations
|
| 19 |
+
|
| 20 |
+
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 21 |
+
from picarones.evaluation.projectors.base import ProjectionReport
|
| 22 |
+
from picarones.formats.alto.parser import AltoParseError, parse_alto
|
| 23 |
+
from picarones.formats.alto.types import AltoDocument, AltoLine
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def alto_document_to_text(document: AltoDocument) -> str:
|
| 27 |
+
"""Extrait le texte plat d'un ``AltoDocument``.
|
| 28 |
+
|
| 29 |
+
Conventions :
|
| 30 |
+
|
| 31 |
+
- Ordre de lecture ``Page → Block → Line → String``, dans l'ordre
|
| 32 |
+
d'apparition dans le XML.
|
| 33 |
+
- Espace entre les ``String`` d'une même ligne.
|
| 34 |
+
- Saut de ligne entre les ``TextLine``.
|
| 35 |
+
- Saut de ligne supplémentaire entre les ``TextBlock``.
|
| 36 |
+
- **Césure** :
|
| 37 |
+
- Si un ``HypPart1`` porte ``SUBS_CONTENT`` (mot complet), on
|
| 38 |
+
utilise ce mot complet et on saute le ``HypPart2``
|
| 39 |
+
correspondant (même ligne ou ligne suivante du même bloc).
|
| 40 |
+
- Sinon, on concatène ``HypPart1.content + HypPart2.content``
|
| 41 |
+
et on saute le ``HypPart2``.
|
| 42 |
+
- Le saut de ligne visuel entre les deux est **conservé** (le
|
| 43 |
+
mot reconstruit termine la ligne du ``HypPart1``, la ligne
|
| 44 |
+
du ``HypPart2`` continue avec ses autres mots).
|
| 45 |
+
"""
|
| 46 |
+
blocks_text: list[str] = []
|
| 47 |
+
for page in document.pages:
|
| 48 |
+
for block in page.blocks:
|
| 49 |
+
block_text = _extract_block_text(block)
|
| 50 |
+
if block_text:
|
| 51 |
+
blocks_text.append(block_text)
|
| 52 |
+
return "\n\n".join(blocks_text).strip()
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def _extract_block_text(block: "AltoTextBlock") -> str:
|
| 56 |
+
"""Extrait le texte d'un bloc en gérant la césure cross-ligne.
|
| 57 |
+
|
| 58 |
+
L'usage standard ALTO place ``HypPart1`` en fin d'une ligne et
|
| 59 |
+
``HypPart2`` en début de la ligne suivante du **même** bloc.
|
| 60 |
+
"""
|
| 61 |
+
from picarones.formats.alto.types import AltoTextBlock as _ATB
|
| 62 |
+
assert isinstance(block, _ATB)
|
| 63 |
+
lines_text: list[str] = []
|
| 64 |
+
skip_first_if_hyppart2 = False
|
| 65 |
+
for line in block.lines:
|
| 66 |
+
text, ended_with_hyp1 = _extract_line_text(
|
| 67 |
+
line, skip_first_if_hyppart2=skip_first_if_hyppart2,
|
| 68 |
+
)
|
| 69 |
+
lines_text.append(text)
|
| 70 |
+
skip_first_if_hyppart2 = ended_with_hyp1
|
| 71 |
+
return "\n".join(lines_text)
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def _extract_line_text(
|
| 75 |
+
line: AltoLine,
|
| 76 |
+
*,
|
| 77 |
+
skip_first_if_hyppart2: bool = False,
|
| 78 |
+
) -> tuple[str, bool]:
|
| 79 |
+
"""Reconstruit le texte d'une ligne.
|
| 80 |
+
|
| 81 |
+
Returns
|
| 82 |
+
-------
|
| 83 |
+
tuple[str, bool]
|
| 84 |
+
``(texte_ligne, ended_with_hyppart1_resolved)``. Le second
|
| 85 |
+
indique si la ligne se termine par un ``HypPart1`` dont la
|
| 86 |
+
résolution implique de skipper le premier ``HypPart2`` de la
|
| 87 |
+
ligne suivante.
|
| 88 |
+
"""
|
| 89 |
+
parts: list[str] = []
|
| 90 |
+
skip_next = False
|
| 91 |
+
ended_with_hyp1 = False
|
| 92 |
+
strings = list(line.strings)
|
| 93 |
+
for i, s in enumerate(strings):
|
| 94 |
+
is_first = (i == 0)
|
| 95 |
+
if skip_next:
|
| 96 |
+
skip_next = False
|
| 97 |
+
continue
|
| 98 |
+
if is_first and skip_first_if_hyppart2 and s.subs_type == "HypPart2":
|
| 99 |
+
# Cross-ligne : la ligne précédente a résolu le HypPart1.
|
| 100 |
+
continue
|
| 101 |
+
if s.subs_type == "HypPart1":
|
| 102 |
+
is_last = (i == len(strings) - 1)
|
| 103 |
+
if s.subs_content:
|
| 104 |
+
parts.append(s.subs_content)
|
| 105 |
+
if i + 1 < len(strings) and strings[i + 1].subs_type == "HypPart2":
|
| 106 |
+
skip_next = True
|
| 107 |
+
elif is_last:
|
| 108 |
+
ended_with_hyp1 = True
|
| 109 |
+
continue
|
| 110 |
+
if i + 1 < len(strings) and strings[i + 1].subs_type == "HypPart2":
|
| 111 |
+
parts.append(s.content + strings[i + 1].content)
|
| 112 |
+
skip_next = True
|
| 113 |
+
continue
|
| 114 |
+
parts.append(s.content)
|
| 115 |
+
if is_last:
|
| 116 |
+
ended_with_hyp1 = True
|
| 117 |
+
continue
|
| 118 |
+
parts.append(s.content)
|
| 119 |
+
return " ".join(p for p in parts if p), ended_with_hyp1
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 123 |
+
# Projecteur conforme au protocole ``Projector`` (Sprint S5)
|
| 124 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
class AltoToText:
|
| 128 |
+
"""Projecteur ``ALTO_XML → RAW_TEXT``.
|
| 129 |
+
|
| 130 |
+
Lit le XML depuis l'``Artifact.uri`` (chemin filesystem) si
|
| 131 |
+
présent, sinon attend que le caller ait pré-stocké le payload
|
| 132 |
+
dans un mécanisme externe (ce projecteur ne télécharge rien
|
| 133 |
+
par lui-même — pas de side-effect réseau).
|
| 134 |
+
|
| 135 |
+
Pour S9, on s'attend à ce que ``artifact.uri`` pointe vers un
|
| 136 |
+
fichier local lisible. Le service applicatif (S19) résoudra
|
| 137 |
+
les autres cas (URI distante, payload inline).
|
| 138 |
+
"""
|
| 139 |
+
|
| 140 |
+
name = "alto_to_text"
|
| 141 |
+
source_type = ArtifactType.ALTO_XML
|
| 142 |
+
target_type = ArtifactType.RAW_TEXT
|
| 143 |
+
|
| 144 |
+
def project(
|
| 145 |
+
self,
|
| 146 |
+
artifact: Artifact,
|
| 147 |
+
params: dict[str, str | int | float | bool],
|
| 148 |
+
) -> tuple[Artifact, ProjectionReport]:
|
| 149 |
+
if artifact.type != self.source_type:
|
| 150 |
+
from picarones.domain.errors import ProjectionError
|
| 151 |
+
raise ProjectionError(
|
| 152 |
+
f"AltoToText n'accepte que ALTO_XML, reçu "
|
| 153 |
+
f"{artifact.type.value!r}"
|
| 154 |
+
)
|
| 155 |
+
|
| 156 |
+
# Lecture du XML. Pour S9, on lit depuis le filesystem.
|
| 157 |
+
xml_bytes = self._read_xml(artifact)
|
| 158 |
+
|
| 159 |
+
try:
|
| 160 |
+
doc = parse_alto(xml_bytes)
|
| 161 |
+
except AltoParseError as exc:
|
| 162 |
+
from picarones.domain.errors import ProjectionError
|
| 163 |
+
raise ProjectionError(f"AltoToText : {exc}") from exc
|
| 164 |
+
|
| 165 |
+
text = alto_document_to_text(doc)
|
| 166 |
+
|
| 167 |
+
# Construction de l'artefact résultat.
|
| 168 |
+
target = Artifact(
|
| 169 |
+
id=f"{artifact.id}:projected_text",
|
| 170 |
+
document_id=artifact.document_id,
|
| 171 |
+
type=self.target_type,
|
| 172 |
+
produced_by_step=artifact.produced_by_step,
|
| 173 |
+
)
|
| 174 |
+
|
| 175 |
+
report = ProjectionReport(
|
| 176 |
+
source_artifact_id=artifact.id,
|
| 177 |
+
source_type=self.source_type,
|
| 178 |
+
target_type=self.target_type,
|
| 179 |
+
projector_name=self.name,
|
| 180 |
+
lossy=True,
|
| 181 |
+
ignored_dimensions=(
|
| 182 |
+
"geometry",
|
| 183 |
+
"block_structure",
|
| 184 |
+
"reading_order",
|
| 185 |
+
"ids",
|
| 186 |
+
"confidence",
|
| 187 |
+
),
|
| 188 |
+
warnings=(
|
| 189 |
+
"L'extraction texte ALTO ignore les coordonnées, "
|
| 190 |
+
"la structure en blocs, et les IDs. La césure "
|
| 191 |
+
"HypPart1/HypPart2 est résolue (mot recombiné).",
|
| 192 |
+
),
|
| 193 |
+
)
|
| 194 |
+
return target, report
|
| 195 |
+
|
| 196 |
+
@staticmethod
|
| 197 |
+
def _read_xml(artifact: Artifact) -> bytes:
|
| 198 |
+
from picarones.domain.errors import ProjectionError
|
| 199 |
+
if artifact.uri is None:
|
| 200 |
+
raise ProjectionError(
|
| 201 |
+
f"AltoToText : artifact {artifact.id!r} n'a pas d'URI "
|
| 202 |
+
"et le projecteur ne sait pas résoudre les payloads "
|
| 203 |
+
"inline pour S9."
|
| 204 |
+
)
|
| 205 |
+
from pathlib import Path
|
| 206 |
+
path = Path(artifact.uri)
|
| 207 |
+
try:
|
| 208 |
+
return path.read_bytes()
|
| 209 |
+
except OSError as exc:
|
| 210 |
+
raise ProjectionError(
|
| 211 |
+
f"AltoToText : impossible de lire {path!r} : {exc}"
|
| 212 |
+
) from exc
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
__all__ = ["alto_document_to_text", "AltoToText"]
|
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Structures internes ALTO — Sprint A14-S9.
|
| 2 |
+
|
| 3 |
+
Représentation **typée et immuable** d'un document ALTO XML pour
|
| 4 |
+
manipulation, projection, et round-trip parser/writer. Indépendante
|
| 5 |
+
du namespace source (v2/v3/v4) — le parser normalise.
|
| 6 |
+
|
| 7 |
+
Hiérarchie ALTO simplifiée :
|
| 8 |
+
|
| 9 |
+
::
|
| 10 |
+
|
| 11 |
+
AltoDocument
|
| 12 |
+
└─ AltoPage (1..N)
|
| 13 |
+
└─ AltoTextBlock (0..N)
|
| 14 |
+
└─ AltoLine (0..N)
|
| 15 |
+
└─ AltoString (0..N)
|
| 16 |
+
|
| 17 |
+
Les coordonnées (HPOS, VPOS, WIDTH, HEIGHT) sont **optionnelles**.
|
| 18 |
+
Un ALTO produit par certains VLM peut omettre les bbox (texte sans
|
| 19 |
+
coordonnées) — on accepte au parsing et le projecteur ALTO→texte
|
| 20 |
+
fonctionne quand même.
|
| 21 |
+
|
| 22 |
+
Anti-sur-ingénierie
|
| 23 |
+
-------------------
|
| 24 |
+
Pas de support des éléments rares pour S9 :
|
| 25 |
+
- ``Composed Block`` (regroupement de blocks) — projeté en blocks plats.
|
| 26 |
+
- ``Illustration`` / ``GraphicalElement`` — ignorés à l'extraction texte.
|
| 27 |
+
- ``StyleRefs`` / typographie — non préservés par le writer.
|
| 28 |
+
- ``Hyphenation`` côté ``HypPart1`` / ``HypPart2`` est par contre
|
| 29 |
+
géré par le projector (cf. ``projector.py``).
|
| 30 |
+
"""
|
| 31 |
+
|
| 32 |
+
from __future__ import annotations
|
| 33 |
+
|
| 34 |
+
from pydantic import BaseModel, ConfigDict, Field
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
class AltoBBox(BaseModel):
|
| 38 |
+
"""Boîte englobante optionnelle (coordonnées en pixels)."""
|
| 39 |
+
|
| 40 |
+
model_config = ConfigDict(frozen=True, extra="forbid")
|
| 41 |
+
|
| 42 |
+
hpos: int = Field(ge=0)
|
| 43 |
+
vpos: int = Field(ge=0)
|
| 44 |
+
width: int = Field(ge=0)
|
| 45 |
+
height: int = Field(ge=0)
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
class AltoString(BaseModel):
|
| 49 |
+
"""Un mot ALTO (élément ``<String>``).
|
| 50 |
+
|
| 51 |
+
Attributs ALTO mappés :
|
| 52 |
+
- ``CONTENT`` → ``content``
|
| 53 |
+
- ``ID`` → ``id``
|
| 54 |
+
- ``HPOS``/``VPOS``/``WIDTH``/``HEIGHT`` → ``bbox``
|
| 55 |
+
- ``SUBS_TYPE`` → ``subs_type`` (``"HypPart1"`` / ``"HypPart2"``).
|
| 56 |
+
Le projecteur l'utilise pour gérer la césure de fin de ligne.
|
| 57 |
+
- ``SUBS_CONTENT`` → ``subs_content`` (mot complet quand césuré).
|
| 58 |
+
"""
|
| 59 |
+
|
| 60 |
+
model_config = ConfigDict(frozen=True, extra="forbid")
|
| 61 |
+
|
| 62 |
+
content: str
|
| 63 |
+
id: str | None = Field(default=None, max_length=128)
|
| 64 |
+
bbox: AltoBBox | None = None
|
| 65 |
+
subs_type: str | None = Field(default=None, pattern=r"^(HypPart1|HypPart2)$")
|
| 66 |
+
subs_content: str | None = None
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
class AltoLine(BaseModel):
|
| 70 |
+
"""Une ligne ALTO (élément ``<TextLine>``)."""
|
| 71 |
+
|
| 72 |
+
model_config = ConfigDict(frozen=True, extra="forbid")
|
| 73 |
+
|
| 74 |
+
id: str | None = Field(default=None, max_length=128)
|
| 75 |
+
bbox: AltoBBox | None = None
|
| 76 |
+
strings: tuple[AltoString, ...] = Field(default_factory=tuple)
|
| 77 |
+
"""Mots de la ligne, ordre de lecture naturel (gauche → droite)."""
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
class AltoTextBlock(BaseModel):
|
| 81 |
+
"""Un bloc de texte ALTO (élément ``<TextBlock>``)."""
|
| 82 |
+
|
| 83 |
+
model_config = ConfigDict(frozen=True, extra="forbid")
|
| 84 |
+
|
| 85 |
+
id: str | None = Field(default=None, max_length=128)
|
| 86 |
+
bbox: AltoBBox | None = None
|
| 87 |
+
lines: tuple[AltoLine, ...] = Field(default_factory=tuple)
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
class AltoPage(BaseModel):
|
| 91 |
+
"""Une page ALTO (élément ``<Page>``)."""
|
| 92 |
+
|
| 93 |
+
model_config = ConfigDict(frozen=True, extra="forbid")
|
| 94 |
+
|
| 95 |
+
id: str | None = Field(default=None, max_length=128)
|
| 96 |
+
width: int | None = Field(default=None, ge=0)
|
| 97 |
+
"""Largeur physique en pixels (``WIDTH``)."""
|
| 98 |
+
height: int | None = Field(default=None, ge=0)
|
| 99 |
+
"""Hauteur physique en pixels (``HEIGHT``)."""
|
| 100 |
+
blocks: tuple[AltoTextBlock, ...] = Field(default_factory=tuple)
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
class AltoDocument(BaseModel):
|
| 104 |
+
"""Document ALTO complet.
|
| 105 |
+
|
| 106 |
+
Conserve la version source au parsing pour permettre au writer
|
| 107 |
+
de re-sortir dans le même namespace si demandé. Par défaut,
|
| 108 |
+
le writer sort en v4 (le plus récent et le plus expressif).
|
| 109 |
+
"""
|
| 110 |
+
|
| 111 |
+
model_config = ConfigDict(frozen=True, extra="forbid")
|
| 112 |
+
|
| 113 |
+
pages: tuple[AltoPage, ...] = Field(default_factory=tuple)
|
| 114 |
+
source_version: str | None = Field(default=None, max_length=8)
|
| 115 |
+
"""Version détectée au parsing : ``"v2"`` / ``"v3"`` / ``"v4"`` /
|
| 116 |
+
``"none"`` (sans namespace) / ``None`` (inconnue)."""
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
__all__ = [
|
| 120 |
+
"AltoBBox",
|
| 121 |
+
"AltoString",
|
| 122 |
+
"AltoLine",
|
| 123 |
+
"AltoTextBlock",
|
| 124 |
+
"AltoPage",
|
| 125 |
+
"AltoDocument",
|
| 126 |
+
]
|
|
@@ -0,0 +1,147 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Writer ALTO XML déterministe — Sprint A14-S9.
|
| 2 |
+
|
| 3 |
+
Sérialise un ``AltoDocument`` en bytes ALTO XML. Sortie
|
| 4 |
+
déterministe : même document → mêmes octets exacts (utile pour le
|
| 5 |
+
cache d'artefacts du S7 et les tests de round-trip).
|
| 6 |
+
|
| 7 |
+
Format de sortie
|
| 8 |
+
----------------
|
| 9 |
+
Par défaut, le writer sort un ALTO **v4** (le plus récent et le
|
| 10 |
+
plus expressif), même si le document a été parsé depuis v2/v3. Le
|
| 11 |
+
caller peut forcer une version cible avec ``write_alto(doc,
|
| 12 |
+
version="v3")``.
|
| 13 |
+
|
| 14 |
+
Anti-sur-ingénierie
|
| 15 |
+
-------------------
|
| 16 |
+
- Pas de support des ``StyleRefs``, ``ProcessingStep``, ``OCRProcessing``,
|
| 17 |
+
``Description`` pour S9. Le writer sort une structure minimale
|
| 18 |
+
(``alto > Layout > Page > PrintSpace > TextBlock > TextLine > String``)
|
| 19 |
+
qui passe la validation des outils consommateurs courants
|
| 20 |
+
(Mirador, IIIF Universal Viewer, Aletheia).
|
| 21 |
+
- Pas d'XSL preprocessing. L'utilisateur qui veut un ALTO
|
| 22 |
+
enrichi écrira un wrapper.
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
+
from xml.etree import ElementTree as ET
|
| 28 |
+
|
| 29 |
+
from picarones.formats.alto.types import (
|
| 30 |
+
AltoBBox,
|
| 31 |
+
AltoDocument,
|
| 32 |
+
AltoLine,
|
| 33 |
+
AltoPage,
|
| 34 |
+
AltoString,
|
| 35 |
+
AltoTextBlock,
|
| 36 |
+
)
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
_NAMESPACE_BY_VERSION: dict[str, str] = {
|
| 40 |
+
"v2": "http://www.loc.gov/standards/alto/ns-v2#",
|
| 41 |
+
"v3": "http://www.loc.gov/standards/alto/ns-v3#",
|
| 42 |
+
"v4": "http://www.loc.gov/standards/alto/ns-v4#",
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def _set_bbox_attrs(elem: ET.Element, bbox: AltoBBox | None) -> None:
|
| 47 |
+
if bbox is None:
|
| 48 |
+
return
|
| 49 |
+
elem.set("HPOS", str(bbox.hpos))
|
| 50 |
+
elem.set("VPOS", str(bbox.vpos))
|
| 51 |
+
elem.set("WIDTH", str(bbox.width))
|
| 52 |
+
elem.set("HEIGHT", str(bbox.height))
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def _set_optional(elem: ET.Element, name: str, value: str | None) -> None:
|
| 56 |
+
if value is not None:
|
| 57 |
+
elem.set(name, value)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def _build_string(parent: ET.Element, ns: str, s: AltoString) -> None:
|
| 61 |
+
elem = ET.SubElement(parent, f"{{{ns}}}String" if ns else "String")
|
| 62 |
+
elem.set("CONTENT", s.content)
|
| 63 |
+
_set_optional(elem, "ID", s.id)
|
| 64 |
+
_set_bbox_attrs(elem, s.bbox)
|
| 65 |
+
_set_optional(elem, "SUBS_TYPE", s.subs_type)
|
| 66 |
+
_set_optional(elem, "SUBS_CONTENT", s.subs_content)
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def _build_line(parent: ET.Element, ns: str, line: AltoLine) -> None:
|
| 70 |
+
elem = ET.SubElement(parent, f"{{{ns}}}TextLine" if ns else "TextLine")
|
| 71 |
+
_set_optional(elem, "ID", line.id)
|
| 72 |
+
_set_bbox_attrs(elem, line.bbox)
|
| 73 |
+
for s in line.strings:
|
| 74 |
+
_build_string(elem, ns, s)
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def _build_block(parent: ET.Element, ns: str, block: AltoTextBlock) -> None:
|
| 78 |
+
elem = ET.SubElement(parent, f"{{{ns}}}TextBlock" if ns else "TextBlock")
|
| 79 |
+
_set_optional(elem, "ID", block.id)
|
| 80 |
+
_set_bbox_attrs(elem, block.bbox)
|
| 81 |
+
for line in block.lines:
|
| 82 |
+
_build_line(elem, ns, line)
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def _build_page(parent: ET.Element, ns: str, page: AltoPage) -> None:
|
| 86 |
+
elem = ET.SubElement(parent, f"{{{ns}}}Page" if ns else "Page")
|
| 87 |
+
_set_optional(elem, "ID", page.id)
|
| 88 |
+
if page.width is not None:
|
| 89 |
+
elem.set("WIDTH", str(page.width))
|
| 90 |
+
if page.height is not None:
|
| 91 |
+
elem.set("HEIGHT", str(page.height))
|
| 92 |
+
print_space = ET.SubElement(
|
| 93 |
+
elem, f"{{{ns}}}PrintSpace" if ns else "PrintSpace",
|
| 94 |
+
)
|
| 95 |
+
for block in page.blocks:
|
| 96 |
+
_build_block(print_space, ns, block)
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def write_alto(
|
| 100 |
+
document: AltoDocument,
|
| 101 |
+
*,
|
| 102 |
+
version: str = "v4",
|
| 103 |
+
pretty: bool = False,
|
| 104 |
+
) -> bytes:
|
| 105 |
+
"""Sérialise un ``AltoDocument`` en bytes ALTO XML.
|
| 106 |
+
|
| 107 |
+
Parameters
|
| 108 |
+
----------
|
| 109 |
+
document:
|
| 110 |
+
Document à sérialiser.
|
| 111 |
+
version:
|
| 112 |
+
Version ALTO cible. ``"v2"`` / ``"v3"`` / ``"v4"`` ou
|
| 113 |
+
``"none"`` (sans namespace). Défaut : ``"v4"``.
|
| 114 |
+
pretty:
|
| 115 |
+
Si ``True``, indente la sortie pour la lisibilité humaine.
|
| 116 |
+
``False`` (défaut) produit une sortie compacte byte-déterministe.
|
| 117 |
+
|
| 118 |
+
Returns
|
| 119 |
+
-------
|
| 120 |
+
bytes
|
| 121 |
+
XML encodé en UTF-8 avec déclaration XML.
|
| 122 |
+
"""
|
| 123 |
+
if version not in (*_NAMESPACE_BY_VERSION, "none"):
|
| 124 |
+
from picarones.domain.errors import PicaronesError
|
| 125 |
+
raise PicaronesError(
|
| 126 |
+
f"version ALTO invalide : {version!r}. "
|
| 127 |
+
f"Acceptées : {sorted(_NAMESPACE_BY_VERSION)} + 'none'."
|
| 128 |
+
)
|
| 129 |
+
ns = _NAMESPACE_BY_VERSION.get(version, "")
|
| 130 |
+
if ns:
|
| 131 |
+
ET.register_namespace("", ns)
|
| 132 |
+
root = ET.Element(f"{{{ns}}}alto")
|
| 133 |
+
else:
|
| 134 |
+
root = ET.Element("alto")
|
| 135 |
+
|
| 136 |
+
layout = ET.SubElement(root, f"{{{ns}}}Layout" if ns else "Layout")
|
| 137 |
+
for page in document.pages:
|
| 138 |
+
_build_page(layout, ns, page)
|
| 139 |
+
|
| 140 |
+
if pretty:
|
| 141 |
+
ET.indent(root, space=" ")
|
| 142 |
+
|
| 143 |
+
body = ET.tostring(root, encoding="utf-8", xml_declaration=True)
|
| 144 |
+
return body
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
__all__ = ["write_alto"]
|
|
@@ -1,12 +1,36 @@
|
|
| 1 |
"""Format PAGE XML (PRIMA / Transkribus).
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
- ``
|
| 6 |
-
|
| 7 |
-
- ``
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
"""
|
| 9 |
|
| 10 |
from __future__ import annotations
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
"""Format PAGE XML (PRIMA / Transkribus).
|
| 2 |
|
| 3 |
+
Sprint A14-S9 livre :
|
| 4 |
|
| 5 |
+
- ``types.py`` — ``PageDocument``, ``PagePage``, ``PageTextRegion``,
|
| 6 |
+
``PageTextLine``. Frozen pydantic.
|
| 7 |
+
- ``parser.py`` — ``parse_pagexml(xml_bytes)`` tolérant aux versions
|
| 8 |
+
de namespace PRIMA. Sécurité ``defusedxml``.
|
| 9 |
+
- ``projector.py`` — ``page_document_to_text(doc)`` + ``PageToText``.
|
| 10 |
+
|
| 11 |
+
Writer reporté post-livraison (les outils PAGE produisent
|
| 12 |
+
typiquement le format à partir d'un éditeur — le besoin de re-sortir
|
| 13 |
+
est plus rare que pour ALTO).
|
| 14 |
"""
|
| 15 |
|
| 16 |
from __future__ import annotations
|
| 17 |
|
| 18 |
+
from picarones.formats.pagexml.parser import PageParseError, parse_pagexml
|
| 19 |
+
from picarones.formats.pagexml.projector import PageToText, page_document_to_text
|
| 20 |
+
from picarones.formats.pagexml.types import (
|
| 21 |
+
PageDocument,
|
| 22 |
+
PagePage,
|
| 23 |
+
PageTextLine,
|
| 24 |
+
PageTextRegion,
|
| 25 |
+
)
|
| 26 |
+
|
| 27 |
+
__all__ = [
|
| 28 |
+
"PageTextLine",
|
| 29 |
+
"PageTextRegion",
|
| 30 |
+
"PagePage",
|
| 31 |
+
"PageDocument",
|
| 32 |
+
"parse_pagexml",
|
| 33 |
+
"PageParseError",
|
| 34 |
+
"page_document_to_text",
|
| 35 |
+
"PageToText",
|
| 36 |
+
]
|
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Parser PAGE XML tolérant — Sprint A14-S9.
|
| 2 |
+
|
| 3 |
+
Détection auto du namespace PRIMA (plusieurs versions co-existent
|
| 4 |
+
dans la nature : ``2010-03-19``, ``2013-07-15``, ``2017-07-15``,
|
| 5 |
+
``2019-07-15``). Utilise ``defusedxml`` pour la sécurité XXE.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
import logging
|
| 11 |
+
import re
|
| 12 |
+
from typing import Any
|
| 13 |
+
|
| 14 |
+
import defusedxml.ElementTree as _SafeET
|
| 15 |
+
|
| 16 |
+
from picarones.domain.errors import PicaronesError
|
| 17 |
+
from picarones.formats.pagexml.types import (
|
| 18 |
+
PageDocument,
|
| 19 |
+
PagePage,
|
| 20 |
+
PageTextLine,
|
| 21 |
+
PageTextRegion,
|
| 22 |
+
)
|
| 23 |
+
|
| 24 |
+
logger = logging.getLogger(__name__)
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
class PageParseError(PicaronesError):
|
| 28 |
+
"""PAGE XML non parsable."""
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
_NS_RE = re.compile(r"^\{([^}]*)\}")
|
| 32 |
+
_LOCAL_NAME_RE = re.compile(r"\{[^}]*\}")
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def _local(tag: str) -> str:
|
| 36 |
+
return _LOCAL_NAME_RE.sub("", tag)
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def _detect_namespace(root_tag: str) -> str | None:
|
| 40 |
+
m = _NS_RE.match(root_tag)
|
| 41 |
+
return m.group(1) if m else None
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def _extract_unicode(elem: Any) -> str:
|
| 45 |
+
"""Cherche le premier ``<Unicode>`` descendant et retourne son
|
| 46 |
+
texte, ou ``""`` si absent.
|
| 47 |
+
|
| 48 |
+
PAGE XML stocke le texte dans ``<TextEquiv><Unicode>...</Unicode></TextEquiv>``.
|
| 49 |
+
Plusieurs ``TextEquiv`` peuvent coexister (variantes d'OCR) —
|
| 50 |
+
on prend la première.
|
| 51 |
+
"""
|
| 52 |
+
for child in elem.iter():
|
| 53 |
+
if _local(child.tag) == "Unicode":
|
| 54 |
+
return (child.text or "").strip()
|
| 55 |
+
return ""
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
def _parse_coords(elem: Any) -> str | None:
|
| 59 |
+
"""Cherche le premier ``<Coords points="...">`` enfant direct."""
|
| 60 |
+
for child in elem:
|
| 61 |
+
if _local(child.tag) == "Coords":
|
| 62 |
+
return child.attrib.get("points")
|
| 63 |
+
return None
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def _parse_baseline(elem: Any) -> str | None:
|
| 67 |
+
for child in elem:
|
| 68 |
+
if _local(child.tag) == "Baseline":
|
| 69 |
+
return child.attrib.get("points")
|
| 70 |
+
return None
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def _parse_text_line(elem: Any) -> PageTextLine:
|
| 74 |
+
return PageTextLine(
|
| 75 |
+
id=elem.attrib.get("id"),
|
| 76 |
+
coords=_parse_coords(elem),
|
| 77 |
+
baseline=_parse_baseline(elem),
|
| 78 |
+
text=_extract_unicode(elem),
|
| 79 |
+
)
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def _parse_text_region(elem: Any) -> PageTextRegion:
|
| 83 |
+
lines: list[PageTextLine] = []
|
| 84 |
+
for child in elem:
|
| 85 |
+
if _local(child.tag) == "TextLine":
|
| 86 |
+
lines.append(_parse_text_line(child))
|
| 87 |
+
return PageTextRegion(
|
| 88 |
+
id=elem.attrib.get("id"),
|
| 89 |
+
coords=_parse_coords(elem),
|
| 90 |
+
region_type=elem.attrib.get("type"),
|
| 91 |
+
text_lines=tuple(lines),
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def _parse_int_attr(elem: Any, name: str) -> int | None:
|
| 96 |
+
raw = elem.attrib.get(name)
|
| 97 |
+
if raw is None:
|
| 98 |
+
return None
|
| 99 |
+
try:
|
| 100 |
+
return int(float(raw))
|
| 101 |
+
except (ValueError, TypeError):
|
| 102 |
+
return None
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
def _parse_page(elem: Any) -> PagePage:
|
| 106 |
+
regions: list[PageTextRegion] = []
|
| 107 |
+
for child in elem:
|
| 108 |
+
if _local(child.tag) == "TextRegion":
|
| 109 |
+
regions.append(_parse_text_region(child))
|
| 110 |
+
return PagePage(
|
| 111 |
+
image_filename=elem.attrib.get("imageFilename"),
|
| 112 |
+
image_width=_parse_int_attr(elem, "imageWidth"),
|
| 113 |
+
image_height=_parse_int_attr(elem, "imageHeight"),
|
| 114 |
+
text_regions=tuple(regions),
|
| 115 |
+
)
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
def parse_pagexml(xml: bytes | str) -> PageDocument:
|
| 119 |
+
"""Parse un document PAGE XML et retourne la structure interne.
|
| 120 |
+
|
| 121 |
+
Raises
|
| 122 |
+
------
|
| 123 |
+
PageParseError
|
| 124 |
+
XML mal formé, défense XXE, ou root absent.
|
| 125 |
+
"""
|
| 126 |
+
if isinstance(xml, str):
|
| 127 |
+
xml_bytes = xml.encode("utf-8")
|
| 128 |
+
else:
|
| 129 |
+
xml_bytes = xml
|
| 130 |
+
if not xml_bytes.strip():
|
| 131 |
+
raise PageParseError("PAGE XML vide.")
|
| 132 |
+
try:
|
| 133 |
+
root = _SafeET.fromstring(xml_bytes)
|
| 134 |
+
except Exception as exc: # noqa: BLE001
|
| 135 |
+
raise PageParseError(f"XML invalide ou XXE bloqué : {exc}") from exc
|
| 136 |
+
|
| 137 |
+
if root is None:
|
| 138 |
+
raise PageParseError("PAGE sans root element.")
|
| 139 |
+
|
| 140 |
+
ns = _detect_namespace(root.tag)
|
| 141 |
+
pages: list[PagePage] = []
|
| 142 |
+
for elem in root.iter():
|
| 143 |
+
if _local(elem.tag) == "Page":
|
| 144 |
+
pages.append(_parse_page(elem))
|
| 145 |
+
|
| 146 |
+
return PageDocument(pages=tuple(pages), source_namespace=ns)
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
__all__ = ["parse_pagexml", "PageParseError"]
|
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Projecteurs PAGE XML — Sprint A14-S9.
|
| 2 |
+
|
| 3 |
+
Convertit un ``PageDocument`` (ou un artefact ``PAGE_XML``) vers
|
| 4 |
+
d'autres types d'artefacts. Symétrique de ``formats.alto.projector``.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 10 |
+
from picarones.evaluation.projectors.base import ProjectionReport
|
| 11 |
+
from picarones.formats.pagexml.parser import PageParseError, parse_pagexml
|
| 12 |
+
from picarones.formats.pagexml.types import PageDocument
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def page_document_to_text(document: PageDocument) -> str:
|
| 16 |
+
"""Extrait le texte plat d'un ``PageDocument``.
|
| 17 |
+
|
| 18 |
+
Convention :
|
| 19 |
+
- Ordre ``Page → TextRegion → TextLine``.
|
| 20 |
+
- Saut de ligne entre lignes d'une même région.
|
| 21 |
+
- Saut de ligne supplémentaire entre régions.
|
| 22 |
+
"""
|
| 23 |
+
page_blocks: list[str] = []
|
| 24 |
+
for page in document.pages:
|
| 25 |
+
for region in page.text_regions:
|
| 26 |
+
lines = [tl.text for tl in region.text_lines if tl.text]
|
| 27 |
+
if lines:
|
| 28 |
+
page_blocks.append("\n".join(lines))
|
| 29 |
+
return "\n\n".join(page_blocks).strip()
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
class PageToText:
|
| 33 |
+
"""Projecteur ``PAGE_XML → RAW_TEXT``."""
|
| 34 |
+
|
| 35 |
+
name = "page_to_text"
|
| 36 |
+
source_type = ArtifactType.PAGE_XML
|
| 37 |
+
target_type = ArtifactType.RAW_TEXT
|
| 38 |
+
|
| 39 |
+
def project(
|
| 40 |
+
self,
|
| 41 |
+
artifact: Artifact,
|
| 42 |
+
params: dict[str, str | int | float | bool],
|
| 43 |
+
) -> tuple[Artifact, ProjectionReport]:
|
| 44 |
+
from picarones.domain.errors import ProjectionError
|
| 45 |
+
if artifact.type != self.source_type:
|
| 46 |
+
raise ProjectionError(
|
| 47 |
+
f"PageToText n'accepte que PAGE_XML, reçu "
|
| 48 |
+
f"{artifact.type.value!r}"
|
| 49 |
+
)
|
| 50 |
+
if artifact.uri is None:
|
| 51 |
+
raise ProjectionError(
|
| 52 |
+
f"PageToText : artifact {artifact.id!r} sans URI."
|
| 53 |
+
)
|
| 54 |
+
from pathlib import Path
|
| 55 |
+
try:
|
| 56 |
+
xml_bytes = Path(artifact.uri).read_bytes()
|
| 57 |
+
except OSError as exc:
|
| 58 |
+
raise ProjectionError(
|
| 59 |
+
f"PageToText : impossible de lire {artifact.uri!r} : {exc}"
|
| 60 |
+
) from exc
|
| 61 |
+
|
| 62 |
+
try:
|
| 63 |
+
doc = parse_pagexml(xml_bytes)
|
| 64 |
+
except PageParseError as exc:
|
| 65 |
+
raise ProjectionError(f"PageToText : {exc}") from exc
|
| 66 |
+
|
| 67 |
+
text = page_document_to_text(doc)
|
| 68 |
+
|
| 69 |
+
target = Artifact(
|
| 70 |
+
id=f"{artifact.id}:projected_text",
|
| 71 |
+
document_id=artifact.document_id,
|
| 72 |
+
type=self.target_type,
|
| 73 |
+
produced_by_step=artifact.produced_by_step,
|
| 74 |
+
)
|
| 75 |
+
report = ProjectionReport(
|
| 76 |
+
source_artifact_id=artifact.id,
|
| 77 |
+
source_type=self.source_type,
|
| 78 |
+
target_type=self.target_type,
|
| 79 |
+
projector_name=self.name,
|
| 80 |
+
lossy=True,
|
| 81 |
+
ignored_dimensions=(
|
| 82 |
+
"geometry",
|
| 83 |
+
"region_structure",
|
| 84 |
+
"baseline",
|
| 85 |
+
"ids",
|
| 86 |
+
),
|
| 87 |
+
warnings=(
|
| 88 |
+
"L'extraction texte PAGE ignore les coordonnées et "
|
| 89 |
+
"la structure en régions. Plusieurs TextEquiv (variantes "
|
| 90 |
+
"d'OCR) sont collapsées au premier Unicode rencontré.",
|
| 91 |
+
),
|
| 92 |
+
)
|
| 93 |
+
return target, report
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
__all__ = ["page_document_to_text", "PageToText"]
|
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Structures internes PAGE XML — Sprint A14-S9.
|
| 2 |
+
|
| 3 |
+
Représentation typée et immuable d'un document PAGE XML (PRIMA /
|
| 4 |
+
Transkribus / eScriptorium). Symétrique de ``formats.alto.types``
|
| 5 |
+
mais avec les conventions PAGE :
|
| 6 |
+
|
| 7 |
+
- ``Coords`` au lieu de ``HPOS/VPOS/WIDTH/HEIGHT`` — chaîne de points
|
| 8 |
+
``"x1,y1 x2,y2 ..."`` représentant un polygone.
|
| 9 |
+
- ``Baseline`` (optionnel) — ligne médiane horizontale typique des
|
| 10 |
+
manuscrits.
|
| 11 |
+
- ``TextEquiv > Unicode`` au lieu de ``CONTENT`` ALTO.
|
| 12 |
+
|
| 13 |
+
Anti-sur-ingénierie
|
| 14 |
+
-------------------
|
| 15 |
+
- Pas de support des ``Word``/``Glyph`` PAGE (granularité plus fine
|
| 16 |
+
que la ligne) pour S9 — la plupart des outils PAGE patrimoniaux
|
| 17 |
+
utilisent la granularité ``TextLine``. Un ``Word`` séparé peut
|
| 18 |
+
être ajouté quand un caller en aura besoin.
|
| 19 |
+
- Coordonnées stockées en string brut (``points``). Le caller qui
|
| 20 |
+
veut une bbox calculée appelle ``points_to_bbox()`` du parser.
|
| 21 |
+
"""
|
| 22 |
+
|
| 23 |
+
from __future__ import annotations
|
| 24 |
+
|
| 25 |
+
from pydantic import BaseModel, ConfigDict, Field
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
class PageTextLine(BaseModel):
|
| 29 |
+
"""Une ligne PAGE (élément ``<TextLine>``)."""
|
| 30 |
+
|
| 31 |
+
model_config = ConfigDict(frozen=True, extra="forbid")
|
| 32 |
+
|
| 33 |
+
id: str | None = Field(default=None, max_length=128)
|
| 34 |
+
coords: str | None = Field(default=None, max_length=4096)
|
| 35 |
+
"""Polygone en format PAGE : ``"x1,y1 x2,y2 x3,y3 ..."``."""
|
| 36 |
+
baseline: str | None = Field(default=None, max_length=2048)
|
| 37 |
+
"""Polyline baseline (optionnelle, typique HTR)."""
|
| 38 |
+
text: str = ""
|
| 39 |
+
"""Texte de la ligne extrait de ``TextEquiv > Unicode``."""
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
class PageTextRegion(BaseModel):
|
| 43 |
+
"""Région de texte PAGE (élément ``<TextRegion>``)."""
|
| 44 |
+
|
| 45 |
+
model_config = ConfigDict(frozen=True, extra="forbid")
|
| 46 |
+
|
| 47 |
+
id: str | None = Field(default=None, max_length=128)
|
| 48 |
+
coords: str | None = Field(default=None, max_length=4096)
|
| 49 |
+
region_type: str | None = Field(default=None, max_length=64)
|
| 50 |
+
"""Type sémantique PAGE : ``"paragraph"``, ``"heading"``,
|
| 51 |
+
``"caption"``, ``"footnote"``, etc. Préservé tel quel sans
|
| 52 |
+
enum (les valeurs PRIMA peuvent être étendues)."""
|
| 53 |
+
text_lines: tuple[PageTextLine, ...] = Field(default_factory=tuple)
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
class PagePage(BaseModel):
|
| 57 |
+
"""Une page PAGE (élément ``<Page>``)."""
|
| 58 |
+
|
| 59 |
+
model_config = ConfigDict(frozen=True, extra="forbid")
|
| 60 |
+
|
| 61 |
+
image_filename: str | None = Field(default=None, max_length=512)
|
| 62 |
+
image_width: int | None = Field(default=None, ge=0)
|
| 63 |
+
image_height: int | None = Field(default=None, ge=0)
|
| 64 |
+
text_regions: tuple[PageTextRegion, ...] = Field(default_factory=tuple)
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
class PageDocument(BaseModel):
|
| 68 |
+
"""Document PAGE XML complet (peut contenir une seule page)."""
|
| 69 |
+
|
| 70 |
+
model_config = ConfigDict(frozen=True, extra="forbid")
|
| 71 |
+
|
| 72 |
+
pages: tuple[PagePage, ...] = Field(default_factory=tuple)
|
| 73 |
+
source_namespace: str | None = Field(default=None, max_length=256)
|
| 74 |
+
"""Namespace détecté au parsing (ex ``2019-07-15``, ``2013-07-15``)."""
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
__all__ = [
|
| 78 |
+
"PageTextLine",
|
| 79 |
+
"PageTextRegion",
|
| 80 |
+
"PagePage",
|
| 81 |
+
"PageDocument",
|
| 82 |
+
]
|
|
@@ -1,21 +1,47 @@
|
|
| 1 |
"""Normalisation et manipulation de texte.
|
| 2 |
|
| 3 |
-
|
| 4 |
-
sans modification de
|
|
|
|
|
|
|
| 5 |
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
sans_ponctuation, sans_apostrophes). Tables diplomatiques.
|
| 12 |
-
Exclusion de caractères.
|
| 13 |
-
|
| 14 |
-
Règle : ce module ne fait **pas** d'extraction depuis ALTO/PAGE
|
| 15 |
-
(c'est le rôle des projecteurs). Il prend une chaîne en entrée,
|
| 16 |
applique un profil, retourne une chaîne.
|
| 17 |
"""
|
| 18 |
|
| 19 |
from __future__ import annotations
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
"""Normalisation et manipulation de texte.
|
| 2 |
|
| 3 |
+
Sprint A14-S9 livre ``normalization.py``, déplacé depuis
|
| 4 |
+
``picarones/measurements/normalization.py`` sans modification de
|
| 5 |
+
logique. L'ancien emplacement reste un re-export pour ne pas
|
| 6 |
+
casser les ~50 consommateurs (sera retiré au S22).
|
| 7 |
|
| 8 |
+
11 profils intégrés : ``nfc``, ``caseless``, ``minimal``,
|
| 9 |
+
``medieval_french``, ``early_modern_french``, ``medieval_latin``,
|
| 10 |
+
``medieval_english``, ``early_modern_english``, ``secretary_hand``,
|
| 11 |
+
``sans_ponctuation``, ``sans_apostrophes``.
|
| 12 |
|
| 13 |
+
Règle architecturale : ce module ne fait **pas** d'extraction depuis
|
| 14 |
+
ALTO/PAGE (c'est le rôle des projecteurs dans
|
| 15 |
+
``picarones.evaluation.projectors``). Il prend une chaîne en entrée,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
applique un profil, retourne une chaîne.
|
| 17 |
"""
|
| 18 |
|
| 19 |
from __future__ import annotations
|
| 20 |
|
| 21 |
+
from picarones.formats.text.normalization import (
|
| 22 |
+
DEFAULT_DIPLOMATIC_PROFILE,
|
| 23 |
+
DIPLOMATIC_EN_EARLY_MODERN,
|
| 24 |
+
DIPLOMATIC_EN_MEDIEVAL,
|
| 25 |
+
DIPLOMATIC_EN_SECRETARY,
|
| 26 |
+
DIPLOMATIC_FR_EARLY_MODERN,
|
| 27 |
+
DIPLOMATIC_FR_MEDIEVAL,
|
| 28 |
+
DIPLOMATIC_LATIN_MEDIEVAL,
|
| 29 |
+
DIPLOMATIC_MINIMAL,
|
| 30 |
+
NORMALIZATION_PROFILES,
|
| 31 |
+
NormalizationProfile,
|
| 32 |
+
get_builtin_profile,
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
__all__ = [
|
| 36 |
+
"NormalizationProfile",
|
| 37 |
+
"NORMALIZATION_PROFILES",
|
| 38 |
+
"DEFAULT_DIPLOMATIC_PROFILE",
|
| 39 |
+
"get_builtin_profile",
|
| 40 |
+
"DIPLOMATIC_FR_MEDIEVAL",
|
| 41 |
+
"DIPLOMATIC_FR_EARLY_MODERN",
|
| 42 |
+
"DIPLOMATIC_LATIN_MEDIEVAL",
|
| 43 |
+
"DIPLOMATIC_MINIMAL",
|
| 44 |
+
"DIPLOMATIC_EN_EARLY_MODERN",
|
| 45 |
+
"DIPLOMATIC_EN_MEDIEVAL",
|
| 46 |
+
"DIPLOMATIC_EN_SECRETARY",
|
| 47 |
+
]
|
|
@@ -0,0 +1,420 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Profils de normalisation unicode pour le calcul du CER diplomatique.
|
| 2 |
+
|
| 3 |
+
La normalisation diplomatique permet de calculer un CER tenant compte des
|
| 4 |
+
équivalences graphiques propres aux documents historiques : ſ=s, u=v, i=j, etc.
|
| 5 |
+
|
| 6 |
+
En appliquant la même table aux deux textes (GT et OCR), on mesure les erreurs
|
| 7 |
+
"substantielles" (transcription erronée) en ignorant les variations graphiques
|
| 8 |
+
codifiées connues.
|
| 9 |
+
|
| 10 |
+
Trois niveaux de normalisation sont disponibles :
|
| 11 |
+
|
| 12 |
+
1. NFC : normalisation Unicode canonique (décomposition+recomposition)
|
| 13 |
+
2. caseless : NFC + pliage de casse (casefold)
|
| 14 |
+
3. diplomatic: NFC + table de correspondances historiques configurables
|
| 15 |
+
|
| 16 |
+
Les profils préconfigurés couvrent les cas d'usage patrimoniaux courants.
|
| 17 |
+
Ils sont également chargeables depuis un fichier YAML.
|
| 18 |
+
|
| 19 |
+
Exemple YAML
|
| 20 |
+
------------
|
| 21 |
+
name: medieval_custom
|
| 22 |
+
caseless: false
|
| 23 |
+
diplomatic:
|
| 24 |
+
ſ: s
|
| 25 |
+
u: v
|
| 26 |
+
i: j
|
| 27 |
+
y: i
|
| 28 |
+
æ: ae
|
| 29 |
+
œ: oe
|
| 30 |
+
"""
|
| 31 |
+
|
| 32 |
+
from __future__ import annotations
|
| 33 |
+
|
| 34 |
+
import unicodedata
|
| 35 |
+
from dataclasses import dataclass, field
|
| 36 |
+
from pathlib import Path
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
# ---------------------------------------------------------------------------
|
| 40 |
+
# Tables de correspondances diplomatiques préconfigurées
|
| 41 |
+
# ---------------------------------------------------------------------------
|
| 42 |
+
|
| 43 |
+
#: Français médiéval (XIIe–XVe siècle)
|
| 44 |
+
DIPLOMATIC_FR_MEDIEVAL: dict[str, str] = {
|
| 45 |
+
"ſ": "s", # s long → s
|
| 46 |
+
"u": "v", # u/v interchangeables en position initiale
|
| 47 |
+
"i": "j", # i/j interchangeables
|
| 48 |
+
"y": "i", # y vocalique → i
|
| 49 |
+
"æ": "ae", # ligature æ
|
| 50 |
+
"œ": "oe", # ligature œ
|
| 51 |
+
"ꝑ": "per", # abréviation per/par
|
| 52 |
+
"ꝓ": "pro", # abréviation pro
|
| 53 |
+
"\u0026": "et", # & → et
|
| 54 |
+
}
|
| 55 |
+
|
| 56 |
+
#: Français moderne / imprimés anciens (XVIe–XVIIIe siècle)
|
| 57 |
+
DIPLOMATIC_FR_EARLY_MODERN: dict[str, str] = {
|
| 58 |
+
"ſ": "s", # s long
|
| 59 |
+
"æ": "ae",
|
| 60 |
+
"œ": "oe",
|
| 61 |
+
"\u0026": "et",
|
| 62 |
+
"ỹ": "yn", # y tilde
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
#: Latin médiéval
|
| 66 |
+
DIPLOMATIC_LATIN_MEDIEVAL: dict[str, str] = {
|
| 67 |
+
"ſ": "s",
|
| 68 |
+
"u": "v",
|
| 69 |
+
"i": "j",
|
| 70 |
+
"y": "i",
|
| 71 |
+
"æ": "ae",
|
| 72 |
+
"œ": "oe",
|
| 73 |
+
"ꝑ": "per",
|
| 74 |
+
"ꝓ": "pro",
|
| 75 |
+
"ꝗ": "que", # q barré → que
|
| 76 |
+
"\u0026": "et",
|
| 77 |
+
}
|
| 78 |
+
|
| 79 |
+
#: Profil minimal — uniquement NFC + s long
|
| 80 |
+
DIPLOMATIC_MINIMAL: dict[str, str] = {
|
| 81 |
+
"ſ": "s",
|
| 82 |
+
}
|
| 83 |
+
|
| 84 |
+
#: Anglais moderne / imprimés anciens (XVIe–XVIIIe siècle)
|
| 85 |
+
#: Orthographe «early modern» : ſ=s, u/v, i/j, vv=w, þ=th, ð=th, ȝ=y
|
| 86 |
+
DIPLOMATIC_EN_EARLY_MODERN: dict[str, str] = {
|
| 87 |
+
"ſ": "s", # s long → s
|
| 88 |
+
"u": "v", # u/v interchangeables (vpon → upon)
|
| 89 |
+
"i": "j", # i/j interchangeables (ioy → joy)
|
| 90 |
+
"vv": "w", # vv → w (vvhich → which)
|
| 91 |
+
"þ": "th", # thorn → th
|
| 92 |
+
"ð": "th", # eth → th
|
| 93 |
+
"ȝ": "y", # yogh → y
|
| 94 |
+
"æ": "ae", # ligature æ
|
| 95 |
+
"œ": "oe", # ligature œ
|
| 96 |
+
"\u0026": "and", # & → and
|
| 97 |
+
}
|
| 98 |
+
|
| 99 |
+
#: Anglais médiéval (XIIe–XVe siècle) — abréviations manuscrites incluses
|
| 100 |
+
DIPLOMATIC_EN_MEDIEVAL: dict[str, str] = {
|
| 101 |
+
"ſ": "s",
|
| 102 |
+
"u": "v",
|
| 103 |
+
"i": "j",
|
| 104 |
+
"vv": "w",
|
| 105 |
+
"þ": "th",
|
| 106 |
+
"ð": "th",
|
| 107 |
+
"ȝ": "y",
|
| 108 |
+
"æ": "ae",
|
| 109 |
+
"œ": "oe",
|
| 110 |
+
"\u0026": "and",
|
| 111 |
+
# Abréviations courantes dans les manuscrits anglais médiévaux
|
| 112 |
+
"ꝑ": "per", # p barré → per/par
|
| 113 |
+
"ꝓ": "pro", # p crocheté → pro
|
| 114 |
+
"ꝗ": "que", # q barré → que
|
| 115 |
+
"\ua75b": "r", # lettre r rotunda → r
|
| 116 |
+
}
|
| 117 |
+
|
| 118 |
+
#: Écriture secrétaire (XVIe–XVIIe siècle) — secretary hand
|
| 119 |
+
#: Confusions visuelles propres à l'écriture cursive anglaise
|
| 120 |
+
DIPLOMATIC_EN_SECRETARY: dict[str, str] = {
|
| 121 |
+
"ſ": "s",
|
| 122 |
+
"u": "v",
|
| 123 |
+
"i": "j",
|
| 124 |
+
"vv": "w",
|
| 125 |
+
"þ": "th",
|
| 126 |
+
"ð": "th",
|
| 127 |
+
"ȝ": "y",
|
| 128 |
+
"\u0026": "and",
|
| 129 |
+
# Confusions visuelles typiques : e/c, n/u, m/w en secrétaire
|
| 130 |
+
# Note : ne pas normaliser e/c automatiquement (trop agressif) ;
|
| 131 |
+
# on se limite aux substituts graphiques historiquement documentés
|
| 132 |
+
}
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
# ---------------------------------------------------------------------------
|
| 136 |
+
# Profil de normalisation
|
| 137 |
+
# ---------------------------------------------------------------------------
|
| 138 |
+
|
| 139 |
+
@dataclass
|
| 140 |
+
class NormalizationProfile:
|
| 141 |
+
"""Décrit une stratégie de normalisation pour le calcul du CER diplomatique.
|
| 142 |
+
|
| 143 |
+
Parameters
|
| 144 |
+
----------
|
| 145 |
+
name:
|
| 146 |
+
Identifiant lisible du profil (ex : ``"medieval_french"``).
|
| 147 |
+
nfc:
|
| 148 |
+
Applique la normalisation Unicode NFC (recommandé, activé par défaut).
|
| 149 |
+
caseless:
|
| 150 |
+
Pliage de casse (casefold) après NFC.
|
| 151 |
+
diplomatic_table:
|
| 152 |
+
Table de correspondances graphiques historiques appliquée caractère
|
| 153 |
+
par caractère sur les deux textes avant calcul du CER.
|
| 154 |
+
exclude_chars:
|
| 155 |
+
Ensemble de caractères supprimés des deux textes (GT et OCR) avant
|
| 156 |
+
tout calcul de métriques (CER, WER, MER, WIL et CER diplomatique).
|
| 157 |
+
Utile pour ignorer la ponctuation ou les apostrophes.
|
| 158 |
+
description:
|
| 159 |
+
Description courte du profil (affichée dans le rapport HTML).
|
| 160 |
+
"""
|
| 161 |
+
|
| 162 |
+
name: str
|
| 163 |
+
nfc: bool = True
|
| 164 |
+
caseless: bool = False
|
| 165 |
+
diplomatic_table: dict[str, str] = field(default_factory=dict)
|
| 166 |
+
exclude_chars: frozenset = field(default_factory=frozenset)
|
| 167 |
+
description: str = ""
|
| 168 |
+
|
| 169 |
+
def normalize(self, text: str) -> str:
|
| 170 |
+
"""Applique le profil de normalisation à un texte."""
|
| 171 |
+
if self.exclude_chars:
|
| 172 |
+
text = "".join(c for c in text if c not in self.exclude_chars)
|
| 173 |
+
if self.nfc:
|
| 174 |
+
text = unicodedata.normalize("NFC", text)
|
| 175 |
+
if self.caseless:
|
| 176 |
+
text = text.casefold()
|
| 177 |
+
if self.diplomatic_table:
|
| 178 |
+
text = _apply_diplomatic_table(text, self.diplomatic_table)
|
| 179 |
+
return text
|
| 180 |
+
|
| 181 |
+
def as_dict(self) -> dict:
|
| 182 |
+
return {
|
| 183 |
+
"name": self.name,
|
| 184 |
+
"nfc": self.nfc,
|
| 185 |
+
"caseless": self.caseless,
|
| 186 |
+
"diplomatic_table": self.diplomatic_table,
|
| 187 |
+
"exclude_chars": sorted(self.exclude_chars),
|
| 188 |
+
"description": self.description,
|
| 189 |
+
}
|
| 190 |
+
|
| 191 |
+
@classmethod
|
| 192 |
+
def from_yaml(cls, path: str | Path) -> "NormalizationProfile":
|
| 193 |
+
"""Charge un profil depuis un fichier YAML.
|
| 194 |
+
|
| 195 |
+
Le fichier YAML doit contenir les clés ``name``, optionnellement
|
| 196 |
+
``caseless``, ``description``, ``diplomatic`` (dict str→str) et
|
| 197 |
+
``exclude_chars`` (liste ou chaîne de caractères à ignorer).
|
| 198 |
+
|
| 199 |
+
Example
|
| 200 |
+
-------
|
| 201 |
+
.. code-block:: yaml
|
| 202 |
+
|
| 203 |
+
name: medieval_custom
|
| 204 |
+
caseless: false
|
| 205 |
+
description: Français médiéval personnalisé
|
| 206 |
+
exclude_chars: ".,;:!?"
|
| 207 |
+
diplomatic:
|
| 208 |
+
ſ: s
|
| 209 |
+
u: v
|
| 210 |
+
"""
|
| 211 |
+
try:
|
| 212 |
+
import yaml
|
| 213 |
+
except ImportError as exc:
|
| 214 |
+
raise RuntimeError(
|
| 215 |
+
"Le package 'pyyaml' est requis pour charger les profils YAML. "
|
| 216 |
+
"Installez-le avec : pip install pyyaml"
|
| 217 |
+
) from exc
|
| 218 |
+
|
| 219 |
+
data = yaml.safe_load(Path(path).read_text(encoding="utf-8"))
|
| 220 |
+
return cls(
|
| 221 |
+
name=data.get("name", Path(path).stem),
|
| 222 |
+
nfc=bool(data.get("nfc", True)),
|
| 223 |
+
caseless=bool(data.get("caseless", False)),
|
| 224 |
+
diplomatic_table=data.get("diplomatic", {}),
|
| 225 |
+
exclude_chars=_parse_exclude_chars(data.get("exclude_chars", "")),
|
| 226 |
+
description=data.get("description", ""),
|
| 227 |
+
)
|
| 228 |
+
|
| 229 |
+
@classmethod
|
| 230 |
+
def from_dict(cls, data: dict) -> "NormalizationProfile":
|
| 231 |
+
"""Charge un profil depuis un dictionnaire (ex : section YAML inline)."""
|
| 232 |
+
return cls(
|
| 233 |
+
name=data.get("name", "custom"),
|
| 234 |
+
nfc=bool(data.get("nfc", True)),
|
| 235 |
+
caseless=bool(data.get("caseless", False)),
|
| 236 |
+
diplomatic_table=data.get("diplomatic", {}),
|
| 237 |
+
exclude_chars=_parse_exclude_chars(data.get("exclude_chars", "")),
|
| 238 |
+
description=data.get("description", ""),
|
| 239 |
+
)
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
# ---------------------------------------------------------------------------
|
| 243 |
+
# Profils préconfigurés
|
| 244 |
+
# ---------------------------------------------------------------------------
|
| 245 |
+
|
| 246 |
+
NORMALIZATION_PROFILES: dict[str, NormalizationProfile] = {
|
| 247 |
+
"nfc": NormalizationProfile(
|
| 248 |
+
name="nfc",
|
| 249 |
+
nfc=True,
|
| 250 |
+
caseless=False,
|
| 251 |
+
diplomatic_table={},
|
| 252 |
+
description="Normalisation NFC uniquement",
|
| 253 |
+
),
|
| 254 |
+
"caseless": NormalizationProfile(
|
| 255 |
+
name="caseless",
|
| 256 |
+
nfc=True,
|
| 257 |
+
caseless=True,
|
| 258 |
+
diplomatic_table={},
|
| 259 |
+
description="NFC + insensible à la casse",
|
| 260 |
+
),
|
| 261 |
+
"minimal": NormalizationProfile(
|
| 262 |
+
name="minimal",
|
| 263 |
+
nfc=True,
|
| 264 |
+
caseless=False,
|
| 265 |
+
diplomatic_table=DIPLOMATIC_MINIMAL,
|
| 266 |
+
description="Minimal : NFC + s long seulement",
|
| 267 |
+
),
|
| 268 |
+
"medieval_french": NormalizationProfile(
|
| 269 |
+
name="medieval_french",
|
| 270 |
+
nfc=True,
|
| 271 |
+
caseless=False,
|
| 272 |
+
diplomatic_table=DIPLOMATIC_FR_MEDIEVAL,
|
| 273 |
+
description="Français médiéval (XIIe–XVe) : ſ=s, u=v, i=j, æ=ae, œ=oe",
|
| 274 |
+
),
|
| 275 |
+
"early_modern_french": NormalizationProfile(
|
| 276 |
+
name="early_modern_french",
|
| 277 |
+
nfc=True,
|
| 278 |
+
caseless=False,
|
| 279 |
+
diplomatic_table=DIPLOMATIC_FR_EARLY_MODERN,
|
| 280 |
+
description="Imprimés anciens (XVIe–XVIIIe) : ſ=s, æ=ae, œ=oe",
|
| 281 |
+
),
|
| 282 |
+
"medieval_latin": NormalizationProfile(
|
| 283 |
+
name="medieval_latin",
|
| 284 |
+
nfc=True,
|
| 285 |
+
caseless=False,
|
| 286 |
+
diplomatic_table=DIPLOMATIC_LATIN_MEDIEVAL,
|
| 287 |
+
description="Latin médiéval : ſ=s, u=v, i=j, ꝑ=per, ꝓ=pro",
|
| 288 |
+
),
|
| 289 |
+
"early_modern_english": NormalizationProfile(
|
| 290 |
+
name="early_modern_english",
|
| 291 |
+
nfc=True,
|
| 292 |
+
caseless=False,
|
| 293 |
+
diplomatic_table=DIPLOMATIC_EN_EARLY_MODERN,
|
| 294 |
+
description="Early Modern English (XVIth–XVIIIth c.): ſ=s, u=v, i=j, vv=w, þ=th, ð=th, ȝ=y",
|
| 295 |
+
),
|
| 296 |
+
"medieval_english": NormalizationProfile(
|
| 297 |
+
name="medieval_english",
|
| 298 |
+
nfc=True,
|
| 299 |
+
caseless=False,
|
| 300 |
+
diplomatic_table=DIPLOMATIC_EN_MEDIEVAL,
|
| 301 |
+
description="Medieval English (XIIth–XVth c.): ſ=s, u=v, i=j, þ=th, ȝ=y, ꝑ=per, ꝓ=pro",
|
| 302 |
+
),
|
| 303 |
+
"secretary_hand": NormalizationProfile(
|
| 304 |
+
name="secretary_hand",
|
| 305 |
+
nfc=True,
|
| 306 |
+
caseless=False,
|
| 307 |
+
diplomatic_table=DIPLOMATIC_EN_SECRETARY,
|
| 308 |
+
description="Secretary hand (XVIth–XVIIth c.): ſ=s, u=v, i=j, vv=w, þ=th, ð=th, ȝ=y",
|
| 309 |
+
),
|
| 310 |
+
# ── Profils d'exclusion de caractères ────────────────────────────────
|
| 311 |
+
"sans_ponctuation": NormalizationProfile(
|
| 312 |
+
name="sans_ponctuation",
|
| 313 |
+
nfc=True,
|
| 314 |
+
caseless=False,
|
| 315 |
+
diplomatic_table={},
|
| 316 |
+
exclude_chars=frozenset(". , ; : ! ? ' \u2019 \" - \u2013 \u2014 ( ) [ ]".split()),
|
| 317 |
+
description="NFC + suppression de la ponctuation courante : . , ; : ! ? ' \" - – — ( ) [ ]",
|
| 318 |
+
),
|
| 319 |
+
"sans_apostrophes": NormalizationProfile(
|
| 320 |
+
name="sans_apostrophes",
|
| 321 |
+
nfc=True,
|
| 322 |
+
caseless=False,
|
| 323 |
+
diplomatic_table={},
|
| 324 |
+
exclude_chars=frozenset(["'", "\u2019"]), # apostrophe droite + apostrophe typographique
|
| 325 |
+
description="NFC + suppression des apostrophes droite (') et typographique (\u2019)",
|
| 326 |
+
),
|
| 327 |
+
}
|
| 328 |
+
|
| 329 |
+
|
| 330 |
+
def get_builtin_profile(name: str) -> NormalizationProfile:
|
| 331 |
+
"""Retourne un profil préconfigurée par son identifiant.
|
| 332 |
+
|
| 333 |
+
Identifiants disponibles
|
| 334 |
+
------------------------
|
| 335 |
+
- ``"medieval_french"`` : français médiéval XIIe–XVe (ſ=s, u=v, i=j, æ=ae, œ=oe…)
|
| 336 |
+
- ``"early_modern_french"`` : imprimés anciens XVIe–XVIIIe (ſ=s, œ=oe, æ=ae…)
|
| 337 |
+
- ``"medieval_latin"`` : latin médiéval (ſ=s, u=v, i=j, ꝑ=per, ꝓ=pro…)
|
| 338 |
+
- ``"early_modern_english"`` : anglais imprimé XVIe–XVIIIe (ſ=s, u=v, i=j, vv=w, þ=th, ð=th, ȝ=y)
|
| 339 |
+
- ``"medieval_english"`` : anglais manuscrit XIIe–XVe (+ abréviations ꝑ, ꝓ…)
|
| 340 |
+
- ``"secretary_hand"`` : écriture secrétaire anglaise XVIe–XVIIe (cursive administrative)
|
| 341 |
+
- ``"minimal"`` : uniquement NFC + s long
|
| 342 |
+
- ``"nfc"`` : NFC seul (sans table diplomatique)
|
| 343 |
+
- ``"caseless"`` : NFC + pliage de casse
|
| 344 |
+
|
| 345 |
+
Raises
|
| 346 |
+
------
|
| 347 |
+
KeyError
|
| 348 |
+
Si le nom n'est pas reconnu.
|
| 349 |
+
"""
|
| 350 |
+
if name not in NORMALIZATION_PROFILES:
|
| 351 |
+
raise KeyError(
|
| 352 |
+
f"Profil de normalisation inconnu : '{name}'. "
|
| 353 |
+
f"Disponibles : {', '.join(NORMALIZATION_PROFILES)}"
|
| 354 |
+
)
|
| 355 |
+
return NORMALIZATION_PROFILES[name]
|
| 356 |
+
|
| 357 |
+
|
| 358 |
+
# ---------------------------------------------------------------------------
|
| 359 |
+
# Fonctions utilitaires
|
| 360 |
+
# ---------------------------------------------------------------------------
|
| 361 |
+
|
| 362 |
+
def _parse_exclude_chars(value: "str | list | None") -> frozenset:
|
| 363 |
+
"""Convertit une liste de caractères (str ou list) en frozenset.
|
| 364 |
+
|
| 365 |
+
Accepte :
|
| 366 |
+
- Une chaîne de caractères séparés par une virgule+espace (ex. ``"', -, –"``)
|
| 367 |
+
ou simplement concaténés sans séparateur (ex. ``".,;:!?"``)
|
| 368 |
+
- Une liste Python/YAML de chaînes (chacune un caractère)
|
| 369 |
+
- None ou chaîne vide → frozenset vide
|
| 370 |
+
|
| 371 |
+
Règle de désambiguïsation : si la chaîne contient la séquence ``", "``
|
| 372 |
+
(virgule suivie d'un espace), on découpe par ``", "``. Sinon, chaque
|
| 373 |
+
caractère Unicode est un item distinct.
|
| 374 |
+
"""
|
| 375 |
+
if not value:
|
| 376 |
+
return frozenset()
|
| 377 |
+
if isinstance(value, (list, tuple)):
|
| 378 |
+
return frozenset(str(c) for c in value if c)
|
| 379 |
+
raw = str(value)
|
| 380 |
+
# Désambiguïsation : séparer par ", " si présent (format lisible)
|
| 381 |
+
if ", " in raw:
|
| 382 |
+
return frozenset(c.strip() for c in raw.split(",") if c.strip())
|
| 383 |
+
# Sinon, chaque caractère Unicode est un item distinct
|
| 384 |
+
return frozenset(raw)
|
| 385 |
+
|
| 386 |
+
|
| 387 |
+
def _apply_diplomatic_table(text: str, table: dict[str, str]) -> str:
|
| 388 |
+
"""Applique une table de correspondances diplomatiques en un seul pass.
|
| 389 |
+
|
| 390 |
+
Les clés multi-caractères (ex : ``"ae"`` → ``"æ"``) sont gérées en priorité
|
| 391 |
+
sur les correspondances simples. Le remplacement est fait en un seul pass
|
| 392 |
+
via regex pour éviter les remplacements en cascade (ex : ``"ſ"→"s"`` puis
|
| 393 |
+
``"s"→"z"`` donnerait ``"z"`` au lieu de ``"s"``).
|
| 394 |
+
"""
|
| 395 |
+
if not table:
|
| 396 |
+
return text
|
| 397 |
+
|
| 398 |
+
import re
|
| 399 |
+
|
| 400 |
+
# Séparer les clés simples (1 char) des clés multi-chars
|
| 401 |
+
multi_keys = sorted(
|
| 402 |
+
(k for k in table if len(k) > 1), key=len, reverse=True
|
| 403 |
+
)
|
| 404 |
+
simple_table = {k: v for k, v in table.items() if len(k) == 1}
|
| 405 |
+
|
| 406 |
+
if multi_keys:
|
| 407 |
+
# Single-pass : construire un pattern regex avec toutes les clés multi-chars
|
| 408 |
+
# triées par longueur décroissante pour matcher les plus longues d'abord
|
| 409 |
+
pattern = re.compile("|".join(re.escape(k) for k in multi_keys))
|
| 410 |
+
text = pattern.sub(lambda m: table[m.group(0)], text)
|
| 411 |
+
|
| 412 |
+
# Remplacements char par char (single-pass via itération)
|
| 413 |
+
if simple_table:
|
| 414 |
+
text = "".join(simple_table.get(c, c) for c in text)
|
| 415 |
+
|
| 416 |
+
return text
|
| 417 |
+
|
| 418 |
+
|
| 419 |
+
# Profil par défaut utilisé pour le CER diplomatique intégré
|
| 420 |
+
DEFAULT_DIPLOMATIC_PROFILE: NormalizationProfile = get_builtin_profile("medieval_french")
|
|
@@ -1,420 +1,58 @@
|
|
| 1 |
-
"""
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
------------
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
ſ: s
|
| 25 |
-
u: v
|
| 26 |
-
i: j
|
| 27 |
-
y: i
|
| 28 |
-
æ: ae
|
| 29 |
-
œ: oe
|
| 30 |
"""
|
| 31 |
|
| 32 |
from __future__ import annotations
|
| 33 |
|
| 34 |
-
import
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
"
|
| 52 |
-
"
|
| 53 |
-
"
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
"
|
| 59 |
-
"
|
| 60 |
-
"
|
| 61 |
-
"
|
| 62 |
-
"
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
#: Latin médiéval
|
| 66 |
-
DIPLOMATIC_LATIN_MEDIEVAL: dict[str, str] = {
|
| 67 |
-
"ſ": "s",
|
| 68 |
-
"u": "v",
|
| 69 |
-
"i": "j",
|
| 70 |
-
"y": "i",
|
| 71 |
-
"æ": "ae",
|
| 72 |
-
"œ": "oe",
|
| 73 |
-
"ꝑ": "per",
|
| 74 |
-
"ꝓ": "pro",
|
| 75 |
-
"ꝗ": "que", # q barré → que
|
| 76 |
-
"\u0026": "et",
|
| 77 |
-
}
|
| 78 |
-
|
| 79 |
-
#: Profil minimal — uniquement NFC + s long
|
| 80 |
-
DIPLOMATIC_MINIMAL: dict[str, str] = {
|
| 81 |
-
"ſ": "s",
|
| 82 |
-
}
|
| 83 |
-
|
| 84 |
-
#: Anglais moderne / imprimés anciens (XVIe–XVIIIe siècle)
|
| 85 |
-
#: Orthographe «early modern» : ſ=s, u/v, i/j, vv=w, þ=th, ð=th, ȝ=y
|
| 86 |
-
DIPLOMATIC_EN_EARLY_MODERN: dict[str, str] = {
|
| 87 |
-
"ſ": "s", # s long → s
|
| 88 |
-
"u": "v", # u/v interchangeables (vpon → upon)
|
| 89 |
-
"i": "j", # i/j interchangeables (ioy → joy)
|
| 90 |
-
"vv": "w", # vv → w (vvhich → which)
|
| 91 |
-
"þ": "th", # thorn → th
|
| 92 |
-
"ð": "th", # eth → th
|
| 93 |
-
"ȝ": "y", # yogh → y
|
| 94 |
-
"æ": "ae", # ligature æ
|
| 95 |
-
"œ": "oe", # ligature œ
|
| 96 |
-
"\u0026": "and", # & → and
|
| 97 |
-
}
|
| 98 |
-
|
| 99 |
-
#: Anglais médiéval (XIIe–XVe siècle) — abréviations manuscrites incluses
|
| 100 |
-
DIPLOMATIC_EN_MEDIEVAL: dict[str, str] = {
|
| 101 |
-
"ſ": "s",
|
| 102 |
-
"u": "v",
|
| 103 |
-
"i": "j",
|
| 104 |
-
"vv": "w",
|
| 105 |
-
"þ": "th",
|
| 106 |
-
"ð": "th",
|
| 107 |
-
"ȝ": "y",
|
| 108 |
-
"æ": "ae",
|
| 109 |
-
"œ": "oe",
|
| 110 |
-
"\u0026": "and",
|
| 111 |
-
# Abréviations courantes dans les manuscrits anglais médiévaux
|
| 112 |
-
"ꝑ": "per", # p barré → per/par
|
| 113 |
-
"ꝓ": "pro", # p crocheté → pro
|
| 114 |
-
"ꝗ": "que", # q barré → que
|
| 115 |
-
"\ua75b": "r", # lettre r rotunda → r
|
| 116 |
-
}
|
| 117 |
-
|
| 118 |
-
#: Écriture secrétaire (XVIe–XVIIe siècle) — secretary hand
|
| 119 |
-
#: Confusions visuelles propres à l'écriture cursive anglaise
|
| 120 |
-
DIPLOMATIC_EN_SECRETARY: dict[str, str] = {
|
| 121 |
-
"ſ": "s",
|
| 122 |
-
"u": "v",
|
| 123 |
-
"i": "j",
|
| 124 |
-
"vv": "w",
|
| 125 |
-
"þ": "th",
|
| 126 |
-
"ð": "th",
|
| 127 |
-
"ȝ": "y",
|
| 128 |
-
"\u0026": "and",
|
| 129 |
-
# Confusions visuelles typiques : e/c, n/u, m/w en secrétaire
|
| 130 |
-
# Note : ne pas normaliser e/c automatiquement (trop agressif) ;
|
| 131 |
-
# on se limite aux substituts graphiques historiquement documentés
|
| 132 |
-
}
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
# ---------------------------------------------------------------------------
|
| 136 |
-
# Profil de normalisation
|
| 137 |
-
# ---------------------------------------------------------------------------
|
| 138 |
-
|
| 139 |
-
@dataclass
|
| 140 |
-
class NormalizationProfile:
|
| 141 |
-
"""Décrit une stratégie de normalisation pour le calcul du CER diplomatique.
|
| 142 |
-
|
| 143 |
-
Parameters
|
| 144 |
-
----------
|
| 145 |
-
name:
|
| 146 |
-
Identifiant lisible du profil (ex : ``"medieval_french"``).
|
| 147 |
-
nfc:
|
| 148 |
-
Applique la normalisation Unicode NFC (recommandé, activé par défaut).
|
| 149 |
-
caseless:
|
| 150 |
-
Pliage de casse (casefold) après NFC.
|
| 151 |
-
diplomatic_table:
|
| 152 |
-
Table de correspondances graphiques historiques appliquée caractère
|
| 153 |
-
par caractère sur les deux textes avant calcul du CER.
|
| 154 |
-
exclude_chars:
|
| 155 |
-
Ensemble de caractères supprimés des deux textes (GT et OCR) avant
|
| 156 |
-
tout calcul de métriques (CER, WER, MER, WIL et CER diplomatique).
|
| 157 |
-
Utile pour ignorer la ponctuation ou les apostrophes.
|
| 158 |
-
description:
|
| 159 |
-
Description courte du profil (affichée dans le rapport HTML).
|
| 160 |
-
"""
|
| 161 |
-
|
| 162 |
-
name: str
|
| 163 |
-
nfc: bool = True
|
| 164 |
-
caseless: bool = False
|
| 165 |
-
diplomatic_table: dict[str, str] = field(default_factory=dict)
|
| 166 |
-
exclude_chars: frozenset = field(default_factory=frozenset)
|
| 167 |
-
description: str = ""
|
| 168 |
-
|
| 169 |
-
def normalize(self, text: str) -> str:
|
| 170 |
-
"""Applique le profil de normalisation à un texte."""
|
| 171 |
-
if self.exclude_chars:
|
| 172 |
-
text = "".join(c for c in text if c not in self.exclude_chars)
|
| 173 |
-
if self.nfc:
|
| 174 |
-
text = unicodedata.normalize("NFC", text)
|
| 175 |
-
if self.caseless:
|
| 176 |
-
text = text.casefold()
|
| 177 |
-
if self.diplomatic_table:
|
| 178 |
-
text = _apply_diplomatic_table(text, self.diplomatic_table)
|
| 179 |
-
return text
|
| 180 |
-
|
| 181 |
-
def as_dict(self) -> dict:
|
| 182 |
-
return {
|
| 183 |
-
"name": self.name,
|
| 184 |
-
"nfc": self.nfc,
|
| 185 |
-
"caseless": self.caseless,
|
| 186 |
-
"diplomatic_table": self.diplomatic_table,
|
| 187 |
-
"exclude_chars": sorted(self.exclude_chars),
|
| 188 |
-
"description": self.description,
|
| 189 |
-
}
|
| 190 |
-
|
| 191 |
-
@classmethod
|
| 192 |
-
def from_yaml(cls, path: str | Path) -> "NormalizationProfile":
|
| 193 |
-
"""Charge un profil depuis un fichier YAML.
|
| 194 |
-
|
| 195 |
-
Le fichier YAML doit contenir les clés ``name``, optionnellement
|
| 196 |
-
``caseless``, ``description``, ``diplomatic`` (dict str→str) et
|
| 197 |
-
``exclude_chars`` (liste ou chaîne de caractères à ignorer).
|
| 198 |
-
|
| 199 |
-
Example
|
| 200 |
-
-------
|
| 201 |
-
.. code-block:: yaml
|
| 202 |
-
|
| 203 |
-
name: medieval_custom
|
| 204 |
-
caseless: false
|
| 205 |
-
description: Français médiéval personnalisé
|
| 206 |
-
exclude_chars: ".,;:!?"
|
| 207 |
-
diplomatic:
|
| 208 |
-
ſ: s
|
| 209 |
-
u: v
|
| 210 |
-
"""
|
| 211 |
-
try:
|
| 212 |
-
import yaml
|
| 213 |
-
except ImportError as exc:
|
| 214 |
-
raise RuntimeError(
|
| 215 |
-
"Le package 'pyyaml' est requis pour charger les profils YAML. "
|
| 216 |
-
"Installez-le avec : pip install pyyaml"
|
| 217 |
-
) from exc
|
| 218 |
-
|
| 219 |
-
data = yaml.safe_load(Path(path).read_text(encoding="utf-8"))
|
| 220 |
-
return cls(
|
| 221 |
-
name=data.get("name", Path(path).stem),
|
| 222 |
-
nfc=bool(data.get("nfc", True)),
|
| 223 |
-
caseless=bool(data.get("caseless", False)),
|
| 224 |
-
diplomatic_table=data.get("diplomatic", {}),
|
| 225 |
-
exclude_chars=_parse_exclude_chars(data.get("exclude_chars", "")),
|
| 226 |
-
description=data.get("description", ""),
|
| 227 |
-
)
|
| 228 |
-
|
| 229 |
-
@classmethod
|
| 230 |
-
def from_dict(cls, data: dict) -> "NormalizationProfile":
|
| 231 |
-
"""Charge un profil depuis un dictionnaire (ex : section YAML inline)."""
|
| 232 |
-
return cls(
|
| 233 |
-
name=data.get("name", "custom"),
|
| 234 |
-
nfc=bool(data.get("nfc", True)),
|
| 235 |
-
caseless=bool(data.get("caseless", False)),
|
| 236 |
-
diplomatic_table=data.get("diplomatic", {}),
|
| 237 |
-
exclude_chars=_parse_exclude_chars(data.get("exclude_chars", "")),
|
| 238 |
-
description=data.get("description", ""),
|
| 239 |
-
)
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
# ---------------------------------------------------------------------------
|
| 243 |
-
# Profils préconfigurés
|
| 244 |
-
# ---------------------------------------------------------------------------
|
| 245 |
-
|
| 246 |
-
NORMALIZATION_PROFILES: dict[str, NormalizationProfile] = {
|
| 247 |
-
"nfc": NormalizationProfile(
|
| 248 |
-
name="nfc",
|
| 249 |
-
nfc=True,
|
| 250 |
-
caseless=False,
|
| 251 |
-
diplomatic_table={},
|
| 252 |
-
description="Normalisation NFC uniquement",
|
| 253 |
-
),
|
| 254 |
-
"caseless": NormalizationProfile(
|
| 255 |
-
name="caseless",
|
| 256 |
-
nfc=True,
|
| 257 |
-
caseless=True,
|
| 258 |
-
diplomatic_table={},
|
| 259 |
-
description="NFC + insensible à la casse",
|
| 260 |
-
),
|
| 261 |
-
"minimal": NormalizationProfile(
|
| 262 |
-
name="minimal",
|
| 263 |
-
nfc=True,
|
| 264 |
-
caseless=False,
|
| 265 |
-
diplomatic_table=DIPLOMATIC_MINIMAL,
|
| 266 |
-
description="Minimal : NFC + s long seulement",
|
| 267 |
-
),
|
| 268 |
-
"medieval_french": NormalizationProfile(
|
| 269 |
-
name="medieval_french",
|
| 270 |
-
nfc=True,
|
| 271 |
-
caseless=False,
|
| 272 |
-
diplomatic_table=DIPLOMATIC_FR_MEDIEVAL,
|
| 273 |
-
description="Français médiéval (XIIe–XVe) : ſ=s, u=v, i=j, æ=ae, œ=oe",
|
| 274 |
-
),
|
| 275 |
-
"early_modern_french": NormalizationProfile(
|
| 276 |
-
name="early_modern_french",
|
| 277 |
-
nfc=True,
|
| 278 |
-
caseless=False,
|
| 279 |
-
diplomatic_table=DIPLOMATIC_FR_EARLY_MODERN,
|
| 280 |
-
description="Imprimés anciens (XVIe–XVIIIe) : ſ=s, æ=ae, œ=oe",
|
| 281 |
-
),
|
| 282 |
-
"medieval_latin": NormalizationProfile(
|
| 283 |
-
name="medieval_latin",
|
| 284 |
-
nfc=True,
|
| 285 |
-
caseless=False,
|
| 286 |
-
diplomatic_table=DIPLOMATIC_LATIN_MEDIEVAL,
|
| 287 |
-
description="Latin médiéval : ſ=s, u=v, i=j, ꝑ=per, ꝓ=pro",
|
| 288 |
-
),
|
| 289 |
-
"early_modern_english": NormalizationProfile(
|
| 290 |
-
name="early_modern_english",
|
| 291 |
-
nfc=True,
|
| 292 |
-
caseless=False,
|
| 293 |
-
diplomatic_table=DIPLOMATIC_EN_EARLY_MODERN,
|
| 294 |
-
description="Early Modern English (XVIth–XVIIIth c.): ſ=s, u=v, i=j, vv=w, þ=th, ð=th, ȝ=y",
|
| 295 |
-
),
|
| 296 |
-
"medieval_english": NormalizationProfile(
|
| 297 |
-
name="medieval_english",
|
| 298 |
-
nfc=True,
|
| 299 |
-
caseless=False,
|
| 300 |
-
diplomatic_table=DIPLOMATIC_EN_MEDIEVAL,
|
| 301 |
-
description="Medieval English (XIIth–XVth c.): ſ=s, u=v, i=j, þ=th, ȝ=y, ꝑ=per, ꝓ=pro",
|
| 302 |
-
),
|
| 303 |
-
"secretary_hand": NormalizationProfile(
|
| 304 |
-
name="secretary_hand",
|
| 305 |
-
nfc=True,
|
| 306 |
-
caseless=False,
|
| 307 |
-
diplomatic_table=DIPLOMATIC_EN_SECRETARY,
|
| 308 |
-
description="Secretary hand (XVIth–XVIIth c.): ſ=s, u=v, i=j, vv=w, þ=th, ð=th, ȝ=y",
|
| 309 |
-
),
|
| 310 |
-
# ── Profils d'exclusion de caractères ────────────────────────────────
|
| 311 |
-
"sans_ponctuation": NormalizationProfile(
|
| 312 |
-
name="sans_ponctuation",
|
| 313 |
-
nfc=True,
|
| 314 |
-
caseless=False,
|
| 315 |
-
diplomatic_table={},
|
| 316 |
-
exclude_chars=frozenset(". , ; : ! ? ' \u2019 \" - \u2013 \u2014 ( ) [ ]".split()),
|
| 317 |
-
description="NFC + suppression de la ponctuation courante : . , ; : ! ? ' \" - – — ( ) [ ]",
|
| 318 |
-
),
|
| 319 |
-
"sans_apostrophes": NormalizationProfile(
|
| 320 |
-
name="sans_apostrophes",
|
| 321 |
-
nfc=True,
|
| 322 |
-
caseless=False,
|
| 323 |
-
diplomatic_table={},
|
| 324 |
-
exclude_chars=frozenset(["'", "\u2019"]), # apostrophe droite + apostrophe typographique
|
| 325 |
-
description="NFC + suppression des apostrophes droite (') et typographique (\u2019)",
|
| 326 |
-
),
|
| 327 |
-
}
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
def get_builtin_profile(name: str) -> NormalizationProfile:
|
| 331 |
-
"""Retourne un profil préconfigurée par son identifiant.
|
| 332 |
-
|
| 333 |
-
Identifiants disponibles
|
| 334 |
-
------------------------
|
| 335 |
-
- ``"medieval_french"`` : français médiéval XIIe–XVe (ſ=s, u=v, i=j, æ=ae, œ=oe…)
|
| 336 |
-
- ``"early_modern_french"`` : imprimés anciens XVIe–XVIIIe (ſ=s, œ=oe, æ=ae…)
|
| 337 |
-
- ``"medieval_latin"`` : latin médiéval (ſ=s, u=v, i=j, ꝑ=per, ꝓ=pro…)
|
| 338 |
-
- ``"early_modern_english"`` : anglais imprimé XVIe–XVIIIe (ſ=s, u=v, i=j, vv=w, þ=th, ð=th, ȝ=y)
|
| 339 |
-
- ``"medieval_english"`` : anglais manuscrit XIIe–XVe (+ abréviations ꝑ, ꝓ…)
|
| 340 |
-
- ``"secretary_hand"`` : écriture secrétaire anglaise XVIe–XVIIe (cursive administrative)
|
| 341 |
-
- ``"minimal"`` : uniquement NFC + s long
|
| 342 |
-
- ``"nfc"`` : NFC seul (sans table diplomatique)
|
| 343 |
-
- ``"caseless"`` : NFC + pliage de casse
|
| 344 |
-
|
| 345 |
-
Raises
|
| 346 |
-
------
|
| 347 |
-
KeyError
|
| 348 |
-
Si le nom n'est pas reconnu.
|
| 349 |
-
"""
|
| 350 |
-
if name not in NORMALIZATION_PROFILES:
|
| 351 |
-
raise KeyError(
|
| 352 |
-
f"Profil de normalisation inconnu : '{name}'. "
|
| 353 |
-
f"Disponibles : {', '.join(NORMALIZATION_PROFILES)}"
|
| 354 |
-
)
|
| 355 |
-
return NORMALIZATION_PROFILES[name]
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
# ---------------------------------------------------------------------------
|
| 359 |
-
# Fonctions utilitaires
|
| 360 |
-
# ---------------------------------------------------------------------------
|
| 361 |
-
|
| 362 |
-
def _parse_exclude_chars(value: "str | list | None") -> frozenset:
|
| 363 |
-
"""Convertit une liste de caractères (str ou list) en frozenset.
|
| 364 |
-
|
| 365 |
-
Accepte :
|
| 366 |
-
- Une chaîne de caractères séparés par une virgule+espace (ex. ``"', -, –"``)
|
| 367 |
-
ou simplement concaténés sans séparateur (ex. ``".,;:!?"``)
|
| 368 |
-
- Une liste Python/YAML de chaînes (chacune un caractère)
|
| 369 |
-
- None ou chaîne vide → frozenset vide
|
| 370 |
-
|
| 371 |
-
Règle de désambiguïsation : si la chaîne contient la séquence ``", "``
|
| 372 |
-
(virgule suivie d'un espace), on découpe par ``", "``. Sinon, chaque
|
| 373 |
-
caractère Unicode est un item distinct.
|
| 374 |
-
"""
|
| 375 |
-
if not value:
|
| 376 |
-
return frozenset()
|
| 377 |
-
if isinstance(value, (list, tuple)):
|
| 378 |
-
return frozenset(str(c) for c in value if c)
|
| 379 |
-
raw = str(value)
|
| 380 |
-
# Désambiguïsation : séparer par ", " si présent (format lisible)
|
| 381 |
-
if ", " in raw:
|
| 382 |
-
return frozenset(c.strip() for c in raw.split(",") if c.strip())
|
| 383 |
-
# Sinon, chaque caractère Unicode est un item distinct
|
| 384 |
-
return frozenset(raw)
|
| 385 |
-
|
| 386 |
-
|
| 387 |
-
def _apply_diplomatic_table(text: str, table: dict[str, str]) -> str:
|
| 388 |
-
"""Applique une table de correspondances diplomatiques en un seul pass.
|
| 389 |
-
|
| 390 |
-
Les clés multi-caractères (ex : ``"ae"`` → ``"æ"``) sont gérées en priorité
|
| 391 |
-
sur les correspondances simples. Le remplacement est fait en un seul pass
|
| 392 |
-
via regex pour éviter les remplacements en cascade (ex : ``"ſ"→"s"`` puis
|
| 393 |
-
``"s"→"z"`` donnerait ``"z"`` au lieu de ``"s"``).
|
| 394 |
-
"""
|
| 395 |
-
if not table:
|
| 396 |
-
return text
|
| 397 |
-
|
| 398 |
-
import re
|
| 399 |
-
|
| 400 |
-
# Séparer les clés simples (1 char) des clés multi-chars
|
| 401 |
-
multi_keys = sorted(
|
| 402 |
-
(k for k in table if len(k) > 1), key=len, reverse=True
|
| 403 |
-
)
|
| 404 |
-
simple_table = {k: v for k, v in table.items() if len(k) == 1}
|
| 405 |
-
|
| 406 |
-
if multi_keys:
|
| 407 |
-
# Single-pass : construire un pattern regex avec toutes les clés multi-chars
|
| 408 |
-
# triées par longueur décroissante pour matcher les plus longues d'abord
|
| 409 |
-
pattern = re.compile("|".join(re.escape(k) for k in multi_keys))
|
| 410 |
-
text = pattern.sub(lambda m: table[m.group(0)], text)
|
| 411 |
-
|
| 412 |
-
# Remplacements char par char (single-pass via itération)
|
| 413 |
-
if simple_table:
|
| 414 |
-
text = "".join(simple_table.get(c, c) for c in text)
|
| 415 |
-
|
| 416 |
-
return text
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
# Profil par défaut utilisé pour le CER diplomatique intégré
|
| 420 |
-
DEFAULT_DIPLOMATIC_PROFILE: NormalizationProfile = get_builtin_profile("medieval_french")
|
|
|
|
| 1 |
+
"""Re-export depuis ``picarones.formats.text.normalization`` — Sprint A14-S9.
|
| 2 |
+
|
| 3 |
+
Le contenu canonique de ce module a été déplacé vers
|
| 4 |
+
``picarones/formats/text/normalization.py`` au Sprint S9 du
|
| 5 |
+
rewrite ciblé (cf. ``docs/roadmap/rewrite-2026.md``).
|
| 6 |
+
|
| 7 |
+
Ce fichier est conservé comme re-export pour ne **rien casser**
|
| 8 |
+
chez les ~50 consommateurs qui font ``from
|
| 9 |
+
picarones.measurements.normalization import X``. Les symboles
|
| 10 |
+
publics ET privés utilisés downstream (``_parse_exclude_chars``,
|
| 11 |
+
``_apply_diplomatic_table``) sont ré-exposés explicitement.
|
| 12 |
+
|
| 13 |
+
Plan de migration
|
| 14 |
+
-----------------
|
| 15 |
+
Au S22, les consommateurs qui importent encore depuis cet
|
| 16 |
+
emplacement seront migrés vers ``picarones.formats.text.normalization``
|
| 17 |
+
et ce re-export disparaîtra.
|
| 18 |
+
|
| 19 |
+
Règle architecturale
|
| 20 |
+
--------------------
|
| 21 |
+
``measurements/`` (ancien code legacy) est autorisé à importer
|
| 22 |
+
``formats/`` (nouveau code) pendant la phase de migration.
|
| 23 |
+
L'inverse est interdit (vérifié par ``test_layer_dependencies``).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
"""
|
| 25 |
|
| 26 |
from __future__ import annotations
|
| 27 |
|
| 28 |
+
from picarones.formats.text.normalization import (
|
| 29 |
+
DEFAULT_DIPLOMATIC_PROFILE,
|
| 30 |
+
DIPLOMATIC_EN_EARLY_MODERN,
|
| 31 |
+
DIPLOMATIC_EN_MEDIEVAL,
|
| 32 |
+
DIPLOMATIC_EN_SECRETARY,
|
| 33 |
+
DIPLOMATIC_FR_EARLY_MODERN,
|
| 34 |
+
DIPLOMATIC_FR_MEDIEVAL,
|
| 35 |
+
DIPLOMATIC_LATIN_MEDIEVAL,
|
| 36 |
+
DIPLOMATIC_MINIMAL,
|
| 37 |
+
NORMALIZATION_PROFILES,
|
| 38 |
+
NormalizationProfile,
|
| 39 |
+
_apply_diplomatic_table,
|
| 40 |
+
_parse_exclude_chars,
|
| 41 |
+
get_builtin_profile,
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
__all__ = [
|
| 45 |
+
"NormalizationProfile",
|
| 46 |
+
"DIPLOMATIC_FR_MEDIEVAL",
|
| 47 |
+
"DIPLOMATIC_FR_EARLY_MODERN",
|
| 48 |
+
"DIPLOMATIC_LATIN_MEDIEVAL",
|
| 49 |
+
"DIPLOMATIC_MINIMAL",
|
| 50 |
+
"DIPLOMATIC_EN_EARLY_MODERN",
|
| 51 |
+
"DIPLOMATIC_EN_MEDIEVAL",
|
| 52 |
+
"DIPLOMATIC_EN_SECRETARY",
|
| 53 |
+
"NORMALIZATION_PROFILES",
|
| 54 |
+
"DEFAULT_DIPLOMATIC_PROFILE",
|
| 55 |
+
"get_builtin_profile",
|
| 56 |
+
"_parse_exclude_chars",
|
| 57 |
+
"_apply_diplomatic_table",
|
| 58 |
+
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -82,7 +82,11 @@ FILE_BUDGETS: dict[str, int] = {
|
|
| 82 |
"picarones/extras/importers/huggingface.py": 550, # actuel 464
|
| 83 |
"picarones/core/metric_hooks.py": 500, # actuel 423
|
| 84 |
"picarones/measurements/numerical_sequences.py": 500, # actuel 422
|
| 85 |
-
"picarones/measurements/normalization.py": 500, # actuel 420
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
"picarones/report/comparison.py": 500, # actuel 409
|
| 87 |
# --- Module mutualisé créé par le sprint des render helpers
|
| 88 |
# (Sprint « consolidation des renderers » 2026-05-02). Budget
|
|
|
|
| 82 |
"picarones/extras/importers/huggingface.py": 550, # actuel 464
|
| 83 |
"picarones/core/metric_hooks.py": 500, # actuel 423
|
| 84 |
"picarones/measurements/numerical_sequences.py": 500, # actuel 422
|
| 85 |
+
"picarones/measurements/normalization.py": 500, # actuel 420 (re-export S9)
|
| 86 |
+
# Sprint A14-S9 — déplacé depuis measurements/normalization.py.
|
| 87 |
+
# L'ancien emplacement est désormais un re-export ; le contenu
|
| 88 |
+
# canonique vit ici.
|
| 89 |
+
"picarones/formats/text/normalization.py": 500, # actuel 420
|
| 90 |
"picarones/report/comparison.py": 500, # actuel 409
|
| 91 |
# --- Module mutualisé créé par le sprint des render helpers
|
| 92 |
# (Sprint « consolidation des renderers » 2026-05-02). Budget
|
|
File without changes
|
|
File without changes
|
|
@@ -0,0 +1,316 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Sprint A14-S9 — ALTO parser, writer, projector.
|
| 2 |
+
|
| 3 |
+
Tests minimaux mais couvrant les invariants critiques :
|
| 4 |
+
|
| 5 |
+
- Round-trip ``parse → write → parse`` préserve la structure.
|
| 6 |
+
- Détection auto v2 / v3 / v4 / sans namespace.
|
| 7 |
+
- Extraction texte respecte ``Page → Block → Line → String``.
|
| 8 |
+
- Césure ``HypPart1`` / ``HypPart2`` (même ligne ET cross-ligne).
|
| 9 |
+
- ``defusedxml`` bloque les attaques XXE.
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
from __future__ import annotations
|
| 13 |
+
|
| 14 |
+
import pytest
|
| 15 |
+
|
| 16 |
+
from picarones.domain import Artifact, ArtifactType
|
| 17 |
+
from picarones.domain.errors import ProjectionError
|
| 18 |
+
from picarones.formats.alto import (
|
| 19 |
+
AltoBBox,
|
| 20 |
+
AltoDocument,
|
| 21 |
+
AltoLine,
|
| 22 |
+
AltoPage,
|
| 23 |
+
AltoParseError,
|
| 24 |
+
AltoString,
|
| 25 |
+
AltoTextBlock,
|
| 26 |
+
AltoToText,
|
| 27 |
+
alto_document_to_text,
|
| 28 |
+
parse_alto,
|
| 29 |
+
write_alto,
|
| 30 |
+
)
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 34 |
+
# Fixtures synthétiques
|
| 35 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def _simple_doc() -> AltoDocument:
|
| 39 |
+
return AltoDocument(
|
| 40 |
+
pages=(AltoPage(
|
| 41 |
+
id="p1", width=1000, height=1500,
|
| 42 |
+
blocks=(AltoTextBlock(
|
| 43 |
+
id="b1",
|
| 44 |
+
lines=(
|
| 45 |
+
AltoLine(id="l1", strings=(
|
| 46 |
+
AltoString(content="Hello", id="s1"),
|
| 47 |
+
AltoString(content="world", id="s2"),
|
| 48 |
+
)),
|
| 49 |
+
AltoLine(id="l2", strings=(
|
| 50 |
+
AltoString(content="second", id="s3"),
|
| 51 |
+
AltoString(content="line", id="s4"),
|
| 52 |
+
)),
|
| 53 |
+
),
|
| 54 |
+
),),
|
| 55 |
+
),),
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 60 |
+
# Parser — détection de namespaces
|
| 61 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
class TestParserVersions:
|
| 65 |
+
def test_v4_namespace_detected(self) -> None:
|
| 66 |
+
xml = b'''<?xml version="1.0"?>
|
| 67 |
+
<alto xmlns="http://www.loc.gov/standards/alto/ns-v4#">
|
| 68 |
+
<Layout><Page ID="p" WIDTH="100" HEIGHT="200">
|
| 69 |
+
<PrintSpace>
|
| 70 |
+
<TextBlock ID="b">
|
| 71 |
+
<TextLine ID="l">
|
| 72 |
+
<String CONTENT="hi"/>
|
| 73 |
+
</TextLine>
|
| 74 |
+
</TextBlock>
|
| 75 |
+
</PrintSpace>
|
| 76 |
+
</Page></Layout>
|
| 77 |
+
</alto>'''
|
| 78 |
+
doc = parse_alto(xml)
|
| 79 |
+
assert doc.source_version == "v4"
|
| 80 |
+
assert len(doc.pages) == 1
|
| 81 |
+
|
| 82 |
+
def test_v3_namespace_detected(self) -> None:
|
| 83 |
+
xml = b'''<?xml version="1.0"?>
|
| 84 |
+
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#">
|
| 85 |
+
<Layout><Page ID="p"><PrintSpace>
|
| 86 |
+
<TextBlock><TextLine><String CONTENT="x"/></TextLine></TextBlock>
|
| 87 |
+
</PrintSpace></Page></Layout>
|
| 88 |
+
</alto>'''
|
| 89 |
+
doc = parse_alto(xml)
|
| 90 |
+
assert doc.source_version == "v3"
|
| 91 |
+
|
| 92 |
+
def test_v2_namespace_detected(self) -> None:
|
| 93 |
+
xml = b'''<?xml version="1.0"?>
|
| 94 |
+
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#">
|
| 95 |
+
<Layout><Page><PrintSpace>
|
| 96 |
+
<TextBlock><TextLine><String CONTENT="x"/></TextLine></TextBlock>
|
| 97 |
+
</PrintSpace></Page></Layout>
|
| 98 |
+
</alto>'''
|
| 99 |
+
doc = parse_alto(xml)
|
| 100 |
+
assert doc.source_version == "v2"
|
| 101 |
+
|
| 102 |
+
def test_no_namespace_accepted(self) -> None:
|
| 103 |
+
xml = b'''<?xml version="1.0"?>
|
| 104 |
+
<alto>
|
| 105 |
+
<Layout><Page><PrintSpace>
|
| 106 |
+
<TextBlock><TextLine><String CONTENT="x"/></TextLine></TextBlock>
|
| 107 |
+
</PrintSpace></Page></Layout>
|
| 108 |
+
</alto>'''
|
| 109 |
+
doc = parse_alto(xml)
|
| 110 |
+
assert doc.source_version == "none"
|
| 111 |
+
|
| 112 |
+
def test_invalid_xml_raises(self) -> None:
|
| 113 |
+
with pytest.raises(AltoParseError, match="invalide"):
|
| 114 |
+
parse_alto(b"<not closed")
|
| 115 |
+
|
| 116 |
+
def test_empty_xml_raises(self) -> None:
|
| 117 |
+
with pytest.raises(AltoParseError, match="vide"):
|
| 118 |
+
parse_alto(b"")
|
| 119 |
+
|
| 120 |
+
def test_xxe_blocked(self) -> None:
|
| 121 |
+
"""defusedxml doit bloquer les attaques XXE."""
|
| 122 |
+
xml = b'''<?xml version="1.0"?>
|
| 123 |
+
<!DOCTYPE foo [<!ENTITY xxe SYSTEM "file:///etc/passwd">]>
|
| 124 |
+
<alto>&xxe;</alto>'''
|
| 125 |
+
with pytest.raises(AltoParseError):
|
| 126 |
+
parse_alto(xml)
|
| 127 |
+
|
| 128 |
+
|
| 129 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 130 |
+
# Round-trip writer/parser
|
| 131 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
class TestRoundTrip:
|
| 135 |
+
def test_simple_doc_roundtrip(self) -> None:
|
| 136 |
+
doc = _simple_doc()
|
| 137 |
+
xml = write_alto(doc)
|
| 138 |
+
doc2 = parse_alto(xml)
|
| 139 |
+
# Les structures internes sont équivalentes (sans
|
| 140 |
+
# tenir compte de source_version qui peut différer).
|
| 141 |
+
assert len(doc2.pages) == len(doc.pages)
|
| 142 |
+
assert len(doc2.pages[0].blocks) == len(doc.pages[0].blocks)
|
| 143 |
+
assert doc2.pages[0].width == doc.pages[0].width
|
| 144 |
+
assert doc2.pages[0].height == doc.pages[0].height
|
| 145 |
+
|
| 146 |
+
def test_string_content_preserved(self) -> None:
|
| 147 |
+
doc = _simple_doc()
|
| 148 |
+
xml = write_alto(doc)
|
| 149 |
+
doc2 = parse_alto(xml)
|
| 150 |
+
block = doc2.pages[0].blocks[0]
|
| 151 |
+
assert block.lines[0].strings[0].content == "Hello"
|
| 152 |
+
assert block.lines[1].strings[1].content == "line"
|
| 153 |
+
|
| 154 |
+
def test_bbox_preserved(self) -> None:
|
| 155 |
+
doc = AltoDocument(
|
| 156 |
+
pages=(AltoPage(
|
| 157 |
+
blocks=(AltoTextBlock(
|
| 158 |
+
lines=(AltoLine(strings=(
|
| 159 |
+
AltoString(
|
| 160 |
+
content="x",
|
| 161 |
+
bbox=AltoBBox(hpos=10, vpos=20, width=30, height=40),
|
| 162 |
+
),
|
| 163 |
+
),),),
|
| 164 |
+
),),
|
| 165 |
+
),),
|
| 166 |
+
)
|
| 167 |
+
doc2 = parse_alto(write_alto(doc))
|
| 168 |
+
bbox = doc2.pages[0].blocks[0].lines[0].strings[0].bbox
|
| 169 |
+
assert bbox is not None
|
| 170 |
+
assert bbox.hpos == 10 and bbox.vpos == 20
|
| 171 |
+
assert bbox.width == 30 and bbox.height == 40
|
| 172 |
+
|
| 173 |
+
def test_byte_deterministic(self) -> None:
|
| 174 |
+
"""Même structure → mêmes octets."""
|
| 175 |
+
doc1 = _simple_doc()
|
| 176 |
+
doc2 = _simple_doc()
|
| 177 |
+
assert write_alto(doc1) == write_alto(doc2)
|
| 178 |
+
|
| 179 |
+
def test_write_in_v3(self) -> None:
|
| 180 |
+
xml = write_alto(_simple_doc(), version="v3")
|
| 181 |
+
doc = parse_alto(xml)
|
| 182 |
+
assert doc.source_version == "v3"
|
| 183 |
+
|
| 184 |
+
def test_write_no_namespace(self) -> None:
|
| 185 |
+
xml = write_alto(_simple_doc(), version="none")
|
| 186 |
+
doc = parse_alto(xml)
|
| 187 |
+
assert doc.source_version == "none"
|
| 188 |
+
|
| 189 |
+
def test_invalid_version_rejected(self) -> None:
|
| 190 |
+
from picarones.domain.errors import PicaronesError
|
| 191 |
+
with pytest.raises(PicaronesError, match="version ALTO invalide"):
|
| 192 |
+
write_alto(_simple_doc(), version="v9")
|
| 193 |
+
|
| 194 |
+
|
| 195 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 196 |
+
# Projector — extraction texte + césure
|
| 197 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 198 |
+
|
| 199 |
+
|
| 200 |
+
class TestExtractText:
|
| 201 |
+
def test_simple_text(self) -> None:
|
| 202 |
+
text = alto_document_to_text(_simple_doc())
|
| 203 |
+
assert text == "Hello world\nsecond line"
|
| 204 |
+
|
| 205 |
+
def test_multi_block_separated_by_blank_line(self) -> None:
|
| 206 |
+
doc = AltoDocument(pages=(AltoPage(
|
| 207 |
+
blocks=(
|
| 208 |
+
AltoTextBlock(lines=(
|
| 209 |
+
AltoLine(strings=(AltoString(content="A"),)),
|
| 210 |
+
),),
|
| 211 |
+
AltoTextBlock(lines=(
|
| 212 |
+
AltoLine(strings=(AltoString(content="B"),)),
|
| 213 |
+
),),
|
| 214 |
+
),
|
| 215 |
+
),),)
|
| 216 |
+
assert alto_document_to_text(doc) == "A\n\nB"
|
| 217 |
+
|
| 218 |
+
def test_hyphenation_same_line_with_subs_content(self) -> None:
|
| 219 |
+
"""HypPart1 + HypPart2 sur la même ligne, SUBS_CONTENT fourni."""
|
| 220 |
+
doc = AltoDocument(pages=(AltoPage(
|
| 221 |
+
blocks=(AltoTextBlock(lines=(
|
| 222 |
+
AltoLine(strings=(
|
| 223 |
+
AltoString(content="Bonjour"),
|
| 224 |
+
AltoString(
|
| 225 |
+
content="est-",
|
| 226 |
+
subs_type="HypPart1",
|
| 227 |
+
subs_content="est-il",
|
| 228 |
+
),
|
| 229 |
+
AltoString(content="il", subs_type="HypPart2"),
|
| 230 |
+
AltoString(content="clair"),
|
| 231 |
+
)),
|
| 232 |
+
),),),
|
| 233 |
+
),),)
|
| 234 |
+
# "est-il" reconstruit, "il" suivant skippé.
|
| 235 |
+
assert alto_document_to_text(doc) == "Bonjour est-il clair"
|
| 236 |
+
|
| 237 |
+
def test_hyphenation_cross_line(self) -> None:
|
| 238 |
+
"""HypPart1 fin d'une ligne, HypPart2 début ligne suivante.
|
| 239 |
+
|
| 240 |
+
C'est l'usage standard ALTO (la césure visuelle correspond à
|
| 241 |
+
un saut de ligne réel).
|
| 242 |
+
"""
|
| 243 |
+
doc = AltoDocument(pages=(AltoPage(
|
| 244 |
+
blocks=(AltoTextBlock(lines=(
|
| 245 |
+
AltoLine(strings=(
|
| 246 |
+
AltoString(content="ceci"),
|
| 247 |
+
AltoString(
|
| 248 |
+
content="est-",
|
| 249 |
+
subs_type="HypPart1",
|
| 250 |
+
subs_content="est-il",
|
| 251 |
+
),
|
| 252 |
+
)),
|
| 253 |
+
AltoLine(strings=(
|
| 254 |
+
AltoString(content="il", subs_type="HypPart2"),
|
| 255 |
+
AltoString(content="clair"),
|
| 256 |
+
)),
|
| 257 |
+
),),),
|
| 258 |
+
),),)
|
| 259 |
+
# Ligne 1 : "ceci est-il" (mot complet placé en fin de ligne 1).
|
| 260 |
+
# Ligne 2 : "clair" (le HypPart2 "il" est skippé).
|
| 261 |
+
assert alto_document_to_text(doc) == "ceci est-il\nclair"
|
| 262 |
+
|
| 263 |
+
def test_hyphenation_no_subs_content_concatenates(self) -> None:
|
| 264 |
+
doc = AltoDocument(pages=(AltoPage(
|
| 265 |
+
blocks=(AltoTextBlock(lines=(
|
| 266 |
+
AltoLine(strings=(
|
| 267 |
+
AltoString(content="lec-", subs_type="HypPart1"),
|
| 268 |
+
AltoString(content="ture", subs_type="HypPart2"),
|
| 269 |
+
)),
|
| 270 |
+
),),),
|
| 271 |
+
),),)
|
| 272 |
+
assert alto_document_to_text(doc) == "lec-ture"
|
| 273 |
+
|
| 274 |
+
|
| 275 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 276 |
+
# AltoToText projector (protocole)
|
| 277 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 278 |
+
|
| 279 |
+
|
| 280 |
+
class TestAltoToTextProjector:
|
| 281 |
+
def test_protocol_satisfied(self) -> None:
|
| 282 |
+
from picarones.evaluation.projectors import Projector
|
| 283 |
+
assert isinstance(AltoToText(), Projector)
|
| 284 |
+
|
| 285 |
+
def test_project_from_filesystem(self, tmp_path) -> None:
|
| 286 |
+
xml = write_alto(_simple_doc())
|
| 287 |
+
path = tmp_path / "doc.alto.xml"
|
| 288 |
+
path.write_bytes(xml)
|
| 289 |
+
|
| 290 |
+
artifact = Artifact(
|
| 291 |
+
id="d1:ocr:alto",
|
| 292 |
+
document_id="d1",
|
| 293 |
+
type=ArtifactType.ALTO_XML,
|
| 294 |
+
uri=str(path),
|
| 295 |
+
)
|
| 296 |
+
projector = AltoToText()
|
| 297 |
+
target, report = projector.project(artifact, {})
|
| 298 |
+
assert target.type == ArtifactType.RAW_TEXT
|
| 299 |
+
assert report.lossy is True
|
| 300 |
+
assert "geometry" in report.ignored_dimensions
|
| 301 |
+
|
| 302 |
+
def test_project_wrong_type_raises(self) -> None:
|
| 303 |
+
artifact = Artifact(
|
| 304 |
+
id="d1:image", document_id="d1",
|
| 305 |
+
type=ArtifactType.IMAGE,
|
| 306 |
+
)
|
| 307 |
+
with pytest.raises(ProjectionError, match="ALTO_XML"):
|
| 308 |
+
AltoToText().project(artifact, {})
|
| 309 |
+
|
| 310 |
+
def test_project_missing_uri_raises(self) -> None:
|
| 311 |
+
artifact = Artifact(
|
| 312 |
+
id="d1:alto", document_id="d1",
|
| 313 |
+
type=ArtifactType.ALTO_XML,
|
| 314 |
+
)
|
| 315 |
+
with pytest.raises(ProjectionError, match="URI"):
|
| 316 |
+
AltoToText().project(artifact, {})
|
|
File without changes
|
|
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Sprint A14-S9 — PAGE XML parser, projector."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import pytest
|
| 6 |
+
|
| 7 |
+
from picarones.domain import Artifact, ArtifactType
|
| 8 |
+
from picarones.domain.errors import ProjectionError
|
| 9 |
+
from picarones.formats.pagexml import (
|
| 10 |
+
PageDocument,
|
| 11 |
+
PageParseError,
|
| 12 |
+
PagePage,
|
| 13 |
+
PageTextLine,
|
| 14 |
+
PageTextRegion,
|
| 15 |
+
PageToText,
|
| 16 |
+
page_document_to_text,
|
| 17 |
+
parse_pagexml,
|
| 18 |
+
)
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
_SAMPLE_PAGE_XML = '''<?xml version="1.0" encoding="UTF-8"?>
|
| 22 |
+
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15">
|
| 23 |
+
<Page imageFilename="folio_001.png" imageWidth="1200" imageHeight="1800">
|
| 24 |
+
<TextRegion id="r1" type="paragraph">
|
| 25 |
+
<Coords points="100,100 1100,100 1100,400 100,400"/>
|
| 26 |
+
<TextLine id="l1">
|
| 27 |
+
<Coords points="100,100 1100,100 1100,150 100,150"/>
|
| 28 |
+
<Baseline points="100,140 1100,140"/>
|
| 29 |
+
<TextEquiv><Unicode>Premier ligne</Unicode></TextEquiv>
|
| 30 |
+
</TextLine>
|
| 31 |
+
<TextLine id="l2">
|
| 32 |
+
<TextEquiv><Unicode>deuxième ligne</Unicode></TextEquiv>
|
| 33 |
+
</TextLine>
|
| 34 |
+
</TextRegion>
|
| 35 |
+
<TextRegion id="r2" type="heading">
|
| 36 |
+
<TextLine id="l3">
|
| 37 |
+
<TextEquiv><Unicode>Titre</Unicode></TextEquiv>
|
| 38 |
+
</TextLine>
|
| 39 |
+
</TextRegion>
|
| 40 |
+
</Page>
|
| 41 |
+
</PcGts>
|
| 42 |
+
'''.encode("utf-8")
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
class TestParser:
|
| 46 |
+
def test_parse_simple_page(self) -> None:
|
| 47 |
+
doc = parse_pagexml(_SAMPLE_PAGE_XML)
|
| 48 |
+
assert len(doc.pages) == 1
|
| 49 |
+
page = doc.pages[0]
|
| 50 |
+
assert page.image_filename == "folio_001.png"
|
| 51 |
+
assert page.image_width == 1200
|
| 52 |
+
assert page.image_height == 1800
|
| 53 |
+
assert len(page.text_regions) == 2
|
| 54 |
+
|
| 55 |
+
def test_text_lines_extracted(self) -> None:
|
| 56 |
+
doc = parse_pagexml(_SAMPLE_PAGE_XML)
|
| 57 |
+
r1 = doc.pages[0].text_regions[0]
|
| 58 |
+
assert len(r1.text_lines) == 2
|
| 59 |
+
assert r1.text_lines[0].text == "Premier ligne"
|
| 60 |
+
assert r1.text_lines[0].coords is not None
|
| 61 |
+
assert r1.text_lines[0].baseline is not None
|
| 62 |
+
|
| 63 |
+
def test_region_type_preserved(self) -> None:
|
| 64 |
+
doc = parse_pagexml(_SAMPLE_PAGE_XML)
|
| 65 |
+
assert doc.pages[0].text_regions[0].region_type == "paragraph"
|
| 66 |
+
assert doc.pages[0].text_regions[1].region_type == "heading"
|
| 67 |
+
|
| 68 |
+
def test_namespace_detected(self) -> None:
|
| 69 |
+
doc = parse_pagexml(_SAMPLE_PAGE_XML)
|
| 70 |
+
assert doc.source_namespace is not None
|
| 71 |
+
assert "primaresearch" in doc.source_namespace
|
| 72 |
+
|
| 73 |
+
def test_empty_raises(self) -> None:
|
| 74 |
+
with pytest.raises(PageParseError, match="vide"):
|
| 75 |
+
parse_pagexml(b"")
|
| 76 |
+
|
| 77 |
+
def test_invalid_xml_raises(self) -> None:
|
| 78 |
+
with pytest.raises(PageParseError, match="invalide"):
|
| 79 |
+
parse_pagexml(b"<not closed")
|
| 80 |
+
|
| 81 |
+
def test_xxe_blocked(self) -> None:
|
| 82 |
+
xml = b'''<?xml version="1.0"?>
|
| 83 |
+
<!DOCTYPE foo [<!ENTITY xxe SYSTEM "file:///etc/passwd">]>
|
| 84 |
+
<PcGts>&xxe;</PcGts>'''
|
| 85 |
+
with pytest.raises(PageParseError):
|
| 86 |
+
parse_pagexml(xml)
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
class TestExtractText:
|
| 90 |
+
def test_full_extraction(self) -> None:
|
| 91 |
+
doc = parse_pagexml(_SAMPLE_PAGE_XML)
|
| 92 |
+
text = page_document_to_text(doc)
|
| 93 |
+
# 2 régions séparées par ligne vide, lignes par \n.
|
| 94 |
+
assert text == "Premier ligne\ndeuxième ligne\n\nTitre"
|
| 95 |
+
|
| 96 |
+
def test_empty_document(self) -> None:
|
| 97 |
+
doc = PageDocument()
|
| 98 |
+
assert page_document_to_text(doc) == ""
|
| 99 |
+
|
| 100 |
+
def test_region_without_lines_skipped(self) -> None:
|
| 101 |
+
doc = PageDocument(pages=(PagePage(
|
| 102 |
+
text_regions=(
|
| 103 |
+
PageTextRegion(id="empty"),
|
| 104 |
+
PageTextRegion(
|
| 105 |
+
id="full",
|
| 106 |
+
text_lines=(PageTextLine(text="hello"),),
|
| 107 |
+
),
|
| 108 |
+
),
|
| 109 |
+
),),)
|
| 110 |
+
assert page_document_to_text(doc) == "hello"
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
class TestProjector:
|
| 114 |
+
def test_protocol_satisfied(self) -> None:
|
| 115 |
+
from picarones.evaluation.projectors import Projector
|
| 116 |
+
assert isinstance(PageToText(), Projector)
|
| 117 |
+
|
| 118 |
+
def test_project_from_filesystem(self, tmp_path) -> None:
|
| 119 |
+
path = tmp_path / "doc.page.xml"
|
| 120 |
+
path.write_bytes(_SAMPLE_PAGE_XML)
|
| 121 |
+
artifact = Artifact(
|
| 122 |
+
id="d:page",
|
| 123 |
+
document_id="d",
|
| 124 |
+
type=ArtifactType.PAGE_XML,
|
| 125 |
+
uri=str(path),
|
| 126 |
+
)
|
| 127 |
+
target, report = PageToText().project(artifact, {})
|
| 128 |
+
assert target.type == ArtifactType.RAW_TEXT
|
| 129 |
+
assert "geometry" in report.ignored_dimensions
|
| 130 |
+
|
| 131 |
+
def test_wrong_type_rejected(self) -> None:
|
| 132 |
+
artifact = Artifact(
|
| 133 |
+
id="d:alto", document_id="d", type=ArtifactType.ALTO_XML,
|
| 134 |
+
)
|
| 135 |
+
with pytest.raises(ProjectionError, match="PAGE_XML"):
|
| 136 |
+
PageToText().project(artifact, {})
|
|
File without changes
|
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Sprint A14-S9 — migration de ``normalization`` vers ``formats/text/``.
|
| 2 |
+
|
| 3 |
+
Vérifie que :
|
| 4 |
+
|
| 5 |
+
1. Le nouveau module ``picarones.formats.text.normalization`` expose
|
| 6 |
+
les 11 profils canoniques.
|
| 7 |
+
2. L'ancien re-export ``picarones.measurements.normalization`` continue
|
| 8 |
+
à fonctionner sans erreur (compat ascendante stricte).
|
| 9 |
+
3. Les symboles privés utilisés downstream (``_parse_exclude_chars``,
|
| 10 |
+
``_apply_diplomatic_table``) sont ré-exposés via le re-export.
|
| 11 |
+
4. Les deux chemins d'import retournent **le même objet** (pas une
|
| 12 |
+
copie) — preuve que c'est un vrai re-export, pas une duplication.
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
from __future__ import annotations
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def test_new_path_exposes_all_eleven_profiles() -> None:
|
| 19 |
+
from picarones.formats.text.normalization import NORMALIZATION_PROFILES
|
| 20 |
+
expected = {
|
| 21 |
+
"nfc", "caseless", "minimal",
|
| 22 |
+
"medieval_french", "early_modern_french",
|
| 23 |
+
"medieval_latin", "early_modern_english", "medieval_english",
|
| 24 |
+
"secretary_hand", "sans_ponctuation", "sans_apostrophes",
|
| 25 |
+
}
|
| 26 |
+
assert set(NORMALIZATION_PROFILES.keys()) == expected
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def test_old_reexport_works() -> None:
|
| 30 |
+
"""Compat ascendante : ~50 consommateurs importent depuis l'ancien
|
| 31 |
+
chemin."""
|
| 32 |
+
from picarones.measurements.normalization import (
|
| 33 |
+
DEFAULT_DIPLOMATIC_PROFILE,
|
| 34 |
+
NORMALIZATION_PROFILES,
|
| 35 |
+
NormalizationProfile,
|
| 36 |
+
get_builtin_profile,
|
| 37 |
+
)
|
| 38 |
+
assert NormalizationProfile is not None
|
| 39 |
+
assert "medieval_french" in NORMALIZATION_PROFILES
|
| 40 |
+
assert get_builtin_profile("nfc") is not None
|
| 41 |
+
assert DEFAULT_DIPLOMATIC_PROFILE.name == "medieval_french"
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def test_private_symbols_reexported() -> None:
|
| 45 |
+
"""Les symboles préfixés ``_`` utilisés en aval doivent rester
|
| 46 |
+
importables depuis l'ancien chemin."""
|
| 47 |
+
from picarones.measurements.normalization import (
|
| 48 |
+
_apply_diplomatic_table,
|
| 49 |
+
_parse_exclude_chars,
|
| 50 |
+
)
|
| 51 |
+
assert callable(_parse_exclude_chars)
|
| 52 |
+
assert callable(_apply_diplomatic_table)
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def test_old_and_new_paths_share_same_objects() -> None:
|
| 56 |
+
"""Preuve que c'est un vrai re-export, pas une duplication."""
|
| 57 |
+
from picarones.formats.text.normalization import (
|
| 58 |
+
NORMALIZATION_PROFILES as new_profiles,
|
| 59 |
+
NormalizationProfile as NewProfile,
|
| 60 |
+
get_builtin_profile as new_get,
|
| 61 |
+
)
|
| 62 |
+
from picarones.measurements.normalization import (
|
| 63 |
+
NORMALIZATION_PROFILES as old_profiles,
|
| 64 |
+
NormalizationProfile as OldProfile,
|
| 65 |
+
get_builtin_profile as old_get,
|
| 66 |
+
)
|
| 67 |
+
assert new_profiles is old_profiles # même dict
|
| 68 |
+
assert NewProfile is OldProfile # même classe
|
| 69 |
+
assert new_get is old_get # même fonction
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def test_apply_profile_works_via_new_path() -> None:
|
| 73 |
+
"""Test fonctionnel : un profil chargé depuis le nouveau chemin
|
| 74 |
+
applique bien la normalisation."""
|
| 75 |
+
from picarones.formats.text.normalization import get_builtin_profile
|
| 76 |
+
profile = get_builtin_profile("medieval_french")
|
| 77 |
+
# ſ → s, u → v dans le profil médiéval français.
|
| 78 |
+
normalized = profile.normalize("aſpre")
|
| 79 |
+
assert "ſ" not in normalized
|
| 80 |
+
assert "s" in normalized
|