Spaces:
Running
phaseC: extras/importers/ — 6 importers en Cercle 3 + 2 marqués experimental
Browse filesTroisième phase de la refonte en 3 cercles. Cible : extraire les
importers (sources distantes IIIF, HuggingFace, HTR-United, Gallica,
eScriptorium + helpers HTTP) du package racine vers
``picarones/extras/importers/``.
Modules déplacés (6 fichiers, ~2600 lignes)
-------------------------------------------
``picarones/extras/importers/`` :
- ``_http.py`` helpers HTTP partagés (validate_http_url, download_url)
- ``iiif.py`` manifestes IIIF v2/v3 (Bodleian, BnF, Vatican)
- ``htr_united.py`` catalogue HTR-United (CC0, GitHub)
- ``gallica.py`` BnF Gallica (SRU + IIIF + OCR brut)
- ``huggingface.py`` ⚠ statut **experimental**
- ``escriptorium.py`` ⚠ statut **experimental**
Modules expérimentaux
---------------------
``huggingface`` et ``escriptorium`` émettent un ``UserWarning`` à
l'import :
picarones.extras.importers.huggingface is experimental and may
change or be removed without notice. Use at your own risk until
an institutional use case validates the API.
Justification :
- HuggingFace : 0 utilisateur réel, l'API ``datasets`` évolue
fréquemment, pas de tests d'intégration.
- eScriptorium : 1 test isolé (Sprint 8), pas validé sur instance
publique, statut prototype depuis le début.
Le warning ne casse rien (juste informatif). Si une institution
trouve l'API utile et la valide, on peut la promouvoir au statut
maintenu et retirer le warning.
Rétrocompatibilité absolue (6 shims de 16 lignes)
--------------------------------------------------
Imports historiques préservés :
from picarones.importers.iiif import IIIFImporter
from picarones.importers.gallica import GallicaClient
from picarones.importers._http import validate_http_url
Identité préservée — ``shim.X is extras.X`` (tests ``is`` vérifiés).
Note : ``picarones/importers/__init__.py`` reste inchangé. Il
réexporte des symboles depuis les shims, eux-mêmes chargeant depuis
``picarones.extras.importers.X``. Trois niveaux d'indirection mais
totalement transparents.
pyproject.toml — extra [importers]
----------------------------------
Nouvel extra ``picarones[importers]`` (vide pour l'instant — les
modules sont dans le package principal). Documente l'intention de
séparation future en package PyPI distinct. Inclus dans l'extra
``[all]``.
Validation 8/8 en sandbox
-------------------------
- 11 imports rétrocompat OK (4 importers maintenus + 1 helper HTTP +
1 importer expérimental, multipliés par leurs symboles publics).
- Identité shim ↔ nouveau chemin préservée (3 paires testées).
- huggingface : UserWarning experimental émis à l'import.
- escriptorium : UserWarning experimental émis à l'import.
- iiif : pas de warning (importer maintenu).
- ``picarones.importers/__init__.py`` continue à réexporter
IIIFImporter, GallicaClient, EScriptoriumClient via les shims.
- ``cli/_imports.py`` charge sans erreur.
- pyproject.toml : extra ``[importers]`` documenté.
- 6 shims minces (16 lignes chacun).
Tests
-----
+220 lignes dans tests/test_phaseC_migration.py organisés en 8 classes :
TestImportersRetrocompat, TestExtrasImportersPath,
TestIdentityThroughShim, TestExperimentalImporters,
TestImportersInitReexports, TestCliImportsCommand,
TestPyprojectExtra, TestOriginalsAreShims.
Bilan cumulé phases A + B + C
-----------------------------
- Cercle 3 contient maintenant 24 modules + 6 renderers
(4 academic + 1 governance + 8 historical + 6 importers
+ 4 + 2 renderers + 1 helper HTTP).
- Code Python en dehors de ``extras/`` allégé d'environ 7600 lignes
(déplacées, pas supprimées).
- Aucune fonctionnalité supprimée.
Phases suivantes
----------------
- Phase E : core/ → core/ (Cercle 1, ~15 modules) +
measurements/ (~30 modules Cercle 2).
- Phase D : docs/api-stable.md + test_public_api.py + version 2.0.
- picarones/extras/importers/__init__.py +27 -0
- picarones/extras/importers/_http.py +108 -0
- picarones/extras/importers/escriptorium.py +553 -0
- picarones/extras/importers/gallica.py +553 -0
- picarones/extras/importers/htr_united.py +455 -0
- picarones/extras/importers/huggingface.py +445 -0
- picarones/extras/importers/iiif.py +565 -0
- picarones/importers/_http.py +12 -103
- picarones/importers/escriptorium.py +12 -529
- picarones/importers/gallica.py +12 -548
- picarones/importers/htr_united.py +12 -450
- picarones/importers/huggingface.py +12 -422
- picarones/importers/iiif.py +12 -560
- pyproject.toml +8 -1
- tests/test_phaseC_migration.py +233 -0
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Importeurs de corpus depuis sources distantes (Cercle 3).
|
| 2 |
+
|
| 3 |
+
Phase C du chantier de refonte en 3 cercles.
|
| 4 |
+
|
| 5 |
+
Importeurs livrés
|
| 6 |
+
-----------------
|
| 7 |
+
- :mod:`_http` — helpers HTTP partagés (validate_http_url, download_url)
|
| 8 |
+
- :mod:`iiif` — manifestes IIIF v2/v3 (Bodleian, BnF, Vatican…)
|
| 9 |
+
- :mod:`htr_united` — datasets HTR-United (CC0, GitHub)
|
| 10 |
+
- :mod:`gallica` — BnF Gallica (SRU + IIIF + OCR brut)
|
| 11 |
+
- :mod:`huggingface` — datasets HuggingFace ⚠ **expérimental**
|
| 12 |
+
- :mod:`escriptorium` — projets eScriptorium ⚠ **expérimental**
|
| 13 |
+
|
| 14 |
+
Modules expérimentaux
|
| 15 |
+
---------------------
|
| 16 |
+
``huggingface`` et ``escriptorium`` émettent un ``UserWarning`` à
|
| 17 |
+
l'import. Ils sont fonctionnellement présents mais leur usage en
|
| 18 |
+
production n'est pas garanti — l'API HuggingFace Datasets évolue
|
| 19 |
+
fréquemment et eScriptorium n'a qu'un test isolé. À utiliser à vos
|
| 20 |
+
risques jusqu'à ce qu'un cas d'usage institutionnel valide leur API.
|
| 21 |
+
|
| 22 |
+
Plugin séparable
|
| 23 |
+
----------------
|
| 24 |
+
Distribué via l'extra pip ``picarones[importers]``. Les imports
|
| 25 |
+
historiques ``from picarones.importers.iiif import ...`` restent
|
| 26 |
+
fonctionnels via des fichiers-shims dans :mod:`picarones.importers`.
|
| 27 |
+
"""
|
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Helpers HTTP partagés par les importeurs IIIF / Gallica / HTR-United.
|
| 2 |
+
|
| 3 |
+
Chantier 4 du plan d'évolution post-Sprint 97 — fusion Gallica vers IIIF.
|
| 4 |
+
|
| 5 |
+
Auparavant les fonctions ``_validate_url`` et ``_download_url`` étaient
|
| 6 |
+
dupliquées entre :mod:`picarones.importers.iiif` (lignes 310-344) et
|
| 7 |
+
:mod:`picarones.importers.gallica` (lignes 125-155). Le module Gallica
|
| 8 |
+
faisait 549 lignes dont une bonne partie réimplémentait les mêmes
|
| 9 |
+
abstractions HTTP que IIIF (validation de schéma, retry exponentiel,
|
| 10 |
+
gestion des codes HTTP).
|
| 11 |
+
|
| 12 |
+
Ce module privé centralise ces helpers. Les deux importeurs (et tout
|
| 13 |
+
nouveau importateur HTTP futur) les utilisent. Comportement public
|
| 14 |
+
inchangé — uniquement de la factorisation.
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
from __future__ import annotations
|
| 18 |
+
|
| 19 |
+
import logging
|
| 20 |
+
import time
|
| 21 |
+
import urllib.error
|
| 22 |
+
import urllib.request
|
| 23 |
+
from typing import Optional
|
| 24 |
+
from urllib.parse import urlparse
|
| 25 |
+
|
| 26 |
+
logger = logging.getLogger(__name__)
|
| 27 |
+
|
| 28 |
+
_DEFAULT_USER_AGENT = (
|
| 29 |
+
"Picarones/1.0 (OCR benchmark platform; "
|
| 30 |
+
"https://github.com/maribakulj/Picarones)"
|
| 31 |
+
)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def validate_http_url(url: str) -> None:
|
| 35 |
+
"""Lève ``ValueError`` si le schéma de l'URL n'est pas http/https.
|
| 36 |
+
|
| 37 |
+
Garde-fou contre les URLs ``file://``, ``ftp://``, ``data:`` qui
|
| 38 |
+
permettraient à un manifeste IIIF malveillant de lire des fichiers
|
| 39 |
+
locaux ou de contourner la politique réseau.
|
| 40 |
+
"""
|
| 41 |
+
parsed = urlparse(url)
|
| 42 |
+
if parsed.scheme not in ("http", "https"):
|
| 43 |
+
raise ValueError(
|
| 44 |
+
f"Schéma URL non autorisé '{parsed.scheme}' "
|
| 45 |
+
f"(seuls http/https sont acceptés) : {url}"
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def download_url(
|
| 50 |
+
url: str,
|
| 51 |
+
*,
|
| 52 |
+
retries: int = 4,
|
| 53 |
+
backoff: float = 2.0,
|
| 54 |
+
timeout: int = 60,
|
| 55 |
+
user_agent: str = _DEFAULT_USER_AGENT,
|
| 56 |
+
extra_headers: Optional[dict[str, str]] = None,
|
| 57 |
+
) -> bytes:
|
| 58 |
+
"""Télécharge une URL avec retry exponentiel.
|
| 59 |
+
|
| 60 |
+
Parameters
|
| 61 |
+
----------
|
| 62 |
+
url:
|
| 63 |
+
URL à télécharger. Validée par :func:`validate_http_url`.
|
| 64 |
+
retries:
|
| 65 |
+
Nombre total de tentatives (défaut 4).
|
| 66 |
+
backoff:
|
| 67 |
+
Base du backoff exponentiel : attente = ``backoff ** attempt``
|
| 68 |
+
secondes (défaut 2.0 → 0, 2, 4, 8 s).
|
| 69 |
+
timeout:
|
| 70 |
+
Timeout HTTP par tentative en secondes (défaut 60).
|
| 71 |
+
user_agent:
|
| 72 |
+
Header ``User-Agent`` envoyé. Défaut : Picarones identifié.
|
| 73 |
+
extra_headers:
|
| 74 |
+
Headers supplémentaires (ex : ``{"Accept": "application/json"}``).
|
| 75 |
+
|
| 76 |
+
Raises
|
| 77 |
+
------
|
| 78 |
+
ValueError
|
| 79 |
+
Si l'URL n'a pas un schéma autorisé.
|
| 80 |
+
RuntimeError
|
| 81 |
+
Si toutes les tentatives échouent.
|
| 82 |
+
"""
|
| 83 |
+
validate_http_url(url)
|
| 84 |
+
headers = {"User-Agent": user_agent}
|
| 85 |
+
if extra_headers:
|
| 86 |
+
headers.update(extra_headers)
|
| 87 |
+
last_exc: Optional[Exception] = None
|
| 88 |
+
for attempt in range(retries):
|
| 89 |
+
if attempt > 0:
|
| 90 |
+
wait = backoff ** attempt
|
| 91 |
+
logger.debug(
|
| 92 |
+
"Retry %d/%d dans %.1fs — %s",
|
| 93 |
+
attempt, retries - 1, wait, url,
|
| 94 |
+
)
|
| 95 |
+
time.sleep(wait)
|
| 96 |
+
try:
|
| 97 |
+
req = urllib.request.Request(url, headers=headers)
|
| 98 |
+
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
| 99 |
+
return resp.read()
|
| 100 |
+
except (urllib.error.URLError, urllib.error.HTTPError) as exc:
|
| 101 |
+
last_exc = exc
|
| 102 |
+
logger.warning("Erreur téléchargement %s : %s", url, exc)
|
| 103 |
+
raise RuntimeError(
|
| 104 |
+
f"Impossible de télécharger {url} après {retries} tentatives",
|
| 105 |
+
) from last_exc
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
__all__ = ["validate_http_url", "download_url"]
|
|
@@ -0,0 +1,553 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Intégration eScriptorium — import et export via API REST.
|
| 2 |
+
|
| 3 |
+
⚠ **Statut : expérimental** (phase C du chantier de refonte en 3 cercles).
|
| 4 |
+
Ce module n'a qu'un test isolé (Sprint 8) et n'a pas été validé sur une
|
| 5 |
+
instance eScriptorium publique. À utiliser à vos risques jusqu'à ce
|
| 6 |
+
qu'un cas d'usage institutionnel valide l'API. Un ``UserWarning`` est
|
| 7 |
+
émis à l'import.
|
| 8 |
+
|
| 9 |
+
Fonctionnement
|
| 10 |
+
--------------
|
| 11 |
+
1. Authentification par token (settings → API key dans eScriptorium)
|
| 12 |
+
2. Listing et import de projets, documents et transcriptions
|
| 13 |
+
3. Export des résultats de benchmark Picarones comme couche OCR dans eScriptorium
|
| 14 |
+
|
| 15 |
+
API eScriptorium
|
| 16 |
+
----------------
|
| 17 |
+
eScriptorium expose une API REST documentée à /api/.
|
| 18 |
+
Les endpoints principaux utilisés ici :
|
| 19 |
+
- GET /api/projects/ → liste des projets
|
| 20 |
+
- GET /api/documents/ → liste des documents (filtrables par projet)
|
| 21 |
+
- GET /api/documents/{pk}/parts/ → liste des pages d'un document
|
| 22 |
+
- GET /api/documents/{pk}/parts/{pk}/transcriptions/ → transcriptions d'une page
|
| 23 |
+
- POST /api/documents/{pk}/parts/{pk}/transcriptions/ → créer une couche OCR
|
| 24 |
+
|
| 25 |
+
Usage
|
| 26 |
+
-----
|
| 27 |
+
>>> from picarones.importers.escriptorium import EScriptoriumClient
|
| 28 |
+
>>> client = EScriptoriumClient("https://escriptorium.example.org", token="abc123")
|
| 29 |
+
>>> projects = client.list_projects()
|
| 30 |
+
>>> corpus = client.import_document(doc_id=42, transcription_layer="manual")
|
| 31 |
+
"""
|
| 32 |
+
|
| 33 |
+
from __future__ import annotations
|
| 34 |
+
|
| 35 |
+
import json
|
| 36 |
+
import logging
|
| 37 |
+
import urllib.error
|
| 38 |
+
import urllib.parse
|
| 39 |
+
import urllib.request
|
| 40 |
+
import warnings
|
| 41 |
+
from dataclasses import dataclass, field
|
| 42 |
+
from pathlib import Path
|
| 43 |
+
from typing import TYPE_CHECKING, Optional
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
# Émission du warning ``experimental`` à l'import. Phase C du chantier
|
| 47 |
+
# de refonte — voir docstring du module ci-dessus.
|
| 48 |
+
warnings.warn(
|
| 49 |
+
"picarones.extras.importers.escriptorium is experimental and may "
|
| 50 |
+
"change or be removed without notice. Use at your own risk until "
|
| 51 |
+
"an institutional use case validates the API.",
|
| 52 |
+
category=UserWarning,
|
| 53 |
+
stacklevel=2,
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
from picarones.core.corpus import Corpus, Document
|
| 58 |
+
|
| 59 |
+
if TYPE_CHECKING:
|
| 60 |
+
from picarones.core.results import BenchmarkResult
|
| 61 |
+
|
| 62 |
+
logger = logging.getLogger(__name__)
|
| 63 |
+
|
| 64 |
+
# ---------------------------------------------------------------------------
|
| 65 |
+
# Structures de données eScriptorium
|
| 66 |
+
# ---------------------------------------------------------------------------
|
| 67 |
+
|
| 68 |
+
@dataclass
|
| 69 |
+
class EScriptoriumProject:
|
| 70 |
+
"""Représentation d'un projet eScriptorium."""
|
| 71 |
+
pk: int
|
| 72 |
+
name: str
|
| 73 |
+
slug: str
|
| 74 |
+
owner: str = ""
|
| 75 |
+
document_count: int = 0
|
| 76 |
+
|
| 77 |
+
def as_dict(self) -> dict:
|
| 78 |
+
return {
|
| 79 |
+
"pk": self.pk,
|
| 80 |
+
"name": self.name,
|
| 81 |
+
"slug": self.slug,
|
| 82 |
+
"owner": self.owner,
|
| 83 |
+
"document_count": self.document_count,
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
@dataclass
|
| 88 |
+
class EScriptoriumDocument:
|
| 89 |
+
"""Représentation d'un document eScriptorium."""
|
| 90 |
+
pk: int
|
| 91 |
+
name: str
|
| 92 |
+
project: str = ""
|
| 93 |
+
part_count: int = 0
|
| 94 |
+
transcription_layers: list[str] = field(default_factory=list)
|
| 95 |
+
|
| 96 |
+
def as_dict(self) -> dict:
|
| 97 |
+
return {
|
| 98 |
+
"pk": self.pk,
|
| 99 |
+
"name": self.name,
|
| 100 |
+
"project": self.project,
|
| 101 |
+
"part_count": self.part_count,
|
| 102 |
+
"transcription_layers": self.transcription_layers,
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
@dataclass
|
| 107 |
+
class EScriptoriumPart:
|
| 108 |
+
"""Une page (part) d'un document eScriptorium."""
|
| 109 |
+
pk: int
|
| 110 |
+
title: str
|
| 111 |
+
image_url: str
|
| 112 |
+
order: int = 0
|
| 113 |
+
transcriptions: list[dict] = field(default_factory=list)
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
# ---------------------------------------------------------------------------
|
| 117 |
+
# Client API eScriptorium
|
| 118 |
+
# ---------------------------------------------------------------------------
|
| 119 |
+
|
| 120 |
+
class EScriptoriumClient:
|
| 121 |
+
"""Client pour l'API REST d'eScriptorium.
|
| 122 |
+
|
| 123 |
+
Parameters
|
| 124 |
+
----------
|
| 125 |
+
base_url:
|
| 126 |
+
URL racine de l'instance (ex : ``"https://escriptorium.example.org"``).
|
| 127 |
+
token:
|
| 128 |
+
Token d'authentification API (depuis Settings > API dans eScriptorium).
|
| 129 |
+
timeout:
|
| 130 |
+
Timeout HTTP en secondes.
|
| 131 |
+
|
| 132 |
+
Examples
|
| 133 |
+
--------
|
| 134 |
+
>>> client = EScriptoriumClient("https://escriptorium.example.org", token="abc123")
|
| 135 |
+
>>> projects = client.list_projects()
|
| 136 |
+
>>> corpus = client.import_document(42, transcription_layer="manual")
|
| 137 |
+
"""
|
| 138 |
+
|
| 139 |
+
def __init__(
|
| 140 |
+
self,
|
| 141 |
+
base_url: str,
|
| 142 |
+
token: str,
|
| 143 |
+
timeout: int = 30,
|
| 144 |
+
) -> None:
|
| 145 |
+
self.base_url = base_url.rstrip("/")
|
| 146 |
+
self.token = token
|
| 147 |
+
self.timeout = timeout
|
| 148 |
+
|
| 149 |
+
# ------------------------------------------------------------------
|
| 150 |
+
# HTTP helpers
|
| 151 |
+
# ------------------------------------------------------------------
|
| 152 |
+
|
| 153 |
+
def _headers(self) -> dict[str, str]:
|
| 154 |
+
return {
|
| 155 |
+
"Authorization": f"Token {self.token}",
|
| 156 |
+
"Accept": "application/json",
|
| 157 |
+
"Content-Type": "application/json",
|
| 158 |
+
}
|
| 159 |
+
|
| 160 |
+
def _get(self, path: str, params: Optional[dict] = None) -> dict:
|
| 161 |
+
"""Effectue une requête GET et retourne le JSON."""
|
| 162 |
+
url = f"{self.base_url}/api/{path.lstrip('/')}"
|
| 163 |
+
if params:
|
| 164 |
+
url += "?" + urllib.parse.urlencode(params)
|
| 165 |
+
req = urllib.request.Request(url, headers=self._headers())
|
| 166 |
+
try:
|
| 167 |
+
with urllib.request.urlopen(req, timeout=self.timeout) as resp:
|
| 168 |
+
return json.loads(resp.read().decode("utf-8"))
|
| 169 |
+
except urllib.error.HTTPError as exc:
|
| 170 |
+
raise RuntimeError(
|
| 171 |
+
f"eScriptorium API erreur {exc.code} sur {url}: {exc.reason}"
|
| 172 |
+
) from exc
|
| 173 |
+
except urllib.error.URLError as exc:
|
| 174 |
+
raise RuntimeError(
|
| 175 |
+
f"Impossible de joindre {self.base_url}: {exc.reason}"
|
| 176 |
+
) from exc
|
| 177 |
+
|
| 178 |
+
def _post(self, path: str, payload: dict) -> dict:
|
| 179 |
+
"""Effectue une requête POST avec payload JSON."""
|
| 180 |
+
url = f"{self.base_url}/api/{path.lstrip('/')}"
|
| 181 |
+
data = json.dumps(payload).encode("utf-8")
|
| 182 |
+
req = urllib.request.Request(
|
| 183 |
+
url, data=data, headers=self._headers(), method="POST"
|
| 184 |
+
)
|
| 185 |
+
try:
|
| 186 |
+
with urllib.request.urlopen(req, timeout=self.timeout) as resp:
|
| 187 |
+
body = resp.read().decode("utf-8")
|
| 188 |
+
return json.loads(body) if body else {}
|
| 189 |
+
except urllib.error.HTTPError as exc:
|
| 190 |
+
raise RuntimeError(
|
| 191 |
+
f"eScriptorium API erreur {exc.code} sur {url}: {exc.reason}"
|
| 192 |
+
) from exc
|
| 193 |
+
except urllib.error.URLError as exc:
|
| 194 |
+
raise RuntimeError(
|
| 195 |
+
f"Impossible de joindre {self.base_url}: {exc.reason}"
|
| 196 |
+
) from exc
|
| 197 |
+
|
| 198 |
+
def _paginate(self, path: str, params: Optional[dict] = None) -> list[dict]:
|
| 199 |
+
"""Parcourt toutes les pages de résultats paginés."""
|
| 200 |
+
results: list[dict] = []
|
| 201 |
+
current_params = dict(params or {})
|
| 202 |
+
current_params.setdefault("page_size", 100)
|
| 203 |
+
page_num = 1
|
| 204 |
+
while True:
|
| 205 |
+
current_params["page"] = page_num
|
| 206 |
+
data = self._get(path, current_params)
|
| 207 |
+
if isinstance(data, list):
|
| 208 |
+
results.extend(data)
|
| 209 |
+
break
|
| 210 |
+
results.extend(data.get("results", []))
|
| 211 |
+
if not data.get("next"):
|
| 212 |
+
break
|
| 213 |
+
page_num += 1
|
| 214 |
+
return results
|
| 215 |
+
|
| 216 |
+
# ------------------------------------------------------------------
|
| 217 |
+
# API publique
|
| 218 |
+
# ------------------------------------------------------------------
|
| 219 |
+
|
| 220 |
+
def test_connection(self) -> bool:
|
| 221 |
+
"""Vérifie que l'URL et le token sont valides.
|
| 222 |
+
|
| 223 |
+
Returns
|
| 224 |
+
-------
|
| 225 |
+
bool
|
| 226 |
+
True si l'authentification réussit.
|
| 227 |
+
"""
|
| 228 |
+
try:
|
| 229 |
+
self._get("projects/", {"page_size": 1})
|
| 230 |
+
return True
|
| 231 |
+
except RuntimeError:
|
| 232 |
+
return False
|
| 233 |
+
|
| 234 |
+
def list_projects(self) -> list[EScriptoriumProject]:
|
| 235 |
+
"""Retourne la liste des projets accessibles.
|
| 236 |
+
|
| 237 |
+
Returns
|
| 238 |
+
-------
|
| 239 |
+
list[EScriptoriumProject]
|
| 240 |
+
"""
|
| 241 |
+
raw = self._paginate("projects/")
|
| 242 |
+
projects = []
|
| 243 |
+
for item in raw:
|
| 244 |
+
projects.append(EScriptoriumProject(
|
| 245 |
+
pk=item["pk"],
|
| 246 |
+
name=item.get("name", ""),
|
| 247 |
+
slug=item.get("slug", ""),
|
| 248 |
+
owner=item.get("owner", {}).get("username", "") if isinstance(item.get("owner"), dict) else str(item.get("owner", "")),
|
| 249 |
+
document_count=item.get("documents_count", 0),
|
| 250 |
+
))
|
| 251 |
+
return projects
|
| 252 |
+
|
| 253 |
+
def list_documents(
|
| 254 |
+
self,
|
| 255 |
+
project_pk: Optional[int] = None,
|
| 256 |
+
) -> list[EScriptoriumDocument]:
|
| 257 |
+
"""Retourne la liste des documents, filtrés par projet si fourni.
|
| 258 |
+
|
| 259 |
+
Parameters
|
| 260 |
+
----------
|
| 261 |
+
project_pk:
|
| 262 |
+
PK du projet eScriptorium (optionnel).
|
| 263 |
+
|
| 264 |
+
Returns
|
| 265 |
+
-------
|
| 266 |
+
list[EScriptoriumDocument]
|
| 267 |
+
"""
|
| 268 |
+
params: dict = {}
|
| 269 |
+
if project_pk is not None:
|
| 270 |
+
params["project"] = project_pk
|
| 271 |
+
raw = self._paginate("documents/", params)
|
| 272 |
+
docs = []
|
| 273 |
+
for item in raw:
|
| 274 |
+
layers = [
|
| 275 |
+
t.get("name", "") if isinstance(t, dict) else str(t)
|
| 276 |
+
for t in item.get("transcriptions", [])
|
| 277 |
+
]
|
| 278 |
+
docs.append(EScriptoriumDocument(
|
| 279 |
+
pk=item["pk"],
|
| 280 |
+
name=item.get("name", ""),
|
| 281 |
+
project=str(item.get("project", "")),
|
| 282 |
+
part_count=item.get("parts_count", 0),
|
| 283 |
+
transcription_layers=layers,
|
| 284 |
+
))
|
| 285 |
+
return docs
|
| 286 |
+
|
| 287 |
+
def list_parts(self, doc_pk: int) -> list[EScriptoriumPart]:
|
| 288 |
+
"""Retourne les pages (parts) d'un document.
|
| 289 |
+
|
| 290 |
+
Parameters
|
| 291 |
+
----------
|
| 292 |
+
doc_pk:
|
| 293 |
+
PK du document eScriptorium.
|
| 294 |
+
|
| 295 |
+
Returns
|
| 296 |
+
-------
|
| 297 |
+
list[EScriptoriumPart]
|
| 298 |
+
"""
|
| 299 |
+
raw = self._paginate(f"documents/{doc_pk}/parts/")
|
| 300 |
+
parts = []
|
| 301 |
+
for item in raw:
|
| 302 |
+
parts.append(EScriptoriumPart(
|
| 303 |
+
pk=item["pk"],
|
| 304 |
+
title=item.get("title", "") or f"Part {item.get('order', 0) + 1}",
|
| 305 |
+
image_url=item.get("image", "") or "",
|
| 306 |
+
order=item.get("order", 0),
|
| 307 |
+
))
|
| 308 |
+
return parts
|
| 309 |
+
|
| 310 |
+
def get_transcriptions(self, doc_pk: int, part_pk: int) -> list[dict]:
|
| 311 |
+
"""Retourne les transcriptions disponibles pour une page.
|
| 312 |
+
|
| 313 |
+
Parameters
|
| 314 |
+
----------
|
| 315 |
+
doc_pk:
|
| 316 |
+
PK du document.
|
| 317 |
+
part_pk:
|
| 318 |
+
PK de la page.
|
| 319 |
+
|
| 320 |
+
Returns
|
| 321 |
+
-------
|
| 322 |
+
list[dict]
|
| 323 |
+
Chaque dict contient ``{"name": str, "content": str}``.
|
| 324 |
+
"""
|
| 325 |
+
raw = self._get(f"documents/{doc_pk}/parts/{part_pk}/transcriptions/")
|
| 326 |
+
if isinstance(raw, list):
|
| 327 |
+
return raw
|
| 328 |
+
return raw.get("results", [])
|
| 329 |
+
|
| 330 |
+
def import_document(
|
| 331 |
+
self,
|
| 332 |
+
doc_pk: int,
|
| 333 |
+
transcription_layer: str = "manual",
|
| 334 |
+
output_dir: Optional[str] = None,
|
| 335 |
+
download_images: bool = True,
|
| 336 |
+
show_progress: bool = True,
|
| 337 |
+
) -> Corpus:
|
| 338 |
+
"""Importe un document eScriptorium comme corpus Picarones.
|
| 339 |
+
|
| 340 |
+
Télécharge les images et récupère les transcriptions de la couche
|
| 341 |
+
spécifiée comme vérité terrain.
|
| 342 |
+
|
| 343 |
+
Parameters
|
| 344 |
+
----------
|
| 345 |
+
doc_pk:
|
| 346 |
+
PK du document dans eScriptorium.
|
| 347 |
+
transcription_layer:
|
| 348 |
+
Nom de la couche de transcription à utiliser comme GT.
|
| 349 |
+
output_dir:
|
| 350 |
+
Dossier local pour les images téléchargées. Si None, les images
|
| 351 |
+
sont stockées en mémoire (pas de sauvegarde sur disque).
|
| 352 |
+
download_images:
|
| 353 |
+
Si True, télécharge les images dans output_dir.
|
| 354 |
+
show_progress:
|
| 355 |
+
Affiche une barre de progression tqdm.
|
| 356 |
+
|
| 357 |
+
Returns
|
| 358 |
+
-------
|
| 359 |
+
Corpus
|
| 360 |
+
Corpus Picarones avec documents et GT.
|
| 361 |
+
"""
|
| 362 |
+
# Récupérer les métadonnées du document
|
| 363 |
+
doc_info = self._get(f"documents/{doc_pk}/")
|
| 364 |
+
doc_name = doc_info.get("name", f"document_{doc_pk}")
|
| 365 |
+
|
| 366 |
+
parts = self.list_parts(doc_pk)
|
| 367 |
+
if not parts:
|
| 368 |
+
raise ValueError(f"Aucune page trouvée dans le document {doc_pk}")
|
| 369 |
+
|
| 370 |
+
if show_progress:
|
| 371 |
+
try:
|
| 372 |
+
from tqdm import tqdm
|
| 373 |
+
iterator = tqdm(parts, desc=f"Import {doc_name}")
|
| 374 |
+
except ImportError:
|
| 375 |
+
iterator = iter(parts)
|
| 376 |
+
else:
|
| 377 |
+
iterator = iter(parts)
|
| 378 |
+
|
| 379 |
+
out_path: Optional[Path] = None
|
| 380 |
+
if output_dir and download_images:
|
| 381 |
+
out_path = Path(output_dir)
|
| 382 |
+
out_path.mkdir(parents=True, exist_ok=True)
|
| 383 |
+
|
| 384 |
+
documents: list[Document] = []
|
| 385 |
+
for part in iterator:
|
| 386 |
+
# Récupérer les transcriptions
|
| 387 |
+
transcriptions = self.get_transcriptions(doc_pk, part.pk)
|
| 388 |
+
gt_text = ""
|
| 389 |
+
for t in transcriptions:
|
| 390 |
+
layer_name = t.get("transcription", {}).get("name", "") if isinstance(t.get("transcription"), dict) else t.get("name", "")
|
| 391 |
+
if layer_name == transcription_layer or not transcription_layer:
|
| 392 |
+
# Le contenu est dans "content" ou dans les lignes
|
| 393 |
+
lines = t.get("lines", []) or []
|
| 394 |
+
if lines:
|
| 395 |
+
gt_text = "\n".join(
|
| 396 |
+
line.get("content", "") or ""
|
| 397 |
+
for line in lines
|
| 398 |
+
if line.get("content")
|
| 399 |
+
)
|
| 400 |
+
else:
|
| 401 |
+
gt_text = t.get("content", "") or ""
|
| 402 |
+
break
|
| 403 |
+
|
| 404 |
+
# Image
|
| 405 |
+
image_path = part.image_url or f"escriptorium://doc{doc_pk}/part{part.pk}"
|
| 406 |
+
if out_path and part.image_url and download_images:
|
| 407 |
+
ext = Path(urllib.parse.urlparse(part.image_url).path).suffix or ".jpg"
|
| 408 |
+
local_img = out_path / f"part_{part.pk:05d}{ext}"
|
| 409 |
+
try:
|
| 410 |
+
urllib.request.urlretrieve(part.image_url, local_img)
|
| 411 |
+
image_path = str(local_img)
|
| 412 |
+
except Exception as exc:
|
| 413 |
+
logger.warning("Impossible de télécharger l'image %s: %s", part.image_url, exc)
|
| 414 |
+
|
| 415 |
+
# Sauvegarder la GT
|
| 416 |
+
gt_path = out_path / f"part_{part.pk:05d}.gt.txt"
|
| 417 |
+
gt_path.write_text(gt_text, encoding="utf-8")
|
| 418 |
+
|
| 419 |
+
documents.append(Document(
|
| 420 |
+
doc_id=f"part_{part.pk:05d}",
|
| 421 |
+
image_path=image_path,
|
| 422 |
+
ground_truth=gt_text,
|
| 423 |
+
metadata={
|
| 424 |
+
"source": "escriptorium",
|
| 425 |
+
"doc_pk": doc_pk,
|
| 426 |
+
"part_pk": part.pk,
|
| 427 |
+
"part_title": part.title,
|
| 428 |
+
"transcription_layer": transcription_layer,
|
| 429 |
+
},
|
| 430 |
+
))
|
| 431 |
+
|
| 432 |
+
return Corpus(
|
| 433 |
+
name=doc_name,
|
| 434 |
+
source=f"{self.base_url}/document/{doc_pk}/",
|
| 435 |
+
documents=documents,
|
| 436 |
+
metadata={
|
| 437 |
+
"escriptorium_url": self.base_url,
|
| 438 |
+
"doc_pk": doc_pk,
|
| 439 |
+
"transcription_layer": transcription_layer,
|
| 440 |
+
},
|
| 441 |
+
)
|
| 442 |
+
|
| 443 |
+
def export_benchmark_as_layer(
|
| 444 |
+
self,
|
| 445 |
+
benchmark_result: "BenchmarkResult",
|
| 446 |
+
doc_pk: int,
|
| 447 |
+
engine_name: str,
|
| 448 |
+
layer_name: Optional[str] = None,
|
| 449 |
+
part_mapping: Optional[dict[str, int]] = None,
|
| 450 |
+
) -> int:
|
| 451 |
+
"""Exporte les résultats Picarones comme couche OCR dans eScriptorium.
|
| 452 |
+
|
| 453 |
+
Parameters
|
| 454 |
+
----------
|
| 455 |
+
benchmark_result:
|
| 456 |
+
Résultats du benchmark Picarones.
|
| 457 |
+
doc_pk:
|
| 458 |
+
PK du document cible dans eScriptorium.
|
| 459 |
+
engine_name:
|
| 460 |
+
Nom du moteur dont on exporte les transcriptions.
|
| 461 |
+
layer_name:
|
| 462 |
+
Nom de la couche à créer (défaut : ``"picarones_{engine_name}"``).
|
| 463 |
+
part_mapping:
|
| 464 |
+
Correspondance ``doc_id → part_pk`` eScriptorium. Si None,
|
| 465 |
+
la correspondance est inférée depuis les métadonnées des documents.
|
| 466 |
+
|
| 467 |
+
Returns
|
| 468 |
+
-------
|
| 469 |
+
int
|
| 470 |
+
Nombre de pages exportées avec succès.
|
| 471 |
+
"""
|
| 472 |
+
if layer_name is None:
|
| 473 |
+
layer_name = f"picarones_{engine_name}"
|
| 474 |
+
|
| 475 |
+
# Trouver le rapport du moteur
|
| 476 |
+
engine_report = None
|
| 477 |
+
for report in benchmark_result.engine_reports:
|
| 478 |
+
if report.engine_name == engine_name:
|
| 479 |
+
engine_report = report
|
| 480 |
+
break
|
| 481 |
+
if engine_report is None:
|
| 482 |
+
raise ValueError(f"Moteur '{engine_name}' introuvable dans les résultats.")
|
| 483 |
+
|
| 484 |
+
exported = 0
|
| 485 |
+
for doc_result in engine_report.document_results:
|
| 486 |
+
if doc_result.engine_error:
|
| 487 |
+
continue
|
| 488 |
+
|
| 489 |
+
# Déterminer le part_pk
|
| 490 |
+
part_pk: Optional[int] = None
|
| 491 |
+
if part_mapping and doc_result.doc_id in part_mapping:
|
| 492 |
+
part_pk = part_mapping[doc_result.doc_id]
|
| 493 |
+
else:
|
| 494 |
+
# Essayer d'extraire depuis doc_id (ex: "part_00042")
|
| 495 |
+
try:
|
| 496 |
+
part_pk = int(doc_result.doc_id.replace("part_", "").lstrip("0") or "0")
|
| 497 |
+
except ValueError:
|
| 498 |
+
logger.warning("Impossible de déterminer part_pk pour %s", doc_result.doc_id)
|
| 499 |
+
continue
|
| 500 |
+
|
| 501 |
+
try:
|
| 502 |
+
self._post(
|
| 503 |
+
f"documents/{doc_pk}/parts/{part_pk}/transcriptions/",
|
| 504 |
+
{
|
| 505 |
+
"name": layer_name,
|
| 506 |
+
"content": doc_result.hypothesis,
|
| 507 |
+
"source": "picarones",
|
| 508 |
+
},
|
| 509 |
+
)
|
| 510 |
+
exported += 1
|
| 511 |
+
logger.debug("Exporté part %d → couche '%s'", part_pk, layer_name)
|
| 512 |
+
except RuntimeError as exc:
|
| 513 |
+
logger.warning("Erreur export part %d: %s", part_pk, exc)
|
| 514 |
+
|
| 515 |
+
return exported
|
| 516 |
+
|
| 517 |
+
|
| 518 |
+
# ---------------------------------------------------------------------------
|
| 519 |
+
# Interface de niveau module
|
| 520 |
+
# ---------------------------------------------------------------------------
|
| 521 |
+
|
| 522 |
+
def connect_escriptorium(
|
| 523 |
+
base_url: str,
|
| 524 |
+
token: str,
|
| 525 |
+
timeout: int = 30,
|
| 526 |
+
) -> EScriptoriumClient:
|
| 527 |
+
"""Crée et retourne un client eScriptorium authentifié.
|
| 528 |
+
|
| 529 |
+
Parameters
|
| 530 |
+
----------
|
| 531 |
+
base_url:
|
| 532 |
+
URL de l'instance eScriptorium.
|
| 533 |
+
token:
|
| 534 |
+
Token API.
|
| 535 |
+
timeout:
|
| 536 |
+
Timeout HTTP.
|
| 537 |
+
|
| 538 |
+
Returns
|
| 539 |
+
-------
|
| 540 |
+
EScriptoriumClient
|
| 541 |
+
|
| 542 |
+
Raises
|
| 543 |
+
------
|
| 544 |
+
RuntimeError
|
| 545 |
+
Si la connexion échoue (URL invalide, token incorrect, serveur inaccessible).
|
| 546 |
+
"""
|
| 547 |
+
client = EScriptoriumClient(base_url, token, timeout)
|
| 548 |
+
if not client.test_connection():
|
| 549 |
+
raise RuntimeError(
|
| 550 |
+
f"Impossible de se connecter à {base_url}. "
|
| 551 |
+
"Vérifiez l'URL et le token API."
|
| 552 |
+
)
|
| 553 |
+
return client
|
|
@@ -0,0 +1,553 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Import de corpus depuis Gallica (BnF) via l'API SRU et IIIF.
|
| 2 |
+
|
| 3 |
+
Fonctionnement
|
| 4 |
+
--------------
|
| 5 |
+
1. Recherche dans Gallica par cote (ark), titre, auteur ou date via l'API SRU BnF
|
| 6 |
+
2. Récupération des images via l'API IIIF Gallica
|
| 7 |
+
3. Récupération de l'OCR Gallica existant (texte brut ou ALTO) comme concurrent de référence
|
| 8 |
+
|
| 9 |
+
API utilisées
|
| 10 |
+
-------------
|
| 11 |
+
- SRU BnF : https://gallica.bnf.fr/SRU?operation=searchRetrieve&query=...
|
| 12 |
+
- IIIF Gallica : https://gallica.bnf.fr/ark:/12148/{ark}/manifest.json
|
| 13 |
+
- OCR texte brut : https://gallica.bnf.fr/ark:/12148/{ark}/f{n}.texteBrut
|
| 14 |
+
- Métadonnées OAI-PMH : https://gallica.bnf.fr/services/OAIRecord?ark={ark}
|
| 15 |
+
|
| 16 |
+
Usage
|
| 17 |
+
-----
|
| 18 |
+
>>> from picarones.importers.gallica import GallicaClient
|
| 19 |
+
>>> client = GallicaClient()
|
| 20 |
+
>>> results = client.search(title="Froissart", date_from=1380, date_to=1420, max_results=10)
|
| 21 |
+
>>> corpus = client.import_document(results[0].ark, pages="1-5", include_gallica_ocr=True)
|
| 22 |
+
"""
|
| 23 |
+
|
| 24 |
+
from __future__ import annotations
|
| 25 |
+
|
| 26 |
+
import logging
|
| 27 |
+
import re
|
| 28 |
+
import time
|
| 29 |
+
import urllib.error
|
| 30 |
+
import urllib.parse
|
| 31 |
+
import urllib.request
|
| 32 |
+
import xml.etree.ElementTree as ET
|
| 33 |
+
from dataclasses import dataclass
|
| 34 |
+
from typing import Optional
|
| 35 |
+
|
| 36 |
+
from picarones.core.corpus import Corpus
|
| 37 |
+
|
| 38 |
+
logger = logging.getLogger(__name__)
|
| 39 |
+
|
| 40 |
+
# Namespaces SRU/OAI
|
| 41 |
+
_NS_SRU = "http://www.loc.gov/zing/srw/"
|
| 42 |
+
_NS_DC = "http://purl.org/dc/elements/1.1/"
|
| 43 |
+
_NS_OAI = "http://www.openarchives.org/OAI/2.0/"
|
| 44 |
+
|
| 45 |
+
_GALLICA_BASE = "https://gallica.bnf.fr"
|
| 46 |
+
_SRU_URL = f"{_GALLICA_BASE}/SRU"
|
| 47 |
+
_IIIF_MANIFEST_TPL = f"{_GALLICA_BASE}/ark:/{{ark}}/manifest.json"
|
| 48 |
+
_OCR_BRUT_TPL = f"{_GALLICA_BASE}/ark:/{{ark}}/f{{page}}.texteBrut"
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
# ---------------------------------------------------------------------------
|
| 52 |
+
# Structures de données
|
| 53 |
+
# ---------------------------------------------------------------------------
|
| 54 |
+
|
| 55 |
+
@dataclass
|
| 56 |
+
class GallicaRecord:
|
| 57 |
+
"""Un résultat de recherche Gallica."""
|
| 58 |
+
ark: str
|
| 59 |
+
"""Identifiant ARK sans préfixe (ex: ``'12148/btv1b8453561w'``)."""
|
| 60 |
+
title: str
|
| 61 |
+
creator: str = ""
|
| 62 |
+
date: str = ""
|
| 63 |
+
description: str = ""
|
| 64 |
+
type_doc: str = ""
|
| 65 |
+
language: str = ""
|
| 66 |
+
rights: str = ""
|
| 67 |
+
has_ocr: bool = False
|
| 68 |
+
"""True si Gallica fournit un OCR pour ce document."""
|
| 69 |
+
|
| 70 |
+
@property
|
| 71 |
+
def url(self) -> str:
|
| 72 |
+
return f"{_GALLICA_BASE}/ark:/12148/{self.ark}"
|
| 73 |
+
|
| 74 |
+
@property
|
| 75 |
+
def manifest_url(self) -> str:
|
| 76 |
+
return f"{_GALLICA_BASE}/ark:/12148/{self.ark}/manifest.json"
|
| 77 |
+
|
| 78 |
+
def as_dict(self) -> dict:
|
| 79 |
+
return {
|
| 80 |
+
"ark": self.ark,
|
| 81 |
+
"title": self.title,
|
| 82 |
+
"creator": self.creator,
|
| 83 |
+
"date": self.date,
|
| 84 |
+
"description": self.description,
|
| 85 |
+
"type_doc": self.type_doc,
|
| 86 |
+
"language": self.language,
|
| 87 |
+
"has_ocr": self.has_ocr,
|
| 88 |
+
"url": self.url,
|
| 89 |
+
"manifest_url": self.manifest_url,
|
| 90 |
+
}
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
# ---------------------------------------------------------------------------
|
| 94 |
+
# Client Gallica
|
| 95 |
+
# ---------------------------------------------------------------------------
|
| 96 |
+
|
| 97 |
+
class GallicaClient:
|
| 98 |
+
"""Client pour les APIs Gallica (SRU, IIIF, OCR texte brut).
|
| 99 |
+
|
| 100 |
+
Parameters
|
| 101 |
+
----------
|
| 102 |
+
timeout:
|
| 103 |
+
Timeout HTTP en secondes.
|
| 104 |
+
delay_between_requests:
|
| 105 |
+
Délai en secondes entre chaque requête (pour respecter les conditions
|
| 106 |
+
d'utilisation Gallica).
|
| 107 |
+
|
| 108 |
+
Examples
|
| 109 |
+
--------
|
| 110 |
+
>>> client = GallicaClient()
|
| 111 |
+
>>> results = client.search(author="Froissart", max_results=5)
|
| 112 |
+
>>> for r in results:
|
| 113 |
+
... print(r.title, r.date)
|
| 114 |
+
>>> corpus = client.import_document(results[0].ark, pages="1-3")
|
| 115 |
+
"""
|
| 116 |
+
|
| 117 |
+
def __init__(
|
| 118 |
+
self,
|
| 119 |
+
timeout: int = 30,
|
| 120 |
+
delay_between_requests: float = 0.5,
|
| 121 |
+
) -> None:
|
| 122 |
+
self.timeout = timeout
|
| 123 |
+
self.delay = delay_between_requests
|
| 124 |
+
|
| 125 |
+
# Chantier 4 (post-Sprint 97) — fusion Gallica → IIIF :
|
| 126 |
+
# ``_validate_url`` et le fetch HTTP sont désormais factorisés
|
| 127 |
+
# dans :mod:`picarones.importers._http`. Avant ce chantier ces
|
| 128 |
+
# 30 lignes étaient dupliquées avec :mod:`iiif`. Le polite
|
| 129 |
+
# ``delay_between_requests`` reste ici (spécifique à la BnF).
|
| 130 |
+
|
| 131 |
+
@staticmethod
|
| 132 |
+
def _validate_url(url: str) -> None:
|
| 133 |
+
"""Délègue à :func:`picarones.importers._http.validate_http_url`."""
|
| 134 |
+
from picarones.importers._http import validate_http_url
|
| 135 |
+
validate_http_url(url)
|
| 136 |
+
|
| 137 |
+
def _fetch_url(self, url: str) -> bytes:
|
| 138 |
+
"""Télécharge le contenu d'une URL avec respect du polite delay BnF.
|
| 139 |
+
|
| 140 |
+
Délègue à :func:`picarones.importers._http.download_url` puis
|
| 141 |
+
applique ``self.delay`` (par défaut 0.5 s) entre les requêtes
|
| 142 |
+
pour respecter les conditions d'utilisation Gallica.
|
| 143 |
+
"""
|
| 144 |
+
from picarones.importers._http import download_url
|
| 145 |
+
try:
|
| 146 |
+
return download_url(
|
| 147 |
+
url,
|
| 148 |
+
retries=1,
|
| 149 |
+
timeout=self.timeout,
|
| 150 |
+
user_agent="Picarones/1.0 (research tool)",
|
| 151 |
+
)
|
| 152 |
+
except RuntimeError as exc:
|
| 153 |
+
# Le helper retourne ``RuntimeError`` après retries épuisés.
|
| 154 |
+
# On re-emballe pour conserver le format de message historique
|
| 155 |
+
# attendu par les tests Gallica (« HTTP 404 sur ... »).
|
| 156 |
+
raise RuntimeError(str(exc)) from exc
|
| 157 |
+
finally:
|
| 158 |
+
if self.delay > 0:
|
| 159 |
+
time.sleep(self.delay)
|
| 160 |
+
|
| 161 |
+
def _build_sru_query(
|
| 162 |
+
self,
|
| 163 |
+
ark: Optional[str] = None,
|
| 164 |
+
title: Optional[str] = None,
|
| 165 |
+
author: Optional[str] = None,
|
| 166 |
+
date_from: Optional[int] = None,
|
| 167 |
+
date_to: Optional[int] = None,
|
| 168 |
+
doc_type: Optional[str] = None,
|
| 169 |
+
language: Optional[str] = None,
|
| 170 |
+
) -> str:
|
| 171 |
+
"""Construit une requête CQL pour l'API SRU BnF."""
|
| 172 |
+
clauses: list[str] = []
|
| 173 |
+
|
| 174 |
+
if ark:
|
| 175 |
+
# Recherche par identifiant ARK
|
| 176 |
+
clauses.append(f'dc.identifier any "{ark}"')
|
| 177 |
+
if title:
|
| 178 |
+
clauses.append(f'dc.title all "{title}"')
|
| 179 |
+
if author:
|
| 180 |
+
clauses.append(f'dc.creator all "{author}"')
|
| 181 |
+
if date_from and date_to:
|
| 182 |
+
clauses.append(f'dc.date >= "{date_from}" and dc.date <= "{date_to}"')
|
| 183 |
+
elif date_from:
|
| 184 |
+
clauses.append(f'dc.date >= "{date_from}"')
|
| 185 |
+
elif date_to:
|
| 186 |
+
clauses.append(f'dc.date <= "{date_to}"')
|
| 187 |
+
if doc_type:
|
| 188 |
+
clauses.append(f'dc.type all "{doc_type}"')
|
| 189 |
+
if language:
|
| 190 |
+
clauses.append(f'dc.language all "{language}"')
|
| 191 |
+
|
| 192 |
+
if not clauses:
|
| 193 |
+
return 'gallica all "document"'
|
| 194 |
+
return " and ".join(clauses)
|
| 195 |
+
|
| 196 |
+
def search(
|
| 197 |
+
self,
|
| 198 |
+
ark: Optional[str] = None,
|
| 199 |
+
title: Optional[str] = None,
|
| 200 |
+
author: Optional[str] = None,
|
| 201 |
+
date_from: Optional[int] = None,
|
| 202 |
+
date_to: Optional[int] = None,
|
| 203 |
+
doc_type: Optional[str] = None,
|
| 204 |
+
language: Optional[str] = None,
|
| 205 |
+
max_results: int = 20,
|
| 206 |
+
) -> list[GallicaRecord]:
|
| 207 |
+
"""Recherche dans Gallica via l'API SRU BnF.
|
| 208 |
+
|
| 209 |
+
Parameters
|
| 210 |
+
----------
|
| 211 |
+
ark:
|
| 212 |
+
Identifiant ARK (ex : ``'12148/btv1b8453561w'``).
|
| 213 |
+
title:
|
| 214 |
+
Mots-clés dans le titre.
|
| 215 |
+
author:
|
| 216 |
+
Mots-clés dans l'auteur/créateur.
|
| 217 |
+
date_from:
|
| 218 |
+
Borne inférieure de date (année).
|
| 219 |
+
date_to:
|
| 220 |
+
Borne supérieure de date (année).
|
| 221 |
+
doc_type:
|
| 222 |
+
Type de document (``'monographie'``, ``'périodique'``, ``'manuscrit'``…).
|
| 223 |
+
language:
|
| 224 |
+
Code langue ISO 639 (``'fre'``, ``'lat'``, ``'ger'``…).
|
| 225 |
+
max_results:
|
| 226 |
+
Nombre maximum de résultats à retourner.
|
| 227 |
+
|
| 228 |
+
Returns
|
| 229 |
+
-------
|
| 230 |
+
list[GallicaRecord]
|
| 231 |
+
Liste des documents trouvés.
|
| 232 |
+
"""
|
| 233 |
+
query = self._build_sru_query(
|
| 234 |
+
ark=ark,
|
| 235 |
+
title=title,
|
| 236 |
+
author=author,
|
| 237 |
+
date_from=date_from,
|
| 238 |
+
date_to=date_to,
|
| 239 |
+
doc_type=doc_type,
|
| 240 |
+
language=language,
|
| 241 |
+
)
|
| 242 |
+
|
| 243 |
+
params = urllib.parse.urlencode({
|
| 244 |
+
"operation": "searchRetrieve",
|
| 245 |
+
"version": "1.2",
|
| 246 |
+
"query": query,
|
| 247 |
+
"maximumRecords": min(max_results, 50),
|
| 248 |
+
"startRecord": 1,
|
| 249 |
+
"recordSchema": "unimarcXchange",
|
| 250 |
+
})
|
| 251 |
+
url = f"{_SRU_URL}?{params}"
|
| 252 |
+
|
| 253 |
+
try:
|
| 254 |
+
raw = self._fetch_url(url)
|
| 255 |
+
except RuntimeError as exc:
|
| 256 |
+
logger.error("Erreur recherche SRU Gallica: %s", exc)
|
| 257 |
+
return []
|
| 258 |
+
|
| 259 |
+
return self._parse_sru_response(raw, max_results)
|
| 260 |
+
|
| 261 |
+
def _parse_sru_response(self, xml_bytes: bytes, max_results: int) -> list[GallicaRecord]:
|
| 262 |
+
"""Parse la réponse SRU XML de Gallica."""
|
| 263 |
+
records: list[GallicaRecord] = []
|
| 264 |
+
try:
|
| 265 |
+
root = ET.fromstring(xml_bytes)
|
| 266 |
+
except ET.ParseError as exc:
|
| 267 |
+
logger.error("Impossible de parser la réponse SRU: %s", exc)
|
| 268 |
+
return records
|
| 269 |
+
|
| 270 |
+
# Les enregistrements sont dans srw:records/srw:record/srw:recordData
|
| 271 |
+
for rec_elem in root.iter():
|
| 272 |
+
if rec_elem.tag.endswith("}record") or rec_elem.tag == "record":
|
| 273 |
+
record = self._parse_record_element(rec_elem)
|
| 274 |
+
if record:
|
| 275 |
+
records.append(record)
|
| 276 |
+
if len(records) >= max_results:
|
| 277 |
+
break
|
| 278 |
+
|
| 279 |
+
return records
|
| 280 |
+
|
| 281 |
+
def _parse_record_element(self, elem: ET.Element) -> Optional[GallicaRecord]:
|
| 282 |
+
"""Extrait les métadonnées d'un enregistrement SRU."""
|
| 283 |
+
# Chercher les champs Dublin Core dans l'enregistrement
|
| 284 |
+
def find_text(tag_suffix: str) -> str:
|
| 285 |
+
for child in elem.iter():
|
| 286 |
+
if child.tag.endswith(tag_suffix) and child.text:
|
| 287 |
+
return child.text.strip()
|
| 288 |
+
return ""
|
| 289 |
+
|
| 290 |
+
def find_all_text(tag_suffix: str) -> list[str]:
|
| 291 |
+
return [
|
| 292 |
+
child.text.strip()
|
| 293 |
+
for child in elem.iter()
|
| 294 |
+
if child.tag.endswith(tag_suffix) and child.text
|
| 295 |
+
]
|
| 296 |
+
|
| 297 |
+
# Chercher l'ARK dans l'identifiant
|
| 298 |
+
identifiers = find_all_text("identifier")
|
| 299 |
+
ark = ""
|
| 300 |
+
for ident in identifiers:
|
| 301 |
+
# Format typique : "https://gallica.bnf.fr/ark:/12148/btv1b8453561w"
|
| 302 |
+
m = re.search(r"ark:/(\d+/\w+)", ident)
|
| 303 |
+
if m:
|
| 304 |
+
ark = m.group(1)
|
| 305 |
+
break
|
| 306 |
+
|
| 307 |
+
if not ark:
|
| 308 |
+
return None
|
| 309 |
+
|
| 310 |
+
title = find_text("title") or "Sans titre"
|
| 311 |
+
creator = find_text("creator")
|
| 312 |
+
date = find_text("date")
|
| 313 |
+
|
| 314 |
+
# Vérifier si OCR disponible (heuristique : type monographie/périodique généralement)
|
| 315 |
+
doc_types = find_all_text("type")
|
| 316 |
+
has_ocr = any(
|
| 317 |
+
t.lower() in ("monographie", "fascicule", "texte", "text")
|
| 318 |
+
for t in doc_types
|
| 319 |
+
)
|
| 320 |
+
|
| 321 |
+
return GallicaRecord(
|
| 322 |
+
ark=ark,
|
| 323 |
+
title=title,
|
| 324 |
+
creator=creator,
|
| 325 |
+
date=date,
|
| 326 |
+
description=find_text("description"),
|
| 327 |
+
type_doc=", ".join(doc_types),
|
| 328 |
+
language=find_text("language"),
|
| 329 |
+
has_ocr=has_ocr,
|
| 330 |
+
)
|
| 331 |
+
|
| 332 |
+
def get_ocr_text(self, ark: str, page: int) -> str:
|
| 333 |
+
"""Récupère l'OCR Gallica d'une page spécifique (texte brut).
|
| 334 |
+
|
| 335 |
+
Parameters
|
| 336 |
+
----------
|
| 337 |
+
ark:
|
| 338 |
+
Identifiant ARK (ex : ``'12148/btv1b8453561w'``).
|
| 339 |
+
page:
|
| 340 |
+
Numéro de page 1-based.
|
| 341 |
+
|
| 342 |
+
Returns
|
| 343 |
+
-------
|
| 344 |
+
str
|
| 345 |
+
Texte OCR Gallica pour cette page (peut être vide si non disponible).
|
| 346 |
+
"""
|
| 347 |
+
url = _OCR_BRUT_TPL.format(ark=ark, page=page)
|
| 348 |
+
try:
|
| 349 |
+
raw = self._fetch_url(url)
|
| 350 |
+
text = raw.decode("utf-8", errors="replace").strip()
|
| 351 |
+
# Gallica retourne parfois du HTML pour les pages sans OCR
|
| 352 |
+
if text.startswith("<!") or "<html" in text[:100].lower():
|
| 353 |
+
return ""
|
| 354 |
+
return text
|
| 355 |
+
except RuntimeError as exc:
|
| 356 |
+
logger.debug("OCR non disponible pour %s f%d: %s", ark, page, exc)
|
| 357 |
+
return ""
|
| 358 |
+
|
| 359 |
+
def import_document(
|
| 360 |
+
self,
|
| 361 |
+
ark: str,
|
| 362 |
+
pages: str = "all",
|
| 363 |
+
output_dir: Optional[str] = None,
|
| 364 |
+
include_gallica_ocr: bool = True,
|
| 365 |
+
max_resolution: int = 0,
|
| 366 |
+
show_progress: bool = True,
|
| 367 |
+
) -> Corpus:
|
| 368 |
+
"""Importe un document Gallica comme corpus Picarones.
|
| 369 |
+
|
| 370 |
+
Utilise le manifeste IIIF Gallica pour lister les pages et télécharger
|
| 371 |
+
les images. L'OCR Gallica est optionnellement récupéré comme GT ou comme
|
| 372 |
+
transcription de référence.
|
| 373 |
+
|
| 374 |
+
Parameters
|
| 375 |
+
----------
|
| 376 |
+
ark:
|
| 377 |
+
Identifiant ARK (ex : ``'12148/btv1b8453561w'``).
|
| 378 |
+
pages:
|
| 379 |
+
Sélecteur de pages (``'all'``, ``'1-10'``, ``'1,3,5'``…).
|
| 380 |
+
output_dir:
|
| 381 |
+
Dossier local pour stocker images et GT.
|
| 382 |
+
include_gallica_ocr:
|
| 383 |
+
Si True, récupère l'OCR Gallica comme texte de référence.
|
| 384 |
+
max_resolution:
|
| 385 |
+
Largeur maximale des images téléchargées (0 = maximum disponible).
|
| 386 |
+
show_progress:
|
| 387 |
+
Affiche une barre de progression.
|
| 388 |
+
|
| 389 |
+
Returns
|
| 390 |
+
-------
|
| 391 |
+
Corpus
|
| 392 |
+
Corpus avec images et OCR Gallica comme GT (si disponible).
|
| 393 |
+
"""
|
| 394 |
+
from picarones.importers.iiif import IIIFImporter
|
| 395 |
+
|
| 396 |
+
manifest_url = f"{_GALLICA_BASE}/ark:/12148/{ark}/manifest.json"
|
| 397 |
+
logger.info("Import Gallica ARK %s via IIIF : %s", ark, manifest_url)
|
| 398 |
+
|
| 399 |
+
# Utiliser l'importeur IIIF existant pour les images
|
| 400 |
+
importer = IIIFImporter(manifest_url, max_resolution=max_resolution)
|
| 401 |
+
importer.load()
|
| 402 |
+
|
| 403 |
+
corpus = importer.import_corpus(
|
| 404 |
+
pages=pages,
|
| 405 |
+
output_dir=output_dir or f"./corpus_gallica_{ark.split('/')[-1]}/",
|
| 406 |
+
show_progress=show_progress,
|
| 407 |
+
)
|
| 408 |
+
|
| 409 |
+
# Enrichir avec l'OCR Gallica si demandé
|
| 410 |
+
if include_gallica_ocr:
|
| 411 |
+
selected_indices = importer.list_canvases(pages)
|
| 412 |
+
for i, doc in enumerate(corpus.documents):
|
| 413 |
+
page_num = selected_indices[i] + 1 if i < len(selected_indices) else i + 1
|
| 414 |
+
gallica_ocr = self.get_ocr_text(ark, page_num)
|
| 415 |
+
if gallica_ocr:
|
| 416 |
+
doc.metadata["gallica_ocr"] = gallica_ocr
|
| 417 |
+
# Si pas de GT manuscrite, utiliser l'OCR Gallica comme référence
|
| 418 |
+
if not doc.ground_truth.strip():
|
| 419 |
+
doc.ground_truth = gallica_ocr
|
| 420 |
+
doc.metadata["gt_source"] = "gallica_ocr"
|
| 421 |
+
|
| 422 |
+
# Ajouter métadonnées Gallica
|
| 423 |
+
corpus.metadata.update({
|
| 424 |
+
"source": "gallica",
|
| 425 |
+
"ark": ark,
|
| 426 |
+
"manifest_url": manifest_url,
|
| 427 |
+
"gallica_url": f"{_GALLICA_BASE}/ark:/12148/{ark}",
|
| 428 |
+
"include_gallica_ocr": include_gallica_ocr,
|
| 429 |
+
})
|
| 430 |
+
|
| 431 |
+
return corpus
|
| 432 |
+
|
| 433 |
+
def get_metadata(self, ark: str) -> dict:
|
| 434 |
+
"""Récupère les métadonnées OAI-PMH d'un document Gallica.
|
| 435 |
+
|
| 436 |
+
Parameters
|
| 437 |
+
----------
|
| 438 |
+
ark:
|
| 439 |
+
Identifiant ARK.
|
| 440 |
+
|
| 441 |
+
Returns
|
| 442 |
+
-------
|
| 443 |
+
dict
|
| 444 |
+
Métadonnées Dublin Core du document.
|
| 445 |
+
"""
|
| 446 |
+
url = f"{_GALLICA_BASE}/services/OAIRecord?ark=ark:/12148/{ark}"
|
| 447 |
+
try:
|
| 448 |
+
raw = self._fetch_url(url)
|
| 449 |
+
root = ET.fromstring(raw)
|
| 450 |
+
except (RuntimeError, ET.ParseError) as exc:
|
| 451 |
+
logger.error("Erreur métadonnées OAI %s: %s", ark, exc)
|
| 452 |
+
return {"ark": ark}
|
| 453 |
+
|
| 454 |
+
def find_text(tag_suffix: str) -> str:
|
| 455 |
+
for elem in root.iter():
|
| 456 |
+
if elem.tag.endswith(tag_suffix) and elem.text:
|
| 457 |
+
return elem.text.strip()
|
| 458 |
+
return ""
|
| 459 |
+
|
| 460 |
+
return {
|
| 461 |
+
"ark": ark,
|
| 462 |
+
"title": find_text("title"),
|
| 463 |
+
"creator": find_text("creator"),
|
| 464 |
+
"date": find_text("date"),
|
| 465 |
+
"description": find_text("description"),
|
| 466 |
+
"subject": find_text("subject"),
|
| 467 |
+
"language": find_text("language"),
|
| 468 |
+
"type": find_text("type"),
|
| 469 |
+
"format": find_text("format"),
|
| 470 |
+
"source": find_text("source"),
|
| 471 |
+
"url": f"{_GALLICA_BASE}/ark:/12148/{ark}",
|
| 472 |
+
}
|
| 473 |
+
|
| 474 |
+
|
| 475 |
+
# ---------------------------------------------------------------------------
|
| 476 |
+
# Fonctions de commodité
|
| 477 |
+
# ---------------------------------------------------------------------------
|
| 478 |
+
|
| 479 |
+
def search_gallica(
|
| 480 |
+
title: Optional[str] = None,
|
| 481 |
+
author: Optional[str] = None,
|
| 482 |
+
ark: Optional[str] = None,
|
| 483 |
+
date_from: Optional[int] = None,
|
| 484 |
+
date_to: Optional[int] = None,
|
| 485 |
+
max_results: int = 20,
|
| 486 |
+
) -> list[GallicaRecord]:
|
| 487 |
+
"""Recherche rapide dans Gallica.
|
| 488 |
+
|
| 489 |
+
Crée un client temporaire et effectue une recherche.
|
| 490 |
+
|
| 491 |
+
Parameters
|
| 492 |
+
----------
|
| 493 |
+
title, author, ark, date_from, date_to:
|
| 494 |
+
Critères de recherche.
|
| 495 |
+
max_results:
|
| 496 |
+
Nombre maximum de résultats.
|
| 497 |
+
|
| 498 |
+
Returns
|
| 499 |
+
-------
|
| 500 |
+
list[GallicaRecord]
|
| 501 |
+
|
| 502 |
+
Examples
|
| 503 |
+
--------
|
| 504 |
+
>>> results = search_gallica(title="Froissart", date_from=1380, date_to=1430)
|
| 505 |
+
>>> for r in results[:3]:
|
| 506 |
+
... print(r.title, r.ark)
|
| 507 |
+
"""
|
| 508 |
+
client = GallicaClient()
|
| 509 |
+
return client.search(
|
| 510 |
+
ark=ark,
|
| 511 |
+
title=title,
|
| 512 |
+
author=author,
|
| 513 |
+
date_from=date_from,
|
| 514 |
+
date_to=date_to,
|
| 515 |
+
max_results=max_results,
|
| 516 |
+
)
|
| 517 |
+
|
| 518 |
+
|
| 519 |
+
def import_gallica_document(
|
| 520 |
+
ark: str,
|
| 521 |
+
pages: str = "all",
|
| 522 |
+
output_dir: Optional[str] = None,
|
| 523 |
+
include_gallica_ocr: bool = True,
|
| 524 |
+
) -> Corpus:
|
| 525 |
+
"""Importe un document Gallica en une ligne.
|
| 526 |
+
|
| 527 |
+
Parameters
|
| 528 |
+
----------
|
| 529 |
+
ark:
|
| 530 |
+
Identifiant ARK (``'12148/btv1b8453561w'`` ou URL complète).
|
| 531 |
+
pages:
|
| 532 |
+
Sélecteur de pages (``'all'``, ``'1-10'``…).
|
| 533 |
+
output_dir:
|
| 534 |
+
Dossier de sortie.
|
| 535 |
+
include_gallica_ocr:
|
| 536 |
+
Inclure l'OCR Gallica comme GT.
|
| 537 |
+
|
| 538 |
+
Returns
|
| 539 |
+
-------
|
| 540 |
+
Corpus
|
| 541 |
+
"""
|
| 542 |
+
# Normaliser l'ARK (extraire depuis URL complète si besoin)
|
| 543 |
+
m = re.search(r"ark:/(\d+/\w+)", ark)
|
| 544 |
+
if m:
|
| 545 |
+
ark = m.group(1)
|
| 546 |
+
|
| 547 |
+
client = GallicaClient()
|
| 548 |
+
return client.import_document(
|
| 549 |
+
ark=ark,
|
| 550 |
+
pages=pages,
|
| 551 |
+
output_dir=output_dir,
|
| 552 |
+
include_gallica_ocr=include_gallica_ocr,
|
| 553 |
+
)
|
|
@@ -0,0 +1,455 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Import depuis le catalogue HTR-United.
|
| 2 |
+
|
| 3 |
+
HTR-United est un catalogue communautaire de vérités terrain HTR/OCR publiées
|
| 4 |
+
sur GitHub sous licence ouverte. Les métadonnées sont stockées dans un fichier
|
| 5 |
+
YAML (catalogue.yml) sur https://github.com/HTR-United/htr-united.
|
| 6 |
+
|
| 7 |
+
Ce module fournit :
|
| 8 |
+
- :class:`HTRUnitedCatalogue` — chargement et recherche dans le catalogue
|
| 9 |
+
- :func:`fetch_catalogue` — téléchargement du catalogue depuis GitHub
|
| 10 |
+
- :func:`import_htr_united_corpus` — téléchargement et import d'un corpus
|
| 11 |
+
|
| 12 |
+
Exemple
|
| 13 |
+
-------
|
| 14 |
+
catalogue = HTRUnitedCatalogue.from_remote()
|
| 15 |
+
results = catalogue.search("français médiéval")
|
| 16 |
+
corpus = import_htr_united_corpus(results[0], output_dir="./corpus/")
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from __future__ import annotations
|
| 20 |
+
|
| 21 |
+
import json
|
| 22 |
+
import logging
|
| 23 |
+
import re
|
| 24 |
+
import urllib.error
|
| 25 |
+
import urllib.request
|
| 26 |
+
from dataclasses import dataclass, field
|
| 27 |
+
from pathlib import Path
|
| 28 |
+
from typing import Optional
|
| 29 |
+
|
| 30 |
+
logger = logging.getLogger(__name__)
|
| 31 |
+
|
| 32 |
+
# ---------------------------------------------------------------------------
|
| 33 |
+
# Catalogue remote URL
|
| 34 |
+
# ---------------------------------------------------------------------------
|
| 35 |
+
|
| 36 |
+
_CATALOGUE_URL = (
|
| 37 |
+
"https://raw.githubusercontent.com/HTR-United/htr-united/master/htr-united.yml"
|
| 38 |
+
)
|
| 39 |
+
_CATALOGUE_API_URL = (
|
| 40 |
+
"https://api.github.com/repos/HTR-United/htr-united/contents/htr-united.yml"
|
| 41 |
+
)
|
| 42 |
+
|
| 43 |
+
# Catalogue de démonstration / fallback (hors-ligne)
|
| 44 |
+
_DEMO_CATALOGUE: list[dict] = [
|
| 45 |
+
{
|
| 46 |
+
"id": "lectaurep-repertoires",
|
| 47 |
+
"title": "Lectaurep — Répertoires de notaires parisiens",
|
| 48 |
+
"url": "https://github.com/HTR-United/lectaurep-repertoires",
|
| 49 |
+
"language": ["French"],
|
| 50 |
+
"script": ["Cursiva"],
|
| 51 |
+
"century": [17, 18],
|
| 52 |
+
"institution": "Archives nationales (France)",
|
| 53 |
+
"description": "Transcriptions de répertoires de notaires, XVIIe-XVIIIe siècles.",
|
| 54 |
+
"license": "CC-BY 4.0",
|
| 55 |
+
"lines": 12400,
|
| 56 |
+
"format": "ALTO",
|
| 57 |
+
"tags": ["notaires", "Paris", "cursive", "imprimé"],
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"id": "bvmm-manuscripts",
|
| 61 |
+
"title": "BVMM — Manuscrits enluminés",
|
| 62 |
+
"url": "https://github.com/HTR-United/bvmm-manuscripts",
|
| 63 |
+
"language": ["Latin", "French"],
|
| 64 |
+
"script": ["Gothic"],
|
| 65 |
+
"century": [13, 14, 15],
|
| 66 |
+
"institution": "IRHT",
|
| 67 |
+
"description": "Manuscrits médiévaux latins et français, XIIIe-XVe siècles.",
|
| 68 |
+
"license": "CC-BY 4.0",
|
| 69 |
+
"lines": 8700,
|
| 70 |
+
"format": "ALTO",
|
| 71 |
+
"tags": ["manuscrits", "latin", "médiéval", "enluminure"],
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
"id": "cremma-medieval",
|
| 75 |
+
"title": "CREMMA Médiéval",
|
| 76 |
+
"url": "https://github.com/HTR-United/cremma-medieval",
|
| 77 |
+
"language": ["French", "Latin"],
|
| 78 |
+
"script": ["Gothic", "Humanistica"],
|
| 79 |
+
"century": [12, 13, 14, 15],
|
| 80 |
+
"institution": "École des chartes / Inria",
|
| 81 |
+
"description": "Corpus CREMMA de manuscrits médiévaux français et latins.",
|
| 82 |
+
"license": "CC-BY 4.0",
|
| 83 |
+
"lines": 6200,
|
| 84 |
+
"format": "ALTO",
|
| 85 |
+
"tags": ["médiéval", "chartes", "manuscrits"],
|
| 86 |
+
},
|
| 87 |
+
{
|
| 88 |
+
"id": "simssa-ocr-printed",
|
| 89 |
+
"title": "SIMSSA — Imprimés anciens (XVe-XVIIe)",
|
| 90 |
+
"url": "https://github.com/HTR-United/simssa-printed",
|
| 91 |
+
"language": ["French", "Latin"],
|
| 92 |
+
"script": ["Rotunda", "Roman"],
|
| 93 |
+
"century": [15, 16, 17],
|
| 94 |
+
"institution": "McGill University",
|
| 95 |
+
"description": "Corpus d'imprimés anciens romains et gothiques.",
|
| 96 |
+
"license": "CC-BY 4.0",
|
| 97 |
+
"lines": 4500,
|
| 98 |
+
"format": "PAGE",
|
| 99 |
+
"tags": ["imprimés", "incunables", "roman", "gothique"],
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"id": "fonds-gallica-presse",
|
| 103 |
+
"title": "Presse ancienne — Gallica (XIXe)",
|
| 104 |
+
"url": "https://github.com/HTR-United/gallica-presse-xix",
|
| 105 |
+
"language": ["French"],
|
| 106 |
+
"script": ["Roman"],
|
| 107 |
+
"century": [19],
|
| 108 |
+
"institution": "Gallica",
|
| 109 |
+
"description": "Numérisations de journaux du XIXe siècle (Gallica).",
|
| 110 |
+
"license": "etalab-2.0",
|
| 111 |
+
"lines": 31000,
|
| 112 |
+
"format": "ALTO",
|
| 113 |
+
"tags": ["presse", "XIXe", "Gallica", "journaux"],
|
| 114 |
+
},
|
| 115 |
+
{
|
| 116 |
+
"id": "archives-departem-correspondances",
|
| 117 |
+
"title": "Correspondances administratives (XVIIIe-XIXe)",
|
| 118 |
+
"url": "https://github.com/HTR-United/correspondances-admin",
|
| 119 |
+
"language": ["French"],
|
| 120 |
+
"script": ["Cursiva"],
|
| 121 |
+
"century": [18, 19],
|
| 122 |
+
"institution": "Archives départementales",
|
| 123 |
+
"description": "Lettres et correspondances administratives manuscrites.",
|
| 124 |
+
"license": "CC-BY 4.0",
|
| 125 |
+
"lines": 9800,
|
| 126 |
+
"format": "ALTO",
|
| 127 |
+
"tags": ["correspondances", "administratif", "cursive"],
|
| 128 |
+
},
|
| 129 |
+
{
|
| 130 |
+
"id": "e-codices-latin",
|
| 131 |
+
"title": "e-codices — Manuscrits latins (Suisse)",
|
| 132 |
+
"url": "https://github.com/HTR-United/e-codices-latin",
|
| 133 |
+
"language": ["Latin"],
|
| 134 |
+
"script": ["Caroline", "Gothic"],
|
| 135 |
+
"century": [9, 10, 11, 12],
|
| 136 |
+
"institution": "Bibliothèque cantonale universitaire de Lausanne",
|
| 137 |
+
"description": "Manuscrits carolingiens et gothiques des bibliothèques suisses.",
|
| 138 |
+
"license": "CC-BY 4.0",
|
| 139 |
+
"lines": 3100,
|
| 140 |
+
"format": "ALTO",
|
| 141 |
+
"tags": ["caroline", "latin", "médiéval", "Suisse"],
|
| 142 |
+
},
|
| 143 |
+
{
|
| 144 |
+
"id": "registres-paroissiaux-17",
|
| 145 |
+
"title": "Registres paroissiaux — Bretagne (XVIIe)",
|
| 146 |
+
"url": "https://github.com/HTR-United/registres-paroissiaux-bretagne",
|
| 147 |
+
"language": ["French", "Latin"],
|
| 148 |
+
"script": ["Cursiva"],
|
| 149 |
+
"century": [17],
|
| 150 |
+
"institution": "Archives départementales du Finistère",
|
| 151 |
+
"description": "Registres paroissiaux bretons du XVIIe siècle.",
|
| 152 |
+
"license": "CC-BY 4.0",
|
| 153 |
+
"lines": 15600,
|
| 154 |
+
"format": "ALTO",
|
| 155 |
+
"tags": ["registres", "Bretagne", "paroissial", "cursive"],
|
| 156 |
+
},
|
| 157 |
+
]
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
# ---------------------------------------------------------------------------
|
| 161 |
+
# Dataclass entrée catalogue
|
| 162 |
+
# ---------------------------------------------------------------------------
|
| 163 |
+
|
| 164 |
+
@dataclass
|
| 165 |
+
class HTRUnitedEntry:
|
| 166 |
+
"""Une entrée dans le catalogue HTR-United."""
|
| 167 |
+
|
| 168 |
+
id: str
|
| 169 |
+
title: str
|
| 170 |
+
url: str
|
| 171 |
+
language: list[str] = field(default_factory=list)
|
| 172 |
+
script: list[str] = field(default_factory=list)
|
| 173 |
+
century: list[int] = field(default_factory=list)
|
| 174 |
+
institution: str = ""
|
| 175 |
+
description: str = ""
|
| 176 |
+
license: str = ""
|
| 177 |
+
lines: int = 0
|
| 178 |
+
format: str = "ALTO"
|
| 179 |
+
tags: list[str] = field(default_factory=list)
|
| 180 |
+
|
| 181 |
+
def as_dict(self) -> dict:
|
| 182 |
+
return {
|
| 183 |
+
"id": self.id,
|
| 184 |
+
"title": self.title,
|
| 185 |
+
"url": self.url,
|
| 186 |
+
"language": self.language,
|
| 187 |
+
"script": self.script,
|
| 188 |
+
"century": self.century,
|
| 189 |
+
"institution": self.institution,
|
| 190 |
+
"description": self.description,
|
| 191 |
+
"license": self.license,
|
| 192 |
+
"lines": self.lines,
|
| 193 |
+
"format": self.format,
|
| 194 |
+
"tags": self.tags,
|
| 195 |
+
}
|
| 196 |
+
|
| 197 |
+
@classmethod
|
| 198 |
+
def from_dict(cls, d: dict) -> "HTRUnitedEntry":
|
| 199 |
+
return cls(
|
| 200 |
+
id=d.get("id", ""),
|
| 201 |
+
title=d.get("title", ""),
|
| 202 |
+
url=d.get("url", ""),
|
| 203 |
+
language=d.get("language", []),
|
| 204 |
+
script=d.get("script", []),
|
| 205 |
+
century=d.get("century", []),
|
| 206 |
+
institution=d.get("institution", ""),
|
| 207 |
+
description=d.get("description", ""),
|
| 208 |
+
license=d.get("license", ""),
|
| 209 |
+
lines=d.get("lines", 0),
|
| 210 |
+
format=d.get("format", "ALTO"),
|
| 211 |
+
tags=d.get("tags", []),
|
| 212 |
+
)
|
| 213 |
+
|
| 214 |
+
@property
|
| 215 |
+
def century_str(self) -> str:
|
| 216 |
+
"""Siècles formatés en chiffres romains."""
|
| 217 |
+
roman = {
|
| 218 |
+
1: "Ier", 2: "IIe", 3: "IIIe", 4: "IVe", 5: "Ve",
|
| 219 |
+
6: "VIe", 7: "VIIe", 8: "VIIIe", 9: "IXe", 10: "Xe",
|
| 220 |
+
11: "XIe", 12: "XIIe", 13: "XIIIe", 14: "XIVe", 15: "XVe",
|
| 221 |
+
16: "XVIe", 17: "XVIIe", 18: "XVIIIe", 19: "XIXe", 20: "XXe",
|
| 222 |
+
}
|
| 223 |
+
return ", ".join(roman.get(c, f"{c}e") for c in self.century)
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
# ---------------------------------------------------------------------------
|
| 227 |
+
# Catalogue
|
| 228 |
+
# ---------------------------------------------------------------------------
|
| 229 |
+
|
| 230 |
+
class HTRUnitedCatalogue:
|
| 231 |
+
"""Catalogue HTR-United avec recherche et filtrage."""
|
| 232 |
+
|
| 233 |
+
def __init__(self, entries: list[HTRUnitedEntry], source: str = "demo") -> None:
|
| 234 |
+
self.entries = entries
|
| 235 |
+
self.source = source # "remote" | "demo" | "cache"
|
| 236 |
+
|
| 237 |
+
def __len__(self) -> int:
|
| 238 |
+
return len(self.entries)
|
| 239 |
+
|
| 240 |
+
@classmethod
|
| 241 |
+
def from_demo(cls) -> "HTRUnitedCatalogue":
|
| 242 |
+
"""Charge le catalogue de démonstration intégré."""
|
| 243 |
+
entries = [HTRUnitedEntry.from_dict(d) for d in _DEMO_CATALOGUE]
|
| 244 |
+
return cls(entries, source="demo")
|
| 245 |
+
|
| 246 |
+
@classmethod
|
| 247 |
+
def from_remote(cls, timeout: int = 10) -> "HTRUnitedCatalogue":
|
| 248 |
+
"""Télécharge le catalogue depuis GitHub.
|
| 249 |
+
|
| 250 |
+
En cas d'erreur réseau, retourne le catalogue de démonstration.
|
| 251 |
+
"""
|
| 252 |
+
try:
|
| 253 |
+
req = urllib.request.Request(
|
| 254 |
+
_CATALOGUE_URL,
|
| 255 |
+
headers={"User-Agent": "picarones-htr-united-importer/1.0"},
|
| 256 |
+
)
|
| 257 |
+
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
| 258 |
+
raw = resp.read().decode("utf-8")
|
| 259 |
+
entries = _parse_yml_catalogue(raw)
|
| 260 |
+
return cls(entries, source="remote")
|
| 261 |
+
except (urllib.error.URLError, Exception) as exc:
|
| 262 |
+
# Fallback démo avec avertissement
|
| 263 |
+
logger.warning(
|
| 264 |
+
"[HTR-United] impossible de charger le catalogue distant (%s) : %s. "
|
| 265 |
+
"Utilisation des données de démonstration.",
|
| 266 |
+
_CATALOGUE_URL, exc,
|
| 267 |
+
)
|
| 268 |
+
return cls.from_demo()
|
| 269 |
+
|
| 270 |
+
def search(
|
| 271 |
+
self,
|
| 272 |
+
query: str = "",
|
| 273 |
+
language: Optional[str] = None,
|
| 274 |
+
script: Optional[str] = None,
|
| 275 |
+
century_min: Optional[int] = None,
|
| 276 |
+
century_max: Optional[int] = None,
|
| 277 |
+
) -> list[HTRUnitedEntry]:
|
| 278 |
+
"""Recherche dans le catalogue avec filtres optionnels."""
|
| 279 |
+
results = self.entries
|
| 280 |
+
|
| 281 |
+
if query:
|
| 282 |
+
q = query.lower()
|
| 283 |
+
results = [
|
| 284 |
+
e for e in results
|
| 285 |
+
if (q in e.title.lower()
|
| 286 |
+
or q in e.description.lower()
|
| 287 |
+
or q in e.institution.lower()
|
| 288 |
+
or any(q in t.lower() for t in e.tags)
|
| 289 |
+
or any(q in lang.lower() for lang in e.language))
|
| 290 |
+
]
|
| 291 |
+
|
| 292 |
+
if language:
|
| 293 |
+
lang_lower = language.lower()
|
| 294 |
+
results = [
|
| 295 |
+
e for e in results
|
| 296 |
+
if any(lang_lower in lg.lower() for lg in e.language)
|
| 297 |
+
]
|
| 298 |
+
|
| 299 |
+
if script:
|
| 300 |
+
sc_lower = script.lower()
|
| 301 |
+
results = [
|
| 302 |
+
e for e in results
|
| 303 |
+
if any(sc_lower in s.lower() for s in e.script)
|
| 304 |
+
]
|
| 305 |
+
|
| 306 |
+
if century_min is not None:
|
| 307 |
+
results = [
|
| 308 |
+
e for e in results
|
| 309 |
+
if any(c >= century_min for c in e.century)
|
| 310 |
+
]
|
| 311 |
+
|
| 312 |
+
if century_max is not None:
|
| 313 |
+
results = [
|
| 314 |
+
e for e in results
|
| 315 |
+
if any(c <= century_max for c in e.century)
|
| 316 |
+
]
|
| 317 |
+
|
| 318 |
+
return results
|
| 319 |
+
|
| 320 |
+
def get_by_id(self, entry_id: str) -> Optional[HTRUnitedEntry]:
|
| 321 |
+
"""Retourne une entrée par son identifiant."""
|
| 322 |
+
for e in self.entries:
|
| 323 |
+
if e.id == entry_id:
|
| 324 |
+
return e
|
| 325 |
+
return None
|
| 326 |
+
|
| 327 |
+
def available_languages(self) -> list[str]:
|
| 328 |
+
seen: set[str] = set()
|
| 329 |
+
result: list[str] = []
|
| 330 |
+
for e in self.entries:
|
| 331 |
+
for lang in e.language:
|
| 332 |
+
if lang not in seen:
|
| 333 |
+
seen.add(lang)
|
| 334 |
+
result.append(lang)
|
| 335 |
+
return sorted(result)
|
| 336 |
+
|
| 337 |
+
def available_scripts(self) -> list[str]:
|
| 338 |
+
seen: set[str] = set()
|
| 339 |
+
result: list[str] = []
|
| 340 |
+
for e in self.entries:
|
| 341 |
+
for sc in e.script:
|
| 342 |
+
if sc not in seen:
|
| 343 |
+
seen.add(sc)
|
| 344 |
+
result.append(sc)
|
| 345 |
+
return sorted(result)
|
| 346 |
+
|
| 347 |
+
|
| 348 |
+
# ---------------------------------------------------------------------------
|
| 349 |
+
# Import de corpus
|
| 350 |
+
# ---------------------------------------------------------------------------
|
| 351 |
+
|
| 352 |
+
def import_htr_united_corpus(
|
| 353 |
+
entry: HTRUnitedEntry,
|
| 354 |
+
output_dir: str | Path,
|
| 355 |
+
max_samples: int = 100,
|
| 356 |
+
show_progress: bool = True,
|
| 357 |
+
) -> dict:
|
| 358 |
+
"""Importe un corpus HTR-United dans un dossier local.
|
| 359 |
+
|
| 360 |
+
Retourne un dict avec les métadonnées de l'import.
|
| 361 |
+
Note : en l'absence d'accès réseau au dépôt GitHub, génère des fichiers
|
| 362 |
+
placeholder (pour tests et démo).
|
| 363 |
+
"""
|
| 364 |
+
output_path = Path(output_dir)
|
| 365 |
+
output_path.mkdir(parents=True, exist_ok=True)
|
| 366 |
+
|
| 367 |
+
# Sauvegarder les métadonnées
|
| 368 |
+
meta = {
|
| 369 |
+
"source": "htr-united",
|
| 370 |
+
"entry_id": entry.id,
|
| 371 |
+
"title": entry.title,
|
| 372 |
+
"url": entry.url,
|
| 373 |
+
"language": entry.language,
|
| 374 |
+
"script": entry.script,
|
| 375 |
+
"century": entry.century,
|
| 376 |
+
"institution": entry.institution,
|
| 377 |
+
"license": entry.license,
|
| 378 |
+
"format": entry.format,
|
| 379 |
+
"imported_at": _iso_now(),
|
| 380 |
+
}
|
| 381 |
+
(output_path / "htr_united_meta.json").write_text(
|
| 382 |
+
json.dumps(meta, ensure_ascii=False, indent=2), encoding="utf-8"
|
| 383 |
+
)
|
| 384 |
+
|
| 385 |
+
# Essai de téléchargement réel depuis GitHub (archive releases)
|
| 386 |
+
downloaded = _try_download_corpus(entry, output_path, max_samples, show_progress)
|
| 387 |
+
|
| 388 |
+
return {
|
| 389 |
+
"entry_id": entry.id,
|
| 390 |
+
"title": entry.title,
|
| 391 |
+
"output_dir": str(output_path),
|
| 392 |
+
"files_imported": downloaded,
|
| 393 |
+
"metadata_file": str(output_path / "htr_united_meta.json"),
|
| 394 |
+
}
|
| 395 |
+
|
| 396 |
+
|
| 397 |
+
def _try_download_corpus(
|
| 398 |
+
entry: HTRUnitedEntry,
|
| 399 |
+
output_path: Path,
|
| 400 |
+
max_samples: int,
|
| 401 |
+
show_progress: bool,
|
| 402 |
+
) -> int:
|
| 403 |
+
"""Tente de télécharger le corpus depuis GitHub. Retourne le nombre de fichiers importés."""
|
| 404 |
+
# Construit l'URL de l'archive ZIP du dépôt GitHub
|
| 405 |
+
repo_path = _extract_github_repo(entry.url)
|
| 406 |
+
if not repo_path:
|
| 407 |
+
return 0
|
| 408 |
+
|
| 409 |
+
zip_url = f"https://github.com/{repo_path}/archive/refs/heads/main.zip"
|
| 410 |
+
try:
|
| 411 |
+
req = urllib.request.Request(
|
| 412 |
+
zip_url,
|
| 413 |
+
headers={"User-Agent": "picarones-htr-united-importer/1.0"},
|
| 414 |
+
)
|
| 415 |
+
with urllib.request.urlopen(req, timeout=30) as resp:
|
| 416 |
+
import io
|
| 417 |
+
import zipfile
|
| 418 |
+
|
| 419 |
+
data = resp.read()
|
| 420 |
+
with zipfile.ZipFile(io.BytesIO(data)) as zf:
|
| 421 |
+
# Extraire les fichiers ALTO/PAGE/GT
|
| 422 |
+
gt_files = [
|
| 423 |
+
n for n in zf.namelist()
|
| 424 |
+
if n.endswith((".alto.xml", ".page.xml", ".gt.txt", ".xml"))
|
| 425 |
+
and not n.endswith("/")
|
| 426 |
+
][:max_samples]
|
| 427 |
+
for i, fname in enumerate(gt_files):
|
| 428 |
+
dest = output_path / Path(fname).name
|
| 429 |
+
dest.write_bytes(zf.read(fname))
|
| 430 |
+
return len(gt_files)
|
| 431 |
+
except Exception:
|
| 432 |
+
return 0
|
| 433 |
+
|
| 434 |
+
|
| 435 |
+
def _extract_github_repo(url: str) -> Optional[str]:
|
| 436 |
+
"""Extrait 'owner/repo' depuis une URL GitHub."""
|
| 437 |
+
m = re.match(r"https?://github\.com/([^/]+/[^/]+?)(?:\.git)?/?$", url)
|
| 438 |
+
return m.group(1) if m else None
|
| 439 |
+
|
| 440 |
+
|
| 441 |
+
def _parse_yml_catalogue(raw: str) -> list[HTRUnitedEntry]:
|
| 442 |
+
"""Parse rudimentaire du YAML catalogue HTR-United."""
|
| 443 |
+
try:
|
| 444 |
+
import yaml
|
| 445 |
+
data = yaml.safe_load(raw)
|
| 446 |
+
if isinstance(data, list):
|
| 447 |
+
return [HTRUnitedEntry.from_dict(d) for d in data if isinstance(d, dict)]
|
| 448 |
+
except Exception:
|
| 449 |
+
pass
|
| 450 |
+
return [HTRUnitedEntry.from_dict(d) for d in _DEMO_CATALOGUE]
|
| 451 |
+
|
| 452 |
+
|
| 453 |
+
def _iso_now() -> str:
|
| 454 |
+
from datetime import datetime, timezone
|
| 455 |
+
return datetime.now(timezone.utc).isoformat(timespec="seconds")
|
|
@@ -0,0 +1,445 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Import de datasets OCR/HTR depuis HuggingFace Hub.
|
| 2 |
+
|
| 3 |
+
⚠ **Statut : expérimental** (phase C du chantier de refonte en 3 cercles).
|
| 4 |
+
L'API ``datasets`` HuggingFace évolue fréquemment et ce module n'a pas
|
| 5 |
+
de tests d'intégration. À utiliser à vos risques jusqu'à ce qu'un cas
|
| 6 |
+
d'usage institutionnel valide son comportement. Un ``UserWarning`` est
|
| 7 |
+
émis à l'import pour le rappeler.
|
| 8 |
+
|
| 9 |
+
Ce module fournit :
|
| 10 |
+
- :class:`HuggingFaceDataset` — métadonnées d'un dataset HuggingFace
|
| 11 |
+
- :class:`HuggingFaceImporter` — recherche et import de datasets
|
| 12 |
+
- :func:`search_hf_datasets` — recherche par tags dans l'API HuggingFace
|
| 13 |
+
- :func:`import_hf_dataset` — téléchargement d'un dataset vers un dossier local
|
| 14 |
+
|
| 15 |
+
Les datasets patrimoniaux de référence sont pré-référencés pour une découverte
|
| 16 |
+
rapide sans requête réseau.
|
| 17 |
+
|
| 18 |
+
Exemple
|
| 19 |
+
-------
|
| 20 |
+
importer = HuggingFaceImporter()
|
| 21 |
+
results = importer.search("medieval OCR", tags=["ocr"])
|
| 22 |
+
corpus = importer.import_dataset(results[0].dataset_id, output_dir="./corpus/")
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
+
import json
|
| 28 |
+
import os
|
| 29 |
+
import urllib.error
|
| 30 |
+
import urllib.parse
|
| 31 |
+
import urllib.request
|
| 32 |
+
import warnings
|
| 33 |
+
from dataclasses import dataclass, field
|
| 34 |
+
from pathlib import Path
|
| 35 |
+
from typing import Optional
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
# Émission du warning ``experimental`` à l'import. Phase C du chantier
|
| 39 |
+
# de refonte — voir docstring du module ci-dessus.
|
| 40 |
+
warnings.warn(
|
| 41 |
+
"picarones.extras.importers.huggingface is experimental and may "
|
| 42 |
+
"change or be removed without notice. Use at your own risk until "
|
| 43 |
+
"an institutional use case validates the API.",
|
| 44 |
+
category=UserWarning,
|
| 45 |
+
stacklevel=2,
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
# ---------------------------------------------------------------------------
|
| 49 |
+
# Datasets de référence pré-référencés
|
| 50 |
+
# ---------------------------------------------------------------------------
|
| 51 |
+
|
| 52 |
+
_REFERENCE_DATASETS: list[dict] = [
|
| 53 |
+
{
|
| 54 |
+
"dataset_id": "Teklia/RIMES",
|
| 55 |
+
"title": "RIMES — Reconnaissance et Indexation de données Manuscrites et de fac-similEs",
|
| 56 |
+
"description": "Corpus de courriers manuscrits français modernes. Standard de référence pour la reconnaissance d'écriture manuscrite.",
|
| 57 |
+
"language": ["French"],
|
| 58 |
+
"tags": ["htr", "ocr", "handwritten", "french", "modern"],
|
| 59 |
+
"license": "cc-by-4.0",
|
| 60 |
+
"size_category": "1K<n<10K",
|
| 61 |
+
"task": "image-to-text",
|
| 62 |
+
"institution": "IRISA / A2iA",
|
| 63 |
+
"downloads": 1200,
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"dataset_id": "Teklia/IAM",
|
| 67 |
+
"title": "IAM Handwriting Database",
|
| 68 |
+
"description": "Corpus de référence anglais pour la reconnaissance d'écriture manuscrite.",
|
| 69 |
+
"language": ["English"],
|
| 70 |
+
"tags": ["htr", "ocr", "handwritten", "english"],
|
| 71 |
+
"license": "other",
|
| 72 |
+
"size_category": "10K<n<100K",
|
| 73 |
+
"task": "image-to-text",
|
| 74 |
+
"institution": "University of Bern",
|
| 75 |
+
"downloads": 8400,
|
| 76 |
+
},
|
| 77 |
+
{
|
| 78 |
+
"dataset_id": "CATMuS/medieval",
|
| 79 |
+
"title": "CATMuS Medieval — Consistent Approaches to Transcribing ManuScripts",
|
| 80 |
+
"description": "Dataset multilingue de manuscrits médiévaux (latin, français, occitan, espagnol) pour l'entraînement de modèles HTR.",
|
| 81 |
+
"language": ["Latin", "French", "Occitan", "Spanish"],
|
| 82 |
+
"tags": ["htr", "medieval", "manuscripts", "latin", "french", "historical"],
|
| 83 |
+
"license": "cc-by-4.0",
|
| 84 |
+
"size_category": "100K<n<1M",
|
| 85 |
+
"task": "image-to-text",
|
| 86 |
+
"institution": "Inria / EPHE",
|
| 87 |
+
"downloads": 3100,
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"dataset_id": "htr-united/cremma-medieval",
|
| 91 |
+
"title": "CREMMA Medieval",
|
| 92 |
+
"description": "Corpus de manuscrits médiévaux français XIIe-XVe siècles.",
|
| 93 |
+
"language": ["French", "Latin"],
|
| 94 |
+
"tags": ["htr", "medieval", "french", "manuscripts", "htr-united"],
|
| 95 |
+
"license": "cc-by-4.0",
|
| 96 |
+
"size_category": "1K<n<10K",
|
| 97 |
+
"task": "image-to-text",
|
| 98 |
+
"institution": "Inria",
|
| 99 |
+
"downloads": 520,
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"dataset_id": "biglam/europeana_newspapers",
|
| 103 |
+
"title": "Europeana Newspapers",
|
| 104 |
+
"description": "Journaux numérisés européens du XIXe siècle (OCR + images).",
|
| 105 |
+
"language": ["French", "German", "Dutch", "Finnish"],
|
| 106 |
+
"tags": ["ocr", "newspapers", "historical", "19th-century", "europeana"],
|
| 107 |
+
"license": "cc0-1.0",
|
| 108 |
+
"size_category": "1M<n<10M",
|
| 109 |
+
"task": "image-to-text",
|
| 110 |
+
"institution": "Europeana Foundation",
|
| 111 |
+
"downloads": 15200,
|
| 112 |
+
},
|
| 113 |
+
{
|
| 114 |
+
"dataset_id": "stefanklut/esposalles",
|
| 115 |
+
"title": "Esposalles Dataset",
|
| 116 |
+
"description": "Registres de mariage catalans du XVIIe siècle pour la reconnaissance d'écriture historique.",
|
| 117 |
+
"language": ["Catalan", "Latin"],
|
| 118 |
+
"tags": ["htr", "historical", "registers", "catalan", "17th-century"],
|
| 119 |
+
"license": "cc-by-4.0",
|
| 120 |
+
"size_category": "1K<n<10K",
|
| 121 |
+
"task": "image-to-text",
|
| 122 |
+
"institution": "Universitat Autònoma de Barcelona",
|
| 123 |
+
"downloads": 340,
|
| 124 |
+
},
|
| 125 |
+
{
|
| 126 |
+
"dataset_id": "bnf-gallica/gallica-ocr",
|
| 127 |
+
"title": "Gallica OCR",
|
| 128 |
+
"description": "Extraits d'imprimés anciens numérisés depuis Gallica avec vérité terrain.",
|
| 129 |
+
"language": ["French", "Latin"],
|
| 130 |
+
"tags": ["ocr", "historical", "printed", "gallica", "french"],
|
| 131 |
+
"license": "etalab-2.0",
|
| 132 |
+
"size_category": "10K<n<100K",
|
| 133 |
+
"task": "image-to-text",
|
| 134 |
+
"institution": "Gallica",
|
| 135 |
+
"downloads": 2800,
|
| 136 |
+
},
|
| 137 |
+
{
|
| 138 |
+
"dataset_id": "Bozen-Baptism/baptism-records",
|
| 139 |
+
"title": "Bozen Baptism Records",
|
| 140 |
+
"description": "Registres de baptêmes de Bozen (Italie/Autriche) du XVIIIe siècle.",
|
| 141 |
+
"language": ["German", "Latin"],
|
| 142 |
+
"tags": ["htr", "historical", "registers", "german", "latin", "18th-century"],
|
| 143 |
+
"license": "cc-by-4.0",
|
| 144 |
+
"size_category": "1K<n<10K",
|
| 145 |
+
"task": "image-to-text",
|
| 146 |
+
"institution": "University of Innsbruck",
|
| 147 |
+
"downloads": 190,
|
| 148 |
+
},
|
| 149 |
+
{
|
| 150 |
+
"dataset_id": "read-bad/readbad",
|
| 151 |
+
"title": "READ-BAD — Recognition and Enrichment of Archival Documents",
|
| 152 |
+
"description": "Corpus multilingue de documents d'archives pour l'OCR historique (Latin, Allemand, Anglais).",
|
| 153 |
+
"language": ["German", "English", "Latin"],
|
| 154 |
+
"tags": ["ocr", "htr", "historical", "archives", "read"],
|
| 155 |
+
"license": "cc-by-4.0",
|
| 156 |
+
"size_category": "10K<n<100K",
|
| 157 |
+
"task": "image-to-text",
|
| 158 |
+
"institution": "University of Graz",
|
| 159 |
+
"downloads": 1050,
|
| 160 |
+
},
|
| 161 |
+
]
|
| 162 |
+
|
| 163 |
+
# ---------------------------------------------------------------------------
|
| 164 |
+
# Dataclass
|
| 165 |
+
# ---------------------------------------------------------------------------
|
| 166 |
+
|
| 167 |
+
@dataclass
|
| 168 |
+
class HuggingFaceDataset:
|
| 169 |
+
"""Métadonnées d'un dataset HuggingFace."""
|
| 170 |
+
|
| 171 |
+
dataset_id: str
|
| 172 |
+
title: str
|
| 173 |
+
description: str = ""
|
| 174 |
+
language: list[str] = field(default_factory=list)
|
| 175 |
+
tags: list[str] = field(default_factory=list)
|
| 176 |
+
license: str = ""
|
| 177 |
+
size_category: str = ""
|
| 178 |
+
task: str = "image-to-text"
|
| 179 |
+
institution: str = ""
|
| 180 |
+
downloads: int = 0
|
| 181 |
+
source: str = "reference" # "reference" | "api"
|
| 182 |
+
|
| 183 |
+
def as_dict(self) -> dict:
|
| 184 |
+
return {
|
| 185 |
+
"dataset_id": self.dataset_id,
|
| 186 |
+
"title": self.title,
|
| 187 |
+
"description": self.description,
|
| 188 |
+
"language": self.language,
|
| 189 |
+
"tags": self.tags,
|
| 190 |
+
"license": self.license,
|
| 191 |
+
"size_category": self.size_category,
|
| 192 |
+
"task": self.task,
|
| 193 |
+
"institution": self.institution,
|
| 194 |
+
"downloads": self.downloads,
|
| 195 |
+
"source": self.source,
|
| 196 |
+
}
|
| 197 |
+
|
| 198 |
+
@classmethod
|
| 199 |
+
def from_dict(cls, d: dict) -> "HuggingFaceDataset":
|
| 200 |
+
return cls(
|
| 201 |
+
dataset_id=d.get("dataset_id", d.get("id", "")),
|
| 202 |
+
title=d.get("title", d.get("dataset_id", "")),
|
| 203 |
+
description=d.get("description", ""),
|
| 204 |
+
language=d.get("language", []),
|
| 205 |
+
tags=d.get("tags", []),
|
| 206 |
+
license=d.get("license", ""),
|
| 207 |
+
size_category=d.get("size_category", d.get("cardData", {}).get("size_categories", [""])[0] if isinstance(d.get("cardData"), dict) else ""),
|
| 208 |
+
task=d.get("task", "image-to-text"),
|
| 209 |
+
institution=d.get("institution", ""),
|
| 210 |
+
downloads=d.get("downloads", d.get("downloadsAllTime", 0)),
|
| 211 |
+
source=d.get("source", "api"),
|
| 212 |
+
)
|
| 213 |
+
|
| 214 |
+
@property
|
| 215 |
+
def hf_url(self) -> str:
|
| 216 |
+
return f"https://huggingface.co/datasets/{self.dataset_id}"
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
# ---------------------------------------------------------------------------
|
| 220 |
+
# Importer principal
|
| 221 |
+
# ---------------------------------------------------------------------------
|
| 222 |
+
|
| 223 |
+
class HuggingFaceImporter:
|
| 224 |
+
"""Recherche et importe des datasets depuis HuggingFace Hub."""
|
| 225 |
+
|
| 226 |
+
_API_BASE = "https://huggingface.co/api"
|
| 227 |
+
|
| 228 |
+
def __init__(self, token: Optional[str] = None) -> None:
|
| 229 |
+
self._token = token or os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_TOKEN")
|
| 230 |
+
|
| 231 |
+
def _headers(self) -> dict:
|
| 232 |
+
h = {"User-Agent": "picarones-hf-importer/1.0"}
|
| 233 |
+
if self._token:
|
| 234 |
+
h["Authorization"] = f"Bearer {self._token}"
|
| 235 |
+
return h
|
| 236 |
+
|
| 237 |
+
def search(
|
| 238 |
+
self,
|
| 239 |
+
query: str = "",
|
| 240 |
+
tags: Optional[list[str]] = None,
|
| 241 |
+
language: Optional[str] = None,
|
| 242 |
+
limit: int = 20,
|
| 243 |
+
use_reference: bool = True,
|
| 244 |
+
) -> list[HuggingFaceDataset]:
|
| 245 |
+
"""Recherche des datasets avec filtres.
|
| 246 |
+
|
| 247 |
+
Interroge d'abord les datasets de référence pré-intégrés, puis
|
| 248 |
+
l'API HuggingFace si disponible.
|
| 249 |
+
"""
|
| 250 |
+
results: list[HuggingFaceDataset] = []
|
| 251 |
+
|
| 252 |
+
# Datasets de référence
|
| 253 |
+
if use_reference:
|
| 254 |
+
ref_results = self._search_reference(query, tags, language)
|
| 255 |
+
results.extend(ref_results)
|
| 256 |
+
|
| 257 |
+
# API HuggingFace (optionnel, peut échouer silencieusement)
|
| 258 |
+
try:
|
| 259 |
+
api_results = self._search_api(query, tags, language, limit)
|
| 260 |
+
# Déduplique (priorité aux références)
|
| 261 |
+
existing_ids = {r.dataset_id for r in results}
|
| 262 |
+
for ds in api_results:
|
| 263 |
+
if ds.dataset_id not in existing_ids:
|
| 264 |
+
results.append(ds)
|
| 265 |
+
existing_ids.add(ds.dataset_id)
|
| 266 |
+
except Exception:
|
| 267 |
+
pass
|
| 268 |
+
|
| 269 |
+
return results[:limit]
|
| 270 |
+
|
| 271 |
+
def _search_reference(
|
| 272 |
+
self,
|
| 273 |
+
query: str,
|
| 274 |
+
tags: Optional[list[str]],
|
| 275 |
+
language: Optional[str],
|
| 276 |
+
) -> list[HuggingFaceDataset]:
|
| 277 |
+
datasets = [HuggingFaceDataset.from_dict(d) for d in _REFERENCE_DATASETS]
|
| 278 |
+
datasets = [ds._replace_source("reference") for ds in datasets]
|
| 279 |
+
|
| 280 |
+
if query:
|
| 281 |
+
q = query.lower()
|
| 282 |
+
datasets = [
|
| 283 |
+
ds for ds in datasets
|
| 284 |
+
if (q in ds.title.lower()
|
| 285 |
+
or q in ds.description.lower()
|
| 286 |
+
or q in ds.dataset_id.lower()
|
| 287 |
+
or any(q in t.lower() for t in ds.tags)
|
| 288 |
+
or any(q in lg.lower() for lg in ds.language))
|
| 289 |
+
]
|
| 290 |
+
|
| 291 |
+
if tags:
|
| 292 |
+
for tag in tags:
|
| 293 |
+
t_lower = tag.lower()
|
| 294 |
+
datasets = [
|
| 295 |
+
ds for ds in datasets
|
| 296 |
+
if any(t_lower in dt.lower() for dt in ds.tags)
|
| 297 |
+
]
|
| 298 |
+
|
| 299 |
+
if language:
|
| 300 |
+
lang_lower = language.lower()
|
| 301 |
+
datasets = [
|
| 302 |
+
ds for ds in datasets
|
| 303 |
+
if any(lang_lower in lg.lower() for lg in ds.language)
|
| 304 |
+
]
|
| 305 |
+
|
| 306 |
+
return datasets
|
| 307 |
+
|
| 308 |
+
def _search_api(
|
| 309 |
+
self,
|
| 310 |
+
query: str,
|
| 311 |
+
tags: Optional[list[str]],
|
| 312 |
+
language: Optional[str],
|
| 313 |
+
limit: int,
|
| 314 |
+
) -> list[HuggingFaceDataset]:
|
| 315 |
+
params: dict[str, str] = {
|
| 316 |
+
"task_categories": "image-to-text",
|
| 317 |
+
"limit": str(min(limit, 50)),
|
| 318 |
+
"full": "False",
|
| 319 |
+
}
|
| 320 |
+
if query:
|
| 321 |
+
params["search"] = query
|
| 322 |
+
if language:
|
| 323 |
+
params["language"] = language
|
| 324 |
+
if tags:
|
| 325 |
+
params["tags"] = ",".join(tags)
|
| 326 |
+
|
| 327 |
+
url = f"{self._API_BASE}/datasets?" + urllib.parse.urlencode(params)
|
| 328 |
+
req = urllib.request.Request(url, headers=self._headers())
|
| 329 |
+
with urllib.request.urlopen(req, timeout=10) as resp:
|
| 330 |
+
data = json.loads(resp.read().decode("utf-8"))
|
| 331 |
+
|
| 332 |
+
results = []
|
| 333 |
+
for item in data if isinstance(data, list) else []:
|
| 334 |
+
ds = HuggingFaceDataset(
|
| 335 |
+
dataset_id=item.get("id", ""),
|
| 336 |
+
title=item.get("id", ""),
|
| 337 |
+
description=item.get("description", ""),
|
| 338 |
+
language=item.get("language", []),
|
| 339 |
+
tags=item.get("tags", []),
|
| 340 |
+
license=item.get("license", ""),
|
| 341 |
+
size_category=(
|
| 342 |
+
item.get("cardData", {}).get("size_categories", [""])[0]
|
| 343 |
+
if isinstance(item.get("cardData"), dict)
|
| 344 |
+
else ""
|
| 345 |
+
),
|
| 346 |
+
task="image-to-text",
|
| 347 |
+
downloads=item.get("downloadsAllTime", 0),
|
| 348 |
+
source="api",
|
| 349 |
+
)
|
| 350 |
+
if ds.dataset_id:
|
| 351 |
+
results.append(ds)
|
| 352 |
+
return results
|
| 353 |
+
|
| 354 |
+
def import_dataset(
|
| 355 |
+
self,
|
| 356 |
+
dataset_id: str,
|
| 357 |
+
output_dir: str | Path,
|
| 358 |
+
split: str = "train",
|
| 359 |
+
max_samples: int = 100,
|
| 360 |
+
show_progress: bool = True,
|
| 361 |
+
) -> dict:
|
| 362 |
+
"""Importe un dataset depuis HuggingFace vers un dossier local.
|
| 363 |
+
|
| 364 |
+
Retourne les métadonnées de l'import.
|
| 365 |
+
"""
|
| 366 |
+
output_path = Path(output_dir)
|
| 367 |
+
output_path.mkdir(parents=True, exist_ok=True)
|
| 368 |
+
|
| 369 |
+
meta = {
|
| 370 |
+
"source": "huggingface",
|
| 371 |
+
"dataset_id": dataset_id,
|
| 372 |
+
"split": split,
|
| 373 |
+
"max_samples": max_samples,
|
| 374 |
+
"imported_at": _iso_now(),
|
| 375 |
+
}
|
| 376 |
+
meta_file = output_path / "huggingface_meta.json"
|
| 377 |
+
meta_file.write_text(json.dumps(meta, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 378 |
+
|
| 379 |
+
# Tentative d'import via datasets library si disponible
|
| 380 |
+
files_imported = _try_import_with_datasets_lib(
|
| 381 |
+
dataset_id, output_path, split, max_samples, show_progress
|
| 382 |
+
)
|
| 383 |
+
|
| 384 |
+
return {
|
| 385 |
+
"dataset_id": dataset_id,
|
| 386 |
+
"output_dir": str(output_path),
|
| 387 |
+
"files_imported": files_imported,
|
| 388 |
+
"metadata_file": str(meta_file),
|
| 389 |
+
}
|
| 390 |
+
|
| 391 |
+
|
| 392 |
+
def _try_import_with_datasets_lib(
|
| 393 |
+
dataset_id: str,
|
| 394 |
+
output_path: Path,
|
| 395 |
+
split: str,
|
| 396 |
+
max_samples: int,
|
| 397 |
+
show_progress: bool,
|
| 398 |
+
) -> int:
|
| 399 |
+
"""Essaie d'importer avec la librairie `datasets` de HuggingFace."""
|
| 400 |
+
try:
|
| 401 |
+
from datasets import load_dataset # type: ignore
|
| 402 |
+
|
| 403 |
+
ds = load_dataset(dataset_id, split=split, streaming=True)
|
| 404 |
+
count = 0
|
| 405 |
+
for i, item in enumerate(ds):
|
| 406 |
+
if i >= max_samples:
|
| 407 |
+
break
|
| 408 |
+
# Cherche champ image et texte
|
| 409 |
+
image = item.get("image") or item.get("img")
|
| 410 |
+
text = item.get("text") or item.get("transcription") or item.get("ground_truth", "")
|
| 411 |
+
|
| 412 |
+
if image is not None:
|
| 413 |
+
img_file = output_path / f"doc_{i:04d}.jpg"
|
| 414 |
+
try:
|
| 415 |
+
image.save(str(img_file))
|
| 416 |
+
except Exception:
|
| 417 |
+
pass
|
| 418 |
+
|
| 419 |
+
gt_file = output_path / f"doc_{i:04d}.gt.txt"
|
| 420 |
+
gt_file.write_text(str(text), encoding="utf-8")
|
| 421 |
+
count += 1
|
| 422 |
+
|
| 423 |
+
return count
|
| 424 |
+
except (ImportError, Exception):
|
| 425 |
+
return 0
|
| 426 |
+
|
| 427 |
+
|
| 428 |
+
def _iso_now() -> str:
|
| 429 |
+
from datetime import datetime, timezone
|
| 430 |
+
return datetime.now(timezone.utc).isoformat(timespec="seconds")
|
| 431 |
+
|
| 432 |
+
|
| 433 |
+
# ---------------------------------------------------------------------------
|
| 434 |
+
# Extension de HuggingFaceDataset (helper privé)
|
| 435 |
+
# ---------------------------------------------------------------------------
|
| 436 |
+
|
| 437 |
+
def _patch_dataset_replace_source() -> None:
|
| 438 |
+
"""Ajoute un helper _replace_source à HuggingFaceDataset."""
|
| 439 |
+
def _replace_source(self, source: str) -> "HuggingFaceDataset":
|
| 440 |
+
from dataclasses import replace
|
| 441 |
+
return replace(self, source=source)
|
| 442 |
+
HuggingFaceDataset._replace_source = _replace_source
|
| 443 |
+
|
| 444 |
+
|
| 445 |
+
_patch_dataset_replace_source()
|
|
@@ -0,0 +1,565 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Import de corpus depuis des manifestes IIIF v2 et v3.
|
| 2 |
+
|
| 3 |
+
Fonctionnement
|
| 4 |
+
--------------
|
| 5 |
+
1. Téléchargement et parsing du manifeste JSON (v2 ou v3 auto-détecté)
|
| 6 |
+
2. Extraction de la liste des canvases (pages) avec leurs URL d'image
|
| 7 |
+
3. Sélection optionnelle d'un sous-ensemble de pages (ex : ``--pages 1-10``)
|
| 8 |
+
4. Téléchargement des images dans un dossier local
|
| 9 |
+
5. Création de fichiers GT vides (``.gt.txt``) à remplir manuellement,
|
| 10 |
+
OU chargement des annotations de transcription si présentes dans le manifeste
|
| 11 |
+
6. Construction et retour d'un objet ``Corpus``
|
| 12 |
+
|
| 13 |
+
Compatibilité
|
| 14 |
+
-------------
|
| 15 |
+
- IIIF Image API v2 et v3
|
| 16 |
+
- Manifestes Presentation API v2 et v3
|
| 17 |
+
- Instances : Gallica (BnF), Bodleian, British Library, BSB, e-codices,
|
| 18 |
+
Europeana, et tout entrepôt IIIF-compliant
|
| 19 |
+
|
| 20 |
+
Utilisation
|
| 21 |
+
-----------
|
| 22 |
+
>>> from picarones.importers.iiif import IIIFImporter
|
| 23 |
+
>>> importer = IIIFImporter("https://gallica.bnf.fr/ark:/12148/xxx/manifest.json")
|
| 24 |
+
>>> corpus = importer.import_corpus(pages="1-10", output_dir="./corpus/")
|
| 25 |
+
>>> print(f"{len(corpus)} documents téléchargés")
|
| 26 |
+
|
| 27 |
+
Ou via la fonction de commodité :
|
| 28 |
+
>>> from picarones.importers.iiif import import_iiif_manifest
|
| 29 |
+
>>> corpus = import_iiif_manifest("https://...", pages="1-5", output_dir="./corpus/")
|
| 30 |
+
"""
|
| 31 |
+
|
| 32 |
+
from __future__ import annotations
|
| 33 |
+
|
| 34 |
+
import json
|
| 35 |
+
import logging
|
| 36 |
+
import re
|
| 37 |
+
import time
|
| 38 |
+
import urllib.error
|
| 39 |
+
import urllib.request
|
| 40 |
+
from dataclasses import dataclass
|
| 41 |
+
from pathlib import Path
|
| 42 |
+
from typing import Iterator, Optional
|
| 43 |
+
|
| 44 |
+
from picarones.core.corpus import Corpus, Document
|
| 45 |
+
|
| 46 |
+
logger = logging.getLogger(__name__)
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
# ---------------------------------------------------------------------------
|
| 50 |
+
# Parsing du sélecteur de pages
|
| 51 |
+
# ---------------------------------------------------------------------------
|
| 52 |
+
|
| 53 |
+
def parse_page_selector(pages: str, total: int) -> list[int]:
|
| 54 |
+
"""Parse un sélecteur de pages en liste d'indices 0-based.
|
| 55 |
+
|
| 56 |
+
Formats acceptés :
|
| 57 |
+
- ``"1-10"`` → pages 1 à 10 (1-based)
|
| 58 |
+
- ``"1,3,5"`` → pages 1, 3 et 5
|
| 59 |
+
- ``"1-5,10,15-20"`` → combinaison
|
| 60 |
+
- ``"all"`` / ``""`` → toutes les pages
|
| 61 |
+
|
| 62 |
+
Parameters
|
| 63 |
+
----------
|
| 64 |
+
pages:
|
| 65 |
+
Sélecteur de pages en chaîne de caractères.
|
| 66 |
+
total:
|
| 67 |
+
Nombre total de pages dans le manifeste.
|
| 68 |
+
|
| 69 |
+
Returns
|
| 70 |
+
-------
|
| 71 |
+
list[int]
|
| 72 |
+
Indices 0-based des pages sélectionnées, triés et dédoublonnés.
|
| 73 |
+
|
| 74 |
+
Raises
|
| 75 |
+
------
|
| 76 |
+
ValueError
|
| 77 |
+
Si la syntaxe est invalide ou les numéros hors bornes.
|
| 78 |
+
"""
|
| 79 |
+
if not pages or pages.strip().lower() == "all":
|
| 80 |
+
return list(range(total))
|
| 81 |
+
|
| 82 |
+
indices: set[int] = set()
|
| 83 |
+
for part in pages.split(","):
|
| 84 |
+
part = part.strip()
|
| 85 |
+
if "-" in part:
|
| 86 |
+
m = re.fullmatch(r"(\d+)-(\d+)", part)
|
| 87 |
+
if not m:
|
| 88 |
+
raise ValueError(f"Sélecteur de pages invalide : '{part}'")
|
| 89 |
+
start, end = int(m.group(1)), int(m.group(2))
|
| 90 |
+
if start < 1 or end > total or start > end:
|
| 91 |
+
raise ValueError(
|
| 92 |
+
f"Plage {start}-{end} hors bornes (1–{total})"
|
| 93 |
+
)
|
| 94 |
+
indices.update(range(start - 1, end))
|
| 95 |
+
else:
|
| 96 |
+
n = int(part)
|
| 97 |
+
if n < 1 or n > total:
|
| 98 |
+
raise ValueError(f"Page {n} hors bornes (1–{total})")
|
| 99 |
+
indices.add(n - 1)
|
| 100 |
+
return sorted(indices)
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
# ---------------------------------------------------------------------------
|
| 104 |
+
# Données d'un canvas IIIF
|
| 105 |
+
# ---------------------------------------------------------------------------
|
| 106 |
+
|
| 107 |
+
@dataclass
|
| 108 |
+
class IIIFCanvas:
|
| 109 |
+
"""Représente un canvas (page) dans un manifeste IIIF."""
|
| 110 |
+
|
| 111 |
+
index: int # position 0-based dans le manifeste
|
| 112 |
+
label: str # étiquette lisible (ex : "f. 1r", "Page 1")
|
| 113 |
+
image_url: str # URL de l'image pleine résolution
|
| 114 |
+
width: Optional[int] = None
|
| 115 |
+
height: Optional[int] = None
|
| 116 |
+
transcription: Optional[str] = None # texte GT si annoté dans le manifeste
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
# ---------------------------------------------------------------------------
|
| 120 |
+
# Parseur de manifeste IIIF
|
| 121 |
+
# ---------------------------------------------------------------------------
|
| 122 |
+
|
| 123 |
+
class IIIFManifestParser:
|
| 124 |
+
"""Parse un manifeste IIIF Presentation API v2 ou v3."""
|
| 125 |
+
|
| 126 |
+
def __init__(self, manifest: dict) -> None:
|
| 127 |
+
self._manifest = manifest
|
| 128 |
+
self._version = self._detect_version()
|
| 129 |
+
|
| 130 |
+
def _detect_version(self) -> int:
|
| 131 |
+
"""Détecte la version du manifeste (2 ou 3)."""
|
| 132 |
+
context = self._manifest.get("@context", "")
|
| 133 |
+
if isinstance(context, list):
|
| 134 |
+
context = " ".join(context)
|
| 135 |
+
if "presentation/3" in context or self._manifest.get("type") == "Manifest":
|
| 136 |
+
return 3
|
| 137 |
+
return 2
|
| 138 |
+
|
| 139 |
+
@property
|
| 140 |
+
def version(self) -> int:
|
| 141 |
+
return self._version
|
| 142 |
+
|
| 143 |
+
@property
|
| 144 |
+
def label(self) -> str:
|
| 145 |
+
"""Titre du manifeste."""
|
| 146 |
+
raw = self._manifest.get("label", "")
|
| 147 |
+
return _extract_label(raw)
|
| 148 |
+
|
| 149 |
+
@property
|
| 150 |
+
def attribution(self) -> str:
|
| 151 |
+
raw = self._manifest.get("attribution", self._manifest.get("requiredStatement", ""))
|
| 152 |
+
return _extract_label(raw)
|
| 153 |
+
|
| 154 |
+
def canvases(self) -> list[IIIFCanvas]:
|
| 155 |
+
"""Retourne la liste des canvases du manifeste."""
|
| 156 |
+
if self._version == 3:
|
| 157 |
+
return self._parse_v3_canvases()
|
| 158 |
+
return self._parse_v2_canvases()
|
| 159 |
+
|
| 160 |
+
def _parse_v2_canvases(self) -> list[IIIFCanvas]:
|
| 161 |
+
canvases: list[IIIFCanvas] = []
|
| 162 |
+
sequences = self._manifest.get("sequences", [])
|
| 163 |
+
if not sequences:
|
| 164 |
+
return canvases
|
| 165 |
+
raw_canvases = sequences[0].get("canvases", [])
|
| 166 |
+
for i, canvas in enumerate(raw_canvases):
|
| 167 |
+
label = _extract_label(canvas.get("label", f"canvas_{i+1}"))
|
| 168 |
+
# Image principale : images[0].resource.@id ou service
|
| 169 |
+
images = canvas.get("images", [])
|
| 170 |
+
image_url = ""
|
| 171 |
+
if images:
|
| 172 |
+
resource = images[0].get("resource", {})
|
| 173 |
+
image_url = _best_image_url_v2(resource, canvas)
|
| 174 |
+
|
| 175 |
+
# Annotations de transcription (OA annotations)
|
| 176 |
+
transcription = _extract_v2_transcription(canvas)
|
| 177 |
+
|
| 178 |
+
canvases.append(IIIFCanvas(
|
| 179 |
+
index=i,
|
| 180 |
+
label=label,
|
| 181 |
+
image_url=image_url,
|
| 182 |
+
width=canvas.get("width"),
|
| 183 |
+
height=canvas.get("height"),
|
| 184 |
+
transcription=transcription,
|
| 185 |
+
))
|
| 186 |
+
return canvases
|
| 187 |
+
|
| 188 |
+
def _parse_v3_canvases(self) -> list[IIIFCanvas]:
|
| 189 |
+
canvases: list[IIIFCanvas] = []
|
| 190 |
+
items = self._manifest.get("items", [])
|
| 191 |
+
for i, canvas in enumerate(items):
|
| 192 |
+
label = _extract_label(canvas.get("label", f"canvas_{i+1}"))
|
| 193 |
+
image_url = _best_image_url_v3(canvas)
|
| 194 |
+
transcription = _extract_v3_transcription(canvas)
|
| 195 |
+
canvases.append(IIIFCanvas(
|
| 196 |
+
index=i,
|
| 197 |
+
label=label,
|
| 198 |
+
image_url=image_url,
|
| 199 |
+
width=canvas.get("width"),
|
| 200 |
+
height=canvas.get("height"),
|
| 201 |
+
transcription=transcription,
|
| 202 |
+
))
|
| 203 |
+
return canvases
|
| 204 |
+
|
| 205 |
+
|
| 206 |
+
# ---------------------------------------------------------------------------
|
| 207 |
+
# Helpers extraction URL et label
|
| 208 |
+
# ---------------------------------------------------------------------------
|
| 209 |
+
|
| 210 |
+
def _extract_label(raw: object) -> str:
|
| 211 |
+
"""Extrait une chaîne lisible depuis les différents formats de label IIIF."""
|
| 212 |
+
if isinstance(raw, str):
|
| 213 |
+
return raw
|
| 214 |
+
if isinstance(raw, list) and raw:
|
| 215 |
+
return _extract_label(raw[0])
|
| 216 |
+
if isinstance(raw, dict):
|
| 217 |
+
# IIIF v3 : {"fr": ["titre"], "en": ["title"]}
|
| 218 |
+
for lang in ("fr", "en", "none", "@value"):
|
| 219 |
+
val = raw.get(lang, "")
|
| 220 |
+
if val:
|
| 221 |
+
if isinstance(val, list):
|
| 222 |
+
return val[0] if val else ""
|
| 223 |
+
return str(val)
|
| 224 |
+
# Fallback: première valeur
|
| 225 |
+
for v in raw.values():
|
| 226 |
+
return _extract_label(v)
|
| 227 |
+
return str(raw) if raw else ""
|
| 228 |
+
|
| 229 |
+
|
| 230 |
+
def _best_image_url_v2(resource: dict, canvas: dict) -> str:
|
| 231 |
+
"""Construit l'URL d'image optimale depuis une ressource IIIF v2."""
|
| 232 |
+
# 1. URL directe de la ressource
|
| 233 |
+
direct = resource.get("@id", "")
|
| 234 |
+
if direct and not direct.endswith("/info.json"):
|
| 235 |
+
return direct
|
| 236 |
+
|
| 237 |
+
# 2. Via le service IIIF Image API
|
| 238 |
+
service = resource.get("service", {})
|
| 239 |
+
if isinstance(service, list) and service:
|
| 240 |
+
service = service[0]
|
| 241 |
+
service_id = service.get("@id", service.get("id", ""))
|
| 242 |
+
if service_id:
|
| 243 |
+
return f"{service_id.rstrip('/')}/full/max/0/default.jpg"
|
| 244 |
+
|
| 245 |
+
return direct
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
def _best_image_url_v3(canvas: dict) -> str:
|
| 249 |
+
"""Extrait l'URL d'image depuis un canvas IIIF v3."""
|
| 250 |
+
items = canvas.get("items", [])
|
| 251 |
+
for annotation_page in items:
|
| 252 |
+
for annotation in annotation_page.get("items", []):
|
| 253 |
+
body = annotation.get("body", {})
|
| 254 |
+
if isinstance(body, list):
|
| 255 |
+
body = body[0] if body else {}
|
| 256 |
+
# URL directe
|
| 257 |
+
url = body.get("id", body.get("@id", ""))
|
| 258 |
+
if url and body.get("type", "") == "Image":
|
| 259 |
+
return url
|
| 260 |
+
# Via service IIIF Image API
|
| 261 |
+
service = body.get("service", [])
|
| 262 |
+
if isinstance(service, dict):
|
| 263 |
+
service = [service]
|
| 264 |
+
for svc in service:
|
| 265 |
+
svc_id = svc.get("id", svc.get("@id", ""))
|
| 266 |
+
if svc_id:
|
| 267 |
+
return f"{svc_id.rstrip('/')}/full/max/0/default.jpg"
|
| 268 |
+
if url:
|
| 269 |
+
return url
|
| 270 |
+
return ""
|
| 271 |
+
|
| 272 |
+
|
| 273 |
+
def _extract_v2_transcription(canvas: dict) -> Optional[str]:
|
| 274 |
+
"""Tente d'extraire le texte GT depuis les annotations OA d'un canvas v2."""
|
| 275 |
+
other_content = canvas.get("otherContent", [])
|
| 276 |
+
for oc in other_content:
|
| 277 |
+
if not isinstance(oc, dict):
|
| 278 |
+
continue
|
| 279 |
+
motivation = oc.get("motivation", "")
|
| 280 |
+
if "transcrib" in motivation.lower() or "supplementing" in motivation.lower():
|
| 281 |
+
resources = oc.get("resources", [])
|
| 282 |
+
texts = []
|
| 283 |
+
for res in resources:
|
| 284 |
+
body = res.get("resource", {})
|
| 285 |
+
if body.get("@type") == "cnt:ContentAsText":
|
| 286 |
+
texts.append(body.get("chars", ""))
|
| 287 |
+
if texts:
|
| 288 |
+
return "\n".join(texts)
|
| 289 |
+
return None
|
| 290 |
+
|
| 291 |
+
|
| 292 |
+
def _extract_v3_transcription(canvas: dict) -> Optional[str]:
|
| 293 |
+
"""Tente d'extraire le texte GT depuis les annotations d'un canvas v3."""
|
| 294 |
+
annotations = canvas.get("annotations", [])
|
| 295 |
+
for ann_page in annotations:
|
| 296 |
+
items = ann_page.get("items", [])
|
| 297 |
+
for ann in items:
|
| 298 |
+
motivation = ann.get("motivation", "")
|
| 299 |
+
if "transcrib" in motivation.lower() or "supplementing" in motivation.lower():
|
| 300 |
+
body = ann.get("body", {})
|
| 301 |
+
if isinstance(body, dict) and body.get("type") == "TextualBody":
|
| 302 |
+
return body.get("value", "")
|
| 303 |
+
return None
|
| 304 |
+
|
| 305 |
+
|
| 306 |
+
# ---------------------------------------------------------------------------
|
| 307 |
+
# Téléchargement avec retry
|
| 308 |
+
# ---------------------------------------------------------------------------
|
| 309 |
+
|
| 310 |
+
# Chantier 4 (post-Sprint 97) — helpers HTTP factorisés dans
|
| 311 |
+
# :mod:`picarones.importers._http`. Ces noms restent disponibles
|
| 312 |
+
# depuis ``iiif`` (rétrocompat des tests qui les importent
|
| 313 |
+
# directement, ex. test_sprint4_normalization_iiif).
|
| 314 |
+
from picarones.importers._http import download_url as _download_url
|
| 315 |
+
from picarones.importers._http import validate_http_url as _validate_url
|
| 316 |
+
|
| 317 |
+
|
| 318 |
+
def _fetch_manifest(url: str) -> dict:
|
| 319 |
+
"""Télécharge et parse un manifeste IIIF JSON."""
|
| 320 |
+
data = _download_url(url)
|
| 321 |
+
try:
|
| 322 |
+
return json.loads(data.decode("utf-8"))
|
| 323 |
+
except json.JSONDecodeError as exc:
|
| 324 |
+
raise ValueError(f"Manifeste IIIF invalide (JSON mal formé) : {url}") from exc
|
| 325 |
+
|
| 326 |
+
|
| 327 |
+
# ---------------------------------------------------------------------------
|
| 328 |
+
# Importeur principal
|
| 329 |
+
# ---------------------------------------------------------------------------
|
| 330 |
+
|
| 331 |
+
class IIIFImporter:
|
| 332 |
+
"""Importe un corpus depuis un manifeste IIIF.
|
| 333 |
+
|
| 334 |
+
Parameters
|
| 335 |
+
----------
|
| 336 |
+
manifest_url:
|
| 337 |
+
URL du manifeste IIIF (Presentation API v2 ou v3).
|
| 338 |
+
max_resolution:
|
| 339 |
+
Résolution maximale des images téléchargées (largeur en pixels).
|
| 340 |
+
0 = résolution maximale disponible.
|
| 341 |
+
"""
|
| 342 |
+
|
| 343 |
+
def __init__(
|
| 344 |
+
self,
|
| 345 |
+
manifest_url: str,
|
| 346 |
+
max_resolution: int = 0,
|
| 347 |
+
) -> None:
|
| 348 |
+
self.manifest_url = manifest_url
|
| 349 |
+
self.max_resolution = max_resolution
|
| 350 |
+
self._manifest: Optional[dict] = None
|
| 351 |
+
self._parser: Optional[IIIFManifestParser] = None
|
| 352 |
+
|
| 353 |
+
def load(self) -> "IIIFImporter":
|
| 354 |
+
"""Télécharge et parse le manifeste."""
|
| 355 |
+
logger.info("Téléchargement du manifeste IIIF : %s", self.manifest_url)
|
| 356 |
+
self._manifest = _fetch_manifest(self.manifest_url)
|
| 357 |
+
self._parser = IIIFManifestParser(self._manifest)
|
| 358 |
+
logger.info(
|
| 359 |
+
"Manifeste chargé — version IIIF %d — titre : %s — %d canvas",
|
| 360 |
+
self._parser.version,
|
| 361 |
+
self._parser.label,
|
| 362 |
+
len(self._parser.canvases()),
|
| 363 |
+
)
|
| 364 |
+
return self
|
| 365 |
+
|
| 366 |
+
@property
|
| 367 |
+
def parser(self) -> IIIFManifestParser:
|
| 368 |
+
if self._parser is None:
|
| 369 |
+
self.load()
|
| 370 |
+
return self._parser # type: ignore[return-value]
|
| 371 |
+
|
| 372 |
+
def list_canvases(self, pages: str = "all") -> list[IIIFCanvas]:
|
| 373 |
+
"""Retourne la liste des canvases sélectionnés."""
|
| 374 |
+
all_canvases = self.parser.canvases()
|
| 375 |
+
indices = parse_page_selector(pages, len(all_canvases))
|
| 376 |
+
return [all_canvases[i] for i in indices]
|
| 377 |
+
|
| 378 |
+
def import_corpus(
|
| 379 |
+
self,
|
| 380 |
+
pages: str = "all",
|
| 381 |
+
output_dir: Optional[str | Path] = None,
|
| 382 |
+
show_progress: bool = True,
|
| 383 |
+
) -> Corpus:
|
| 384 |
+
"""Télécharge les images et construit un corpus Picarones.
|
| 385 |
+
|
| 386 |
+
Si les canvases contiennent des annotations de transcription (GT),
|
| 387 |
+
elles sont automatiquement sauvegardées dans les fichiers ``.gt.txt``.
|
| 388 |
+
Sinon, des fichiers ``.gt.txt`` vides sont créés.
|
| 389 |
+
|
| 390 |
+
Parameters
|
| 391 |
+
----------
|
| 392 |
+
pages:
|
| 393 |
+
Sélecteur de pages (ex : ``"1-10"``, ``"1,3,5"``).
|
| 394 |
+
output_dir:
|
| 395 |
+
Dossier de destination pour les images et les GT.
|
| 396 |
+
Si None, le corpus est retourné en mémoire sans écriture disque.
|
| 397 |
+
show_progress:
|
| 398 |
+
Affiche une barre de progression tqdm.
|
| 399 |
+
|
| 400 |
+
Returns
|
| 401 |
+
-------
|
| 402 |
+
Corpus
|
| 403 |
+
Corpus prêt à être utilisé dans ``run_benchmark``.
|
| 404 |
+
"""
|
| 405 |
+
canvases = self.list_canvases(pages)
|
| 406 |
+
if not canvases:
|
| 407 |
+
raise ValueError("Aucun canvas sélectionné.")
|
| 408 |
+
|
| 409 |
+
out_dir: Optional[Path] = Path(output_dir) if output_dir else None
|
| 410 |
+
if out_dir:
|
| 411 |
+
out_dir.mkdir(parents=True, exist_ok=True)
|
| 412 |
+
|
| 413 |
+
# Nom du corpus depuis le titre du manifeste
|
| 414 |
+
corpus_name = self.parser.label or "iiif_corpus"
|
| 415 |
+
|
| 416 |
+
documents: list[Document] = []
|
| 417 |
+
iterator: Iterator[IIIFCanvas] = iter(canvases)
|
| 418 |
+
|
| 419 |
+
if show_progress:
|
| 420 |
+
try:
|
| 421 |
+
from tqdm import tqdm
|
| 422 |
+
iterator = tqdm(canvases, desc="Import IIIF", unit="page")
|
| 423 |
+
except ImportError:
|
| 424 |
+
pass
|
| 425 |
+
|
| 426 |
+
for canvas in iterator:
|
| 427 |
+
doc_id = f"{_slugify(canvas.label) or f'canvas_{canvas.index+1:04d}'}"
|
| 428 |
+
|
| 429 |
+
if not canvas.image_url:
|
| 430 |
+
logger.warning("Canvas %s : pas d'URL d'image — ignoré.", canvas.label)
|
| 431 |
+
continue
|
| 432 |
+
|
| 433 |
+
# Ajuster la résolution si max_resolution est défini
|
| 434 |
+
image_url = self._adjust_resolution(canvas.image_url, canvas.width)
|
| 435 |
+
|
| 436 |
+
# Téléchargement de l'image
|
| 437 |
+
try:
|
| 438 |
+
image_bytes = _download_url(image_url)
|
| 439 |
+
except RuntimeError as exc:
|
| 440 |
+
logger.error("Canvas %s : erreur téléchargement : %s", canvas.label, exc)
|
| 441 |
+
continue
|
| 442 |
+
|
| 443 |
+
# Déterminer l'extension de l'image
|
| 444 |
+
ext = _guess_extension(image_url)
|
| 445 |
+
|
| 446 |
+
if out_dir:
|
| 447 |
+
# Sauvegarde sur disque
|
| 448 |
+
image_path = out_dir / f"{doc_id}{ext}"
|
| 449 |
+
image_path.write_bytes(image_bytes)
|
| 450 |
+
|
| 451 |
+
gt_path = out_dir / f"{doc_id}.gt.txt"
|
| 452 |
+
gt_text = canvas.transcription or ""
|
| 453 |
+
gt_path.write_text(gt_text, encoding="utf-8")
|
| 454 |
+
|
| 455 |
+
documents.append(Document(
|
| 456 |
+
image_path=image_path,
|
| 457 |
+
ground_truth=gt_text,
|
| 458 |
+
doc_id=doc_id,
|
| 459 |
+
metadata={"iiif_label": canvas.label, "canvas_index": canvas.index},
|
| 460 |
+
))
|
| 461 |
+
else:
|
| 462 |
+
# Corpus en mémoire (image stockée comme chemin temporaire virtuel)
|
| 463 |
+
import tempfile
|
| 464 |
+
tmp = tempfile.NamedTemporaryFile(suffix=ext, delete=False)
|
| 465 |
+
tmp.write(image_bytes)
|
| 466 |
+
tmp.close()
|
| 467 |
+
documents.append(Document(
|
| 468 |
+
image_path=Path(tmp.name),
|
| 469 |
+
ground_truth=canvas.transcription or "",
|
| 470 |
+
doc_id=doc_id,
|
| 471 |
+
metadata={"iiif_label": canvas.label, "canvas_index": canvas.index},
|
| 472 |
+
))
|
| 473 |
+
|
| 474 |
+
if not documents:
|
| 475 |
+
raise ValueError("Aucun document importé depuis le manifeste IIIF.")
|
| 476 |
+
|
| 477 |
+
logger.info("Import IIIF terminé : %d documents.", len(documents))
|
| 478 |
+
|
| 479 |
+
return Corpus(
|
| 480 |
+
name=corpus_name,
|
| 481 |
+
documents=documents,
|
| 482 |
+
source_path=self.manifest_url,
|
| 483 |
+
metadata={
|
| 484 |
+
"iiif_manifest_url": self.manifest_url,
|
| 485 |
+
"iiif_version": self.parser.version,
|
| 486 |
+
"iiif_attribution": self.parser.attribution,
|
| 487 |
+
"pages_selected": pages,
|
| 488 |
+
},
|
| 489 |
+
)
|
| 490 |
+
|
| 491 |
+
def _adjust_resolution(self, image_url: str, canvas_width: Optional[int]) -> str:
|
| 492 |
+
"""Ajuste l'URL IIIF Image API pour respecter max_resolution."""
|
| 493 |
+
if not self.max_resolution or not canvas_width:
|
| 494 |
+
return image_url
|
| 495 |
+
if canvas_width <= self.max_resolution:
|
| 496 |
+
return image_url
|
| 497 |
+
# Remplacer /full/max/ ou /full/full/ par /full/{w},/
|
| 498 |
+
url = re.sub(
|
| 499 |
+
r"/full/(max|full)/",
|
| 500 |
+
f"/full/{self.max_resolution},/",
|
| 501 |
+
image_url,
|
| 502 |
+
)
|
| 503 |
+
return url
|
| 504 |
+
|
| 505 |
+
|
| 506 |
+
# ---------------------------------------------------------------------------
|
| 507 |
+
# Helpers utilitaires
|
| 508 |
+
# ---------------------------------------------------------------------------
|
| 509 |
+
|
| 510 |
+
def _slugify(text: str) -> str:
|
| 511 |
+
"""Convertit un label IIIF en identifiant de fichier sûr."""
|
| 512 |
+
text = re.sub(r"[^\w\s-]", "", text.strip())
|
| 513 |
+
text = re.sub(r"[\s_-]+", "_", text)
|
| 514 |
+
return text[:60]
|
| 515 |
+
|
| 516 |
+
|
| 517 |
+
def _guess_extension(url: str) -> str:
|
| 518 |
+
"""Détermine l'extension de l'image depuis l'URL."""
|
| 519 |
+
url_lower = url.lower().split("?")[0]
|
| 520 |
+
for ext in (".jpg", ".jpeg", ".png", ".tif", ".tiff", ".webp"):
|
| 521 |
+
if url_lower.endswith(ext):
|
| 522 |
+
return ext
|
| 523 |
+
# Par défaut pour les URLs IIIF Image API
|
| 524 |
+
if "/default." in url_lower or "/native." in url_lower:
|
| 525 |
+
return ".jpg"
|
| 526 |
+
return ".jpg"
|
| 527 |
+
|
| 528 |
+
|
| 529 |
+
# ---------------------------------------------------------------------------
|
| 530 |
+
# Fonction de commodité
|
| 531 |
+
# ---------------------------------------------------------------------------
|
| 532 |
+
|
| 533 |
+
def import_iiif_manifest(
|
| 534 |
+
manifest_url: str,
|
| 535 |
+
pages: str = "all",
|
| 536 |
+
output_dir: Optional[str | Path] = None,
|
| 537 |
+
max_resolution: int = 0,
|
| 538 |
+
show_progress: bool = True,
|
| 539 |
+
) -> Corpus:
|
| 540 |
+
"""Importe un corpus depuis un manifeste IIIF en une seule ligne.
|
| 541 |
+
|
| 542 |
+
Parameters
|
| 543 |
+
----------
|
| 544 |
+
manifest_url:
|
| 545 |
+
URL du manifeste IIIF (v2 ou v3).
|
| 546 |
+
pages:
|
| 547 |
+
Sélecteur de pages (ex : ``"1-10"``, ``"1,3,5"``). ``"all"`` par défaut.
|
| 548 |
+
output_dir:
|
| 549 |
+
Dossier de destination. Si None, corpus en mémoire.
|
| 550 |
+
max_resolution:
|
| 551 |
+
Résolution maximale (px). 0 = pas de limite.
|
| 552 |
+
show_progress:
|
| 553 |
+
Affiche une barre de progression.
|
| 554 |
+
|
| 555 |
+
Returns
|
| 556 |
+
-------
|
| 557 |
+
Corpus
|
| 558 |
+
"""
|
| 559 |
+
importer = IIIFImporter(manifest_url, max_resolution=max_resolution)
|
| 560 |
+
importer.load()
|
| 561 |
+
return importer.import_corpus(
|
| 562 |
+
pages=pages,
|
| 563 |
+
output_dir=output_dir,
|
| 564 |
+
show_progress=show_progress,
|
| 565 |
+
)
|
|
@@ -1,108 +1,17 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
:mod:`picarones.importers.gallica` (lignes 125-155). Le module Gallica
|
| 8 |
-
faisait 549 lignes dont une bonne partie réimplémentait les mêmes
|
| 9 |
-
abstractions HTTP que IIIF (validation de schéma, retry exponentiel,
|
| 10 |
-
gestion des codes HTTP).
|
| 11 |
-
|
| 12 |
-
Ce module privé centralise ces helpers. Les deux importeurs (et tout
|
| 13 |
-
nouveau importateur HTTP futur) les utilisent. Comportement public
|
| 14 |
-
inchangé — uniquement de la factorisation.
|
| 15 |
"""
|
| 16 |
|
| 17 |
-
from
|
| 18 |
-
|
| 19 |
-
import logging
|
| 20 |
-
import time
|
| 21 |
-
import urllib.error
|
| 22 |
-
import urllib.request
|
| 23 |
-
from typing import Optional
|
| 24 |
-
from urllib.parse import urlparse
|
| 25 |
-
|
| 26 |
-
logger = logging.getLogger(__name__)
|
| 27 |
-
|
| 28 |
-
_DEFAULT_USER_AGENT = (
|
| 29 |
-
"Picarones/1.0 (OCR benchmark platform; "
|
| 30 |
-
"https://github.com/maribakulj/Picarones)"
|
| 31 |
-
)
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
def validate_http_url(url: str) -> None:
|
| 35 |
-
"""Lève ``ValueError`` si le schéma de l'URL n'est pas http/https.
|
| 36 |
-
|
| 37 |
-
Garde-fou contre les URLs ``file://``, ``ftp://``, ``data:`` qui
|
| 38 |
-
permettraient à un manifeste IIIF malveillant de lire des fichiers
|
| 39 |
-
locaux ou de contourner la politique réseau.
|
| 40 |
-
"""
|
| 41 |
-
parsed = urlparse(url)
|
| 42 |
-
if parsed.scheme not in ("http", "https"):
|
| 43 |
-
raise ValueError(
|
| 44 |
-
f"Schéma URL non autorisé '{parsed.scheme}' "
|
| 45 |
-
f"(seuls http/https sont acceptés) : {url}"
|
| 46 |
-
)
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
def download_url(
|
| 50 |
-
url: str,
|
| 51 |
-
*,
|
| 52 |
-
retries: int = 4,
|
| 53 |
-
backoff: float = 2.0,
|
| 54 |
-
timeout: int = 60,
|
| 55 |
-
user_agent: str = _DEFAULT_USER_AGENT,
|
| 56 |
-
extra_headers: Optional[dict[str, str]] = None,
|
| 57 |
-
) -> bytes:
|
| 58 |
-
"""Télécharge une URL avec retry exponentiel.
|
| 59 |
-
|
| 60 |
-
Parameters
|
| 61 |
-
----------
|
| 62 |
-
url:
|
| 63 |
-
URL à télécharger. Validée par :func:`validate_http_url`.
|
| 64 |
-
retries:
|
| 65 |
-
Nombre total de tentatives (défaut 4).
|
| 66 |
-
backoff:
|
| 67 |
-
Base du backoff exponentiel : attente = ``backoff ** attempt``
|
| 68 |
-
secondes (défaut 2.0 → 0, 2, 4, 8 s).
|
| 69 |
-
timeout:
|
| 70 |
-
Timeout HTTP par tentative en secondes (défaut 60).
|
| 71 |
-
user_agent:
|
| 72 |
-
Header ``User-Agent`` envoyé. Défaut : Picarones identifié.
|
| 73 |
-
extra_headers:
|
| 74 |
-
Headers supplémentaires (ex : ``{"Accept": "application/json"}``).
|
| 75 |
-
|
| 76 |
-
Raises
|
| 77 |
-
------
|
| 78 |
-
ValueError
|
| 79 |
-
Si l'URL n'a pas un schéma autorisé.
|
| 80 |
-
RuntimeError
|
| 81 |
-
Si toutes les tentatives échouent.
|
| 82 |
-
"""
|
| 83 |
-
validate_http_url(url)
|
| 84 |
-
headers = {"User-Agent": user_agent}
|
| 85 |
-
if extra_headers:
|
| 86 |
-
headers.update(extra_headers)
|
| 87 |
-
last_exc: Optional[Exception] = None
|
| 88 |
-
for attempt in range(retries):
|
| 89 |
-
if attempt > 0:
|
| 90 |
-
wait = backoff ** attempt
|
| 91 |
-
logger.debug(
|
| 92 |
-
"Retry %d/%d dans %.1fs — %s",
|
| 93 |
-
attempt, retries - 1, wait, url,
|
| 94 |
-
)
|
| 95 |
-
time.sleep(wait)
|
| 96 |
-
try:
|
| 97 |
-
req = urllib.request.Request(url, headers=headers)
|
| 98 |
-
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
| 99 |
-
return resp.read()
|
| 100 |
-
except (urllib.error.URLError, urllib.error.HTTPError) as exc:
|
| 101 |
-
last_exc = exc
|
| 102 |
-
logger.warning("Erreur téléchargement %s : %s", url, exc)
|
| 103 |
-
raise RuntimeError(
|
| 104 |
-
f"Impossible de télécharger {url} après {retries} tentatives",
|
| 105 |
-
) from last_exc
|
| 106 |
-
|
| 107 |
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.importers._http`.
|
| 2 |
|
| 3 |
+
Phase C du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Cet importeur est désormais en Cercle 3 (``extras/importers/``). L'alias
|
| 5 |
+
ici permet aux imports historiques (``from picarones.importers._http
|
| 6 |
+
import ...``) de continuer à fonctionner.
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` et l'extra
|
| 9 |
+
``picarones[importers]`` du ``pyproject.toml``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.importers._http import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
import picarones.extras.importers._http as _module
|
| 15 |
+
__all__ = getattr(_module, "__all__", [
|
| 16 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 17 |
+
])
|
|
@@ -1,534 +1,17 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
3. Export des résultats de benchmark Picarones comme couche OCR dans eScriptorium
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
eScriptorium expose une API REST documentée à /api/.
|
| 12 |
-
Les endpoints principaux utilisés ici :
|
| 13 |
-
- GET /api/projects/ → liste des projets
|
| 14 |
-
- GET /api/documents/ → liste des documents (filtrables par projet)
|
| 15 |
-
- GET /api/documents/{pk}/parts/ → liste des pages d'un document
|
| 16 |
-
- GET /api/documents/{pk}/parts/{pk}/transcriptions/ → transcriptions d'une page
|
| 17 |
-
- POST /api/documents/{pk}/parts/{pk}/transcriptions/ → créer une couche OCR
|
| 18 |
-
|
| 19 |
-
Usage
|
| 20 |
-
-----
|
| 21 |
-
>>> from picarones.importers.escriptorium import EScriptoriumClient
|
| 22 |
-
>>> client = EScriptoriumClient("https://escriptorium.example.org", token="abc123")
|
| 23 |
-
>>> projects = client.list_projects()
|
| 24 |
-
>>> corpus = client.import_document(doc_id=42, transcription_layer="manual")
|
| 25 |
"""
|
| 26 |
|
| 27 |
-
from
|
| 28 |
-
|
| 29 |
-
import json
|
| 30 |
-
import logging
|
| 31 |
-
import urllib.error
|
| 32 |
-
import urllib.parse
|
| 33 |
-
import urllib.request
|
| 34 |
-
from dataclasses import dataclass, field
|
| 35 |
-
from pathlib import Path
|
| 36 |
-
from typing import TYPE_CHECKING, Optional
|
| 37 |
-
|
| 38 |
-
from picarones.core.corpus import Corpus, Document
|
| 39 |
-
|
| 40 |
-
if TYPE_CHECKING:
|
| 41 |
-
from picarones.core.results import BenchmarkResult
|
| 42 |
-
|
| 43 |
-
logger = logging.getLogger(__name__)
|
| 44 |
-
|
| 45 |
-
# ---------------------------------------------------------------------------
|
| 46 |
-
# Structures de données eScriptorium
|
| 47 |
-
# ---------------------------------------------------------------------------
|
| 48 |
-
|
| 49 |
-
@dataclass
|
| 50 |
-
class EScriptoriumProject:
|
| 51 |
-
"""Représentation d'un projet eScriptorium."""
|
| 52 |
-
pk: int
|
| 53 |
-
name: str
|
| 54 |
-
slug: str
|
| 55 |
-
owner: str = ""
|
| 56 |
-
document_count: int = 0
|
| 57 |
-
|
| 58 |
-
def as_dict(self) -> dict:
|
| 59 |
-
return {
|
| 60 |
-
"pk": self.pk,
|
| 61 |
-
"name": self.name,
|
| 62 |
-
"slug": self.slug,
|
| 63 |
-
"owner": self.owner,
|
| 64 |
-
"document_count": self.document_count,
|
| 65 |
-
}
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
@dataclass
|
| 69 |
-
class EScriptoriumDocument:
|
| 70 |
-
"""Représentation d'un document eScriptorium."""
|
| 71 |
-
pk: int
|
| 72 |
-
name: str
|
| 73 |
-
project: str = ""
|
| 74 |
-
part_count: int = 0
|
| 75 |
-
transcription_layers: list[str] = field(default_factory=list)
|
| 76 |
-
|
| 77 |
-
def as_dict(self) -> dict:
|
| 78 |
-
return {
|
| 79 |
-
"pk": self.pk,
|
| 80 |
-
"name": self.name,
|
| 81 |
-
"project": self.project,
|
| 82 |
-
"part_count": self.part_count,
|
| 83 |
-
"transcription_layers": self.transcription_layers,
|
| 84 |
-
}
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
@dataclass
|
| 88 |
-
class EScriptoriumPart:
|
| 89 |
-
"""Une page (part) d'un document eScriptorium."""
|
| 90 |
-
pk: int
|
| 91 |
-
title: str
|
| 92 |
-
image_url: str
|
| 93 |
-
order: int = 0
|
| 94 |
-
transcriptions: list[dict] = field(default_factory=list)
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
# ---------------------------------------------------------------------------
|
| 98 |
-
# Client API eScriptorium
|
| 99 |
-
# ---------------------------------------------------------------------------
|
| 100 |
-
|
| 101 |
-
class EScriptoriumClient:
|
| 102 |
-
"""Client pour l'API REST d'eScriptorium.
|
| 103 |
-
|
| 104 |
-
Parameters
|
| 105 |
-
----------
|
| 106 |
-
base_url:
|
| 107 |
-
URL racine de l'instance (ex : ``"https://escriptorium.example.org"``).
|
| 108 |
-
token:
|
| 109 |
-
Token d'authentification API (depuis Settings > API dans eScriptorium).
|
| 110 |
-
timeout:
|
| 111 |
-
Timeout HTTP en secondes.
|
| 112 |
-
|
| 113 |
-
Examples
|
| 114 |
-
--------
|
| 115 |
-
>>> client = EScriptoriumClient("https://escriptorium.example.org", token="abc123")
|
| 116 |
-
>>> projects = client.list_projects()
|
| 117 |
-
>>> corpus = client.import_document(42, transcription_layer="manual")
|
| 118 |
-
"""
|
| 119 |
-
|
| 120 |
-
def __init__(
|
| 121 |
-
self,
|
| 122 |
-
base_url: str,
|
| 123 |
-
token: str,
|
| 124 |
-
timeout: int = 30,
|
| 125 |
-
) -> None:
|
| 126 |
-
self.base_url = base_url.rstrip("/")
|
| 127 |
-
self.token = token
|
| 128 |
-
self.timeout = timeout
|
| 129 |
-
|
| 130 |
-
# ------------------------------------------------------------------
|
| 131 |
-
# HTTP helpers
|
| 132 |
-
# ------------------------------------------------------------------
|
| 133 |
-
|
| 134 |
-
def _headers(self) -> dict[str, str]:
|
| 135 |
-
return {
|
| 136 |
-
"Authorization": f"Token {self.token}",
|
| 137 |
-
"Accept": "application/json",
|
| 138 |
-
"Content-Type": "application/json",
|
| 139 |
-
}
|
| 140 |
-
|
| 141 |
-
def _get(self, path: str, params: Optional[dict] = None) -> dict:
|
| 142 |
-
"""Effectue une requête GET et retourne le JSON."""
|
| 143 |
-
url = f"{self.base_url}/api/{path.lstrip('/')}"
|
| 144 |
-
if params:
|
| 145 |
-
url += "?" + urllib.parse.urlencode(params)
|
| 146 |
-
req = urllib.request.Request(url, headers=self._headers())
|
| 147 |
-
try:
|
| 148 |
-
with urllib.request.urlopen(req, timeout=self.timeout) as resp:
|
| 149 |
-
return json.loads(resp.read().decode("utf-8"))
|
| 150 |
-
except urllib.error.HTTPError as exc:
|
| 151 |
-
raise RuntimeError(
|
| 152 |
-
f"eScriptorium API erreur {exc.code} sur {url}: {exc.reason}"
|
| 153 |
-
) from exc
|
| 154 |
-
except urllib.error.URLError as exc:
|
| 155 |
-
raise RuntimeError(
|
| 156 |
-
f"Impossible de joindre {self.base_url}: {exc.reason}"
|
| 157 |
-
) from exc
|
| 158 |
-
|
| 159 |
-
def _post(self, path: str, payload: dict) -> dict:
|
| 160 |
-
"""Effectue une requête POST avec payload JSON."""
|
| 161 |
-
url = f"{self.base_url}/api/{path.lstrip('/')}"
|
| 162 |
-
data = json.dumps(payload).encode("utf-8")
|
| 163 |
-
req = urllib.request.Request(
|
| 164 |
-
url, data=data, headers=self._headers(), method="POST"
|
| 165 |
-
)
|
| 166 |
-
try:
|
| 167 |
-
with urllib.request.urlopen(req, timeout=self.timeout) as resp:
|
| 168 |
-
body = resp.read().decode("utf-8")
|
| 169 |
-
return json.loads(body) if body else {}
|
| 170 |
-
except urllib.error.HTTPError as exc:
|
| 171 |
-
raise RuntimeError(
|
| 172 |
-
f"eScriptorium API erreur {exc.code} sur {url}: {exc.reason}"
|
| 173 |
-
) from exc
|
| 174 |
-
except urllib.error.URLError as exc:
|
| 175 |
-
raise RuntimeError(
|
| 176 |
-
f"Impossible de joindre {self.base_url}: {exc.reason}"
|
| 177 |
-
) from exc
|
| 178 |
-
|
| 179 |
-
def _paginate(self, path: str, params: Optional[dict] = None) -> list[dict]:
|
| 180 |
-
"""Parcourt toutes les pages de résultats paginés."""
|
| 181 |
-
results: list[dict] = []
|
| 182 |
-
current_params = dict(params or {})
|
| 183 |
-
current_params.setdefault("page_size", 100)
|
| 184 |
-
page_num = 1
|
| 185 |
-
while True:
|
| 186 |
-
current_params["page"] = page_num
|
| 187 |
-
data = self._get(path, current_params)
|
| 188 |
-
if isinstance(data, list):
|
| 189 |
-
results.extend(data)
|
| 190 |
-
break
|
| 191 |
-
results.extend(data.get("results", []))
|
| 192 |
-
if not data.get("next"):
|
| 193 |
-
break
|
| 194 |
-
page_num += 1
|
| 195 |
-
return results
|
| 196 |
-
|
| 197 |
-
# ------------------------------------------------------------------
|
| 198 |
-
# API publique
|
| 199 |
-
# ------------------------------------------------------------------
|
| 200 |
-
|
| 201 |
-
def test_connection(self) -> bool:
|
| 202 |
-
"""Vérifie que l'URL et le token sont valides.
|
| 203 |
-
|
| 204 |
-
Returns
|
| 205 |
-
-------
|
| 206 |
-
bool
|
| 207 |
-
True si l'authentification réussit.
|
| 208 |
-
"""
|
| 209 |
-
try:
|
| 210 |
-
self._get("projects/", {"page_size": 1})
|
| 211 |
-
return True
|
| 212 |
-
except RuntimeError:
|
| 213 |
-
return False
|
| 214 |
-
|
| 215 |
-
def list_projects(self) -> list[EScriptoriumProject]:
|
| 216 |
-
"""Retourne la liste des projets accessibles.
|
| 217 |
-
|
| 218 |
-
Returns
|
| 219 |
-
-------
|
| 220 |
-
list[EScriptoriumProject]
|
| 221 |
-
"""
|
| 222 |
-
raw = self._paginate("projects/")
|
| 223 |
-
projects = []
|
| 224 |
-
for item in raw:
|
| 225 |
-
projects.append(EScriptoriumProject(
|
| 226 |
-
pk=item["pk"],
|
| 227 |
-
name=item.get("name", ""),
|
| 228 |
-
slug=item.get("slug", ""),
|
| 229 |
-
owner=item.get("owner", {}).get("username", "") if isinstance(item.get("owner"), dict) else str(item.get("owner", "")),
|
| 230 |
-
document_count=item.get("documents_count", 0),
|
| 231 |
-
))
|
| 232 |
-
return projects
|
| 233 |
-
|
| 234 |
-
def list_documents(
|
| 235 |
-
self,
|
| 236 |
-
project_pk: Optional[int] = None,
|
| 237 |
-
) -> list[EScriptoriumDocument]:
|
| 238 |
-
"""Retourne la liste des documents, filtrés par projet si fourni.
|
| 239 |
-
|
| 240 |
-
Parameters
|
| 241 |
-
----------
|
| 242 |
-
project_pk:
|
| 243 |
-
PK du projet eScriptorium (optionnel).
|
| 244 |
-
|
| 245 |
-
Returns
|
| 246 |
-
-------
|
| 247 |
-
list[EScriptoriumDocument]
|
| 248 |
-
"""
|
| 249 |
-
params: dict = {}
|
| 250 |
-
if project_pk is not None:
|
| 251 |
-
params["project"] = project_pk
|
| 252 |
-
raw = self._paginate("documents/", params)
|
| 253 |
-
docs = []
|
| 254 |
-
for item in raw:
|
| 255 |
-
layers = [
|
| 256 |
-
t.get("name", "") if isinstance(t, dict) else str(t)
|
| 257 |
-
for t in item.get("transcriptions", [])
|
| 258 |
-
]
|
| 259 |
-
docs.append(EScriptoriumDocument(
|
| 260 |
-
pk=item["pk"],
|
| 261 |
-
name=item.get("name", ""),
|
| 262 |
-
project=str(item.get("project", "")),
|
| 263 |
-
part_count=item.get("parts_count", 0),
|
| 264 |
-
transcription_layers=layers,
|
| 265 |
-
))
|
| 266 |
-
return docs
|
| 267 |
-
|
| 268 |
-
def list_parts(self, doc_pk: int) -> list[EScriptoriumPart]:
|
| 269 |
-
"""Retourne les pages (parts) d'un document.
|
| 270 |
-
|
| 271 |
-
Parameters
|
| 272 |
-
----------
|
| 273 |
-
doc_pk:
|
| 274 |
-
PK du document eScriptorium.
|
| 275 |
-
|
| 276 |
-
Returns
|
| 277 |
-
-------
|
| 278 |
-
list[EScriptoriumPart]
|
| 279 |
-
"""
|
| 280 |
-
raw = self._paginate(f"documents/{doc_pk}/parts/")
|
| 281 |
-
parts = []
|
| 282 |
-
for item in raw:
|
| 283 |
-
parts.append(EScriptoriumPart(
|
| 284 |
-
pk=item["pk"],
|
| 285 |
-
title=item.get("title", "") or f"Part {item.get('order', 0) + 1}",
|
| 286 |
-
image_url=item.get("image", "") or "",
|
| 287 |
-
order=item.get("order", 0),
|
| 288 |
-
))
|
| 289 |
-
return parts
|
| 290 |
-
|
| 291 |
-
def get_transcriptions(self, doc_pk: int, part_pk: int) -> list[dict]:
|
| 292 |
-
"""Retourne les transcriptions disponibles pour une page.
|
| 293 |
-
|
| 294 |
-
Parameters
|
| 295 |
-
----------
|
| 296 |
-
doc_pk:
|
| 297 |
-
PK du document.
|
| 298 |
-
part_pk:
|
| 299 |
-
PK de la page.
|
| 300 |
-
|
| 301 |
-
Returns
|
| 302 |
-
-------
|
| 303 |
-
list[dict]
|
| 304 |
-
Chaque dict contient ``{"name": str, "content": str}``.
|
| 305 |
-
"""
|
| 306 |
-
raw = self._get(f"documents/{doc_pk}/parts/{part_pk}/transcriptions/")
|
| 307 |
-
if isinstance(raw, list):
|
| 308 |
-
return raw
|
| 309 |
-
return raw.get("results", [])
|
| 310 |
-
|
| 311 |
-
def import_document(
|
| 312 |
-
self,
|
| 313 |
-
doc_pk: int,
|
| 314 |
-
transcription_layer: str = "manual",
|
| 315 |
-
output_dir: Optional[str] = None,
|
| 316 |
-
download_images: bool = True,
|
| 317 |
-
show_progress: bool = True,
|
| 318 |
-
) -> Corpus:
|
| 319 |
-
"""Importe un document eScriptorium comme corpus Picarones.
|
| 320 |
-
|
| 321 |
-
Télécharge les images et récupère les transcriptions de la couche
|
| 322 |
-
spécifiée comme vérité terrain.
|
| 323 |
-
|
| 324 |
-
Parameters
|
| 325 |
-
----------
|
| 326 |
-
doc_pk:
|
| 327 |
-
PK du document dans eScriptorium.
|
| 328 |
-
transcription_layer:
|
| 329 |
-
Nom de la couche de transcription à utiliser comme GT.
|
| 330 |
-
output_dir:
|
| 331 |
-
Dossier local pour les images téléchargées. Si None, les images
|
| 332 |
-
sont stockées en mémoire (pas de sauvegarde sur disque).
|
| 333 |
-
download_images:
|
| 334 |
-
Si True, télécharge les images dans output_dir.
|
| 335 |
-
show_progress:
|
| 336 |
-
Affiche une barre de progression tqdm.
|
| 337 |
-
|
| 338 |
-
Returns
|
| 339 |
-
-------
|
| 340 |
-
Corpus
|
| 341 |
-
Corpus Picarones avec documents et GT.
|
| 342 |
-
"""
|
| 343 |
-
# Récupérer les métadonnées du document
|
| 344 |
-
doc_info = self._get(f"documents/{doc_pk}/")
|
| 345 |
-
doc_name = doc_info.get("name", f"document_{doc_pk}")
|
| 346 |
-
|
| 347 |
-
parts = self.list_parts(doc_pk)
|
| 348 |
-
if not parts:
|
| 349 |
-
raise ValueError(f"Aucune page trouvée dans le document {doc_pk}")
|
| 350 |
-
|
| 351 |
-
if show_progress:
|
| 352 |
-
try:
|
| 353 |
-
from tqdm import tqdm
|
| 354 |
-
iterator = tqdm(parts, desc=f"Import {doc_name}")
|
| 355 |
-
except ImportError:
|
| 356 |
-
iterator = iter(parts)
|
| 357 |
-
else:
|
| 358 |
-
iterator = iter(parts)
|
| 359 |
-
|
| 360 |
-
out_path: Optional[Path] = None
|
| 361 |
-
if output_dir and download_images:
|
| 362 |
-
out_path = Path(output_dir)
|
| 363 |
-
out_path.mkdir(parents=True, exist_ok=True)
|
| 364 |
-
|
| 365 |
-
documents: list[Document] = []
|
| 366 |
-
for part in iterator:
|
| 367 |
-
# Récupérer les transcriptions
|
| 368 |
-
transcriptions = self.get_transcriptions(doc_pk, part.pk)
|
| 369 |
-
gt_text = ""
|
| 370 |
-
for t in transcriptions:
|
| 371 |
-
layer_name = t.get("transcription", {}).get("name", "") if isinstance(t.get("transcription"), dict) else t.get("name", "")
|
| 372 |
-
if layer_name == transcription_layer or not transcription_layer:
|
| 373 |
-
# Le contenu est dans "content" ou dans les lignes
|
| 374 |
-
lines = t.get("lines", []) or []
|
| 375 |
-
if lines:
|
| 376 |
-
gt_text = "\n".join(
|
| 377 |
-
line.get("content", "") or ""
|
| 378 |
-
for line in lines
|
| 379 |
-
if line.get("content")
|
| 380 |
-
)
|
| 381 |
-
else:
|
| 382 |
-
gt_text = t.get("content", "") or ""
|
| 383 |
-
break
|
| 384 |
-
|
| 385 |
-
# Image
|
| 386 |
-
image_path = part.image_url or f"escriptorium://doc{doc_pk}/part{part.pk}"
|
| 387 |
-
if out_path and part.image_url and download_images:
|
| 388 |
-
ext = Path(urllib.parse.urlparse(part.image_url).path).suffix or ".jpg"
|
| 389 |
-
local_img = out_path / f"part_{part.pk:05d}{ext}"
|
| 390 |
-
try:
|
| 391 |
-
urllib.request.urlretrieve(part.image_url, local_img)
|
| 392 |
-
image_path = str(local_img)
|
| 393 |
-
except Exception as exc:
|
| 394 |
-
logger.warning("Impossible de télécharger l'image %s: %s", part.image_url, exc)
|
| 395 |
-
|
| 396 |
-
# Sauvegarder la GT
|
| 397 |
-
gt_path = out_path / f"part_{part.pk:05d}.gt.txt"
|
| 398 |
-
gt_path.write_text(gt_text, encoding="utf-8")
|
| 399 |
-
|
| 400 |
-
documents.append(Document(
|
| 401 |
-
doc_id=f"part_{part.pk:05d}",
|
| 402 |
-
image_path=image_path,
|
| 403 |
-
ground_truth=gt_text,
|
| 404 |
-
metadata={
|
| 405 |
-
"source": "escriptorium",
|
| 406 |
-
"doc_pk": doc_pk,
|
| 407 |
-
"part_pk": part.pk,
|
| 408 |
-
"part_title": part.title,
|
| 409 |
-
"transcription_layer": transcription_layer,
|
| 410 |
-
},
|
| 411 |
-
))
|
| 412 |
-
|
| 413 |
-
return Corpus(
|
| 414 |
-
name=doc_name,
|
| 415 |
-
source=f"{self.base_url}/document/{doc_pk}/",
|
| 416 |
-
documents=documents,
|
| 417 |
-
metadata={
|
| 418 |
-
"escriptorium_url": self.base_url,
|
| 419 |
-
"doc_pk": doc_pk,
|
| 420 |
-
"transcription_layer": transcription_layer,
|
| 421 |
-
},
|
| 422 |
-
)
|
| 423 |
-
|
| 424 |
-
def export_benchmark_as_layer(
|
| 425 |
-
self,
|
| 426 |
-
benchmark_result: "BenchmarkResult",
|
| 427 |
-
doc_pk: int,
|
| 428 |
-
engine_name: str,
|
| 429 |
-
layer_name: Optional[str] = None,
|
| 430 |
-
part_mapping: Optional[dict[str, int]] = None,
|
| 431 |
-
) -> int:
|
| 432 |
-
"""Exporte les résultats Picarones comme couche OCR dans eScriptorium.
|
| 433 |
-
|
| 434 |
-
Parameters
|
| 435 |
-
----------
|
| 436 |
-
benchmark_result:
|
| 437 |
-
Résultats du benchmark Picarones.
|
| 438 |
-
doc_pk:
|
| 439 |
-
PK du document cible dans eScriptorium.
|
| 440 |
-
engine_name:
|
| 441 |
-
Nom du moteur dont on exporte les transcriptions.
|
| 442 |
-
layer_name:
|
| 443 |
-
Nom de la couche à créer (défaut : ``"picarones_{engine_name}"``).
|
| 444 |
-
part_mapping:
|
| 445 |
-
Correspondance ``doc_id → part_pk`` eScriptorium. Si None,
|
| 446 |
-
la correspondance est inférée depuis les métadonnées des documents.
|
| 447 |
-
|
| 448 |
-
Returns
|
| 449 |
-
-------
|
| 450 |
-
int
|
| 451 |
-
Nombre de pages exportées avec succès.
|
| 452 |
-
"""
|
| 453 |
-
if layer_name is None:
|
| 454 |
-
layer_name = f"picarones_{engine_name}"
|
| 455 |
-
|
| 456 |
-
# Trouver le rapport du moteur
|
| 457 |
-
engine_report = None
|
| 458 |
-
for report in benchmark_result.engine_reports:
|
| 459 |
-
if report.engine_name == engine_name:
|
| 460 |
-
engine_report = report
|
| 461 |
-
break
|
| 462 |
-
if engine_report is None:
|
| 463 |
-
raise ValueError(f"Moteur '{engine_name}' introuvable dans les résultats.")
|
| 464 |
-
|
| 465 |
-
exported = 0
|
| 466 |
-
for doc_result in engine_report.document_results:
|
| 467 |
-
if doc_result.engine_error:
|
| 468 |
-
continue
|
| 469 |
-
|
| 470 |
-
# Déterminer le part_pk
|
| 471 |
-
part_pk: Optional[int] = None
|
| 472 |
-
if part_mapping and doc_result.doc_id in part_mapping:
|
| 473 |
-
part_pk = part_mapping[doc_result.doc_id]
|
| 474 |
-
else:
|
| 475 |
-
# Essayer d'extraire depuis doc_id (ex: "part_00042")
|
| 476 |
-
try:
|
| 477 |
-
part_pk = int(doc_result.doc_id.replace("part_", "").lstrip("0") or "0")
|
| 478 |
-
except ValueError:
|
| 479 |
-
logger.warning("Impossible de déterminer part_pk pour %s", doc_result.doc_id)
|
| 480 |
-
continue
|
| 481 |
-
|
| 482 |
-
try:
|
| 483 |
-
self._post(
|
| 484 |
-
f"documents/{doc_pk}/parts/{part_pk}/transcriptions/",
|
| 485 |
-
{
|
| 486 |
-
"name": layer_name,
|
| 487 |
-
"content": doc_result.hypothesis,
|
| 488 |
-
"source": "picarones",
|
| 489 |
-
},
|
| 490 |
-
)
|
| 491 |
-
exported += 1
|
| 492 |
-
logger.debug("Exporté part %d → couche '%s'", part_pk, layer_name)
|
| 493 |
-
except RuntimeError as exc:
|
| 494 |
-
logger.warning("Erreur export part %d: %s", part_pk, exc)
|
| 495 |
-
|
| 496 |
-
return exported
|
| 497 |
-
|
| 498 |
-
|
| 499 |
-
# ---------------------------------------------------------------------------
|
| 500 |
-
# Interface de niveau module
|
| 501 |
-
# ---------------------------------------------------------------------------
|
| 502 |
-
|
| 503 |
-
def connect_escriptorium(
|
| 504 |
-
base_url: str,
|
| 505 |
-
token: str,
|
| 506 |
-
timeout: int = 30,
|
| 507 |
-
) -> EScriptoriumClient:
|
| 508 |
-
"""Crée et retourne un client eScriptorium authentifié.
|
| 509 |
-
|
| 510 |
-
Parameters
|
| 511 |
-
----------
|
| 512 |
-
base_url:
|
| 513 |
-
URL de l'instance eScriptorium.
|
| 514 |
-
token:
|
| 515 |
-
Token API.
|
| 516 |
-
timeout:
|
| 517 |
-
Timeout HTTP.
|
| 518 |
-
|
| 519 |
-
Returns
|
| 520 |
-
-------
|
| 521 |
-
EScriptoriumClient
|
| 522 |
|
| 523 |
-
|
| 524 |
-
|
| 525 |
-
|
| 526 |
-
|
| 527 |
-
"""
|
| 528 |
-
client = EScriptoriumClient(base_url, token, timeout)
|
| 529 |
-
if not client.test_connection():
|
| 530 |
-
raise RuntimeError(
|
| 531 |
-
f"Impossible de se connecter à {base_url}. "
|
| 532 |
-
"Vérifiez l'URL et le token API."
|
| 533 |
-
)
|
| 534 |
-
return client
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.importers.escriptorium`.
|
| 2 |
|
| 3 |
+
Phase C du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Cet importeur est désormais en Cercle 3 (``extras/importers/``). L'alias
|
| 5 |
+
ici permet aux imports historiques (``from picarones.importers.escriptorium
|
| 6 |
+
import ...``) de continuer à fonctionner.
|
|
|
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` et l'extra
|
| 9 |
+
``picarones[importers]`` du ``pyproject.toml``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.importers.escriptorium import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
import picarones.extras.importers.escriptorium as _module
|
| 15 |
+
__all__ = getattr(_module, "__all__", [
|
| 16 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 17 |
+
])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,553 +1,17 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
3. Récupération de l'OCR Gallica existant (texte brut ou ALTO) comme concurrent de référence
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
- SRU BnF : https://gallica.bnf.fr/SRU?operation=searchRetrieve&query=...
|
| 12 |
-
- IIIF Gallica : https://gallica.bnf.fr/ark:/12148/{ark}/manifest.json
|
| 13 |
-
- OCR texte brut : https://gallica.bnf.fr/ark:/12148/{ark}/f{n}.texteBrut
|
| 14 |
-
- Métadonnées OAI-PMH : https://gallica.bnf.fr/services/OAIRecord?ark={ark}
|
| 15 |
-
|
| 16 |
-
Usage
|
| 17 |
-
-----
|
| 18 |
-
>>> from picarones.importers.gallica import GallicaClient
|
| 19 |
-
>>> client = GallicaClient()
|
| 20 |
-
>>> results = client.search(title="Froissart", date_from=1380, date_to=1420, max_results=10)
|
| 21 |
-
>>> corpus = client.import_document(results[0].ark, pages="1-5", include_gallica_ocr=True)
|
| 22 |
"""
|
| 23 |
|
| 24 |
-
from
|
| 25 |
-
|
| 26 |
-
import logging
|
| 27 |
-
import re
|
| 28 |
-
import time
|
| 29 |
-
import urllib.error
|
| 30 |
-
import urllib.parse
|
| 31 |
-
import urllib.request
|
| 32 |
-
import xml.etree.ElementTree as ET
|
| 33 |
-
from dataclasses import dataclass
|
| 34 |
-
from typing import Optional
|
| 35 |
-
|
| 36 |
-
from picarones.core.corpus import Corpus
|
| 37 |
-
|
| 38 |
-
logger = logging.getLogger(__name__)
|
| 39 |
-
|
| 40 |
-
# Namespaces SRU/OAI
|
| 41 |
-
_NS_SRU = "http://www.loc.gov/zing/srw/"
|
| 42 |
-
_NS_DC = "http://purl.org/dc/elements/1.1/"
|
| 43 |
-
_NS_OAI = "http://www.openarchives.org/OAI/2.0/"
|
| 44 |
-
|
| 45 |
-
_GALLICA_BASE = "https://gallica.bnf.fr"
|
| 46 |
-
_SRU_URL = f"{_GALLICA_BASE}/SRU"
|
| 47 |
-
_IIIF_MANIFEST_TPL = f"{_GALLICA_BASE}/ark:/{{ark}}/manifest.json"
|
| 48 |
-
_OCR_BRUT_TPL = f"{_GALLICA_BASE}/ark:/{{ark}}/f{{page}}.texteBrut"
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
# ---------------------------------------------------------------------------
|
| 52 |
-
# Structures de données
|
| 53 |
-
# ---------------------------------------------------------------------------
|
| 54 |
-
|
| 55 |
-
@dataclass
|
| 56 |
-
class GallicaRecord:
|
| 57 |
-
"""Un résultat de recherche Gallica."""
|
| 58 |
-
ark: str
|
| 59 |
-
"""Identifiant ARK sans préfixe (ex: ``'12148/btv1b8453561w'``)."""
|
| 60 |
-
title: str
|
| 61 |
-
creator: str = ""
|
| 62 |
-
date: str = ""
|
| 63 |
-
description: str = ""
|
| 64 |
-
type_doc: str = ""
|
| 65 |
-
language: str = ""
|
| 66 |
-
rights: str = ""
|
| 67 |
-
has_ocr: bool = False
|
| 68 |
-
"""True si Gallica fournit un OCR pour ce document."""
|
| 69 |
-
|
| 70 |
-
@property
|
| 71 |
-
def url(self) -> str:
|
| 72 |
-
return f"{_GALLICA_BASE}/ark:/12148/{self.ark}"
|
| 73 |
-
|
| 74 |
-
@property
|
| 75 |
-
def manifest_url(self) -> str:
|
| 76 |
-
return f"{_GALLICA_BASE}/ark:/12148/{self.ark}/manifest.json"
|
| 77 |
-
|
| 78 |
-
def as_dict(self) -> dict:
|
| 79 |
-
return {
|
| 80 |
-
"ark": self.ark,
|
| 81 |
-
"title": self.title,
|
| 82 |
-
"creator": self.creator,
|
| 83 |
-
"date": self.date,
|
| 84 |
-
"description": self.description,
|
| 85 |
-
"type_doc": self.type_doc,
|
| 86 |
-
"language": self.language,
|
| 87 |
-
"has_ocr": self.has_ocr,
|
| 88 |
-
"url": self.url,
|
| 89 |
-
"manifest_url": self.manifest_url,
|
| 90 |
-
}
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
# ---------------------------------------------------------------------------
|
| 94 |
-
# Client Gallica
|
| 95 |
-
# ---------------------------------------------------------------------------
|
| 96 |
-
|
| 97 |
-
class GallicaClient:
|
| 98 |
-
"""Client pour les APIs Gallica (SRU, IIIF, OCR texte brut).
|
| 99 |
-
|
| 100 |
-
Parameters
|
| 101 |
-
----------
|
| 102 |
-
timeout:
|
| 103 |
-
Timeout HTTP en secondes.
|
| 104 |
-
delay_between_requests:
|
| 105 |
-
Délai en secondes entre chaque requête (pour respecter les conditions
|
| 106 |
-
d'utilisation Gallica).
|
| 107 |
-
|
| 108 |
-
Examples
|
| 109 |
-
--------
|
| 110 |
-
>>> client = GallicaClient()
|
| 111 |
-
>>> results = client.search(author="Froissart", max_results=5)
|
| 112 |
-
>>> for r in results:
|
| 113 |
-
... print(r.title, r.date)
|
| 114 |
-
>>> corpus = client.import_document(results[0].ark, pages="1-3")
|
| 115 |
-
"""
|
| 116 |
-
|
| 117 |
-
def __init__(
|
| 118 |
-
self,
|
| 119 |
-
timeout: int = 30,
|
| 120 |
-
delay_between_requests: float = 0.5,
|
| 121 |
-
) -> None:
|
| 122 |
-
self.timeout = timeout
|
| 123 |
-
self.delay = delay_between_requests
|
| 124 |
-
|
| 125 |
-
# Chantier 4 (post-Sprint 97) — fusion Gallica → IIIF :
|
| 126 |
-
# ``_validate_url`` et le fetch HTTP sont désormais factorisés
|
| 127 |
-
# dans :mod:`picarones.importers._http`. Avant ce chantier ces
|
| 128 |
-
# 30 lignes étaient dupliquées avec :mod:`iiif`. Le polite
|
| 129 |
-
# ``delay_between_requests`` reste ici (spécifique à la BnF).
|
| 130 |
-
|
| 131 |
-
@staticmethod
|
| 132 |
-
def _validate_url(url: str) -> None:
|
| 133 |
-
"""Délègue à :func:`picarones.importers._http.validate_http_url`."""
|
| 134 |
-
from picarones.importers._http import validate_http_url
|
| 135 |
-
validate_http_url(url)
|
| 136 |
-
|
| 137 |
-
def _fetch_url(self, url: str) -> bytes:
|
| 138 |
-
"""Télécharge le contenu d'une URL avec respect du polite delay BnF.
|
| 139 |
-
|
| 140 |
-
Délègue à :func:`picarones.importers._http.download_url` puis
|
| 141 |
-
applique ``self.delay`` (par défaut 0.5 s) entre les requêtes
|
| 142 |
-
pour respecter les conditions d'utilisation Gallica.
|
| 143 |
-
"""
|
| 144 |
-
from picarones.importers._http import download_url
|
| 145 |
-
try:
|
| 146 |
-
return download_url(
|
| 147 |
-
url,
|
| 148 |
-
retries=1,
|
| 149 |
-
timeout=self.timeout,
|
| 150 |
-
user_agent="Picarones/1.0 (research tool)",
|
| 151 |
-
)
|
| 152 |
-
except RuntimeError as exc:
|
| 153 |
-
# Le helper retourne ``RuntimeError`` après retries épuisés.
|
| 154 |
-
# On re-emballe pour conserver le format de message historique
|
| 155 |
-
# attendu par les tests Gallica (« HTTP 404 sur ... »).
|
| 156 |
-
raise RuntimeError(str(exc)) from exc
|
| 157 |
-
finally:
|
| 158 |
-
if self.delay > 0:
|
| 159 |
-
time.sleep(self.delay)
|
| 160 |
-
|
| 161 |
-
def _build_sru_query(
|
| 162 |
-
self,
|
| 163 |
-
ark: Optional[str] = None,
|
| 164 |
-
title: Optional[str] = None,
|
| 165 |
-
author: Optional[str] = None,
|
| 166 |
-
date_from: Optional[int] = None,
|
| 167 |
-
date_to: Optional[int] = None,
|
| 168 |
-
doc_type: Optional[str] = None,
|
| 169 |
-
language: Optional[str] = None,
|
| 170 |
-
) -> str:
|
| 171 |
-
"""Construit une requête CQL pour l'API SRU BnF."""
|
| 172 |
-
clauses: list[str] = []
|
| 173 |
-
|
| 174 |
-
if ark:
|
| 175 |
-
# Recherche par identifiant ARK
|
| 176 |
-
clauses.append(f'dc.identifier any "{ark}"')
|
| 177 |
-
if title:
|
| 178 |
-
clauses.append(f'dc.title all "{title}"')
|
| 179 |
-
if author:
|
| 180 |
-
clauses.append(f'dc.creator all "{author}"')
|
| 181 |
-
if date_from and date_to:
|
| 182 |
-
clauses.append(f'dc.date >= "{date_from}" and dc.date <= "{date_to}"')
|
| 183 |
-
elif date_from:
|
| 184 |
-
clauses.append(f'dc.date >= "{date_from}"')
|
| 185 |
-
elif date_to:
|
| 186 |
-
clauses.append(f'dc.date <= "{date_to}"')
|
| 187 |
-
if doc_type:
|
| 188 |
-
clauses.append(f'dc.type all "{doc_type}"')
|
| 189 |
-
if language:
|
| 190 |
-
clauses.append(f'dc.language all "{language}"')
|
| 191 |
-
|
| 192 |
-
if not clauses:
|
| 193 |
-
return 'gallica all "document"'
|
| 194 |
-
return " and ".join(clauses)
|
| 195 |
-
|
| 196 |
-
def search(
|
| 197 |
-
self,
|
| 198 |
-
ark: Optional[str] = None,
|
| 199 |
-
title: Optional[str] = None,
|
| 200 |
-
author: Optional[str] = None,
|
| 201 |
-
date_from: Optional[int] = None,
|
| 202 |
-
date_to: Optional[int] = None,
|
| 203 |
-
doc_type: Optional[str] = None,
|
| 204 |
-
language: Optional[str] = None,
|
| 205 |
-
max_results: int = 20,
|
| 206 |
-
) -> list[GallicaRecord]:
|
| 207 |
-
"""Recherche dans Gallica via l'API SRU BnF.
|
| 208 |
-
|
| 209 |
-
Parameters
|
| 210 |
-
----------
|
| 211 |
-
ark:
|
| 212 |
-
Identifiant ARK (ex : ``'12148/btv1b8453561w'``).
|
| 213 |
-
title:
|
| 214 |
-
Mots-clés dans le titre.
|
| 215 |
-
author:
|
| 216 |
-
Mots-clés dans l'auteur/créateur.
|
| 217 |
-
date_from:
|
| 218 |
-
Borne inférieure de date (année).
|
| 219 |
-
date_to:
|
| 220 |
-
Borne supérieure de date (année).
|
| 221 |
-
doc_type:
|
| 222 |
-
Type de document (``'monographie'``, ``'périodique'``, ``'manuscrit'``…).
|
| 223 |
-
language:
|
| 224 |
-
Code langue ISO 639 (``'fre'``, ``'lat'``, ``'ger'``…).
|
| 225 |
-
max_results:
|
| 226 |
-
Nombre maximum de résultats à retourner.
|
| 227 |
-
|
| 228 |
-
Returns
|
| 229 |
-
-------
|
| 230 |
-
list[GallicaRecord]
|
| 231 |
-
Liste des documents trouvés.
|
| 232 |
-
"""
|
| 233 |
-
query = self._build_sru_query(
|
| 234 |
-
ark=ark,
|
| 235 |
-
title=title,
|
| 236 |
-
author=author,
|
| 237 |
-
date_from=date_from,
|
| 238 |
-
date_to=date_to,
|
| 239 |
-
doc_type=doc_type,
|
| 240 |
-
language=language,
|
| 241 |
-
)
|
| 242 |
-
|
| 243 |
-
params = urllib.parse.urlencode({
|
| 244 |
-
"operation": "searchRetrieve",
|
| 245 |
-
"version": "1.2",
|
| 246 |
-
"query": query,
|
| 247 |
-
"maximumRecords": min(max_results, 50),
|
| 248 |
-
"startRecord": 1,
|
| 249 |
-
"recordSchema": "unimarcXchange",
|
| 250 |
-
})
|
| 251 |
-
url = f"{_SRU_URL}?{params}"
|
| 252 |
-
|
| 253 |
-
try:
|
| 254 |
-
raw = self._fetch_url(url)
|
| 255 |
-
except RuntimeError as exc:
|
| 256 |
-
logger.error("Erreur recherche SRU Gallica: %s", exc)
|
| 257 |
-
return []
|
| 258 |
-
|
| 259 |
-
return self._parse_sru_response(raw, max_results)
|
| 260 |
-
|
| 261 |
-
def _parse_sru_response(self, xml_bytes: bytes, max_results: int) -> list[GallicaRecord]:
|
| 262 |
-
"""Parse la réponse SRU XML de Gallica."""
|
| 263 |
-
records: list[GallicaRecord] = []
|
| 264 |
-
try:
|
| 265 |
-
root = ET.fromstring(xml_bytes)
|
| 266 |
-
except ET.ParseError as exc:
|
| 267 |
-
logger.error("Impossible de parser la réponse SRU: %s", exc)
|
| 268 |
-
return records
|
| 269 |
-
|
| 270 |
-
# Les enregistrements sont dans srw:records/srw:record/srw:recordData
|
| 271 |
-
for rec_elem in root.iter():
|
| 272 |
-
if rec_elem.tag.endswith("}record") or rec_elem.tag == "record":
|
| 273 |
-
record = self._parse_record_element(rec_elem)
|
| 274 |
-
if record:
|
| 275 |
-
records.append(record)
|
| 276 |
-
if len(records) >= max_results:
|
| 277 |
-
break
|
| 278 |
-
|
| 279 |
-
return records
|
| 280 |
-
|
| 281 |
-
def _parse_record_element(self, elem: ET.Element) -> Optional[GallicaRecord]:
|
| 282 |
-
"""Extrait les métadonnées d'un enregistrement SRU."""
|
| 283 |
-
# Chercher les champs Dublin Core dans l'enregistrement
|
| 284 |
-
def find_text(tag_suffix: str) -> str:
|
| 285 |
-
for child in elem.iter():
|
| 286 |
-
if child.tag.endswith(tag_suffix) and child.text:
|
| 287 |
-
return child.text.strip()
|
| 288 |
-
return ""
|
| 289 |
-
|
| 290 |
-
def find_all_text(tag_suffix: str) -> list[str]:
|
| 291 |
-
return [
|
| 292 |
-
child.text.strip()
|
| 293 |
-
for child in elem.iter()
|
| 294 |
-
if child.tag.endswith(tag_suffix) and child.text
|
| 295 |
-
]
|
| 296 |
-
|
| 297 |
-
# Chercher l'ARK dans l'identifiant
|
| 298 |
-
identifiers = find_all_text("identifier")
|
| 299 |
-
ark = ""
|
| 300 |
-
for ident in identifiers:
|
| 301 |
-
# Format typique : "https://gallica.bnf.fr/ark:/12148/btv1b8453561w"
|
| 302 |
-
m = re.search(r"ark:/(\d+/\w+)", ident)
|
| 303 |
-
if m:
|
| 304 |
-
ark = m.group(1)
|
| 305 |
-
break
|
| 306 |
-
|
| 307 |
-
if not ark:
|
| 308 |
-
return None
|
| 309 |
-
|
| 310 |
-
title = find_text("title") or "Sans titre"
|
| 311 |
-
creator = find_text("creator")
|
| 312 |
-
date = find_text("date")
|
| 313 |
-
|
| 314 |
-
# Vérifier si OCR disponible (heuristique : type monographie/périodique généralement)
|
| 315 |
-
doc_types = find_all_text("type")
|
| 316 |
-
has_ocr = any(
|
| 317 |
-
t.lower() in ("monographie", "fascicule", "texte", "text")
|
| 318 |
-
for t in doc_types
|
| 319 |
-
)
|
| 320 |
-
|
| 321 |
-
return GallicaRecord(
|
| 322 |
-
ark=ark,
|
| 323 |
-
title=title,
|
| 324 |
-
creator=creator,
|
| 325 |
-
date=date,
|
| 326 |
-
description=find_text("description"),
|
| 327 |
-
type_doc=", ".join(doc_types),
|
| 328 |
-
language=find_text("language"),
|
| 329 |
-
has_ocr=has_ocr,
|
| 330 |
-
)
|
| 331 |
-
|
| 332 |
-
def get_ocr_text(self, ark: str, page: int) -> str:
|
| 333 |
-
"""Récupère l'OCR Gallica d'une page spécifique (texte brut).
|
| 334 |
-
|
| 335 |
-
Parameters
|
| 336 |
-
----------
|
| 337 |
-
ark:
|
| 338 |
-
Identifiant ARK (ex : ``'12148/btv1b8453561w'``).
|
| 339 |
-
page:
|
| 340 |
-
Numéro de page 1-based.
|
| 341 |
-
|
| 342 |
-
Returns
|
| 343 |
-
-------
|
| 344 |
-
str
|
| 345 |
-
Texte OCR Gallica pour cette page (peut être vide si non disponible).
|
| 346 |
-
"""
|
| 347 |
-
url = _OCR_BRUT_TPL.format(ark=ark, page=page)
|
| 348 |
-
try:
|
| 349 |
-
raw = self._fetch_url(url)
|
| 350 |
-
text = raw.decode("utf-8", errors="replace").strip()
|
| 351 |
-
# Gallica retourne parfois du HTML pour les pages sans OCR
|
| 352 |
-
if text.startswith("<!") or "<html" in text[:100].lower():
|
| 353 |
-
return ""
|
| 354 |
-
return text
|
| 355 |
-
except RuntimeError as exc:
|
| 356 |
-
logger.debug("OCR non disponible pour %s f%d: %s", ark, page, exc)
|
| 357 |
-
return ""
|
| 358 |
-
|
| 359 |
-
def import_document(
|
| 360 |
-
self,
|
| 361 |
-
ark: str,
|
| 362 |
-
pages: str = "all",
|
| 363 |
-
output_dir: Optional[str] = None,
|
| 364 |
-
include_gallica_ocr: bool = True,
|
| 365 |
-
max_resolution: int = 0,
|
| 366 |
-
show_progress: bool = True,
|
| 367 |
-
) -> Corpus:
|
| 368 |
-
"""Importe un document Gallica comme corpus Picarones.
|
| 369 |
-
|
| 370 |
-
Utilise le manifeste IIIF Gallica pour lister les pages et télécharger
|
| 371 |
-
les images. L'OCR Gallica est optionnellement récupéré comme GT ou comme
|
| 372 |
-
transcription de référence.
|
| 373 |
-
|
| 374 |
-
Parameters
|
| 375 |
-
----------
|
| 376 |
-
ark:
|
| 377 |
-
Identifiant ARK (ex : ``'12148/btv1b8453561w'``).
|
| 378 |
-
pages:
|
| 379 |
-
Sélecteur de pages (``'all'``, ``'1-10'``, ``'1,3,5'``…).
|
| 380 |
-
output_dir:
|
| 381 |
-
Dossier local pour stocker images et GT.
|
| 382 |
-
include_gallica_ocr:
|
| 383 |
-
Si True, récupère l'OCR Gallica comme texte de référence.
|
| 384 |
-
max_resolution:
|
| 385 |
-
Largeur maximale des images téléchargées (0 = maximum disponible).
|
| 386 |
-
show_progress:
|
| 387 |
-
Affiche une barre de progression.
|
| 388 |
-
|
| 389 |
-
Returns
|
| 390 |
-
-------
|
| 391 |
-
Corpus
|
| 392 |
-
Corpus avec images et OCR Gallica comme GT (si disponible).
|
| 393 |
-
"""
|
| 394 |
-
from picarones.importers.iiif import IIIFImporter
|
| 395 |
-
|
| 396 |
-
manifest_url = f"{_GALLICA_BASE}/ark:/12148/{ark}/manifest.json"
|
| 397 |
-
logger.info("Import Gallica ARK %s via IIIF : %s", ark, manifest_url)
|
| 398 |
-
|
| 399 |
-
# Utiliser l'importeur IIIF existant pour les images
|
| 400 |
-
importer = IIIFImporter(manifest_url, max_resolution=max_resolution)
|
| 401 |
-
importer.load()
|
| 402 |
-
|
| 403 |
-
corpus = importer.import_corpus(
|
| 404 |
-
pages=pages,
|
| 405 |
-
output_dir=output_dir or f"./corpus_gallica_{ark.split('/')[-1]}/",
|
| 406 |
-
show_progress=show_progress,
|
| 407 |
-
)
|
| 408 |
-
|
| 409 |
-
# Enrichir avec l'OCR Gallica si demandé
|
| 410 |
-
if include_gallica_ocr:
|
| 411 |
-
selected_indices = importer.list_canvases(pages)
|
| 412 |
-
for i, doc in enumerate(corpus.documents):
|
| 413 |
-
page_num = selected_indices[i] + 1 if i < len(selected_indices) else i + 1
|
| 414 |
-
gallica_ocr = self.get_ocr_text(ark, page_num)
|
| 415 |
-
if gallica_ocr:
|
| 416 |
-
doc.metadata["gallica_ocr"] = gallica_ocr
|
| 417 |
-
# Si pas de GT manuscrite, utiliser l'OCR Gallica comme référence
|
| 418 |
-
if not doc.ground_truth.strip():
|
| 419 |
-
doc.ground_truth = gallica_ocr
|
| 420 |
-
doc.metadata["gt_source"] = "gallica_ocr"
|
| 421 |
-
|
| 422 |
-
# Ajouter métadonnées Gallica
|
| 423 |
-
corpus.metadata.update({
|
| 424 |
-
"source": "gallica",
|
| 425 |
-
"ark": ark,
|
| 426 |
-
"manifest_url": manifest_url,
|
| 427 |
-
"gallica_url": f"{_GALLICA_BASE}/ark:/12148/{ark}",
|
| 428 |
-
"include_gallica_ocr": include_gallica_ocr,
|
| 429 |
-
})
|
| 430 |
-
|
| 431 |
-
return corpus
|
| 432 |
-
|
| 433 |
-
def get_metadata(self, ark: str) -> dict:
|
| 434 |
-
"""Récupère les métadonnées OAI-PMH d'un document Gallica.
|
| 435 |
-
|
| 436 |
-
Parameters
|
| 437 |
-
----------
|
| 438 |
-
ark:
|
| 439 |
-
Identifiant ARK.
|
| 440 |
-
|
| 441 |
-
Returns
|
| 442 |
-
-------
|
| 443 |
-
dict
|
| 444 |
-
Métadonnées Dublin Core du document.
|
| 445 |
-
"""
|
| 446 |
-
url = f"{_GALLICA_BASE}/services/OAIRecord?ark=ark:/12148/{ark}"
|
| 447 |
-
try:
|
| 448 |
-
raw = self._fetch_url(url)
|
| 449 |
-
root = ET.fromstring(raw)
|
| 450 |
-
except (RuntimeError, ET.ParseError) as exc:
|
| 451 |
-
logger.error("Erreur métadonnées OAI %s: %s", ark, exc)
|
| 452 |
-
return {"ark": ark}
|
| 453 |
-
|
| 454 |
-
def find_text(tag_suffix: str) -> str:
|
| 455 |
-
for elem in root.iter():
|
| 456 |
-
if elem.tag.endswith(tag_suffix) and elem.text:
|
| 457 |
-
return elem.text.strip()
|
| 458 |
-
return ""
|
| 459 |
-
|
| 460 |
-
return {
|
| 461 |
-
"ark": ark,
|
| 462 |
-
"title": find_text("title"),
|
| 463 |
-
"creator": find_text("creator"),
|
| 464 |
-
"date": find_text("date"),
|
| 465 |
-
"description": find_text("description"),
|
| 466 |
-
"subject": find_text("subject"),
|
| 467 |
-
"language": find_text("language"),
|
| 468 |
-
"type": find_text("type"),
|
| 469 |
-
"format": find_text("format"),
|
| 470 |
-
"source": find_text("source"),
|
| 471 |
-
"url": f"{_GALLICA_BASE}/ark:/12148/{ark}",
|
| 472 |
-
}
|
| 473 |
-
|
| 474 |
-
|
| 475 |
-
# ---------------------------------------------------------------------------
|
| 476 |
-
# Fonctions de commodité
|
| 477 |
-
# ---------------------------------------------------------------------------
|
| 478 |
-
|
| 479 |
-
def search_gallica(
|
| 480 |
-
title: Optional[str] = None,
|
| 481 |
-
author: Optional[str] = None,
|
| 482 |
-
ark: Optional[str] = None,
|
| 483 |
-
date_from: Optional[int] = None,
|
| 484 |
-
date_to: Optional[int] = None,
|
| 485 |
-
max_results: int = 20,
|
| 486 |
-
) -> list[GallicaRecord]:
|
| 487 |
-
"""Recherche rapide dans Gallica.
|
| 488 |
-
|
| 489 |
-
Crée un client temporaire et effectue une recherche.
|
| 490 |
-
|
| 491 |
-
Parameters
|
| 492 |
-
----------
|
| 493 |
-
title, author, ark, date_from, date_to:
|
| 494 |
-
Critères de recherche.
|
| 495 |
-
max_results:
|
| 496 |
-
Nombre maximum de résultats.
|
| 497 |
-
|
| 498 |
-
Returns
|
| 499 |
-
-------
|
| 500 |
-
list[GallicaRecord]
|
| 501 |
-
|
| 502 |
-
Examples
|
| 503 |
-
--------
|
| 504 |
-
>>> results = search_gallica(title="Froissart", date_from=1380, date_to=1430)
|
| 505 |
-
>>> for r in results[:3]:
|
| 506 |
-
... print(r.title, r.ark)
|
| 507 |
-
"""
|
| 508 |
-
client = GallicaClient()
|
| 509 |
-
return client.search(
|
| 510 |
-
ark=ark,
|
| 511 |
-
title=title,
|
| 512 |
-
author=author,
|
| 513 |
-
date_from=date_from,
|
| 514 |
-
date_to=date_to,
|
| 515 |
-
max_results=max_results,
|
| 516 |
-
)
|
| 517 |
-
|
| 518 |
-
|
| 519 |
-
def import_gallica_document(
|
| 520 |
-
ark: str,
|
| 521 |
-
pages: str = "all",
|
| 522 |
-
output_dir: Optional[str] = None,
|
| 523 |
-
include_gallica_ocr: bool = True,
|
| 524 |
-
) -> Corpus:
|
| 525 |
-
"""Importe un document Gallica en une ligne.
|
| 526 |
-
|
| 527 |
-
Parameters
|
| 528 |
-
----------
|
| 529 |
-
ark:
|
| 530 |
-
Identifiant ARK (``'12148/btv1b8453561w'`` ou URL complète).
|
| 531 |
-
pages:
|
| 532 |
-
Sélecteur de pages (``'all'``, ``'1-10'``…).
|
| 533 |
-
output_dir:
|
| 534 |
-
Dossier de sortie.
|
| 535 |
-
include_gallica_ocr:
|
| 536 |
-
Inclure l'OCR Gallica comme GT.
|
| 537 |
-
|
| 538 |
-
Returns
|
| 539 |
-
-------
|
| 540 |
-
Corpus
|
| 541 |
-
"""
|
| 542 |
-
# Normaliser l'ARK (extraire depuis URL complète si besoin)
|
| 543 |
-
m = re.search(r"ark:/(\d+/\w+)", ark)
|
| 544 |
-
if m:
|
| 545 |
-
ark = m.group(1)
|
| 546 |
|
| 547 |
-
|
| 548 |
-
|
| 549 |
-
|
| 550 |
-
|
| 551 |
-
output_dir=output_dir,
|
| 552 |
-
include_gallica_ocr=include_gallica_ocr,
|
| 553 |
-
)
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.importers.gallica`.
|
| 2 |
|
| 3 |
+
Phase C du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Cet importeur est désormais en Cercle 3 (``extras/importers/``). L'alias
|
| 5 |
+
ici permet aux imports historiques (``from picarones.importers.gallica
|
| 6 |
+
import ...``) de continuer à fonctionner.
|
|
|
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` et l'extra
|
| 9 |
+
``picarones[importers]`` du ``pyproject.toml``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.importers.gallica import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
import picarones.extras.importers.gallica as _module
|
| 15 |
+
__all__ = getattr(_module, "__all__", [
|
| 16 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 17 |
+
])
|
|
|
|
|
|
|
|
|
|
@@ -1,455 +1,17 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
|
|
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
- :func:`fetch_catalogue` — téléchargement du catalogue depuis GitHub
|
| 10 |
-
- :func:`import_htr_united_corpus` — téléchargement et import d'un corpus
|
| 11 |
-
|
| 12 |
-
Exemple
|
| 13 |
-
-------
|
| 14 |
-
catalogue = HTRUnitedCatalogue.from_remote()
|
| 15 |
-
results = catalogue.search("français médiéval")
|
| 16 |
-
corpus = import_htr_united_corpus(results[0], output_dir="./corpus/")
|
| 17 |
"""
|
| 18 |
|
| 19 |
-
from
|
| 20 |
-
|
| 21 |
-
import json
|
| 22 |
-
import logging
|
| 23 |
-
import re
|
| 24 |
-
import urllib.error
|
| 25 |
-
import urllib.request
|
| 26 |
-
from dataclasses import dataclass, field
|
| 27 |
-
from pathlib import Path
|
| 28 |
-
from typing import Optional
|
| 29 |
-
|
| 30 |
-
logger = logging.getLogger(__name__)
|
| 31 |
-
|
| 32 |
-
# ---------------------------------------------------------------------------
|
| 33 |
-
# Catalogue remote URL
|
| 34 |
-
# ---------------------------------------------------------------------------
|
| 35 |
-
|
| 36 |
-
_CATALOGUE_URL = (
|
| 37 |
-
"https://raw.githubusercontent.com/HTR-United/htr-united/master/htr-united.yml"
|
| 38 |
-
)
|
| 39 |
-
_CATALOGUE_API_URL = (
|
| 40 |
-
"https://api.github.com/repos/HTR-United/htr-united/contents/htr-united.yml"
|
| 41 |
-
)
|
| 42 |
-
|
| 43 |
-
# Catalogue de démonstration / fallback (hors-ligne)
|
| 44 |
-
_DEMO_CATALOGUE: list[dict] = [
|
| 45 |
-
{
|
| 46 |
-
"id": "lectaurep-repertoires",
|
| 47 |
-
"title": "Lectaurep — Répertoires de notaires parisiens",
|
| 48 |
-
"url": "https://github.com/HTR-United/lectaurep-repertoires",
|
| 49 |
-
"language": ["French"],
|
| 50 |
-
"script": ["Cursiva"],
|
| 51 |
-
"century": [17, 18],
|
| 52 |
-
"institution": "Archives nationales (France)",
|
| 53 |
-
"description": "Transcriptions de répertoires de notaires, XVIIe-XVIIIe siècles.",
|
| 54 |
-
"license": "CC-BY 4.0",
|
| 55 |
-
"lines": 12400,
|
| 56 |
-
"format": "ALTO",
|
| 57 |
-
"tags": ["notaires", "Paris", "cursive", "imprimé"],
|
| 58 |
-
},
|
| 59 |
-
{
|
| 60 |
-
"id": "bvmm-manuscripts",
|
| 61 |
-
"title": "BVMM — Manuscrits enluminés",
|
| 62 |
-
"url": "https://github.com/HTR-United/bvmm-manuscripts",
|
| 63 |
-
"language": ["Latin", "French"],
|
| 64 |
-
"script": ["Gothic"],
|
| 65 |
-
"century": [13, 14, 15],
|
| 66 |
-
"institution": "IRHT",
|
| 67 |
-
"description": "Manuscrits médiévaux latins et français, XIIIe-XVe siècles.",
|
| 68 |
-
"license": "CC-BY 4.0",
|
| 69 |
-
"lines": 8700,
|
| 70 |
-
"format": "ALTO",
|
| 71 |
-
"tags": ["manuscrits", "latin", "médiéval", "enluminure"],
|
| 72 |
-
},
|
| 73 |
-
{
|
| 74 |
-
"id": "cremma-medieval",
|
| 75 |
-
"title": "CREMMA Médiéval",
|
| 76 |
-
"url": "https://github.com/HTR-United/cremma-medieval",
|
| 77 |
-
"language": ["French", "Latin"],
|
| 78 |
-
"script": ["Gothic", "Humanistica"],
|
| 79 |
-
"century": [12, 13, 14, 15],
|
| 80 |
-
"institution": "École des chartes / Inria",
|
| 81 |
-
"description": "Corpus CREMMA de manuscrits médiévaux français et latins.",
|
| 82 |
-
"license": "CC-BY 4.0",
|
| 83 |
-
"lines": 6200,
|
| 84 |
-
"format": "ALTO",
|
| 85 |
-
"tags": ["médiéval", "chartes", "manuscrits"],
|
| 86 |
-
},
|
| 87 |
-
{
|
| 88 |
-
"id": "simssa-ocr-printed",
|
| 89 |
-
"title": "SIMSSA — Imprimés anciens (XVe-XVIIe)",
|
| 90 |
-
"url": "https://github.com/HTR-United/simssa-printed",
|
| 91 |
-
"language": ["French", "Latin"],
|
| 92 |
-
"script": ["Rotunda", "Roman"],
|
| 93 |
-
"century": [15, 16, 17],
|
| 94 |
-
"institution": "McGill University",
|
| 95 |
-
"description": "Corpus d'imprimés anciens romains et gothiques.",
|
| 96 |
-
"license": "CC-BY 4.0",
|
| 97 |
-
"lines": 4500,
|
| 98 |
-
"format": "PAGE",
|
| 99 |
-
"tags": ["imprimés", "incunables", "roman", "gothique"],
|
| 100 |
-
},
|
| 101 |
-
{
|
| 102 |
-
"id": "fonds-gallica-presse",
|
| 103 |
-
"title": "Presse ancienne — Gallica (XIXe)",
|
| 104 |
-
"url": "https://github.com/HTR-United/gallica-presse-xix",
|
| 105 |
-
"language": ["French"],
|
| 106 |
-
"script": ["Roman"],
|
| 107 |
-
"century": [19],
|
| 108 |
-
"institution": "Gallica",
|
| 109 |
-
"description": "Numérisations de journaux du XIXe siècle (Gallica).",
|
| 110 |
-
"license": "etalab-2.0",
|
| 111 |
-
"lines": 31000,
|
| 112 |
-
"format": "ALTO",
|
| 113 |
-
"tags": ["presse", "XIXe", "Gallica", "journaux"],
|
| 114 |
-
},
|
| 115 |
-
{
|
| 116 |
-
"id": "archives-departem-correspondances",
|
| 117 |
-
"title": "Correspondances administratives (XVIIIe-XIXe)",
|
| 118 |
-
"url": "https://github.com/HTR-United/correspondances-admin",
|
| 119 |
-
"language": ["French"],
|
| 120 |
-
"script": ["Cursiva"],
|
| 121 |
-
"century": [18, 19],
|
| 122 |
-
"institution": "Archives départementales",
|
| 123 |
-
"description": "Lettres et correspondances administratives manuscrites.",
|
| 124 |
-
"license": "CC-BY 4.0",
|
| 125 |
-
"lines": 9800,
|
| 126 |
-
"format": "ALTO",
|
| 127 |
-
"tags": ["correspondances", "administratif", "cursive"],
|
| 128 |
-
},
|
| 129 |
-
{
|
| 130 |
-
"id": "e-codices-latin",
|
| 131 |
-
"title": "e-codices — Manuscrits latins (Suisse)",
|
| 132 |
-
"url": "https://github.com/HTR-United/e-codices-latin",
|
| 133 |
-
"language": ["Latin"],
|
| 134 |
-
"script": ["Caroline", "Gothic"],
|
| 135 |
-
"century": [9, 10, 11, 12],
|
| 136 |
-
"institution": "Bibliothèque cantonale universitaire de Lausanne",
|
| 137 |
-
"description": "Manuscrits carolingiens et gothiques des bibliothèques suisses.",
|
| 138 |
-
"license": "CC-BY 4.0",
|
| 139 |
-
"lines": 3100,
|
| 140 |
-
"format": "ALTO",
|
| 141 |
-
"tags": ["caroline", "latin", "médiéval", "Suisse"],
|
| 142 |
-
},
|
| 143 |
-
{
|
| 144 |
-
"id": "registres-paroissiaux-17",
|
| 145 |
-
"title": "Registres paroissiaux — Bretagne (XVIIe)",
|
| 146 |
-
"url": "https://github.com/HTR-United/registres-paroissiaux-bretagne",
|
| 147 |
-
"language": ["French", "Latin"],
|
| 148 |
-
"script": ["Cursiva"],
|
| 149 |
-
"century": [17],
|
| 150 |
-
"institution": "Archives départementales du Finistère",
|
| 151 |
-
"description": "Registres paroissiaux bretons du XVIIe siècle.",
|
| 152 |
-
"license": "CC-BY 4.0",
|
| 153 |
-
"lines": 15600,
|
| 154 |
-
"format": "ALTO",
|
| 155 |
-
"tags": ["registres", "Bretagne", "paroissial", "cursive"],
|
| 156 |
-
},
|
| 157 |
-
]
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
# ---------------------------------------------------------------------------
|
| 161 |
-
# Dataclass entrée catalogue
|
| 162 |
-
# ---------------------------------------------------------------------------
|
| 163 |
-
|
| 164 |
-
@dataclass
|
| 165 |
-
class HTRUnitedEntry:
|
| 166 |
-
"""Une entrée dans le catalogue HTR-United."""
|
| 167 |
-
|
| 168 |
-
id: str
|
| 169 |
-
title: str
|
| 170 |
-
url: str
|
| 171 |
-
language: list[str] = field(default_factory=list)
|
| 172 |
-
script: list[str] = field(default_factory=list)
|
| 173 |
-
century: list[int] = field(default_factory=list)
|
| 174 |
-
institution: str = ""
|
| 175 |
-
description: str = ""
|
| 176 |
-
license: str = ""
|
| 177 |
-
lines: int = 0
|
| 178 |
-
format: str = "ALTO"
|
| 179 |
-
tags: list[str] = field(default_factory=list)
|
| 180 |
-
|
| 181 |
-
def as_dict(self) -> dict:
|
| 182 |
-
return {
|
| 183 |
-
"id": self.id,
|
| 184 |
-
"title": self.title,
|
| 185 |
-
"url": self.url,
|
| 186 |
-
"language": self.language,
|
| 187 |
-
"script": self.script,
|
| 188 |
-
"century": self.century,
|
| 189 |
-
"institution": self.institution,
|
| 190 |
-
"description": self.description,
|
| 191 |
-
"license": self.license,
|
| 192 |
-
"lines": self.lines,
|
| 193 |
-
"format": self.format,
|
| 194 |
-
"tags": self.tags,
|
| 195 |
-
}
|
| 196 |
-
|
| 197 |
-
@classmethod
|
| 198 |
-
def from_dict(cls, d: dict) -> "HTRUnitedEntry":
|
| 199 |
-
return cls(
|
| 200 |
-
id=d.get("id", ""),
|
| 201 |
-
title=d.get("title", ""),
|
| 202 |
-
url=d.get("url", ""),
|
| 203 |
-
language=d.get("language", []),
|
| 204 |
-
script=d.get("script", []),
|
| 205 |
-
century=d.get("century", []),
|
| 206 |
-
institution=d.get("institution", ""),
|
| 207 |
-
description=d.get("description", ""),
|
| 208 |
-
license=d.get("license", ""),
|
| 209 |
-
lines=d.get("lines", 0),
|
| 210 |
-
format=d.get("format", "ALTO"),
|
| 211 |
-
tags=d.get("tags", []),
|
| 212 |
-
)
|
| 213 |
-
|
| 214 |
-
@property
|
| 215 |
-
def century_str(self) -> str:
|
| 216 |
-
"""Siècles formatés en chiffres romains."""
|
| 217 |
-
roman = {
|
| 218 |
-
1: "Ier", 2: "IIe", 3: "IIIe", 4: "IVe", 5: "Ve",
|
| 219 |
-
6: "VIe", 7: "VIIe", 8: "VIIIe", 9: "IXe", 10: "Xe",
|
| 220 |
-
11: "XIe", 12: "XIIe", 13: "XIIIe", 14: "XIVe", 15: "XVe",
|
| 221 |
-
16: "XVIe", 17: "XVIIe", 18: "XVIIIe", 19: "XIXe", 20: "XXe",
|
| 222 |
-
}
|
| 223 |
-
return ", ".join(roman.get(c, f"{c}e") for c in self.century)
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
# ---------------------------------------------------------------------------
|
| 227 |
-
# Catalogue
|
| 228 |
-
# ---------------------------------------------------------------------------
|
| 229 |
-
|
| 230 |
-
class HTRUnitedCatalogue:
|
| 231 |
-
"""Catalogue HTR-United avec recherche et filtrage."""
|
| 232 |
-
|
| 233 |
-
def __init__(self, entries: list[HTRUnitedEntry], source: str = "demo") -> None:
|
| 234 |
-
self.entries = entries
|
| 235 |
-
self.source = source # "remote" | "demo" | "cache"
|
| 236 |
-
|
| 237 |
-
def __len__(self) -> int:
|
| 238 |
-
return len(self.entries)
|
| 239 |
-
|
| 240 |
-
@classmethod
|
| 241 |
-
def from_demo(cls) -> "HTRUnitedCatalogue":
|
| 242 |
-
"""Charge le catalogue de démonstration intégré."""
|
| 243 |
-
entries = [HTRUnitedEntry.from_dict(d) for d in _DEMO_CATALOGUE]
|
| 244 |
-
return cls(entries, source="demo")
|
| 245 |
-
|
| 246 |
-
@classmethod
|
| 247 |
-
def from_remote(cls, timeout: int = 10) -> "HTRUnitedCatalogue":
|
| 248 |
-
"""Télécharge le catalogue depuis GitHub.
|
| 249 |
-
|
| 250 |
-
En cas d'erreur réseau, retourne le catalogue de démonstration.
|
| 251 |
-
"""
|
| 252 |
-
try:
|
| 253 |
-
req = urllib.request.Request(
|
| 254 |
-
_CATALOGUE_URL,
|
| 255 |
-
headers={"User-Agent": "picarones-htr-united-importer/1.0"},
|
| 256 |
-
)
|
| 257 |
-
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
| 258 |
-
raw = resp.read().decode("utf-8")
|
| 259 |
-
entries = _parse_yml_catalogue(raw)
|
| 260 |
-
return cls(entries, source="remote")
|
| 261 |
-
except (urllib.error.URLError, Exception) as exc:
|
| 262 |
-
# Fallback démo avec avertissement
|
| 263 |
-
logger.warning(
|
| 264 |
-
"[HTR-United] impossible de charger le catalogue distant (%s) : %s. "
|
| 265 |
-
"Utilisation des données de démonstration.",
|
| 266 |
-
_CATALOGUE_URL, exc,
|
| 267 |
-
)
|
| 268 |
-
return cls.from_demo()
|
| 269 |
-
|
| 270 |
-
def search(
|
| 271 |
-
self,
|
| 272 |
-
query: str = "",
|
| 273 |
-
language: Optional[str] = None,
|
| 274 |
-
script: Optional[str] = None,
|
| 275 |
-
century_min: Optional[int] = None,
|
| 276 |
-
century_max: Optional[int] = None,
|
| 277 |
-
) -> list[HTRUnitedEntry]:
|
| 278 |
-
"""Recherche dans le catalogue avec filtres optionnels."""
|
| 279 |
-
results = self.entries
|
| 280 |
-
|
| 281 |
-
if query:
|
| 282 |
-
q = query.lower()
|
| 283 |
-
results = [
|
| 284 |
-
e for e in results
|
| 285 |
-
if (q in e.title.lower()
|
| 286 |
-
or q in e.description.lower()
|
| 287 |
-
or q in e.institution.lower()
|
| 288 |
-
or any(q in t.lower() for t in e.tags)
|
| 289 |
-
or any(q in lang.lower() for lang in e.language))
|
| 290 |
-
]
|
| 291 |
-
|
| 292 |
-
if language:
|
| 293 |
-
lang_lower = language.lower()
|
| 294 |
-
results = [
|
| 295 |
-
e for e in results
|
| 296 |
-
if any(lang_lower in lg.lower() for lg in e.language)
|
| 297 |
-
]
|
| 298 |
-
|
| 299 |
-
if script:
|
| 300 |
-
sc_lower = script.lower()
|
| 301 |
-
results = [
|
| 302 |
-
e for e in results
|
| 303 |
-
if any(sc_lower in s.lower() for s in e.script)
|
| 304 |
-
]
|
| 305 |
-
|
| 306 |
-
if century_min is not None:
|
| 307 |
-
results = [
|
| 308 |
-
e for e in results
|
| 309 |
-
if any(c >= century_min for c in e.century)
|
| 310 |
-
]
|
| 311 |
-
|
| 312 |
-
if century_max is not None:
|
| 313 |
-
results = [
|
| 314 |
-
e for e in results
|
| 315 |
-
if any(c <= century_max for c in e.century)
|
| 316 |
-
]
|
| 317 |
-
|
| 318 |
-
return results
|
| 319 |
-
|
| 320 |
-
def get_by_id(self, entry_id: str) -> Optional[HTRUnitedEntry]:
|
| 321 |
-
"""Retourne une entrée par son identifiant."""
|
| 322 |
-
for e in self.entries:
|
| 323 |
-
if e.id == entry_id:
|
| 324 |
-
return e
|
| 325 |
-
return None
|
| 326 |
-
|
| 327 |
-
def available_languages(self) -> list[str]:
|
| 328 |
-
seen: set[str] = set()
|
| 329 |
-
result: list[str] = []
|
| 330 |
-
for e in self.entries:
|
| 331 |
-
for lang in e.language:
|
| 332 |
-
if lang not in seen:
|
| 333 |
-
seen.add(lang)
|
| 334 |
-
result.append(lang)
|
| 335 |
-
return sorted(result)
|
| 336 |
-
|
| 337 |
-
def available_scripts(self) -> list[str]:
|
| 338 |
-
seen: set[str] = set()
|
| 339 |
-
result: list[str] = []
|
| 340 |
-
for e in self.entries:
|
| 341 |
-
for sc in e.script:
|
| 342 |
-
if sc not in seen:
|
| 343 |
-
seen.add(sc)
|
| 344 |
-
result.append(sc)
|
| 345 |
-
return sorted(result)
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
# ---------------------------------------------------------------------------
|
| 349 |
-
# Import de corpus
|
| 350 |
-
# ---------------------------------------------------------------------------
|
| 351 |
-
|
| 352 |
-
def import_htr_united_corpus(
|
| 353 |
-
entry: HTRUnitedEntry,
|
| 354 |
-
output_dir: str | Path,
|
| 355 |
-
max_samples: int = 100,
|
| 356 |
-
show_progress: bool = True,
|
| 357 |
-
) -> dict:
|
| 358 |
-
"""Importe un corpus HTR-United dans un dossier local.
|
| 359 |
-
|
| 360 |
-
Retourne un dict avec les métadonnées de l'import.
|
| 361 |
-
Note : en l'absence d'accès réseau au dépôt GitHub, génère des fichiers
|
| 362 |
-
placeholder (pour tests et démo).
|
| 363 |
-
"""
|
| 364 |
-
output_path = Path(output_dir)
|
| 365 |
-
output_path.mkdir(parents=True, exist_ok=True)
|
| 366 |
-
|
| 367 |
-
# Sauvegarder les métadonnées
|
| 368 |
-
meta = {
|
| 369 |
-
"source": "htr-united",
|
| 370 |
-
"entry_id": entry.id,
|
| 371 |
-
"title": entry.title,
|
| 372 |
-
"url": entry.url,
|
| 373 |
-
"language": entry.language,
|
| 374 |
-
"script": entry.script,
|
| 375 |
-
"century": entry.century,
|
| 376 |
-
"institution": entry.institution,
|
| 377 |
-
"license": entry.license,
|
| 378 |
-
"format": entry.format,
|
| 379 |
-
"imported_at": _iso_now(),
|
| 380 |
-
}
|
| 381 |
-
(output_path / "htr_united_meta.json").write_text(
|
| 382 |
-
json.dumps(meta, ensure_ascii=False, indent=2), encoding="utf-8"
|
| 383 |
-
)
|
| 384 |
-
|
| 385 |
-
# Essai de téléchargement réel depuis GitHub (archive releases)
|
| 386 |
-
downloaded = _try_download_corpus(entry, output_path, max_samples, show_progress)
|
| 387 |
-
|
| 388 |
-
return {
|
| 389 |
-
"entry_id": entry.id,
|
| 390 |
-
"title": entry.title,
|
| 391 |
-
"output_dir": str(output_path),
|
| 392 |
-
"files_imported": downloaded,
|
| 393 |
-
"metadata_file": str(output_path / "htr_united_meta.json"),
|
| 394 |
-
}
|
| 395 |
-
|
| 396 |
-
|
| 397 |
-
def _try_download_corpus(
|
| 398 |
-
entry: HTRUnitedEntry,
|
| 399 |
-
output_path: Path,
|
| 400 |
-
max_samples: int,
|
| 401 |
-
show_progress: bool,
|
| 402 |
-
) -> int:
|
| 403 |
-
"""Tente de télécharger le corpus depuis GitHub. Retourne le nombre de fichiers importés."""
|
| 404 |
-
# Construit l'URL de l'archive ZIP du dépôt GitHub
|
| 405 |
-
repo_path = _extract_github_repo(entry.url)
|
| 406 |
-
if not repo_path:
|
| 407 |
-
return 0
|
| 408 |
-
|
| 409 |
-
zip_url = f"https://github.com/{repo_path}/archive/refs/heads/main.zip"
|
| 410 |
-
try:
|
| 411 |
-
req = urllib.request.Request(
|
| 412 |
-
zip_url,
|
| 413 |
-
headers={"User-Agent": "picarones-htr-united-importer/1.0"},
|
| 414 |
-
)
|
| 415 |
-
with urllib.request.urlopen(req, timeout=30) as resp:
|
| 416 |
-
import io
|
| 417 |
-
import zipfile
|
| 418 |
-
|
| 419 |
-
data = resp.read()
|
| 420 |
-
with zipfile.ZipFile(io.BytesIO(data)) as zf:
|
| 421 |
-
# Extraire les fichiers ALTO/PAGE/GT
|
| 422 |
-
gt_files = [
|
| 423 |
-
n for n in zf.namelist()
|
| 424 |
-
if n.endswith((".alto.xml", ".page.xml", ".gt.txt", ".xml"))
|
| 425 |
-
and not n.endswith("/")
|
| 426 |
-
][:max_samples]
|
| 427 |
-
for i, fname in enumerate(gt_files):
|
| 428 |
-
dest = output_path / Path(fname).name
|
| 429 |
-
dest.write_bytes(zf.read(fname))
|
| 430 |
-
return len(gt_files)
|
| 431 |
-
except Exception:
|
| 432 |
-
return 0
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
def _extract_github_repo(url: str) -> Optional[str]:
|
| 436 |
-
"""Extrait 'owner/repo' depuis une URL GitHub."""
|
| 437 |
-
m = re.match(r"https?://github\.com/([^/]+/[^/]+?)(?:\.git)?/?$", url)
|
| 438 |
-
return m.group(1) if m else None
|
| 439 |
-
|
| 440 |
-
|
| 441 |
-
def _parse_yml_catalogue(raw: str) -> list[HTRUnitedEntry]:
|
| 442 |
-
"""Parse rudimentaire du YAML catalogue HTR-United."""
|
| 443 |
-
try:
|
| 444 |
-
import yaml
|
| 445 |
-
data = yaml.safe_load(raw)
|
| 446 |
-
if isinstance(data, list):
|
| 447 |
-
return [HTRUnitedEntry.from_dict(d) for d in data if isinstance(d, dict)]
|
| 448 |
-
except Exception:
|
| 449 |
-
pass
|
| 450 |
-
return [HTRUnitedEntry.from_dict(d) for d in _DEMO_CATALOGUE]
|
| 451 |
-
|
| 452 |
|
| 453 |
-
|
| 454 |
-
|
| 455 |
-
|
|
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.importers.htr_united`.
|
| 2 |
|
| 3 |
+
Phase C du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Cet importeur est désormais en Cercle 3 (``extras/importers/``). L'alias
|
| 5 |
+
ici permet aux imports historiques (``from picarones.importers.htr_united
|
| 6 |
+
import ...``) de continuer à fonctionner.
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` et l'extra
|
| 9 |
+
``picarones[importers]`` du ``pyproject.toml``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.importers.htr_united import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
import picarones.extras.importers.htr_united as _module
|
| 15 |
+
__all__ = getattr(_module, "__all__", [
|
| 16 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 17 |
+
])
|
|
@@ -1,427 +1,17 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
- :func:`import_hf_dataset` — téléchargement d'un dataset vers un dossier local
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
Exemple
|
| 13 |
-
-------
|
| 14 |
-
importer = HuggingFaceImporter()
|
| 15 |
-
results = importer.search("medieval OCR", tags=["ocr"])
|
| 16 |
-
corpus = importer.import_dataset(results[0].dataset_id, output_dir="./corpus/")
|
| 17 |
"""
|
| 18 |
|
| 19 |
-
from
|
| 20 |
-
|
| 21 |
-
import json
|
| 22 |
-
import os
|
| 23 |
-
import urllib.error
|
| 24 |
-
import urllib.parse
|
| 25 |
-
import urllib.request
|
| 26 |
-
from dataclasses import dataclass, field
|
| 27 |
-
from pathlib import Path
|
| 28 |
-
from typing import Optional
|
| 29 |
-
|
| 30 |
-
# ---------------------------------------------------------------------------
|
| 31 |
-
# Datasets de référence pré-référencés
|
| 32 |
-
# ---------------------------------------------------------------------------
|
| 33 |
-
|
| 34 |
-
_REFERENCE_DATASETS: list[dict] = [
|
| 35 |
-
{
|
| 36 |
-
"dataset_id": "Teklia/RIMES",
|
| 37 |
-
"title": "RIMES — Reconnaissance et Indexation de données Manuscrites et de fac-similEs",
|
| 38 |
-
"description": "Corpus de courriers manuscrits français modernes. Standard de référence pour la reconnaissance d'écriture manuscrite.",
|
| 39 |
-
"language": ["French"],
|
| 40 |
-
"tags": ["htr", "ocr", "handwritten", "french", "modern"],
|
| 41 |
-
"license": "cc-by-4.0",
|
| 42 |
-
"size_category": "1K<n<10K",
|
| 43 |
-
"task": "image-to-text",
|
| 44 |
-
"institution": "IRISA / A2iA",
|
| 45 |
-
"downloads": 1200,
|
| 46 |
-
},
|
| 47 |
-
{
|
| 48 |
-
"dataset_id": "Teklia/IAM",
|
| 49 |
-
"title": "IAM Handwriting Database",
|
| 50 |
-
"description": "Corpus de référence anglais pour la reconnaissance d'écriture manuscrite.",
|
| 51 |
-
"language": ["English"],
|
| 52 |
-
"tags": ["htr", "ocr", "handwritten", "english"],
|
| 53 |
-
"license": "other",
|
| 54 |
-
"size_category": "10K<n<100K",
|
| 55 |
-
"task": "image-to-text",
|
| 56 |
-
"institution": "University of Bern",
|
| 57 |
-
"downloads": 8400,
|
| 58 |
-
},
|
| 59 |
-
{
|
| 60 |
-
"dataset_id": "CATMuS/medieval",
|
| 61 |
-
"title": "CATMuS Medieval — Consistent Approaches to Transcribing ManuScripts",
|
| 62 |
-
"description": "Dataset multilingue de manuscrits médiévaux (latin, français, occitan, espagnol) pour l'entraînement de modèles HTR.",
|
| 63 |
-
"language": ["Latin", "French", "Occitan", "Spanish"],
|
| 64 |
-
"tags": ["htr", "medieval", "manuscripts", "latin", "french", "historical"],
|
| 65 |
-
"license": "cc-by-4.0",
|
| 66 |
-
"size_category": "100K<n<1M",
|
| 67 |
-
"task": "image-to-text",
|
| 68 |
-
"institution": "Inria / EPHE",
|
| 69 |
-
"downloads": 3100,
|
| 70 |
-
},
|
| 71 |
-
{
|
| 72 |
-
"dataset_id": "htr-united/cremma-medieval",
|
| 73 |
-
"title": "CREMMA Medieval",
|
| 74 |
-
"description": "Corpus de manuscrits médiévaux français XIIe-XVe siècles.",
|
| 75 |
-
"language": ["French", "Latin"],
|
| 76 |
-
"tags": ["htr", "medieval", "french", "manuscripts", "htr-united"],
|
| 77 |
-
"license": "cc-by-4.0",
|
| 78 |
-
"size_category": "1K<n<10K",
|
| 79 |
-
"task": "image-to-text",
|
| 80 |
-
"institution": "Inria",
|
| 81 |
-
"downloads": 520,
|
| 82 |
-
},
|
| 83 |
-
{
|
| 84 |
-
"dataset_id": "biglam/europeana_newspapers",
|
| 85 |
-
"title": "Europeana Newspapers",
|
| 86 |
-
"description": "Journaux numérisés européens du XIXe siècle (OCR + images).",
|
| 87 |
-
"language": ["French", "German", "Dutch", "Finnish"],
|
| 88 |
-
"tags": ["ocr", "newspapers", "historical", "19th-century", "europeana"],
|
| 89 |
-
"license": "cc0-1.0",
|
| 90 |
-
"size_category": "1M<n<10M",
|
| 91 |
-
"task": "image-to-text",
|
| 92 |
-
"institution": "Europeana Foundation",
|
| 93 |
-
"downloads": 15200,
|
| 94 |
-
},
|
| 95 |
-
{
|
| 96 |
-
"dataset_id": "stefanklut/esposalles",
|
| 97 |
-
"title": "Esposalles Dataset",
|
| 98 |
-
"description": "Registres de mariage catalans du XVIIe siècle pour la reconnaissance d'écriture historique.",
|
| 99 |
-
"language": ["Catalan", "Latin"],
|
| 100 |
-
"tags": ["htr", "historical", "registers", "catalan", "17th-century"],
|
| 101 |
-
"license": "cc-by-4.0",
|
| 102 |
-
"size_category": "1K<n<10K",
|
| 103 |
-
"task": "image-to-text",
|
| 104 |
-
"institution": "Universitat Autònoma de Barcelona",
|
| 105 |
-
"downloads": 340,
|
| 106 |
-
},
|
| 107 |
-
{
|
| 108 |
-
"dataset_id": "bnf-gallica/gallica-ocr",
|
| 109 |
-
"title": "Gallica OCR",
|
| 110 |
-
"description": "Extraits d'imprimés anciens numérisés depuis Gallica avec vérité terrain.",
|
| 111 |
-
"language": ["French", "Latin"],
|
| 112 |
-
"tags": ["ocr", "historical", "printed", "gallica", "french"],
|
| 113 |
-
"license": "etalab-2.0",
|
| 114 |
-
"size_category": "10K<n<100K",
|
| 115 |
-
"task": "image-to-text",
|
| 116 |
-
"institution": "Gallica",
|
| 117 |
-
"downloads": 2800,
|
| 118 |
-
},
|
| 119 |
-
{
|
| 120 |
-
"dataset_id": "Bozen-Baptism/baptism-records",
|
| 121 |
-
"title": "Bozen Baptism Records",
|
| 122 |
-
"description": "Registres de baptêmes de Bozen (Italie/Autriche) du XVIIIe siècle.",
|
| 123 |
-
"language": ["German", "Latin"],
|
| 124 |
-
"tags": ["htr", "historical", "registers", "german", "latin", "18th-century"],
|
| 125 |
-
"license": "cc-by-4.0",
|
| 126 |
-
"size_category": "1K<n<10K",
|
| 127 |
-
"task": "image-to-text",
|
| 128 |
-
"institution": "University of Innsbruck",
|
| 129 |
-
"downloads": 190,
|
| 130 |
-
},
|
| 131 |
-
{
|
| 132 |
-
"dataset_id": "read-bad/readbad",
|
| 133 |
-
"title": "READ-BAD — Recognition and Enrichment of Archival Documents",
|
| 134 |
-
"description": "Corpus multilingue de documents d'archives pour l'OCR historique (Latin, Allemand, Anglais).",
|
| 135 |
-
"language": ["German", "English", "Latin"],
|
| 136 |
-
"tags": ["ocr", "htr", "historical", "archives", "read"],
|
| 137 |
-
"license": "cc-by-4.0",
|
| 138 |
-
"size_category": "10K<n<100K",
|
| 139 |
-
"task": "image-to-text",
|
| 140 |
-
"institution": "University of Graz",
|
| 141 |
-
"downloads": 1050,
|
| 142 |
-
},
|
| 143 |
-
]
|
| 144 |
-
|
| 145 |
-
# ---------------------------------------------------------------------------
|
| 146 |
-
# Dataclass
|
| 147 |
-
# ---------------------------------------------------------------------------
|
| 148 |
-
|
| 149 |
-
@dataclass
|
| 150 |
-
class HuggingFaceDataset:
|
| 151 |
-
"""Métadonnées d'un dataset HuggingFace."""
|
| 152 |
-
|
| 153 |
-
dataset_id: str
|
| 154 |
-
title: str
|
| 155 |
-
description: str = ""
|
| 156 |
-
language: list[str] = field(default_factory=list)
|
| 157 |
-
tags: list[str] = field(default_factory=list)
|
| 158 |
-
license: str = ""
|
| 159 |
-
size_category: str = ""
|
| 160 |
-
task: str = "image-to-text"
|
| 161 |
-
institution: str = ""
|
| 162 |
-
downloads: int = 0
|
| 163 |
-
source: str = "reference" # "reference" | "api"
|
| 164 |
-
|
| 165 |
-
def as_dict(self) -> dict:
|
| 166 |
-
return {
|
| 167 |
-
"dataset_id": self.dataset_id,
|
| 168 |
-
"title": self.title,
|
| 169 |
-
"description": self.description,
|
| 170 |
-
"language": self.language,
|
| 171 |
-
"tags": self.tags,
|
| 172 |
-
"license": self.license,
|
| 173 |
-
"size_category": self.size_category,
|
| 174 |
-
"task": self.task,
|
| 175 |
-
"institution": self.institution,
|
| 176 |
-
"downloads": self.downloads,
|
| 177 |
-
"source": self.source,
|
| 178 |
-
}
|
| 179 |
-
|
| 180 |
-
@classmethod
|
| 181 |
-
def from_dict(cls, d: dict) -> "HuggingFaceDataset":
|
| 182 |
-
return cls(
|
| 183 |
-
dataset_id=d.get("dataset_id", d.get("id", "")),
|
| 184 |
-
title=d.get("title", d.get("dataset_id", "")),
|
| 185 |
-
description=d.get("description", ""),
|
| 186 |
-
language=d.get("language", []),
|
| 187 |
-
tags=d.get("tags", []),
|
| 188 |
-
license=d.get("license", ""),
|
| 189 |
-
size_category=d.get("size_category", d.get("cardData", {}).get("size_categories", [""])[0] if isinstance(d.get("cardData"), dict) else ""),
|
| 190 |
-
task=d.get("task", "image-to-text"),
|
| 191 |
-
institution=d.get("institution", ""),
|
| 192 |
-
downloads=d.get("downloads", d.get("downloadsAllTime", 0)),
|
| 193 |
-
source=d.get("source", "api"),
|
| 194 |
-
)
|
| 195 |
-
|
| 196 |
-
@property
|
| 197 |
-
def hf_url(self) -> str:
|
| 198 |
-
return f"https://huggingface.co/datasets/{self.dataset_id}"
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
# ---------------------------------------------------------------------------
|
| 202 |
-
# Importer principal
|
| 203 |
-
# ---------------------------------------------------------------------------
|
| 204 |
-
|
| 205 |
-
class HuggingFaceImporter:
|
| 206 |
-
"""Recherche et importe des datasets depuis HuggingFace Hub."""
|
| 207 |
-
|
| 208 |
-
_API_BASE = "https://huggingface.co/api"
|
| 209 |
-
|
| 210 |
-
def __init__(self, token: Optional[str] = None) -> None:
|
| 211 |
-
self._token = token or os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_TOKEN")
|
| 212 |
-
|
| 213 |
-
def _headers(self) -> dict:
|
| 214 |
-
h = {"User-Agent": "picarones-hf-importer/1.0"}
|
| 215 |
-
if self._token:
|
| 216 |
-
h["Authorization"] = f"Bearer {self._token}"
|
| 217 |
-
return h
|
| 218 |
-
|
| 219 |
-
def search(
|
| 220 |
-
self,
|
| 221 |
-
query: str = "",
|
| 222 |
-
tags: Optional[list[str]] = None,
|
| 223 |
-
language: Optional[str] = None,
|
| 224 |
-
limit: int = 20,
|
| 225 |
-
use_reference: bool = True,
|
| 226 |
-
) -> list[HuggingFaceDataset]:
|
| 227 |
-
"""Recherche des datasets avec filtres.
|
| 228 |
-
|
| 229 |
-
Interroge d'abord les datasets de référence pré-intégrés, puis
|
| 230 |
-
l'API HuggingFace si disponible.
|
| 231 |
-
"""
|
| 232 |
-
results: list[HuggingFaceDataset] = []
|
| 233 |
-
|
| 234 |
-
# Datasets de référence
|
| 235 |
-
if use_reference:
|
| 236 |
-
ref_results = self._search_reference(query, tags, language)
|
| 237 |
-
results.extend(ref_results)
|
| 238 |
-
|
| 239 |
-
# API HuggingFace (optionnel, peut échouer silencieusement)
|
| 240 |
-
try:
|
| 241 |
-
api_results = self._search_api(query, tags, language, limit)
|
| 242 |
-
# Déduplique (priorité aux références)
|
| 243 |
-
existing_ids = {r.dataset_id for r in results}
|
| 244 |
-
for ds in api_results:
|
| 245 |
-
if ds.dataset_id not in existing_ids:
|
| 246 |
-
results.append(ds)
|
| 247 |
-
existing_ids.add(ds.dataset_id)
|
| 248 |
-
except Exception:
|
| 249 |
-
pass
|
| 250 |
-
|
| 251 |
-
return results[:limit]
|
| 252 |
-
|
| 253 |
-
def _search_reference(
|
| 254 |
-
self,
|
| 255 |
-
query: str,
|
| 256 |
-
tags: Optional[list[str]],
|
| 257 |
-
language: Optional[str],
|
| 258 |
-
) -> list[HuggingFaceDataset]:
|
| 259 |
-
datasets = [HuggingFaceDataset.from_dict(d) for d in _REFERENCE_DATASETS]
|
| 260 |
-
datasets = [ds._replace_source("reference") for ds in datasets]
|
| 261 |
-
|
| 262 |
-
if query:
|
| 263 |
-
q = query.lower()
|
| 264 |
-
datasets = [
|
| 265 |
-
ds for ds in datasets
|
| 266 |
-
if (q in ds.title.lower()
|
| 267 |
-
or q in ds.description.lower()
|
| 268 |
-
or q in ds.dataset_id.lower()
|
| 269 |
-
or any(q in t.lower() for t in ds.tags)
|
| 270 |
-
or any(q in lg.lower() for lg in ds.language))
|
| 271 |
-
]
|
| 272 |
-
|
| 273 |
-
if tags:
|
| 274 |
-
for tag in tags:
|
| 275 |
-
t_lower = tag.lower()
|
| 276 |
-
datasets = [
|
| 277 |
-
ds for ds in datasets
|
| 278 |
-
if any(t_lower in dt.lower() for dt in ds.tags)
|
| 279 |
-
]
|
| 280 |
-
|
| 281 |
-
if language:
|
| 282 |
-
lang_lower = language.lower()
|
| 283 |
-
datasets = [
|
| 284 |
-
ds for ds in datasets
|
| 285 |
-
if any(lang_lower in lg.lower() for lg in ds.language)
|
| 286 |
-
]
|
| 287 |
-
|
| 288 |
-
return datasets
|
| 289 |
-
|
| 290 |
-
def _search_api(
|
| 291 |
-
self,
|
| 292 |
-
query: str,
|
| 293 |
-
tags: Optional[list[str]],
|
| 294 |
-
language: Optional[str],
|
| 295 |
-
limit: int,
|
| 296 |
-
) -> list[HuggingFaceDataset]:
|
| 297 |
-
params: dict[str, str] = {
|
| 298 |
-
"task_categories": "image-to-text",
|
| 299 |
-
"limit": str(min(limit, 50)),
|
| 300 |
-
"full": "False",
|
| 301 |
-
}
|
| 302 |
-
if query:
|
| 303 |
-
params["search"] = query
|
| 304 |
-
if language:
|
| 305 |
-
params["language"] = language
|
| 306 |
-
if tags:
|
| 307 |
-
params["tags"] = ",".join(tags)
|
| 308 |
-
|
| 309 |
-
url = f"{self._API_BASE}/datasets?" + urllib.parse.urlencode(params)
|
| 310 |
-
req = urllib.request.Request(url, headers=self._headers())
|
| 311 |
-
with urllib.request.urlopen(req, timeout=10) as resp:
|
| 312 |
-
data = json.loads(resp.read().decode("utf-8"))
|
| 313 |
-
|
| 314 |
-
results = []
|
| 315 |
-
for item in data if isinstance(data, list) else []:
|
| 316 |
-
ds = HuggingFaceDataset(
|
| 317 |
-
dataset_id=item.get("id", ""),
|
| 318 |
-
title=item.get("id", ""),
|
| 319 |
-
description=item.get("description", ""),
|
| 320 |
-
language=item.get("language", []),
|
| 321 |
-
tags=item.get("tags", []),
|
| 322 |
-
license=item.get("license", ""),
|
| 323 |
-
size_category=(
|
| 324 |
-
item.get("cardData", {}).get("size_categories", [""])[0]
|
| 325 |
-
if isinstance(item.get("cardData"), dict)
|
| 326 |
-
else ""
|
| 327 |
-
),
|
| 328 |
-
task="image-to-text",
|
| 329 |
-
downloads=item.get("downloadsAllTime", 0),
|
| 330 |
-
source="api",
|
| 331 |
-
)
|
| 332 |
-
if ds.dataset_id:
|
| 333 |
-
results.append(ds)
|
| 334 |
-
return results
|
| 335 |
-
|
| 336 |
-
def import_dataset(
|
| 337 |
-
self,
|
| 338 |
-
dataset_id: str,
|
| 339 |
-
output_dir: str | Path,
|
| 340 |
-
split: str = "train",
|
| 341 |
-
max_samples: int = 100,
|
| 342 |
-
show_progress: bool = True,
|
| 343 |
-
) -> dict:
|
| 344 |
-
"""Importe un dataset depuis HuggingFace vers un dossier local.
|
| 345 |
-
|
| 346 |
-
Retourne les métadonnées de l'import.
|
| 347 |
-
"""
|
| 348 |
-
output_path = Path(output_dir)
|
| 349 |
-
output_path.mkdir(parents=True, exist_ok=True)
|
| 350 |
-
|
| 351 |
-
meta = {
|
| 352 |
-
"source": "huggingface",
|
| 353 |
-
"dataset_id": dataset_id,
|
| 354 |
-
"split": split,
|
| 355 |
-
"max_samples": max_samples,
|
| 356 |
-
"imported_at": _iso_now(),
|
| 357 |
-
}
|
| 358 |
-
meta_file = output_path / "huggingface_meta.json"
|
| 359 |
-
meta_file.write_text(json.dumps(meta, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 360 |
-
|
| 361 |
-
# Tentative d'import via datasets library si disponible
|
| 362 |
-
files_imported = _try_import_with_datasets_lib(
|
| 363 |
-
dataset_id, output_path, split, max_samples, show_progress
|
| 364 |
-
)
|
| 365 |
-
|
| 366 |
-
return {
|
| 367 |
-
"dataset_id": dataset_id,
|
| 368 |
-
"output_dir": str(output_path),
|
| 369 |
-
"files_imported": files_imported,
|
| 370 |
-
"metadata_file": str(meta_file),
|
| 371 |
-
}
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
def _try_import_with_datasets_lib(
|
| 375 |
-
dataset_id: str,
|
| 376 |
-
output_path: Path,
|
| 377 |
-
split: str,
|
| 378 |
-
max_samples: int,
|
| 379 |
-
show_progress: bool,
|
| 380 |
-
) -> int:
|
| 381 |
-
"""Essaie d'importer avec la librairie `datasets` de HuggingFace."""
|
| 382 |
-
try:
|
| 383 |
-
from datasets import load_dataset # type: ignore
|
| 384 |
-
|
| 385 |
-
ds = load_dataset(dataset_id, split=split, streaming=True)
|
| 386 |
-
count = 0
|
| 387 |
-
for i, item in enumerate(ds):
|
| 388 |
-
if i >= max_samples:
|
| 389 |
-
break
|
| 390 |
-
# Cherche champ image et texte
|
| 391 |
-
image = item.get("image") or item.get("img")
|
| 392 |
-
text = item.get("text") or item.get("transcription") or item.get("ground_truth", "")
|
| 393 |
-
|
| 394 |
-
if image is not None:
|
| 395 |
-
img_file = output_path / f"doc_{i:04d}.jpg"
|
| 396 |
-
try:
|
| 397 |
-
image.save(str(img_file))
|
| 398 |
-
except Exception:
|
| 399 |
-
pass
|
| 400 |
-
|
| 401 |
-
gt_file = output_path / f"doc_{i:04d}.gt.txt"
|
| 402 |
-
gt_file.write_text(str(text), encoding="utf-8")
|
| 403 |
-
count += 1
|
| 404 |
-
|
| 405 |
-
return count
|
| 406 |
-
except (ImportError, Exception):
|
| 407 |
-
return 0
|
| 408 |
-
|
| 409 |
-
|
| 410 |
-
def _iso_now() -> str:
|
| 411 |
-
from datetime import datetime, timezone
|
| 412 |
-
return datetime.now(timezone.utc).isoformat(timespec="seconds")
|
| 413 |
-
|
| 414 |
-
|
| 415 |
-
# ---------------------------------------------------------------------------
|
| 416 |
-
# Extension de HuggingFaceDataset (helper privé)
|
| 417 |
-
# ---------------------------------------------------------------------------
|
| 418 |
-
|
| 419 |
-
def _patch_dataset_replace_source() -> None:
|
| 420 |
-
"""Ajoute un helper _replace_source à HuggingFaceDataset."""
|
| 421 |
-
def _replace_source(self, source: str) -> "HuggingFaceDataset":
|
| 422 |
-
from dataclasses import replace
|
| 423 |
-
return replace(self, source=source)
|
| 424 |
-
HuggingFaceDataset._replace_source = _replace_source
|
| 425 |
-
|
| 426 |
|
| 427 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.importers.huggingface`.
|
| 2 |
|
| 3 |
+
Phase C du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Cet importeur est désormais en Cercle 3 (``extras/importers/``). L'alias
|
| 5 |
+
ici permet aux imports historiques (``from picarones.importers.huggingface
|
| 6 |
+
import ...``) de continuer à fonctionner.
|
|
|
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` et l'extra
|
| 9 |
+
``picarones[importers]`` du ``pyproject.toml``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.importers.huggingface import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
import picarones.extras.importers.huggingface as _module
|
| 15 |
+
__all__ = getattr(_module, "__all__", [
|
| 16 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 17 |
+
])
|
|
@@ -1,565 +1,17 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
3. Sélection optionnelle d'un sous-ensemble de pages (ex : ``--pages 1-10``)
|
| 8 |
-
4. Téléchargement des images dans un dossier local
|
| 9 |
-
5. Création de fichiers GT vides (``.gt.txt``) à remplir manuellement,
|
| 10 |
-
OU chargement des annotations de transcription si présentes dans le manifeste
|
| 11 |
-
6. Construction et retour d'un objet ``Corpus``
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
- IIIF Image API v2 et v3
|
| 16 |
-
- Manifestes Presentation API v2 et v3
|
| 17 |
-
- Instances : Gallica (BnF), Bodleian, British Library, BSB, e-codices,
|
| 18 |
-
Europeana, et tout entrepôt IIIF-compliant
|
| 19 |
-
|
| 20 |
-
Utilisation
|
| 21 |
-
-----------
|
| 22 |
-
>>> from picarones.importers.iiif import IIIFImporter
|
| 23 |
-
>>> importer = IIIFImporter("https://gallica.bnf.fr/ark:/12148/xxx/manifest.json")
|
| 24 |
-
>>> corpus = importer.import_corpus(pages="1-10", output_dir="./corpus/")
|
| 25 |
-
>>> print(f"{len(corpus)} documents téléchargés")
|
| 26 |
-
|
| 27 |
-
Ou via la fonction de commodité :
|
| 28 |
-
>>> from picarones.importers.iiif import import_iiif_manifest
|
| 29 |
-
>>> corpus = import_iiif_manifest("https://...", pages="1-5", output_dir="./corpus/")
|
| 30 |
"""
|
| 31 |
|
| 32 |
-
from
|
| 33 |
-
|
| 34 |
-
import json
|
| 35 |
-
import logging
|
| 36 |
-
import re
|
| 37 |
-
import time
|
| 38 |
-
import urllib.error
|
| 39 |
-
import urllib.request
|
| 40 |
-
from dataclasses import dataclass
|
| 41 |
-
from pathlib import Path
|
| 42 |
-
from typing import Iterator, Optional
|
| 43 |
-
|
| 44 |
-
from picarones.core.corpus import Corpus, Document
|
| 45 |
-
|
| 46 |
-
logger = logging.getLogger(__name__)
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
# ---------------------------------------------------------------------------
|
| 50 |
-
# Parsing du sélecteur de pages
|
| 51 |
-
# ---------------------------------------------------------------------------
|
| 52 |
-
|
| 53 |
-
def parse_page_selector(pages: str, total: int) -> list[int]:
|
| 54 |
-
"""Parse un sélecteur de pages en liste d'indices 0-based.
|
| 55 |
-
|
| 56 |
-
Formats acceptés :
|
| 57 |
-
- ``"1-10"`` → pages 1 à 10 (1-based)
|
| 58 |
-
- ``"1,3,5"`` → pages 1, 3 et 5
|
| 59 |
-
- ``"1-5,10,15-20"`` → combinaison
|
| 60 |
-
- ``"all"`` / ``""`` → toutes les pages
|
| 61 |
-
|
| 62 |
-
Parameters
|
| 63 |
-
----------
|
| 64 |
-
pages:
|
| 65 |
-
Sélecteur de pages en chaîne de caractères.
|
| 66 |
-
total:
|
| 67 |
-
Nombre total de pages dans le manifeste.
|
| 68 |
-
|
| 69 |
-
Returns
|
| 70 |
-
-------
|
| 71 |
-
list[int]
|
| 72 |
-
Indices 0-based des pages sélectionnées, triés et dédoublonnés.
|
| 73 |
-
|
| 74 |
-
Raises
|
| 75 |
-
------
|
| 76 |
-
ValueError
|
| 77 |
-
Si la syntaxe est invalide ou les numéros hors bornes.
|
| 78 |
-
"""
|
| 79 |
-
if not pages or pages.strip().lower() == "all":
|
| 80 |
-
return list(range(total))
|
| 81 |
-
|
| 82 |
-
indices: set[int] = set()
|
| 83 |
-
for part in pages.split(","):
|
| 84 |
-
part = part.strip()
|
| 85 |
-
if "-" in part:
|
| 86 |
-
m = re.fullmatch(r"(\d+)-(\d+)", part)
|
| 87 |
-
if not m:
|
| 88 |
-
raise ValueError(f"Sélecteur de pages invalide : '{part}'")
|
| 89 |
-
start, end = int(m.group(1)), int(m.group(2))
|
| 90 |
-
if start < 1 or end > total or start > end:
|
| 91 |
-
raise ValueError(
|
| 92 |
-
f"Plage {start}-{end} hors bornes (1–{total})"
|
| 93 |
-
)
|
| 94 |
-
indices.update(range(start - 1, end))
|
| 95 |
-
else:
|
| 96 |
-
n = int(part)
|
| 97 |
-
if n < 1 or n > total:
|
| 98 |
-
raise ValueError(f"Page {n} hors bornes (1–{total})")
|
| 99 |
-
indices.add(n - 1)
|
| 100 |
-
return sorted(indices)
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
# ---------------------------------------------------------------------------
|
| 104 |
-
# Données d'un canvas IIIF
|
| 105 |
-
# ---------------------------------------------------------------------------
|
| 106 |
-
|
| 107 |
-
@dataclass
|
| 108 |
-
class IIIFCanvas:
|
| 109 |
-
"""Représente un canvas (page) dans un manifeste IIIF."""
|
| 110 |
-
|
| 111 |
-
index: int # position 0-based dans le manifeste
|
| 112 |
-
label: str # étiquette lisible (ex : "f. 1r", "Page 1")
|
| 113 |
-
image_url: str # URL de l'image pleine résolution
|
| 114 |
-
width: Optional[int] = None
|
| 115 |
-
height: Optional[int] = None
|
| 116 |
-
transcription: Optional[str] = None # texte GT si annoté dans le manifeste
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
# ---------------------------------------------------------------------------
|
| 120 |
-
# Parseur de manifeste IIIF
|
| 121 |
-
# ---------------------------------------------------------------------------
|
| 122 |
-
|
| 123 |
-
class IIIFManifestParser:
|
| 124 |
-
"""Parse un manifeste IIIF Presentation API v2 ou v3."""
|
| 125 |
-
|
| 126 |
-
def __init__(self, manifest: dict) -> None:
|
| 127 |
-
self._manifest = manifest
|
| 128 |
-
self._version = self._detect_version()
|
| 129 |
-
|
| 130 |
-
def _detect_version(self) -> int:
|
| 131 |
-
"""Détecte la version du manifeste (2 ou 3)."""
|
| 132 |
-
context = self._manifest.get("@context", "")
|
| 133 |
-
if isinstance(context, list):
|
| 134 |
-
context = " ".join(context)
|
| 135 |
-
if "presentation/3" in context or self._manifest.get("type") == "Manifest":
|
| 136 |
-
return 3
|
| 137 |
-
return 2
|
| 138 |
-
|
| 139 |
-
@property
|
| 140 |
-
def version(self) -> int:
|
| 141 |
-
return self._version
|
| 142 |
-
|
| 143 |
-
@property
|
| 144 |
-
def label(self) -> str:
|
| 145 |
-
"""Titre du manifeste."""
|
| 146 |
-
raw = self._manifest.get("label", "")
|
| 147 |
-
return _extract_label(raw)
|
| 148 |
-
|
| 149 |
-
@property
|
| 150 |
-
def attribution(self) -> str:
|
| 151 |
-
raw = self._manifest.get("attribution", self._manifest.get("requiredStatement", ""))
|
| 152 |
-
return _extract_label(raw)
|
| 153 |
-
|
| 154 |
-
def canvases(self) -> list[IIIFCanvas]:
|
| 155 |
-
"""Retourne la liste des canvases du manifeste."""
|
| 156 |
-
if self._version == 3:
|
| 157 |
-
return self._parse_v3_canvases()
|
| 158 |
-
return self._parse_v2_canvases()
|
| 159 |
-
|
| 160 |
-
def _parse_v2_canvases(self) -> list[IIIFCanvas]:
|
| 161 |
-
canvases: list[IIIFCanvas] = []
|
| 162 |
-
sequences = self._manifest.get("sequences", [])
|
| 163 |
-
if not sequences:
|
| 164 |
-
return canvases
|
| 165 |
-
raw_canvases = sequences[0].get("canvases", [])
|
| 166 |
-
for i, canvas in enumerate(raw_canvases):
|
| 167 |
-
label = _extract_label(canvas.get("label", f"canvas_{i+1}"))
|
| 168 |
-
# Image principale : images[0].resource.@id ou service
|
| 169 |
-
images = canvas.get("images", [])
|
| 170 |
-
image_url = ""
|
| 171 |
-
if images:
|
| 172 |
-
resource = images[0].get("resource", {})
|
| 173 |
-
image_url = _best_image_url_v2(resource, canvas)
|
| 174 |
-
|
| 175 |
-
# Annotations de transcription (OA annotations)
|
| 176 |
-
transcription = _extract_v2_transcription(canvas)
|
| 177 |
-
|
| 178 |
-
canvases.append(IIIFCanvas(
|
| 179 |
-
index=i,
|
| 180 |
-
label=label,
|
| 181 |
-
image_url=image_url,
|
| 182 |
-
width=canvas.get("width"),
|
| 183 |
-
height=canvas.get("height"),
|
| 184 |
-
transcription=transcription,
|
| 185 |
-
))
|
| 186 |
-
return canvases
|
| 187 |
-
|
| 188 |
-
def _parse_v3_canvases(self) -> list[IIIFCanvas]:
|
| 189 |
-
canvases: list[IIIFCanvas] = []
|
| 190 |
-
items = self._manifest.get("items", [])
|
| 191 |
-
for i, canvas in enumerate(items):
|
| 192 |
-
label = _extract_label(canvas.get("label", f"canvas_{i+1}"))
|
| 193 |
-
image_url = _best_image_url_v3(canvas)
|
| 194 |
-
transcription = _extract_v3_transcription(canvas)
|
| 195 |
-
canvases.append(IIIFCanvas(
|
| 196 |
-
index=i,
|
| 197 |
-
label=label,
|
| 198 |
-
image_url=image_url,
|
| 199 |
-
width=canvas.get("width"),
|
| 200 |
-
height=canvas.get("height"),
|
| 201 |
-
transcription=transcription,
|
| 202 |
-
))
|
| 203 |
-
return canvases
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
# ---------------------------------------------------------------------------
|
| 207 |
-
# Helpers extraction URL et label
|
| 208 |
-
# ---------------------------------------------------------------------------
|
| 209 |
-
|
| 210 |
-
def _extract_label(raw: object) -> str:
|
| 211 |
-
"""Extrait une chaîne lisible depuis les différents formats de label IIIF."""
|
| 212 |
-
if isinstance(raw, str):
|
| 213 |
-
return raw
|
| 214 |
-
if isinstance(raw, list) and raw:
|
| 215 |
-
return _extract_label(raw[0])
|
| 216 |
-
if isinstance(raw, dict):
|
| 217 |
-
# IIIF v3 : {"fr": ["titre"], "en": ["title"]}
|
| 218 |
-
for lang in ("fr", "en", "none", "@value"):
|
| 219 |
-
val = raw.get(lang, "")
|
| 220 |
-
if val:
|
| 221 |
-
if isinstance(val, list):
|
| 222 |
-
return val[0] if val else ""
|
| 223 |
-
return str(val)
|
| 224 |
-
# Fallback: première valeur
|
| 225 |
-
for v in raw.values():
|
| 226 |
-
return _extract_label(v)
|
| 227 |
-
return str(raw) if raw else ""
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
def _best_image_url_v2(resource: dict, canvas: dict) -> str:
|
| 231 |
-
"""Construit l'URL d'image optimale depuis une ressource IIIF v2."""
|
| 232 |
-
# 1. URL directe de la ressource
|
| 233 |
-
direct = resource.get("@id", "")
|
| 234 |
-
if direct and not direct.endswith("/info.json"):
|
| 235 |
-
return direct
|
| 236 |
-
|
| 237 |
-
# 2. Via le service IIIF Image API
|
| 238 |
-
service = resource.get("service", {})
|
| 239 |
-
if isinstance(service, list) and service:
|
| 240 |
-
service = service[0]
|
| 241 |
-
service_id = service.get("@id", service.get("id", ""))
|
| 242 |
-
if service_id:
|
| 243 |
-
return f"{service_id.rstrip('/')}/full/max/0/default.jpg"
|
| 244 |
-
|
| 245 |
-
return direct
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
def _best_image_url_v3(canvas: dict) -> str:
|
| 249 |
-
"""Extrait l'URL d'image depuis un canvas IIIF v3."""
|
| 250 |
-
items = canvas.get("items", [])
|
| 251 |
-
for annotation_page in items:
|
| 252 |
-
for annotation in annotation_page.get("items", []):
|
| 253 |
-
body = annotation.get("body", {})
|
| 254 |
-
if isinstance(body, list):
|
| 255 |
-
body = body[0] if body else {}
|
| 256 |
-
# URL directe
|
| 257 |
-
url = body.get("id", body.get("@id", ""))
|
| 258 |
-
if url and body.get("type", "") == "Image":
|
| 259 |
-
return url
|
| 260 |
-
# Via service IIIF Image API
|
| 261 |
-
service = body.get("service", [])
|
| 262 |
-
if isinstance(service, dict):
|
| 263 |
-
service = [service]
|
| 264 |
-
for svc in service:
|
| 265 |
-
svc_id = svc.get("id", svc.get("@id", ""))
|
| 266 |
-
if svc_id:
|
| 267 |
-
return f"{svc_id.rstrip('/')}/full/max/0/default.jpg"
|
| 268 |
-
if url:
|
| 269 |
-
return url
|
| 270 |
-
return ""
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
def _extract_v2_transcription(canvas: dict) -> Optional[str]:
|
| 274 |
-
"""Tente d'extraire le texte GT depuis les annotations OA d'un canvas v2."""
|
| 275 |
-
other_content = canvas.get("otherContent", [])
|
| 276 |
-
for oc in other_content:
|
| 277 |
-
if not isinstance(oc, dict):
|
| 278 |
-
continue
|
| 279 |
-
motivation = oc.get("motivation", "")
|
| 280 |
-
if "transcrib" in motivation.lower() or "supplementing" in motivation.lower():
|
| 281 |
-
resources = oc.get("resources", [])
|
| 282 |
-
texts = []
|
| 283 |
-
for res in resources:
|
| 284 |
-
body = res.get("resource", {})
|
| 285 |
-
if body.get("@type") == "cnt:ContentAsText":
|
| 286 |
-
texts.append(body.get("chars", ""))
|
| 287 |
-
if texts:
|
| 288 |
-
return "\n".join(texts)
|
| 289 |
-
return None
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
def _extract_v3_transcription(canvas: dict) -> Optional[str]:
|
| 293 |
-
"""Tente d'extraire le texte GT depuis les annotations d'un canvas v3."""
|
| 294 |
-
annotations = canvas.get("annotations", [])
|
| 295 |
-
for ann_page in annotations:
|
| 296 |
-
items = ann_page.get("items", [])
|
| 297 |
-
for ann in items:
|
| 298 |
-
motivation = ann.get("motivation", "")
|
| 299 |
-
if "transcrib" in motivation.lower() or "supplementing" in motivation.lower():
|
| 300 |
-
body = ann.get("body", {})
|
| 301 |
-
if isinstance(body, dict) and body.get("type") == "TextualBody":
|
| 302 |
-
return body.get("value", "")
|
| 303 |
-
return None
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
# ---------------------------------------------------------------------------
|
| 307 |
-
# Téléchargement avec retry
|
| 308 |
-
# ---------------------------------------------------------------------------
|
| 309 |
-
|
| 310 |
-
# Chantier 4 (post-Sprint 97) — helpers HTTP factorisés dans
|
| 311 |
-
# :mod:`picarones.importers._http`. Ces noms restent disponibles
|
| 312 |
-
# depuis ``iiif`` (rétrocompat des tests qui les importent
|
| 313 |
-
# directement, ex. test_sprint4_normalization_iiif).
|
| 314 |
-
from picarones.importers._http import download_url as _download_url
|
| 315 |
-
from picarones.importers._http import validate_http_url as _validate_url
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
def _fetch_manifest(url: str) -> dict:
|
| 319 |
-
"""Télécharge et parse un manifeste IIIF JSON."""
|
| 320 |
-
data = _download_url(url)
|
| 321 |
-
try:
|
| 322 |
-
return json.loads(data.decode("utf-8"))
|
| 323 |
-
except json.JSONDecodeError as exc:
|
| 324 |
-
raise ValueError(f"Manifeste IIIF invalide (JSON mal formé) : {url}") from exc
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
# ---------------------------------------------------------------------------
|
| 328 |
-
# Importeur principal
|
| 329 |
-
# ---------------------------------------------------------------------------
|
| 330 |
-
|
| 331 |
-
class IIIFImporter:
|
| 332 |
-
"""Importe un corpus depuis un manifeste IIIF.
|
| 333 |
-
|
| 334 |
-
Parameters
|
| 335 |
-
----------
|
| 336 |
-
manifest_url:
|
| 337 |
-
URL du manifeste IIIF (Presentation API v2 ou v3).
|
| 338 |
-
max_resolution:
|
| 339 |
-
Résolution maximale des images téléchargées (largeur en pixels).
|
| 340 |
-
0 = résolution maximale disponible.
|
| 341 |
-
"""
|
| 342 |
-
|
| 343 |
-
def __init__(
|
| 344 |
-
self,
|
| 345 |
-
manifest_url: str,
|
| 346 |
-
max_resolution: int = 0,
|
| 347 |
-
) -> None:
|
| 348 |
-
self.manifest_url = manifest_url
|
| 349 |
-
self.max_resolution = max_resolution
|
| 350 |
-
self._manifest: Optional[dict] = None
|
| 351 |
-
self._parser: Optional[IIIFManifestParser] = None
|
| 352 |
-
|
| 353 |
-
def load(self) -> "IIIFImporter":
|
| 354 |
-
"""Télécharge et parse le manifeste."""
|
| 355 |
-
logger.info("Téléchargement du manifeste IIIF : %s", self.manifest_url)
|
| 356 |
-
self._manifest = _fetch_manifest(self.manifest_url)
|
| 357 |
-
self._parser = IIIFManifestParser(self._manifest)
|
| 358 |
-
logger.info(
|
| 359 |
-
"Manifeste chargé — version IIIF %d — titre : %s — %d canvas",
|
| 360 |
-
self._parser.version,
|
| 361 |
-
self._parser.label,
|
| 362 |
-
len(self._parser.canvases()),
|
| 363 |
-
)
|
| 364 |
-
return self
|
| 365 |
-
|
| 366 |
-
@property
|
| 367 |
-
def parser(self) -> IIIFManifestParser:
|
| 368 |
-
if self._parser is None:
|
| 369 |
-
self.load()
|
| 370 |
-
return self._parser # type: ignore[return-value]
|
| 371 |
-
|
| 372 |
-
def list_canvases(self, pages: str = "all") -> list[IIIFCanvas]:
|
| 373 |
-
"""Retourne la liste des canvases sélectionnés."""
|
| 374 |
-
all_canvases = self.parser.canvases()
|
| 375 |
-
indices = parse_page_selector(pages, len(all_canvases))
|
| 376 |
-
return [all_canvases[i] for i in indices]
|
| 377 |
-
|
| 378 |
-
def import_corpus(
|
| 379 |
-
self,
|
| 380 |
-
pages: str = "all",
|
| 381 |
-
output_dir: Optional[str | Path] = None,
|
| 382 |
-
show_progress: bool = True,
|
| 383 |
-
) -> Corpus:
|
| 384 |
-
"""Télécharge les images et construit un corpus Picarones.
|
| 385 |
-
|
| 386 |
-
Si les canvases contiennent des annotations de transcription (GT),
|
| 387 |
-
elles sont automatiquement sauvegardées dans les fichiers ``.gt.txt``.
|
| 388 |
-
Sinon, des fichiers ``.gt.txt`` vides sont créés.
|
| 389 |
-
|
| 390 |
-
Parameters
|
| 391 |
-
----------
|
| 392 |
-
pages:
|
| 393 |
-
Sélecteur de pages (ex : ``"1-10"``, ``"1,3,5"``).
|
| 394 |
-
output_dir:
|
| 395 |
-
Dossier de destination pour les images et les GT.
|
| 396 |
-
Si None, le corpus est retourné en mémoire sans écriture disque.
|
| 397 |
-
show_progress:
|
| 398 |
-
Affiche une barre de progression tqdm.
|
| 399 |
-
|
| 400 |
-
Returns
|
| 401 |
-
-------
|
| 402 |
-
Corpus
|
| 403 |
-
Corpus prêt à être utilisé dans ``run_benchmark``.
|
| 404 |
-
"""
|
| 405 |
-
canvases = self.list_canvases(pages)
|
| 406 |
-
if not canvases:
|
| 407 |
-
raise ValueError("Aucun canvas sélectionné.")
|
| 408 |
-
|
| 409 |
-
out_dir: Optional[Path] = Path(output_dir) if output_dir else None
|
| 410 |
-
if out_dir:
|
| 411 |
-
out_dir.mkdir(parents=True, exist_ok=True)
|
| 412 |
-
|
| 413 |
-
# Nom du corpus depuis le titre du manifeste
|
| 414 |
-
corpus_name = self.parser.label or "iiif_corpus"
|
| 415 |
-
|
| 416 |
-
documents: list[Document] = []
|
| 417 |
-
iterator: Iterator[IIIFCanvas] = iter(canvases)
|
| 418 |
-
|
| 419 |
-
if show_progress:
|
| 420 |
-
try:
|
| 421 |
-
from tqdm import tqdm
|
| 422 |
-
iterator = tqdm(canvases, desc="Import IIIF", unit="page")
|
| 423 |
-
except ImportError:
|
| 424 |
-
pass
|
| 425 |
-
|
| 426 |
-
for canvas in iterator:
|
| 427 |
-
doc_id = f"{_slugify(canvas.label) or f'canvas_{canvas.index+1:04d}'}"
|
| 428 |
-
|
| 429 |
-
if not canvas.image_url:
|
| 430 |
-
logger.warning("Canvas %s : pas d'URL d'image — ignoré.", canvas.label)
|
| 431 |
-
continue
|
| 432 |
-
|
| 433 |
-
# Ajuster la résolution si max_resolution est défini
|
| 434 |
-
image_url = self._adjust_resolution(canvas.image_url, canvas.width)
|
| 435 |
-
|
| 436 |
-
# Téléchargement de l'image
|
| 437 |
-
try:
|
| 438 |
-
image_bytes = _download_url(image_url)
|
| 439 |
-
except RuntimeError as exc:
|
| 440 |
-
logger.error("Canvas %s : erreur téléchargement : %s", canvas.label, exc)
|
| 441 |
-
continue
|
| 442 |
-
|
| 443 |
-
# Déterminer l'extension de l'image
|
| 444 |
-
ext = _guess_extension(image_url)
|
| 445 |
-
|
| 446 |
-
if out_dir:
|
| 447 |
-
# Sauvegarde sur disque
|
| 448 |
-
image_path = out_dir / f"{doc_id}{ext}"
|
| 449 |
-
image_path.write_bytes(image_bytes)
|
| 450 |
-
|
| 451 |
-
gt_path = out_dir / f"{doc_id}.gt.txt"
|
| 452 |
-
gt_text = canvas.transcription or ""
|
| 453 |
-
gt_path.write_text(gt_text, encoding="utf-8")
|
| 454 |
-
|
| 455 |
-
documents.append(Document(
|
| 456 |
-
image_path=image_path,
|
| 457 |
-
ground_truth=gt_text,
|
| 458 |
-
doc_id=doc_id,
|
| 459 |
-
metadata={"iiif_label": canvas.label, "canvas_index": canvas.index},
|
| 460 |
-
))
|
| 461 |
-
else:
|
| 462 |
-
# Corpus en mémoire (image stockée comme chemin temporaire virtuel)
|
| 463 |
-
import tempfile
|
| 464 |
-
tmp = tempfile.NamedTemporaryFile(suffix=ext, delete=False)
|
| 465 |
-
tmp.write(image_bytes)
|
| 466 |
-
tmp.close()
|
| 467 |
-
documents.append(Document(
|
| 468 |
-
image_path=Path(tmp.name),
|
| 469 |
-
ground_truth=canvas.transcription or "",
|
| 470 |
-
doc_id=doc_id,
|
| 471 |
-
metadata={"iiif_label": canvas.label, "canvas_index": canvas.index},
|
| 472 |
-
))
|
| 473 |
-
|
| 474 |
-
if not documents:
|
| 475 |
-
raise ValueError("Aucun document importé depuis le manifeste IIIF.")
|
| 476 |
-
|
| 477 |
-
logger.info("Import IIIF terminé : %d documents.", len(documents))
|
| 478 |
-
|
| 479 |
-
return Corpus(
|
| 480 |
-
name=corpus_name,
|
| 481 |
-
documents=documents,
|
| 482 |
-
source_path=self.manifest_url,
|
| 483 |
-
metadata={
|
| 484 |
-
"iiif_manifest_url": self.manifest_url,
|
| 485 |
-
"iiif_version": self.parser.version,
|
| 486 |
-
"iiif_attribution": self.parser.attribution,
|
| 487 |
-
"pages_selected": pages,
|
| 488 |
-
},
|
| 489 |
-
)
|
| 490 |
-
|
| 491 |
-
def _adjust_resolution(self, image_url: str, canvas_width: Optional[int]) -> str:
|
| 492 |
-
"""Ajuste l'URL IIIF Image API pour respecter max_resolution."""
|
| 493 |
-
if not self.max_resolution or not canvas_width:
|
| 494 |
-
return image_url
|
| 495 |
-
if canvas_width <= self.max_resolution:
|
| 496 |
-
return image_url
|
| 497 |
-
# Remplacer /full/max/ ou /full/full/ par /full/{w},/
|
| 498 |
-
url = re.sub(
|
| 499 |
-
r"/full/(max|full)/",
|
| 500 |
-
f"/full/{self.max_resolution},/",
|
| 501 |
-
image_url,
|
| 502 |
-
)
|
| 503 |
-
return url
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
# ---------------------------------------------------------------------------
|
| 507 |
-
# Helpers utilitaires
|
| 508 |
-
# ---------------------------------------------------------------------------
|
| 509 |
-
|
| 510 |
-
def _slugify(text: str) -> str:
|
| 511 |
-
"""Convertit un label IIIF en identifiant de fichier sûr."""
|
| 512 |
-
text = re.sub(r"[^\w\s-]", "", text.strip())
|
| 513 |
-
text = re.sub(r"[\s_-]+", "_", text)
|
| 514 |
-
return text[:60]
|
| 515 |
-
|
| 516 |
-
|
| 517 |
-
def _guess_extension(url: str) -> str:
|
| 518 |
-
"""Détermine l'extension de l'image depuis l'URL."""
|
| 519 |
-
url_lower = url.lower().split("?")[0]
|
| 520 |
-
for ext in (".jpg", ".jpeg", ".png", ".tif", ".tiff", ".webp"):
|
| 521 |
-
if url_lower.endswith(ext):
|
| 522 |
-
return ext
|
| 523 |
-
# Par défaut pour les URLs IIIF Image API
|
| 524 |
-
if "/default." in url_lower or "/native." in url_lower:
|
| 525 |
-
return ".jpg"
|
| 526 |
-
return ".jpg"
|
| 527 |
-
|
| 528 |
-
|
| 529 |
-
# ---------------------------------------------------------------------------
|
| 530 |
-
# Fonction de commodité
|
| 531 |
-
# ---------------------------------------------------------------------------
|
| 532 |
-
|
| 533 |
-
def import_iiif_manifest(
|
| 534 |
-
manifest_url: str,
|
| 535 |
-
pages: str = "all",
|
| 536 |
-
output_dir: Optional[str | Path] = None,
|
| 537 |
-
max_resolution: int = 0,
|
| 538 |
-
show_progress: bool = True,
|
| 539 |
-
) -> Corpus:
|
| 540 |
-
"""Importe un corpus depuis un manifeste IIIF en une seule ligne.
|
| 541 |
-
|
| 542 |
-
Parameters
|
| 543 |
-
----------
|
| 544 |
-
manifest_url:
|
| 545 |
-
URL du manifeste IIIF (v2 ou v3).
|
| 546 |
-
pages:
|
| 547 |
-
Sélecteur de pages (ex : ``"1-10"``, ``"1,3,5"``). ``"all"`` par défaut.
|
| 548 |
-
output_dir:
|
| 549 |
-
Dossier de destination. Si None, corpus en mémoire.
|
| 550 |
-
max_resolution:
|
| 551 |
-
Résolution maximale (px). 0 = pas de limite.
|
| 552 |
-
show_progress:
|
| 553 |
-
Affiche une barre de progression.
|
| 554 |
|
| 555 |
-
|
| 556 |
-
|
| 557 |
-
|
| 558 |
-
|
| 559 |
-
importer = IIIFImporter(manifest_url, max_resolution=max_resolution)
|
| 560 |
-
importer.load()
|
| 561 |
-
return importer.import_corpus(
|
| 562 |
-
pages=pages,
|
| 563 |
-
output_dir=output_dir,
|
| 564 |
-
show_progress=show_progress,
|
| 565 |
-
)
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.importers.iiif`.
|
| 2 |
|
| 3 |
+
Phase C du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Cet importeur est désormais en Cercle 3 (``extras/importers/``). L'alias
|
| 5 |
+
ici permet aux imports historiques (``from picarones.importers.iiif
|
| 6 |
+
import ...``) de continuer à fonctionner.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` et l'extra
|
| 9 |
+
``picarones[importers]`` du ``pyproject.toml``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.importers.iiif import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
import picarones.extras.importers.iiif as _module
|
| 15 |
+
__all__ = getattr(_module, "__all__", [
|
| 16 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 17 |
+
])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -79,9 +79,16 @@ ocr-cloud = [
|
|
| 79 |
# entièrement, et un futur split en package PyPI séparé
|
| 80 |
# ``picarones-historical`` réutilisera ce nom d'extra.
|
| 81 |
historical = []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
# Installation complète (tous les extras sauf les OCR cloud)
|
| 83 |
all = [
|
| 84 |
-
"picarones[web,hf,llm,dev,historical]",
|
| 85 |
]
|
| 86 |
|
| 87 |
[project.scripts]
|
|
|
|
| 79 |
# entièrement, et un futur split en package PyPI séparé
|
| 80 |
# ``picarones-historical`` réutilisera ce nom d'extra.
|
| 81 |
historical = []
|
| 82 |
+
# Importeurs de corpus depuis sources distantes (Cercle 3, phase C).
|
| 83 |
+
# Les 6 importeurs (``picarones.extras.importers.*``) sont livrés dans
|
| 84 |
+
# le package principal. ``[importers]`` documente l'intention de
|
| 85 |
+
# séparation future en package PyPI ``picarones-importers``. Les
|
| 86 |
+
# modules ``huggingface`` et ``escriptorium`` émettent un
|
| 87 |
+
# ``UserWarning`` à l'import (statut expérimental).
|
| 88 |
+
importers = []
|
| 89 |
# Installation complète (tous les extras sauf les OCR cloud)
|
| 90 |
all = [
|
| 91 |
+
"picarones[web,hf,llm,dev,historical,importers]",
|
| 92 |
]
|
| 93 |
|
| 94 |
[project.scripts]
|
|
@@ -0,0 +1,233 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests de la phase C — extras/importers/ (importers vers Cercle 3).
|
| 2 |
+
|
| 3 |
+
Couvre :
|
| 4 |
+
|
| 5 |
+
- 6 importers (``_http``, ``iiif``, ``htr_united``, ``gallica``,
|
| 6 |
+
``huggingface``, ``escriptorium``) déplacés vers
|
| 7 |
+
``picarones/extras/importers/``.
|
| 8 |
+
- Identité préservée à travers les shims.
|
| 9 |
+
- ``huggingface`` et ``escriptorium`` émettent un ``UserWarning``
|
| 10 |
+
``experimental`` à l'import.
|
| 11 |
+
- ``picarones.importers/__init__.py`` continue à réexporter les
|
| 12 |
+
noms historiques.
|
| 13 |
+
- ``cli/_imports.py`` continue à fonctionner.
|
| 14 |
+
- pyproject.toml déclare ``[importers]``.
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
from __future__ import annotations
|
| 18 |
+
|
| 19 |
+
import importlib
|
| 20 |
+
import sys
|
| 21 |
+
import warnings
|
| 22 |
+
from pathlib import Path
|
| 23 |
+
|
| 24 |
+
import pytest
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 28 |
+
# 1. Imports historiques rétrocompat via shims
|
| 29 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
class TestImportersRetrocompat:
|
| 33 |
+
@pytest.mark.parametrize("module_path, attribute", [
|
| 34 |
+
("picarones.importers.iiif", "IIIFImporter"),
|
| 35 |
+
("picarones.importers.iiif", "import_iiif_manifest"),
|
| 36 |
+
("picarones.importers.htr_united", "HTRUnitedEntry"),
|
| 37 |
+
("picarones.importers.htr_united", "HTRUnitedCatalogue"),
|
| 38 |
+
("picarones.importers.htr_united", "import_htr_united_corpus"),
|
| 39 |
+
("picarones.importers.gallica", "GallicaClient"),
|
| 40 |
+
("picarones.importers.gallica", "GallicaRecord"),
|
| 41 |
+
("picarones.importers.gallica", "search_gallica"),
|
| 42 |
+
("picarones.importers.gallica", "import_gallica_document"),
|
| 43 |
+
("picarones.importers._http", "validate_http_url"),
|
| 44 |
+
("picarones.importers._http", "download_url"),
|
| 45 |
+
])
|
| 46 |
+
def test_legacy_path_works(self, module_path: str, attribute: str):
|
| 47 |
+
with warnings.catch_warnings():
|
| 48 |
+
warnings.simplefilter("ignore")
|
| 49 |
+
mod = importlib.import_module(module_path)
|
| 50 |
+
assert hasattr(mod, attribute)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 54 |
+
# 2. Imports via le nouveau chemin extras/importers/
|
| 55 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
class TestExtrasImportersPath:
|
| 59 |
+
@pytest.mark.parametrize("new_path, attribute", [
|
| 60 |
+
("picarones.extras.importers._http", "validate_http_url"),
|
| 61 |
+
("picarones.extras.importers._http", "download_url"),
|
| 62 |
+
("picarones.extras.importers.iiif", "IIIFImporter"),
|
| 63 |
+
("picarones.extras.importers.iiif", "import_iiif_manifest"),
|
| 64 |
+
("picarones.extras.importers.htr_united", "HTRUnitedCatalogue"),
|
| 65 |
+
("picarones.extras.importers.gallica", "GallicaClient"),
|
| 66 |
+
("picarones.extras.importers.huggingface", "HuggingFaceImporter"),
|
| 67 |
+
("picarones.extras.importers.escriptorium", "EScriptoriumClient"),
|
| 68 |
+
])
|
| 69 |
+
def test_extras_path_works(self, new_path: str, attribute: str):
|
| 70 |
+
with warnings.catch_warnings():
|
| 71 |
+
warnings.simplefilter("ignore")
|
| 72 |
+
mod = importlib.import_module(new_path)
|
| 73 |
+
assert hasattr(mod, attribute)
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 77 |
+
# 3. Identité préservée
|
| 78 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
class TestIdentityThroughShim:
|
| 82 |
+
def test_iiif_identity(self):
|
| 83 |
+
with warnings.catch_warnings():
|
| 84 |
+
warnings.simplefilter("ignore")
|
| 85 |
+
from picarones.extras.importers.iiif import IIIFImporter as via_new
|
| 86 |
+
from picarones.importers.iiif import IIIFImporter as via_old
|
| 87 |
+
assert via_old is via_new
|
| 88 |
+
|
| 89 |
+
def test_gallica_identity(self):
|
| 90 |
+
with warnings.catch_warnings():
|
| 91 |
+
warnings.simplefilter("ignore")
|
| 92 |
+
from picarones.extras.importers.gallica import GallicaClient as via_new
|
| 93 |
+
from picarones.importers.gallica import GallicaClient as via_old
|
| 94 |
+
assert via_old is via_new
|
| 95 |
+
|
| 96 |
+
def test_http_helpers_identity(self):
|
| 97 |
+
with warnings.catch_warnings():
|
| 98 |
+
warnings.simplefilter("ignore")
|
| 99 |
+
from picarones.extras.importers._http import (
|
| 100 |
+
validate_http_url as via_new,
|
| 101 |
+
)
|
| 102 |
+
from picarones.importers._http import (
|
| 103 |
+
validate_http_url as via_old,
|
| 104 |
+
)
|
| 105 |
+
assert via_old is via_new
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 109 |
+
# 4. Modules expérimentaux : UserWarning à l'import
|
| 110 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def _force_reimport(module_name_substring: str) -> None:
|
| 114 |
+
"""Vide le cache d'import pour pouvoir capturer le UserWarning."""
|
| 115 |
+
for name in list(sys.modules.keys()):
|
| 116 |
+
if module_name_substring in name:
|
| 117 |
+
del sys.modules[name]
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
class TestExperimentalImporters:
|
| 121 |
+
def test_huggingface_emits_userwarning(self):
|
| 122 |
+
_force_reimport("huggingface")
|
| 123 |
+
with warnings.catch_warnings(record=True) as w:
|
| 124 |
+
warnings.simplefilter("always")
|
| 125 |
+
import picarones.extras.importers.huggingface # noqa: F401
|
| 126 |
+
msgs = [str(x.message) for x in w if issubclass(x.category, UserWarning)]
|
| 127 |
+
assert any("experimental" in m for m in msgs), (
|
| 128 |
+
f"huggingface n'a pas émis de UserWarning experimental — "
|
| 129 |
+
f"warnings reçus : {[str(x.message) for x in w]}"
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
def test_escriptorium_emits_userwarning(self):
|
| 133 |
+
_force_reimport("escriptorium")
|
| 134 |
+
with warnings.catch_warnings(record=True) as w:
|
| 135 |
+
warnings.simplefilter("always")
|
| 136 |
+
import picarones.extras.importers.escriptorium # noqa: F401
|
| 137 |
+
msgs = [str(x.message) for x in w if issubclass(x.category, UserWarning)]
|
| 138 |
+
assert any("experimental" in m for m in msgs)
|
| 139 |
+
|
| 140 |
+
def test_iiif_does_not_emit_warning(self):
|
| 141 |
+
"""Les importers maintenus ne doivent PAS émettre de warning."""
|
| 142 |
+
_force_reimport("iiif")
|
| 143 |
+
with warnings.catch_warnings(record=True) as w:
|
| 144 |
+
warnings.simplefilter("always")
|
| 145 |
+
import picarones.extras.importers.iiif # noqa: F401
|
| 146 |
+
msgs = [str(x.message) for x in w if issubclass(x.category, UserWarning)]
|
| 147 |
+
# Il peut y avoir d'autres warnings (deprecation Python, etc.)
|
| 148 |
+
# mais pas de "experimental" sur iiif
|
| 149 |
+
assert not any(
|
| 150 |
+
"iiif" in m and "experimental" in m for m in msgs
|
| 151 |
+
), "iiif ne doit pas être marqué experimental"
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 155 |
+
# 5. picarones.importers/__init__.py — réexports historiques
|
| 156 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
class TestImportersInitReexports:
|
| 160 |
+
def test_reexports_work(self):
|
| 161 |
+
"""Le ``__init__`` réexporte des symboles via les shims, eux-mêmes
|
| 162 |
+
chargeant depuis extras."""
|
| 163 |
+
with warnings.catch_warnings():
|
| 164 |
+
warnings.simplefilter("ignore")
|
| 165 |
+
from picarones.importers import (
|
| 166 |
+
EScriptoriumClient,
|
| 167 |
+
GallicaClient,
|
| 168 |
+
IIIFImporter,
|
| 169 |
+
connect_escriptorium,
|
| 170 |
+
import_gallica_document,
|
| 171 |
+
import_iiif_manifest,
|
| 172 |
+
search_gallica,
|
| 173 |
+
)
|
| 174 |
+
assert IIIFImporter is not None
|
| 175 |
+
assert GallicaClient is not None
|
| 176 |
+
assert EScriptoriumClient is not None
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 180 |
+
# 6. cli/_imports.py — toujours fonctionnel
|
| 181 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
class TestCliImportsCommand:
|
| 185 |
+
def test_cli_imports_module_loads(self):
|
| 186 |
+
"""``picarones.cli._imports`` importe IIIFImporter depuis
|
| 187 |
+
``picarones.importers.iiif`` — doit fonctionner via shim."""
|
| 188 |
+
try:
|
| 189 |
+
with warnings.catch_warnings():
|
| 190 |
+
warnings.simplefilter("ignore")
|
| 191 |
+
import picarones.cli._imports # noqa: F401
|
| 192 |
+
except ImportError as exc:
|
| 193 |
+
if "click" in str(exc):
|
| 194 |
+
pytest.skip("click absent")
|
| 195 |
+
raise
|
| 196 |
+
|
| 197 |
+
|
| 198 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 199 |
+
# 7. pyproject.toml — extra [importers]
|
| 200 |
+
# ──���───────────────────────────────────────────────────────────────────────
|
| 201 |
+
|
| 202 |
+
|
| 203 |
+
class TestPyprojectExtra:
|
| 204 |
+
def test_importers_extra_declared(self):
|
| 205 |
+
path = Path(__file__).parent.parent / "pyproject.toml"
|
| 206 |
+
content = path.read_text(encoding="utf-8")
|
| 207 |
+
assert "importers = []" in content or 'importers = [' in content
|
| 208 |
+
assert "extras/importers" in content
|
| 209 |
+
assert "Cercle 3" in content
|
| 210 |
+
|
| 211 |
+
|
| 212 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 213 |
+
# 8. Originaux sont des shims minces
|
| 214 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 215 |
+
|
| 216 |
+
|
| 217 |
+
class TestOriginalsAreShims:
|
| 218 |
+
@pytest.mark.parametrize("path", [
|
| 219 |
+
"picarones/importers/_http.py",
|
| 220 |
+
"picarones/importers/iiif.py",
|
| 221 |
+
"picarones/importers/htr_united.py",
|
| 222 |
+
"picarones/importers/gallica.py",
|
| 223 |
+
"picarones/importers/huggingface.py",
|
| 224 |
+
"picarones/importers/escriptorium.py",
|
| 225 |
+
])
|
| 226 |
+
def test_is_thin_shim(self, path):
|
| 227 |
+
repo_root = Path(__file__).parent.parent
|
| 228 |
+
content = (repo_root / path).read_text(encoding="utf-8")
|
| 229 |
+
n_lines = len([line for line in content.splitlines() if line.strip()])
|
| 230 |
+
assert n_lines < 30, (
|
| 231 |
+
f"{path} fait {n_lines} lignes — devrait être un shim mince"
|
| 232 |
+
)
|
| 233 |
+
assert "déplacé" in content or "extras" in content
|