Spaces:
Running
refactor(adapters): Sprint A14-S11 — migration des adapters LLM + 2 importers corpus
Browse filesSprint S11 du plan rewrite ciblé. Phase 2 continue.
Déplacement physique (sans modification de logique) de 8 fichiers
adapters vers ``picarones/adapters/`` :
- 5 LLM (base + 4 providers) → ``picarones/adapters/llm/``
- 2 corpus importers (htr_united, huggingface) +
1 helper (_fallback_log) → ``picarones/adapters/corpus/``
L'ancien emplacement devient un re-export pour ne casser aucun
consommateur. Aucun test modifié.
Migrés (8)
----------
``picarones/adapters/llm/``
- ``base.py`` (BaseLLMAdapter, normalize_llm_content, etc.)
- ``openai_adapter.py``
- ``mistral_adapter.py``
- ``anthropic_adapter.py``
- ``ollama_adapter.py``
``picarones/adapters/corpus/``
- ``_fallback_log.py``
- ``htr_united.py``
- ``huggingface.py``
Imports internes mis à jour
---------------------------
Les 4 adapters LLM importaient ``picarones.llm.base`` ;
réécrit en ``picarones.adapters.llm.base``.
Les 2 importers corpus importaient
``picarones.extras.importers._fallback_log`` (imports paresseux
dans des fonctions) ; réécrit en
``picarones.adapters.corpus._fallback_log``.
Mécanisme de re-export
----------------------
Pour chaque fichier migré, l'ancien emplacement est un re-export
de 10 lignes. Trois fichiers ré-exposent en plus des **symboles
privés** importés par les tests :
- ``llm/mistral_adapter.py`` : ``_TEXT_ONLY_MODELS``
- ``extras/importers/huggingface.py`` : ``_REFERENCE_DATASETS``
- ``adapters/corpus/_fallback_log.py`` : helper privé partagé
Reste à migrer (différé)
------------------------
**Adapters OCR** (5 fichiers : tesseract, pero_ocr, mistral_ocr,
google_vision, azure_doc_intel) restent dans
``picarones/engines/``. Tous importent ``engines/base.py`` qui
hérite de ``core.modules.BaseModule``. Migration différée
jusqu'au S20 quand ``core.modules`` aura disparu (remplacé par
le protocole ``StepExecutor`` du S6).
**Importers patrimoniaux** (3 fichiers : iiif, gallica,
escriptorium) restent dans ``picarones/extras/importers/``. Tous
importent ``core.corpus.{Corpus, Document}``. Migration différée
jusqu'au déplacement de ``core.corpus`` vers ``domain/`` (sprint
dédié).
Documenté dans ``BACKLOG_POST_LIVRAISON.md`` §2.5b.
Mise à jour des budgets
-----------------------
``tests/architecture/test_file_budgets.py`` :
- ``picarones/adapters/corpus/htr_united.py`` (473 lignes)
- ``picarones/adapters/corpus/huggingface.py`` (464 lignes)
Les anciens emplacements restent dans la whitelist comme
re-exports, conservant leur ancien plafond.
État de la suite
----------------
``pytest tests/ -q`` → 4163 passed, 8 skipped, 2 failed
(strictement environnementaux). +1 test vs S10. Aucune
régression S11.
Critère go/no-go S11 (partiel) atteint
--------------------------------------
- 5 LLM adapters migrés proprement : tous passent par
``picarones.adapters.llm.*``. Le module legacy
``picarones.llm.*`` est devenu une couche de re-exports.
- 2 corpus importers (htr_united, huggingface) migrés.
Le critère "engines/, llm/, extras/ ne contiennent plus que des
re-exports" du plan original n'est PAS atteint (5 OCR + 3
importers patrimoniaux restent legacy). C'est un choix
pragmatique assumé documenté dans BACKLOG : leur migration
demande d'abord de déplacer ``core.modules`` et ``core.corpus``,
ce qui est hors scope S11.
Prêt pour S12 (équivalence numérique CER/WER avec l'ancien
runner sur fixtures).
https://claude.ai/code/session_011XQZNitg1rCgia8ZD1a2hP
- BACKLOG_POST_LIVRAISON.md +18 -1
- picarones/adapters/corpus/__pycache__/__init__.cpython-311.pyc +0 -0
- picarones/adapters/corpus/__pycache__/_fallback_log.cpython-311.pyc +0 -0
- picarones/adapters/corpus/__pycache__/htr_united.cpython-311.pyc +0 -0
- picarones/adapters/corpus/__pycache__/huggingface.cpython-311.pyc +0 -0
- picarones/adapters/corpus/_fallback_log.py +98 -0
- picarones/adapters/corpus/htr_united.py +473 -0
- picarones/adapters/corpus/huggingface.py +464 -0
- picarones/adapters/llm/anthropic_adapter.py +111 -0
- picarones/adapters/llm/base.py +279 -0
- picarones/adapters/llm/mistral_adapter.py +157 -0
- picarones/adapters/llm/ollama_adapter.py +109 -0
- picarones/adapters/llm/openai_adapter.py +94 -0
- picarones/extras/importers/_fallback_log.py +3 -94
- picarones/extras/importers/htr_united.py +3 -469
- picarones/extras/importers/huggingface.py +6 -459
- picarones/llm/anthropic_adapter.py +7 -108
- picarones/llm/base.py +7 -276
- picarones/llm/mistral_adapter.py +8 -154
- picarones/llm/ollama_adapter.py +7 -106
- picarones/llm/openai_adapter.py +7 -91
- tests/architecture/test_file_budgets.py +5 -1
|
@@ -126,7 +126,24 @@ exister à la livraison BnF.
|
|
| 126 |
|
| 127 |
→ Sprint S5 + S20 du rewrite.
|
| 128 |
|
| 129 |
-
### 2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
Le Sprint S10 a migré 23 fichiers de calcul autonomes. 17 fichiers
|
| 132 |
restent dans `picarones/measurements/` à migrer.
|
|
|
|
| 126 |
|
| 127 |
→ Sprint S5 + S20 du rewrite.
|
| 128 |
|
| 129 |
+
### 2.5b Migration des adapters restants
|
| 130 |
+
|
| 131 |
+
Le Sprint S11 a migré 5 LLM (base + openai/mistral/anthropic/ollama)
|
| 132 |
+
+ 2 corpus importers (htr_united, huggingface) + 1 helper privé
|
| 133 |
+
(_fallback_log). L'ancien emplacement est un re-export.
|
| 134 |
+
|
| 135 |
+
**Adapters OCR** (5 fichiers : tesseract, pero_ocr, mistral_ocr,
|
| 136 |
+
google_vision, azure_doc_intel) restent dans `picarones/engines/`.
|
| 137 |
+
Tous importent `engines/base.py` qui hérite de `core.modules.BaseModule`.
|
| 138 |
+
Migration différée jusqu'au S20 quand `core.modules` aura disparu
|
| 139 |
+
(remplacé par le protocole `StepExecutor` du S6).
|
| 140 |
+
|
| 141 |
+
**Importers patrimoniaux** (3 fichiers : iiif, gallica, escriptorium)
|
| 142 |
+
restent dans `picarones/extras/importers/`. Tous importent
|
| 143 |
+
`core.corpus.{Corpus, Document}`. Migration différée jusqu'au
|
| 144 |
+
déplacement de `core.corpus` vers `domain/` (sprint dédié).
|
| 145 |
+
|
| 146 |
+
### 2.5c Migration des fichiers `measurements/*.py` restants vers `evaluation/metrics/`
|
| 147 |
|
| 148 |
Le Sprint S10 a migré 23 fichiers de calcul autonomes. 17 fichiers
|
| 149 |
restent dans `picarones/measurements/` à migrer.
|
|
Binary file (892 Bytes). View file
|
|
|
|
Binary file (4.83 kB). View file
|
|
|
|
Binary file (23.6 kB). View file
|
|
|
|
Binary file (21.4 kB). View file
|
|
|
|
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Journal en mémoire des fallbacks d'importer (Sprint A3, item B-3).
|
| 2 |
+
|
| 3 |
+
Quand un importer (HuggingFace, HTR-United, Gallica, eScriptorium…)
|
| 4 |
+
bascule en mode dégradé (timeout réseau, JSON mal formé, ZIP corrompu,
|
| 5 |
+
catalogue distant indisponible…), il enregistre un incident ici via
|
| 6 |
+
:func:`record_fallback`. Le moteur narratif consomme ces incidents via
|
| 7 |
+
:func:`consume_fallback_log`, qui **vide** la liste pour qu'un benchmark
|
| 8 |
+
suivant ne remonte pas les incidents du précédent.
|
| 9 |
+
|
| 10 |
+
Conception volontairement minimale :
|
| 11 |
+
|
| 12 |
+
- Pas de persistance disque (les incidents sont contextuels à un run).
|
| 13 |
+
- Pas de structure complexe (juste un ``list[dict]`` thread-safe).
|
| 14 |
+
- Le runner / le rapport peuvent ignorer la liste sans casser.
|
| 15 |
+
|
| 16 |
+
Le détecteur de Fact correspondant (``FactType.IMPORTER_FALLBACK_TRIGGERED``)
|
| 17 |
+
est implémenté dans
|
| 18 |
+
:mod:`picarones.measurements.narrative.detectors.history`.
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import logging
|
| 24 |
+
import threading
|
| 25 |
+
from typing import Any
|
| 26 |
+
|
| 27 |
+
logger = logging.getLogger(__name__)
|
| 28 |
+
|
| 29 |
+
_lock = threading.Lock()
|
| 30 |
+
_fallbacks: list[dict[str, Any]] = []
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
def record_fallback(
|
| 34 |
+
importer: str,
|
| 35 |
+
operation: str,
|
| 36 |
+
error: BaseException | None = None,
|
| 37 |
+
*,
|
| 38 |
+
extra: dict[str, Any] | None = None,
|
| 39 |
+
) -> None:
|
| 40 |
+
"""Enregistre un incident de mode dégradé.
|
| 41 |
+
|
| 42 |
+
Logge également via ``logger.warning`` pour qu'un opérateur voit
|
| 43 |
+
l'incident en temps réel sans dépendre du rapport.
|
| 44 |
+
|
| 45 |
+
Parameters
|
| 46 |
+
----------
|
| 47 |
+
importer:
|
| 48 |
+
Nom court de l'importer (ex : ``"huggingface"``, ``"htr_united"``).
|
| 49 |
+
operation:
|
| 50 |
+
Description courte de l'opération (ex : ``"yaml_catalogue_parse"``,
|
| 51 |
+
``"image_save"``, ``"hub_search"``).
|
| 52 |
+
error:
|
| 53 |
+
Exception originelle (utilisée pour le message log et stockée dans
|
| 54 |
+
le payload sous forme de chaîne — pas l'objet, pour éviter les
|
| 55 |
+
références persistantes).
|
| 56 |
+
extra:
|
| 57 |
+
Champs additionnels (URL distante, identifiant dataset…) qui peuvent
|
| 58 |
+
être utiles à un détecteur de Fact ultérieur.
|
| 59 |
+
"""
|
| 60 |
+
error_repr = repr(error) if error is not None else None
|
| 61 |
+
logger.warning(
|
| 62 |
+
"[importers/%s] %s a échoué (mode dégradé) : %s",
|
| 63 |
+
importer,
|
| 64 |
+
operation,
|
| 65 |
+
error_repr,
|
| 66 |
+
)
|
| 67 |
+
entry: dict[str, Any] = {
|
| 68 |
+
"importer": importer,
|
| 69 |
+
"operation": operation,
|
| 70 |
+
"error": error_repr,
|
| 71 |
+
}
|
| 72 |
+
if extra:
|
| 73 |
+
entry["extra"] = dict(extra)
|
| 74 |
+
with _lock:
|
| 75 |
+
_fallbacks.append(entry)
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
def consume_fallback_log() -> list[dict[str, Any]]:
|
| 79 |
+
"""Retourne ET vide la liste des incidents accumulés.
|
| 80 |
+
|
| 81 |
+
Le moteur narratif appelle cette fonction au moment de construire
|
| 82 |
+
la synthèse pour transformer chaque incident en ``Fact``."""
|
| 83 |
+
with _lock:
|
| 84 |
+
out = list(_fallbacks)
|
| 85 |
+
_fallbacks.clear()
|
| 86 |
+
return out
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
def peek_fallback_log() -> list[dict[str, Any]]:
|
| 90 |
+
"""Retourne une copie sans vider — utile pour les tests."""
|
| 91 |
+
with _lock:
|
| 92 |
+
return list(_fallbacks)
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def reset_fallback_log() -> None:
|
| 96 |
+
"""Vide la liste sans rien retourner — utile pour les fixtures pytest."""
|
| 97 |
+
with _lock:
|
| 98 |
+
_fallbacks.clear()
|
|
@@ -0,0 +1,473 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Import depuis le catalogue HTR-United.
|
| 2 |
+
|
| 3 |
+
HTR-United est un catalogue communautaire de vérités terrain HTR/OCR publiées
|
| 4 |
+
sur GitHub sous licence ouverte. Les métadonnées sont stockées dans un fichier
|
| 5 |
+
YAML (catalogue.yml) sur https://github.com/HTR-United/htr-united.
|
| 6 |
+
|
| 7 |
+
Ce module fournit :
|
| 8 |
+
- :class:`HTRUnitedCatalogue` — chargement et recherche dans le catalogue
|
| 9 |
+
- :func:`fetch_catalogue` — téléchargement du catalogue depuis GitHub
|
| 10 |
+
- :func:`import_htr_united_corpus` — téléchargement et import d'un corpus
|
| 11 |
+
|
| 12 |
+
Exemple
|
| 13 |
+
-------
|
| 14 |
+
catalogue = HTRUnitedCatalogue.from_remote()
|
| 15 |
+
results = catalogue.search("français médiéval")
|
| 16 |
+
corpus = import_htr_united_corpus(results[0], output_dir="./corpus/")
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from __future__ import annotations
|
| 20 |
+
|
| 21 |
+
import json
|
| 22 |
+
import logging
|
| 23 |
+
import re
|
| 24 |
+
import urllib.error
|
| 25 |
+
import urllib.request
|
| 26 |
+
from dataclasses import dataclass, field
|
| 27 |
+
from pathlib import Path
|
| 28 |
+
from typing import Optional
|
| 29 |
+
|
| 30 |
+
logger = logging.getLogger(__name__)
|
| 31 |
+
|
| 32 |
+
# ---------------------------------------------------------------------------
|
| 33 |
+
# Catalogue remote URL
|
| 34 |
+
# ---------------------------------------------------------------------------
|
| 35 |
+
|
| 36 |
+
_CATALOGUE_URL = (
|
| 37 |
+
"https://raw.githubusercontent.com/HTR-United/htr-united/master/htr-united.yml"
|
| 38 |
+
)
|
| 39 |
+
_CATALOGUE_API_URL = (
|
| 40 |
+
"https://api.github.com/repos/HTR-United/htr-united/contents/htr-united.yml"
|
| 41 |
+
)
|
| 42 |
+
|
| 43 |
+
# Catalogue de démonstration / fallback (hors-ligne)
|
| 44 |
+
_DEMO_CATALOGUE: list[dict] = [
|
| 45 |
+
{
|
| 46 |
+
"id": "lectaurep-repertoires",
|
| 47 |
+
"title": "Lectaurep — Répertoires de notaires parisiens",
|
| 48 |
+
"url": "https://github.com/HTR-United/lectaurep-repertoires",
|
| 49 |
+
"language": ["French"],
|
| 50 |
+
"script": ["Cursiva"],
|
| 51 |
+
"century": [17, 18],
|
| 52 |
+
"institution": "Archives nationales (France)",
|
| 53 |
+
"description": "Transcriptions de répertoires de notaires, XVIIe-XVIIIe siècles.",
|
| 54 |
+
"license": "CC-BY 4.0",
|
| 55 |
+
"lines": 12400,
|
| 56 |
+
"format": "ALTO",
|
| 57 |
+
"tags": ["notaires", "Paris", "cursive", "imprimé"],
|
| 58 |
+
},
|
| 59 |
+
{
|
| 60 |
+
"id": "bvmm-manuscripts",
|
| 61 |
+
"title": "BVMM — Manuscrits enluminés",
|
| 62 |
+
"url": "https://github.com/HTR-United/bvmm-manuscripts",
|
| 63 |
+
"language": ["Latin", "French"],
|
| 64 |
+
"script": ["Gothic"],
|
| 65 |
+
"century": [13, 14, 15],
|
| 66 |
+
"institution": "IRHT",
|
| 67 |
+
"description": "Manuscrits médiévaux latins et français, XIIIe-XVe siècles.",
|
| 68 |
+
"license": "CC-BY 4.0",
|
| 69 |
+
"lines": 8700,
|
| 70 |
+
"format": "ALTO",
|
| 71 |
+
"tags": ["manuscrits", "latin", "médiéval", "enluminure"],
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
"id": "cremma-medieval",
|
| 75 |
+
"title": "CREMMA Médiéval",
|
| 76 |
+
"url": "https://github.com/HTR-United/cremma-medieval",
|
| 77 |
+
"language": ["French", "Latin"],
|
| 78 |
+
"script": ["Gothic", "Humanistica"],
|
| 79 |
+
"century": [12, 13, 14, 15],
|
| 80 |
+
"institution": "École des chartes / Inria",
|
| 81 |
+
"description": "Corpus CREMMA de manuscrits médiévaux français et latins.",
|
| 82 |
+
"license": "CC-BY 4.0",
|
| 83 |
+
"lines": 6200,
|
| 84 |
+
"format": "ALTO",
|
| 85 |
+
"tags": ["médiéval", "chartes", "manuscrits"],
|
| 86 |
+
},
|
| 87 |
+
{
|
| 88 |
+
"id": "simssa-ocr-printed",
|
| 89 |
+
"title": "SIMSSA — Imprimés anciens (XVe-XVIIe)",
|
| 90 |
+
"url": "https://github.com/HTR-United/simssa-printed",
|
| 91 |
+
"language": ["French", "Latin"],
|
| 92 |
+
"script": ["Rotunda", "Roman"],
|
| 93 |
+
"century": [15, 16, 17],
|
| 94 |
+
"institution": "McGill University",
|
| 95 |
+
"description": "Corpus d'imprimés anciens romains et gothiques.",
|
| 96 |
+
"license": "CC-BY 4.0",
|
| 97 |
+
"lines": 4500,
|
| 98 |
+
"format": "PAGE",
|
| 99 |
+
"tags": ["imprimés", "incunables", "roman", "gothique"],
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"id": "fonds-gallica-presse",
|
| 103 |
+
"title": "Presse ancienne — Gallica (XIXe)",
|
| 104 |
+
"url": "https://github.com/HTR-United/gallica-presse-xix",
|
| 105 |
+
"language": ["French"],
|
| 106 |
+
"script": ["Roman"],
|
| 107 |
+
"century": [19],
|
| 108 |
+
"institution": "Gallica",
|
| 109 |
+
"description": "Numérisations de journaux du XIXe siècle (Gallica).",
|
| 110 |
+
"license": "etalab-2.0",
|
| 111 |
+
"lines": 31000,
|
| 112 |
+
"format": "ALTO",
|
| 113 |
+
"tags": ["presse", "XIXe", "Gallica", "journaux"],
|
| 114 |
+
},
|
| 115 |
+
{
|
| 116 |
+
"id": "archives-departem-correspondances",
|
| 117 |
+
"title": "Correspondances administratives (XVIIIe-XIXe)",
|
| 118 |
+
"url": "https://github.com/HTR-United/correspondances-admin",
|
| 119 |
+
"language": ["French"],
|
| 120 |
+
"script": ["Cursiva"],
|
| 121 |
+
"century": [18, 19],
|
| 122 |
+
"institution": "Archives départementales",
|
| 123 |
+
"description": "Lettres et correspondances administratives manuscrites.",
|
| 124 |
+
"license": "CC-BY 4.0",
|
| 125 |
+
"lines": 9800,
|
| 126 |
+
"format": "ALTO",
|
| 127 |
+
"tags": ["correspondances", "administratif", "cursive"],
|
| 128 |
+
},
|
| 129 |
+
{
|
| 130 |
+
"id": "e-codices-latin",
|
| 131 |
+
"title": "e-codices — Manuscrits latins (Suisse)",
|
| 132 |
+
"url": "https://github.com/HTR-United/e-codices-latin",
|
| 133 |
+
"language": ["Latin"],
|
| 134 |
+
"script": ["Caroline", "Gothic"],
|
| 135 |
+
"century": [9, 10, 11, 12],
|
| 136 |
+
"institution": "Bibliothèque cantonale universitaire de Lausanne",
|
| 137 |
+
"description": "Manuscrits carolingiens et gothiques des bibliothèques suisses.",
|
| 138 |
+
"license": "CC-BY 4.0",
|
| 139 |
+
"lines": 3100,
|
| 140 |
+
"format": "ALTO",
|
| 141 |
+
"tags": ["caroline", "latin", "médiéval", "Suisse"],
|
| 142 |
+
},
|
| 143 |
+
{
|
| 144 |
+
"id": "registres-paroissiaux-17",
|
| 145 |
+
"title": "Registres paroissiaux — Bretagne (XVIIe)",
|
| 146 |
+
"url": "https://github.com/HTR-United/registres-paroissiaux-bretagne",
|
| 147 |
+
"language": ["French", "Latin"],
|
| 148 |
+
"script": ["Cursiva"],
|
| 149 |
+
"century": [17],
|
| 150 |
+
"institution": "Archives départementales du Finistère",
|
| 151 |
+
"description": "Registres paroissiaux bretons du XVIIe siècle.",
|
| 152 |
+
"license": "CC-BY 4.0",
|
| 153 |
+
"lines": 15600,
|
| 154 |
+
"format": "ALTO",
|
| 155 |
+
"tags": ["registres", "Bretagne", "paroissial", "cursive"],
|
| 156 |
+
},
|
| 157 |
+
]
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
# ---------------------------------------------------------------------------
|
| 161 |
+
# Dataclass entrée catalogue
|
| 162 |
+
# ---------------------------------------------------------------------------
|
| 163 |
+
|
| 164 |
+
@dataclass
|
| 165 |
+
class HTRUnitedEntry:
|
| 166 |
+
"""Une entrée dans le catalogue HTR-United."""
|
| 167 |
+
|
| 168 |
+
id: str
|
| 169 |
+
title: str
|
| 170 |
+
url: str
|
| 171 |
+
language: list[str] = field(default_factory=list)
|
| 172 |
+
script: list[str] = field(default_factory=list)
|
| 173 |
+
century: list[int] = field(default_factory=list)
|
| 174 |
+
institution: str = ""
|
| 175 |
+
description: str = ""
|
| 176 |
+
license: str = ""
|
| 177 |
+
lines: int = 0
|
| 178 |
+
format: str = "ALTO"
|
| 179 |
+
tags: list[str] = field(default_factory=list)
|
| 180 |
+
|
| 181 |
+
def as_dict(self) -> dict:
|
| 182 |
+
return {
|
| 183 |
+
"id": self.id,
|
| 184 |
+
"title": self.title,
|
| 185 |
+
"url": self.url,
|
| 186 |
+
"language": self.language,
|
| 187 |
+
"script": self.script,
|
| 188 |
+
"century": self.century,
|
| 189 |
+
"institution": self.institution,
|
| 190 |
+
"description": self.description,
|
| 191 |
+
"license": self.license,
|
| 192 |
+
"lines": self.lines,
|
| 193 |
+
"format": self.format,
|
| 194 |
+
"tags": self.tags,
|
| 195 |
+
}
|
| 196 |
+
|
| 197 |
+
@classmethod
|
| 198 |
+
def from_dict(cls, d: dict) -> "HTRUnitedEntry":
|
| 199 |
+
return cls(
|
| 200 |
+
id=d.get("id", ""),
|
| 201 |
+
title=d.get("title", ""),
|
| 202 |
+
url=d.get("url", ""),
|
| 203 |
+
language=d.get("language", []),
|
| 204 |
+
script=d.get("script", []),
|
| 205 |
+
century=d.get("century", []),
|
| 206 |
+
institution=d.get("institution", ""),
|
| 207 |
+
description=d.get("description", ""),
|
| 208 |
+
license=d.get("license", ""),
|
| 209 |
+
lines=d.get("lines", 0),
|
| 210 |
+
format=d.get("format", "ALTO"),
|
| 211 |
+
tags=d.get("tags", []),
|
| 212 |
+
)
|
| 213 |
+
|
| 214 |
+
@property
|
| 215 |
+
def century_str(self) -> str:
|
| 216 |
+
"""Siècles formatés en chiffres romains."""
|
| 217 |
+
roman = {
|
| 218 |
+
1: "Ier", 2: "IIe", 3: "IIIe", 4: "IVe", 5: "Ve",
|
| 219 |
+
6: "VIe", 7: "VIIe", 8: "VIIIe", 9: "IXe", 10: "Xe",
|
| 220 |
+
11: "XIe", 12: "XIIe", 13: "XIIIe", 14: "XIVe", 15: "XVe",
|
| 221 |
+
16: "XVIe", 17: "XVIIe", 18: "XVIIIe", 19: "XIXe", 20: "XXe",
|
| 222 |
+
}
|
| 223 |
+
return ", ".join(roman.get(c, f"{c}e") for c in self.century)
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
# ---------------------------------------------------------------------------
|
| 227 |
+
# Catalogue
|
| 228 |
+
# ---------------------------------------------------------------------------
|
| 229 |
+
|
| 230 |
+
class HTRUnitedCatalogue:
|
| 231 |
+
"""Catalogue HTR-United avec recherche et filtrage."""
|
| 232 |
+
|
| 233 |
+
def __init__(self, entries: list[HTRUnitedEntry], source: str = "demo") -> None:
|
| 234 |
+
self.entries = entries
|
| 235 |
+
self.source = source # "remote" | "demo" | "cache"
|
| 236 |
+
|
| 237 |
+
def __len__(self) -> int:
|
| 238 |
+
return len(self.entries)
|
| 239 |
+
|
| 240 |
+
@classmethod
|
| 241 |
+
def from_demo(cls) -> "HTRUnitedCatalogue":
|
| 242 |
+
"""Charge le catalogue de démonstration intégré."""
|
| 243 |
+
entries = [HTRUnitedEntry.from_dict(d) for d in _DEMO_CATALOGUE]
|
| 244 |
+
return cls(entries, source="demo")
|
| 245 |
+
|
| 246 |
+
@classmethod
|
| 247 |
+
def from_remote(cls, timeout: int = 10) -> "HTRUnitedCatalogue":
|
| 248 |
+
"""Télécharge le catalogue depuis GitHub.
|
| 249 |
+
|
| 250 |
+
En cas d'erreur réseau, retourne le catalogue de démonstration.
|
| 251 |
+
"""
|
| 252 |
+
try:
|
| 253 |
+
req = urllib.request.Request(
|
| 254 |
+
_CATALOGUE_URL,
|
| 255 |
+
headers={"User-Agent": "picarones-htr-united-importer/1.0"},
|
| 256 |
+
)
|
| 257 |
+
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
| 258 |
+
raw = resp.read().decode("utf-8")
|
| 259 |
+
entries = _parse_yml_catalogue(raw)
|
| 260 |
+
return cls(entries, source="remote")
|
| 261 |
+
except (urllib.error.URLError, Exception) as exc:
|
| 262 |
+
# Fallback démo avec avertissement
|
| 263 |
+
logger.warning(
|
| 264 |
+
"[HTR-United] impossible de charger le catalogue distant (%s) : %s. "
|
| 265 |
+
"Utilisation des données de démonstration.",
|
| 266 |
+
_CATALOGUE_URL, exc,
|
| 267 |
+
)
|
| 268 |
+
return cls.from_demo()
|
| 269 |
+
|
| 270 |
+
def search(
|
| 271 |
+
self,
|
| 272 |
+
query: str = "",
|
| 273 |
+
language: Optional[str] = None,
|
| 274 |
+
script: Optional[str] = None,
|
| 275 |
+
century_min: Optional[int] = None,
|
| 276 |
+
century_max: Optional[int] = None,
|
| 277 |
+
) -> list[HTRUnitedEntry]:
|
| 278 |
+
"""Recherche dans le catalogue avec filtres optionnels."""
|
| 279 |
+
results = self.entries
|
| 280 |
+
|
| 281 |
+
if query:
|
| 282 |
+
q = query.lower()
|
| 283 |
+
results = [
|
| 284 |
+
e for e in results
|
| 285 |
+
if (q in e.title.lower()
|
| 286 |
+
or q in e.description.lower()
|
| 287 |
+
or q in e.institution.lower()
|
| 288 |
+
or any(q in t.lower() for t in e.tags)
|
| 289 |
+
or any(q in lang.lower() for lang in e.language))
|
| 290 |
+
]
|
| 291 |
+
|
| 292 |
+
if language:
|
| 293 |
+
lang_lower = language.lower()
|
| 294 |
+
results = [
|
| 295 |
+
e for e in results
|
| 296 |
+
if any(lang_lower in lg.lower() for lg in e.language)
|
| 297 |
+
]
|
| 298 |
+
|
| 299 |
+
if script:
|
| 300 |
+
sc_lower = script.lower()
|
| 301 |
+
results = [
|
| 302 |
+
e for e in results
|
| 303 |
+
if any(sc_lower in s.lower() for s in e.script)
|
| 304 |
+
]
|
| 305 |
+
|
| 306 |
+
if century_min is not None:
|
| 307 |
+
results = [
|
| 308 |
+
e for e in results
|
| 309 |
+
if any(c >= century_min for c in e.century)
|
| 310 |
+
]
|
| 311 |
+
|
| 312 |
+
if century_max is not None:
|
| 313 |
+
results = [
|
| 314 |
+
e for e in results
|
| 315 |
+
if any(c <= century_max for c in e.century)
|
| 316 |
+
]
|
| 317 |
+
|
| 318 |
+
return results
|
| 319 |
+
|
| 320 |
+
def get_by_id(self, entry_id: str) -> Optional[HTRUnitedEntry]:
|
| 321 |
+
"""Retourne une entrée par son identifiant."""
|
| 322 |
+
for e in self.entries:
|
| 323 |
+
if e.id == entry_id:
|
| 324 |
+
return e
|
| 325 |
+
return None
|
| 326 |
+
|
| 327 |
+
def available_languages(self) -> list[str]:
|
| 328 |
+
seen: set[str] = set()
|
| 329 |
+
result: list[str] = []
|
| 330 |
+
for e in self.entries:
|
| 331 |
+
for lang in e.language:
|
| 332 |
+
if lang not in seen:
|
| 333 |
+
seen.add(lang)
|
| 334 |
+
result.append(lang)
|
| 335 |
+
return sorted(result)
|
| 336 |
+
|
| 337 |
+
def available_scripts(self) -> list[str]:
|
| 338 |
+
seen: set[str] = set()
|
| 339 |
+
result: list[str] = []
|
| 340 |
+
for e in self.entries:
|
| 341 |
+
for sc in e.script:
|
| 342 |
+
if sc not in seen:
|
| 343 |
+
seen.add(sc)
|
| 344 |
+
result.append(sc)
|
| 345 |
+
return sorted(result)
|
| 346 |
+
|
| 347 |
+
|
| 348 |
+
# ---------------------------------------------------------------------------
|
| 349 |
+
# Import de corpus
|
| 350 |
+
# ---------------------------------------------------------------------------
|
| 351 |
+
|
| 352 |
+
def import_htr_united_corpus(
|
| 353 |
+
entry: HTRUnitedEntry,
|
| 354 |
+
output_dir: str | Path,
|
| 355 |
+
max_samples: int = 100,
|
| 356 |
+
show_progress: bool = True,
|
| 357 |
+
) -> dict:
|
| 358 |
+
"""Importe un corpus HTR-United dans un dossier local.
|
| 359 |
+
|
| 360 |
+
Retourne un dict avec les métadonnées de l'import.
|
| 361 |
+
Note : en l'absence d'accès réseau au dépôt GitHub, génère des fichiers
|
| 362 |
+
placeholder (pour tests et démo).
|
| 363 |
+
"""
|
| 364 |
+
output_path = Path(output_dir)
|
| 365 |
+
output_path.mkdir(parents=True, exist_ok=True)
|
| 366 |
+
|
| 367 |
+
# Sauvegarder les métadonnées
|
| 368 |
+
meta = {
|
| 369 |
+
"source": "htr-united",
|
| 370 |
+
"entry_id": entry.id,
|
| 371 |
+
"title": entry.title,
|
| 372 |
+
"url": entry.url,
|
| 373 |
+
"language": entry.language,
|
| 374 |
+
"script": entry.script,
|
| 375 |
+
"century": entry.century,
|
| 376 |
+
"institution": entry.institution,
|
| 377 |
+
"license": entry.license,
|
| 378 |
+
"format": entry.format,
|
| 379 |
+
"imported_at": _iso_now(),
|
| 380 |
+
}
|
| 381 |
+
(output_path / "htr_united_meta.json").write_text(
|
| 382 |
+
json.dumps(meta, ensure_ascii=False, indent=2), encoding="utf-8"
|
| 383 |
+
)
|
| 384 |
+
|
| 385 |
+
# Essai de téléchargement réel depuis GitHub (archive releases)
|
| 386 |
+
downloaded = _try_download_corpus(entry, output_path, max_samples, show_progress)
|
| 387 |
+
|
| 388 |
+
return {
|
| 389 |
+
"entry_id": entry.id,
|
| 390 |
+
"title": entry.title,
|
| 391 |
+
"output_dir": str(output_path),
|
| 392 |
+
"files_imported": downloaded,
|
| 393 |
+
"metadata_file": str(output_path / "htr_united_meta.json"),
|
| 394 |
+
}
|
| 395 |
+
|
| 396 |
+
|
| 397 |
+
def _try_download_corpus(
|
| 398 |
+
entry: HTRUnitedEntry,
|
| 399 |
+
output_path: Path,
|
| 400 |
+
max_samples: int,
|
| 401 |
+
show_progress: bool,
|
| 402 |
+
) -> int:
|
| 403 |
+
"""Tente de télécharger le corpus depuis GitHub. Retourne le nombre de fichiers importés."""
|
| 404 |
+
# Construit l'URL de l'archive ZIP du dépôt GitHub
|
| 405 |
+
repo_path = _extract_github_repo(entry.url)
|
| 406 |
+
if not repo_path:
|
| 407 |
+
return 0
|
| 408 |
+
|
| 409 |
+
zip_url = f"https://github.com/{repo_path}/archive/refs/heads/main.zip"
|
| 410 |
+
try:
|
| 411 |
+
req = urllib.request.Request(
|
| 412 |
+
zip_url,
|
| 413 |
+
headers={"User-Agent": "picarones-htr-united-importer/1.0"},
|
| 414 |
+
)
|
| 415 |
+
with urllib.request.urlopen(req, timeout=30) as resp:
|
| 416 |
+
import io
|
| 417 |
+
import zipfile
|
| 418 |
+
|
| 419 |
+
data = resp.read()
|
| 420 |
+
with zipfile.ZipFile(io.BytesIO(data)) as zf:
|
| 421 |
+
# Extraire les fichiers ALTO/PAGE/GT
|
| 422 |
+
gt_files = [
|
| 423 |
+
n for n in zf.namelist()
|
| 424 |
+
if n.endswith((".alto.xml", ".page.xml", ".gt.txt", ".xml"))
|
| 425 |
+
and not n.endswith("/")
|
| 426 |
+
][:max_samples]
|
| 427 |
+
for i, fname in enumerate(gt_files):
|
| 428 |
+
dest = output_path / Path(fname).name
|
| 429 |
+
dest.write_bytes(zf.read(fname))
|
| 430 |
+
return len(gt_files)
|
| 431 |
+
except Exception as exc: # noqa: BLE001 — large surface (réseau, ZIP, FS)
|
| 432 |
+
# Sprint A3 (B-3) : on documente l'incident plutôt que de le
|
| 433 |
+
# masquer ; le caller reçoit toujours 0 pour préserver le
|
| 434 |
+
# contrat numérique de retour.
|
| 435 |
+
from picarones.adapters.corpus._fallback_log import record_fallback
|
| 436 |
+
record_fallback(
|
| 437 |
+
importer="htr_united",
|
| 438 |
+
operation="download_zip_samples",
|
| 439 |
+
error=exc,
|
| 440 |
+
extra={"output_path": str(output_path)},
|
| 441 |
+
)
|
| 442 |
+
return 0
|
| 443 |
+
|
| 444 |
+
|
| 445 |
+
def _extract_github_repo(url: str) -> Optional[str]:
|
| 446 |
+
"""Extrait 'owner/repo' depuis une URL GitHub."""
|
| 447 |
+
m = re.match(r"https?://github\.com/([^/]+/[^/]+?)(?:\.git)?/?$", url)
|
| 448 |
+
return m.group(1) if m else None
|
| 449 |
+
|
| 450 |
+
|
| 451 |
+
def _parse_yml_catalogue(raw: str) -> list[HTRUnitedEntry]:
|
| 452 |
+
"""Parse rudimentaire du YAML catalogue HTR-United."""
|
| 453 |
+
try:
|
| 454 |
+
import yaml
|
| 455 |
+
data = yaml.safe_load(raw)
|
| 456 |
+
if isinstance(data, list):
|
| 457 |
+
return [HTRUnitedEntry.from_dict(d) for d in data if isinstance(d, dict)]
|
| 458 |
+
except Exception as exc: # noqa: BLE001 — yaml + parsing user-supplied
|
| 459 |
+
# Sprint A3 (B-3) : un YAML mal formé bascule en mode démo
|
| 460 |
+
# sans que l'utilisateur en soit averti — on logge et on émet
|
| 461 |
+
# un Fact pour que la synthèse du rapport mentionne l'incident.
|
| 462 |
+
from picarones.adapters.corpus._fallback_log import record_fallback
|
| 463 |
+
record_fallback(
|
| 464 |
+
importer="htr_united",
|
| 465 |
+
operation="yaml_catalogue_parse",
|
| 466 |
+
error=exc,
|
| 467 |
+
)
|
| 468 |
+
return [HTRUnitedEntry.from_dict(d) for d in _DEMO_CATALOGUE]
|
| 469 |
+
|
| 470 |
+
|
| 471 |
+
def _iso_now() -> str:
|
| 472 |
+
from datetime import datetime, timezone
|
| 473 |
+
return datetime.now(timezone.utc).isoformat(timespec="seconds")
|
|
@@ -0,0 +1,464 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Import de datasets OCR/HTR depuis HuggingFace Hub.
|
| 2 |
+
|
| 3 |
+
⚠ **Statut : expérimental** (phase C du chantier de refonte en 3 cercles).
|
| 4 |
+
L'API ``datasets`` HuggingFace évolue fréquemment et ce module n'a pas
|
| 5 |
+
de tests d'intégration. À utiliser à vos risques jusqu'à ce qu'un cas
|
| 6 |
+
d'usage institutionnel valide son comportement. Un ``UserWarning`` est
|
| 7 |
+
émis à l'import pour le rappeler.
|
| 8 |
+
|
| 9 |
+
Ce module fournit :
|
| 10 |
+
- :class:`HuggingFaceDataset` — métadonnées d'un dataset HuggingFace
|
| 11 |
+
- :class:`HuggingFaceImporter` — recherche et import de datasets
|
| 12 |
+
- :func:`search_hf_datasets` — recherche par tags dans l'API HuggingFace
|
| 13 |
+
- :func:`import_hf_dataset` — téléchargement d'un dataset vers un dossier local
|
| 14 |
+
|
| 15 |
+
Les datasets patrimoniaux de référence sont pré-référencés pour une découverte
|
| 16 |
+
rapide sans requête réseau.
|
| 17 |
+
|
| 18 |
+
Exemple
|
| 19 |
+
-------
|
| 20 |
+
importer = HuggingFaceImporter()
|
| 21 |
+
results = importer.search("medieval OCR", tags=["ocr"])
|
| 22 |
+
corpus = importer.import_dataset(results[0].dataset_id, output_dir="./corpus/")
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
+
import json
|
| 28 |
+
import os
|
| 29 |
+
import urllib.error
|
| 30 |
+
import urllib.parse
|
| 31 |
+
import urllib.request
|
| 32 |
+
import warnings
|
| 33 |
+
from dataclasses import dataclass, field
|
| 34 |
+
from pathlib import Path
|
| 35 |
+
from typing import Optional
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
# Émission du warning ``experimental`` à l'import. Phase C du chantier
|
| 39 |
+
# de refonte — voir docstring du module ci-dessus.
|
| 40 |
+
warnings.warn(
|
| 41 |
+
"picarones.extras.importers.huggingface is experimental and may "
|
| 42 |
+
"change or be removed without notice. Use at your own risk until "
|
| 43 |
+
"an institutional use case validates the API.",
|
| 44 |
+
category=UserWarning,
|
| 45 |
+
stacklevel=2,
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
# ---------------------------------------------------------------------------
|
| 49 |
+
# Datasets de référence pré-référencés
|
| 50 |
+
# ---------------------------------------------------------------------------
|
| 51 |
+
|
| 52 |
+
_REFERENCE_DATASETS: list[dict] = [
|
| 53 |
+
{
|
| 54 |
+
"dataset_id": "Teklia/RIMES",
|
| 55 |
+
"title": "RIMES — Reconnaissance et Indexation de données Manuscrites et de fac-similEs",
|
| 56 |
+
"description": "Corpus de courriers manuscrits français modernes. Standard de référence pour la reconnaissance d'écriture manuscrite.",
|
| 57 |
+
"language": ["French"],
|
| 58 |
+
"tags": ["htr", "ocr", "handwritten", "french", "modern"],
|
| 59 |
+
"license": "cc-by-4.0",
|
| 60 |
+
"size_category": "1K<n<10K",
|
| 61 |
+
"task": "image-to-text",
|
| 62 |
+
"institution": "IRISA / A2iA",
|
| 63 |
+
"downloads": 1200,
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"dataset_id": "Teklia/IAM",
|
| 67 |
+
"title": "IAM Handwriting Database",
|
| 68 |
+
"description": "Corpus de référence anglais pour la reconnaissance d'écriture manuscrite.",
|
| 69 |
+
"language": ["English"],
|
| 70 |
+
"tags": ["htr", "ocr", "handwritten", "english"],
|
| 71 |
+
"license": "other",
|
| 72 |
+
"size_category": "10K<n<100K",
|
| 73 |
+
"task": "image-to-text",
|
| 74 |
+
"institution": "University of Bern",
|
| 75 |
+
"downloads": 8400,
|
| 76 |
+
},
|
| 77 |
+
{
|
| 78 |
+
"dataset_id": "CATMuS/medieval",
|
| 79 |
+
"title": "CATMuS Medieval — Consistent Approaches to Transcribing ManuScripts",
|
| 80 |
+
"description": "Dataset multilingue de manuscrits médiévaux (latin, français, occitan, espagnol) pour l'entraînement de modèles HTR.",
|
| 81 |
+
"language": ["Latin", "French", "Occitan", "Spanish"],
|
| 82 |
+
"tags": ["htr", "medieval", "manuscripts", "latin", "french", "historical"],
|
| 83 |
+
"license": "cc-by-4.0",
|
| 84 |
+
"size_category": "100K<n<1M",
|
| 85 |
+
"task": "image-to-text",
|
| 86 |
+
"institution": "Inria / EPHE",
|
| 87 |
+
"downloads": 3100,
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"dataset_id": "htr-united/cremma-medieval",
|
| 91 |
+
"title": "CREMMA Medieval",
|
| 92 |
+
"description": "Corpus de manuscrits médiévaux français XIIe-XVe siècles.",
|
| 93 |
+
"language": ["French", "Latin"],
|
| 94 |
+
"tags": ["htr", "medieval", "french", "manuscripts", "htr-united"],
|
| 95 |
+
"license": "cc-by-4.0",
|
| 96 |
+
"size_category": "1K<n<10K",
|
| 97 |
+
"task": "image-to-text",
|
| 98 |
+
"institution": "Inria",
|
| 99 |
+
"downloads": 520,
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"dataset_id": "biglam/europeana_newspapers",
|
| 103 |
+
"title": "Europeana Newspapers",
|
| 104 |
+
"description": "Journaux numérisés européens du XIXe siècle (OCR + images).",
|
| 105 |
+
"language": ["French", "German", "Dutch", "Finnish"],
|
| 106 |
+
"tags": ["ocr", "newspapers", "historical", "19th-century", "europeana"],
|
| 107 |
+
"license": "cc0-1.0",
|
| 108 |
+
"size_category": "1M<n<10M",
|
| 109 |
+
"task": "image-to-text",
|
| 110 |
+
"institution": "Europeana Foundation",
|
| 111 |
+
"downloads": 15200,
|
| 112 |
+
},
|
| 113 |
+
{
|
| 114 |
+
"dataset_id": "stefanklut/esposalles",
|
| 115 |
+
"title": "Esposalles Dataset",
|
| 116 |
+
"description": "Registres de mariage catalans du XVIIe siècle pour la reconnaissance d'écriture historique.",
|
| 117 |
+
"language": ["Catalan", "Latin"],
|
| 118 |
+
"tags": ["htr", "historical", "registers", "catalan", "17th-century"],
|
| 119 |
+
"license": "cc-by-4.0",
|
| 120 |
+
"size_category": "1K<n<10K",
|
| 121 |
+
"task": "image-to-text",
|
| 122 |
+
"institution": "Universitat Autònoma de Barcelona",
|
| 123 |
+
"downloads": 340,
|
| 124 |
+
},
|
| 125 |
+
{
|
| 126 |
+
"dataset_id": "bnf-gallica/gallica-ocr",
|
| 127 |
+
"title": "Gallica OCR",
|
| 128 |
+
"description": "Extraits d'imprimés anciens numérisés depuis Gallica avec vérité terrain.",
|
| 129 |
+
"language": ["French", "Latin"],
|
| 130 |
+
"tags": ["ocr", "historical", "printed", "gallica", "french"],
|
| 131 |
+
"license": "etalab-2.0",
|
| 132 |
+
"size_category": "10K<n<100K",
|
| 133 |
+
"task": "image-to-text",
|
| 134 |
+
"institution": "Gallica",
|
| 135 |
+
"downloads": 2800,
|
| 136 |
+
},
|
| 137 |
+
{
|
| 138 |
+
"dataset_id": "Bozen-Baptism/baptism-records",
|
| 139 |
+
"title": "Bozen Baptism Records",
|
| 140 |
+
"description": "Registres de baptêmes de Bozen (Italie/Autriche) du XVIIIe siècle.",
|
| 141 |
+
"language": ["German", "Latin"],
|
| 142 |
+
"tags": ["htr", "historical", "registers", "german", "latin", "18th-century"],
|
| 143 |
+
"license": "cc-by-4.0",
|
| 144 |
+
"size_category": "1K<n<10K",
|
| 145 |
+
"task": "image-to-text",
|
| 146 |
+
"institution": "University of Innsbruck",
|
| 147 |
+
"downloads": 190,
|
| 148 |
+
},
|
| 149 |
+
{
|
| 150 |
+
"dataset_id": "read-bad/readbad",
|
| 151 |
+
"title": "READ-BAD — Recognition and Enrichment of Archival Documents",
|
| 152 |
+
"description": "Corpus multilingue de documents d'archives pour l'OCR historique (Latin, Allemand, Anglais).",
|
| 153 |
+
"language": ["German", "English", "Latin"],
|
| 154 |
+
"tags": ["ocr", "htr", "historical", "archives", "read"],
|
| 155 |
+
"license": "cc-by-4.0",
|
| 156 |
+
"size_category": "10K<n<100K",
|
| 157 |
+
"task": "image-to-text",
|
| 158 |
+
"institution": "University of Graz",
|
| 159 |
+
"downloads": 1050,
|
| 160 |
+
},
|
| 161 |
+
]
|
| 162 |
+
|
| 163 |
+
# ---------------------------------------------------------------------------
|
| 164 |
+
# Dataclass
|
| 165 |
+
# ---------------------------------------------------------------------------
|
| 166 |
+
|
| 167 |
+
@dataclass
|
| 168 |
+
class HuggingFaceDataset:
|
| 169 |
+
"""Métadonnées d'un dataset HuggingFace."""
|
| 170 |
+
|
| 171 |
+
dataset_id: str
|
| 172 |
+
title: str
|
| 173 |
+
description: str = ""
|
| 174 |
+
language: list[str] = field(default_factory=list)
|
| 175 |
+
tags: list[str] = field(default_factory=list)
|
| 176 |
+
license: str = ""
|
| 177 |
+
size_category: str = ""
|
| 178 |
+
task: str = "image-to-text"
|
| 179 |
+
institution: str = ""
|
| 180 |
+
downloads: int = 0
|
| 181 |
+
source: str = "reference" # "reference" | "api"
|
| 182 |
+
|
| 183 |
+
def as_dict(self) -> dict:
|
| 184 |
+
return {
|
| 185 |
+
"dataset_id": self.dataset_id,
|
| 186 |
+
"title": self.title,
|
| 187 |
+
"description": self.description,
|
| 188 |
+
"language": self.language,
|
| 189 |
+
"tags": self.tags,
|
| 190 |
+
"license": self.license,
|
| 191 |
+
"size_category": self.size_category,
|
| 192 |
+
"task": self.task,
|
| 193 |
+
"institution": self.institution,
|
| 194 |
+
"downloads": self.downloads,
|
| 195 |
+
"source": self.source,
|
| 196 |
+
}
|
| 197 |
+
|
| 198 |
+
@classmethod
|
| 199 |
+
def from_dict(cls, d: dict) -> "HuggingFaceDataset":
|
| 200 |
+
return cls(
|
| 201 |
+
dataset_id=d.get("dataset_id", d.get("id", "")),
|
| 202 |
+
title=d.get("title", d.get("dataset_id", "")),
|
| 203 |
+
description=d.get("description", ""),
|
| 204 |
+
language=d.get("language", []),
|
| 205 |
+
tags=d.get("tags", []),
|
| 206 |
+
license=d.get("license", ""),
|
| 207 |
+
size_category=d.get("size_category", d.get("cardData", {}).get("size_categories", [""])[0] if isinstance(d.get("cardData"), dict) else ""),
|
| 208 |
+
task=d.get("task", "image-to-text"),
|
| 209 |
+
institution=d.get("institution", ""),
|
| 210 |
+
downloads=d.get("downloads", d.get("downloadsAllTime", 0)),
|
| 211 |
+
source=d.get("source", "api"),
|
| 212 |
+
)
|
| 213 |
+
|
| 214 |
+
@property
|
| 215 |
+
def hf_url(self) -> str:
|
| 216 |
+
return f"https://huggingface.co/datasets/{self.dataset_id}"
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
# ---------------------------------------------------------------------------
|
| 220 |
+
# Importer principal
|
| 221 |
+
# ---------------------------------------------------------------------------
|
| 222 |
+
|
| 223 |
+
class HuggingFaceImporter:
|
| 224 |
+
"""Recherche et importe des datasets depuis HuggingFace Hub."""
|
| 225 |
+
|
| 226 |
+
_API_BASE = "https://huggingface.co/api"
|
| 227 |
+
|
| 228 |
+
def __init__(self, token: Optional[str] = None) -> None:
|
| 229 |
+
self._token = token or os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_TOKEN")
|
| 230 |
+
|
| 231 |
+
def _headers(self) -> dict:
|
| 232 |
+
h = {"User-Agent": "picarones-hf-importer/1.0"}
|
| 233 |
+
if self._token:
|
| 234 |
+
h["Authorization"] = f"Bearer {self._token}"
|
| 235 |
+
return h
|
| 236 |
+
|
| 237 |
+
def search(
|
| 238 |
+
self,
|
| 239 |
+
query: str = "",
|
| 240 |
+
tags: Optional[list[str]] = None,
|
| 241 |
+
language: Optional[str] = None,
|
| 242 |
+
limit: int = 20,
|
| 243 |
+
use_reference: bool = True,
|
| 244 |
+
) -> list[HuggingFaceDataset]:
|
| 245 |
+
"""Recherche des datasets avec filtres.
|
| 246 |
+
|
| 247 |
+
Interroge d'abord les datasets de référence pré-intégrés, puis
|
| 248 |
+
l'API HuggingFace si disponible.
|
| 249 |
+
"""
|
| 250 |
+
results: list[HuggingFaceDataset] = []
|
| 251 |
+
|
| 252 |
+
# Datasets de référence
|
| 253 |
+
if use_reference:
|
| 254 |
+
ref_results = self._search_reference(query, tags, language)
|
| 255 |
+
results.extend(ref_results)
|
| 256 |
+
|
| 257 |
+
# API HuggingFace (optionnel, peut échouer silencieusement)
|
| 258 |
+
try:
|
| 259 |
+
api_results = self._search_api(query, tags, language, limit)
|
| 260 |
+
# Déduplique (priorité aux références)
|
| 261 |
+
existing_ids = {r.dataset_id for r in results}
|
| 262 |
+
for ds in api_results:
|
| 263 |
+
if ds.dataset_id not in existing_ids:
|
| 264 |
+
results.append(ds)
|
| 265 |
+
existing_ids.add(ds.dataset_id)
|
| 266 |
+
except Exception as exc: # noqa: BLE001 — réseau/API tierce
|
| 267 |
+
# Sprint A3 (B-3) : la recherche API échoue silencieusement →
|
| 268 |
+
# l'utilisateur ne voit que les datasets de référence et croit
|
| 269 |
+
# que l'API est vide. On documente l'incident.
|
| 270 |
+
from picarones.adapters.corpus._fallback_log import record_fallback
|
| 271 |
+
record_fallback(
|
| 272 |
+
importer="huggingface",
|
| 273 |
+
operation="hub_search_api",
|
| 274 |
+
error=exc,
|
| 275 |
+
extra={"query": query, "language": language, "limit": limit},
|
| 276 |
+
)
|
| 277 |
+
|
| 278 |
+
return results[:limit]
|
| 279 |
+
|
| 280 |
+
def _search_reference(
|
| 281 |
+
self,
|
| 282 |
+
query: str,
|
| 283 |
+
tags: Optional[list[str]],
|
| 284 |
+
language: Optional[str],
|
| 285 |
+
) -> list[HuggingFaceDataset]:
|
| 286 |
+
datasets = [HuggingFaceDataset.from_dict(d) for d in _REFERENCE_DATASETS]
|
| 287 |
+
datasets = [ds._replace_source("reference") for ds in datasets]
|
| 288 |
+
|
| 289 |
+
if query:
|
| 290 |
+
q = query.lower()
|
| 291 |
+
datasets = [
|
| 292 |
+
ds for ds in datasets
|
| 293 |
+
if (q in ds.title.lower()
|
| 294 |
+
or q in ds.description.lower()
|
| 295 |
+
or q in ds.dataset_id.lower()
|
| 296 |
+
or any(q in t.lower() for t in ds.tags)
|
| 297 |
+
or any(q in lg.lower() for lg in ds.language))
|
| 298 |
+
]
|
| 299 |
+
|
| 300 |
+
if tags:
|
| 301 |
+
for tag in tags:
|
| 302 |
+
t_lower = tag.lower()
|
| 303 |
+
datasets = [
|
| 304 |
+
ds for ds in datasets
|
| 305 |
+
if any(t_lower in dt.lower() for dt in ds.tags)
|
| 306 |
+
]
|
| 307 |
+
|
| 308 |
+
if language:
|
| 309 |
+
lang_lower = language.lower()
|
| 310 |
+
datasets = [
|
| 311 |
+
ds for ds in datasets
|
| 312 |
+
if any(lang_lower in lg.lower() for lg in ds.language)
|
| 313 |
+
]
|
| 314 |
+
|
| 315 |
+
return datasets
|
| 316 |
+
|
| 317 |
+
def _search_api(
|
| 318 |
+
self,
|
| 319 |
+
query: str,
|
| 320 |
+
tags: Optional[list[str]],
|
| 321 |
+
language: Optional[str],
|
| 322 |
+
limit: int,
|
| 323 |
+
) -> list[HuggingFaceDataset]:
|
| 324 |
+
params: dict[str, str] = {
|
| 325 |
+
"task_categories": "image-to-text",
|
| 326 |
+
"limit": str(min(limit, 50)),
|
| 327 |
+
"full": "False",
|
| 328 |
+
}
|
| 329 |
+
if query:
|
| 330 |
+
params["search"] = query
|
| 331 |
+
if language:
|
| 332 |
+
params["language"] = language
|
| 333 |
+
if tags:
|
| 334 |
+
params["tags"] = ",".join(tags)
|
| 335 |
+
|
| 336 |
+
url = f"{self._API_BASE}/datasets?" + urllib.parse.urlencode(params)
|
| 337 |
+
req = urllib.request.Request(url, headers=self._headers())
|
| 338 |
+
with urllib.request.urlopen(req, timeout=10) as resp:
|
| 339 |
+
data = json.loads(resp.read().decode("utf-8"))
|
| 340 |
+
|
| 341 |
+
results = []
|
| 342 |
+
for item in data if isinstance(data, list) else []:
|
| 343 |
+
ds = HuggingFaceDataset(
|
| 344 |
+
dataset_id=item.get("id", ""),
|
| 345 |
+
title=item.get("id", ""),
|
| 346 |
+
description=item.get("description", ""),
|
| 347 |
+
language=item.get("language", []),
|
| 348 |
+
tags=item.get("tags", []),
|
| 349 |
+
license=item.get("license", ""),
|
| 350 |
+
size_category=(
|
| 351 |
+
item.get("cardData", {}).get("size_categories", [""])[0]
|
| 352 |
+
if isinstance(item.get("cardData"), dict)
|
| 353 |
+
else ""
|
| 354 |
+
),
|
| 355 |
+
task="image-to-text",
|
| 356 |
+
downloads=item.get("downloadsAllTime", 0),
|
| 357 |
+
source="api",
|
| 358 |
+
)
|
| 359 |
+
if ds.dataset_id:
|
| 360 |
+
results.append(ds)
|
| 361 |
+
return results
|
| 362 |
+
|
| 363 |
+
def import_dataset(
|
| 364 |
+
self,
|
| 365 |
+
dataset_id: str,
|
| 366 |
+
output_dir: str | Path,
|
| 367 |
+
split: str = "train",
|
| 368 |
+
max_samples: int = 100,
|
| 369 |
+
show_progress: bool = True,
|
| 370 |
+
) -> dict:
|
| 371 |
+
"""Importe un dataset depuis HuggingFace vers un dossier local.
|
| 372 |
+
|
| 373 |
+
Retourne les métadonnées de l'import.
|
| 374 |
+
"""
|
| 375 |
+
output_path = Path(output_dir)
|
| 376 |
+
output_path.mkdir(parents=True, exist_ok=True)
|
| 377 |
+
|
| 378 |
+
meta = {
|
| 379 |
+
"source": "huggingface",
|
| 380 |
+
"dataset_id": dataset_id,
|
| 381 |
+
"split": split,
|
| 382 |
+
"max_samples": max_samples,
|
| 383 |
+
"imported_at": _iso_now(),
|
| 384 |
+
}
|
| 385 |
+
meta_file = output_path / "huggingface_meta.json"
|
| 386 |
+
meta_file.write_text(json.dumps(meta, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 387 |
+
|
| 388 |
+
# Tentative d'import via datasets library si disponible
|
| 389 |
+
files_imported = _try_import_with_datasets_lib(
|
| 390 |
+
dataset_id, output_path, split, max_samples, show_progress
|
| 391 |
+
)
|
| 392 |
+
|
| 393 |
+
return {
|
| 394 |
+
"dataset_id": dataset_id,
|
| 395 |
+
"output_dir": str(output_path),
|
| 396 |
+
"files_imported": files_imported,
|
| 397 |
+
"metadata_file": str(meta_file),
|
| 398 |
+
}
|
| 399 |
+
|
| 400 |
+
|
| 401 |
+
def _try_import_with_datasets_lib(
|
| 402 |
+
dataset_id: str,
|
| 403 |
+
output_path: Path,
|
| 404 |
+
split: str,
|
| 405 |
+
max_samples: int,
|
| 406 |
+
show_progress: bool,
|
| 407 |
+
) -> int:
|
| 408 |
+
"""Essaie d'importer avec la librairie `datasets` de HuggingFace."""
|
| 409 |
+
try:
|
| 410 |
+
from datasets import load_dataset # type: ignore
|
| 411 |
+
|
| 412 |
+
ds = load_dataset(dataset_id, split=split, streaming=True)
|
| 413 |
+
count = 0
|
| 414 |
+
for i, item in enumerate(ds):
|
| 415 |
+
if i >= max_samples:
|
| 416 |
+
break
|
| 417 |
+
# Cherche champ image et texte
|
| 418 |
+
image = item.get("image") or item.get("img")
|
| 419 |
+
text = item.get("text") or item.get("transcription") or item.get("ground_truth", "")
|
| 420 |
+
|
| 421 |
+
if image is not None:
|
| 422 |
+
img_file = output_path / f"doc_{i:04d}.jpg"
|
| 423 |
+
try:
|
| 424 |
+
image.save(str(img_file))
|
| 425 |
+
except Exception as exc: # noqa: BLE001 — PIL/PIL-IO
|
| 426 |
+
# Sprint A3 (B-3) : un échec de sauvegarde d'image
|
| 427 |
+
# produirait un GT orphelin (texte sans image). On
|
| 428 |
+
# documente et on continue — le GT est tout de même
|
| 429 |
+
# écrit pour préserver la cohérence numérique du compteur.
|
| 430 |
+
from picarones.adapters.corpus._fallback_log import record_fallback
|
| 431 |
+
record_fallback(
|
| 432 |
+
importer="huggingface",
|
| 433 |
+
operation="image_save",
|
| 434 |
+
error=exc,
|
| 435 |
+
extra={"img_file": str(img_file), "doc_index": i},
|
| 436 |
+
)
|
| 437 |
+
|
| 438 |
+
gt_file = output_path / f"doc_{i:04d}.gt.txt"
|
| 439 |
+
gt_file.write_text(str(text), encoding="utf-8")
|
| 440 |
+
count += 1
|
| 441 |
+
|
| 442 |
+
return count
|
| 443 |
+
except (ImportError, Exception):
|
| 444 |
+
return 0
|
| 445 |
+
|
| 446 |
+
|
| 447 |
+
def _iso_now() -> str:
|
| 448 |
+
from datetime import datetime, timezone
|
| 449 |
+
return datetime.now(timezone.utc).isoformat(timespec="seconds")
|
| 450 |
+
|
| 451 |
+
|
| 452 |
+
# ---------------------------------------------------------------------------
|
| 453 |
+
# Extension de HuggingFaceDataset (helper privé)
|
| 454 |
+
# ---------------------------------------------------------------------------
|
| 455 |
+
|
| 456 |
+
def _patch_dataset_replace_source() -> None:
|
| 457 |
+
"""Ajoute un helper _replace_source à HuggingFaceDataset."""
|
| 458 |
+
def _replace_source(self, source: str) -> "HuggingFaceDataset":
|
| 459 |
+
from dataclasses import replace
|
| 460 |
+
return replace(self, source=source)
|
| 461 |
+
HuggingFaceDataset._replace_source = _replace_source
|
| 462 |
+
|
| 463 |
+
|
| 464 |
+
_patch_dataset_replace_source()
|
|
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Adaptateur LLM — Anthropic (Claude Sonnet, Claude Haiku)."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import logging
|
| 6 |
+
import os
|
| 7 |
+
from typing import Optional
|
| 8 |
+
|
| 9 |
+
from picarones.adapters.llm.base import (
|
| 10 |
+
BaseLLMAdapter,
|
| 11 |
+
log_http_error,
|
| 12 |
+
normalize_llm_content,
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
logger = logging.getLogger(__name__)
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
class AnthropicAdapter(BaseLLMAdapter):
|
| 19 |
+
"""Adaptateur pour les modèles Anthropic Claude.
|
| 20 |
+
|
| 21 |
+
Clé API via la variable d'environnement ``ANTHROPIC_API_KEY``.
|
| 22 |
+
|
| 23 |
+
Modes supportés : text_only, text_and_image, zero_shot.
|
| 24 |
+
"""
|
| 25 |
+
|
| 26 |
+
api_key_env_var = "ANTHROPIC_API_KEY"
|
| 27 |
+
|
| 28 |
+
@property
|
| 29 |
+
def name(self) -> str:
|
| 30 |
+
return "anthropic"
|
| 31 |
+
|
| 32 |
+
@property
|
| 33 |
+
def default_model(self) -> str:
|
| 34 |
+
return "claude-sonnet-4-6"
|
| 35 |
+
|
| 36 |
+
def __init__(
|
| 37 |
+
self,
|
| 38 |
+
model: Optional[str] = None,
|
| 39 |
+
config: Optional[dict] = None,
|
| 40 |
+
) -> None:
|
| 41 |
+
super().__init__(model, config)
|
| 42 |
+
self._api_key = os.environ.get("ANTHROPIC_API_KEY")
|
| 43 |
+
|
| 44 |
+
def _call(self, prompt: str, image_b64: Optional[str] = None) -> str:
|
| 45 |
+
if not self._api_key:
|
| 46 |
+
raise RuntimeError(
|
| 47 |
+
"Clé API Anthropic manquante — définissez la variable d'environnement ANTHROPIC_API_KEY"
|
| 48 |
+
)
|
| 49 |
+
try:
|
| 50 |
+
import anthropic
|
| 51 |
+
except ImportError as exc:
|
| 52 |
+
raise RuntimeError(
|
| 53 |
+
"Le package 'anthropic' n'est pas installé. Lancez : pip install anthropic"
|
| 54 |
+
) from exc
|
| 55 |
+
|
| 56 |
+
client = anthropic.Anthropic(api_key=self._api_key)
|
| 57 |
+
temperature = float(self.config.get("temperature", 0.0))
|
| 58 |
+
max_tokens = int(self.config.get("max_tokens", 4096))
|
| 59 |
+
|
| 60 |
+
if image_b64:
|
| 61 |
+
content: list | str = [
|
| 62 |
+
{
|
| 63 |
+
"type": "image",
|
| 64 |
+
"source": {
|
| 65 |
+
"type": "base64",
|
| 66 |
+
"media_type": "image/png",
|
| 67 |
+
"data": image_b64,
|
| 68 |
+
},
|
| 69 |
+
},
|
| 70 |
+
{"type": "text", "text": prompt},
|
| 71 |
+
]
|
| 72 |
+
else:
|
| 73 |
+
content = prompt
|
| 74 |
+
|
| 75 |
+
try:
|
| 76 |
+
response = client.messages.create(
|
| 77 |
+
model=self.model,
|
| 78 |
+
max_tokens=max_tokens,
|
| 79 |
+
temperature=temperature,
|
| 80 |
+
messages=[{"role": "user", "content": content}],
|
| 81 |
+
)
|
| 82 |
+
except Exception as exc:
|
| 83 |
+
# Chantier 4 — log discriminant (401/429/5xx) factorisé.
|
| 84 |
+
# Auparavant Anthropic ne discriminait pas par code HTTP,
|
| 85 |
+
# difficile à diagnostiquer (clé invalide vs rate limit).
|
| 86 |
+
log_http_error(
|
| 87 |
+
"AnthropicAdapter", self.model, exc,
|
| 88 |
+
env_var=self.api_key_env_var,
|
| 89 |
+
)
|
| 90 |
+
raise
|
| 91 |
+
|
| 92 |
+
if not response.content:
|
| 93 |
+
logger.warning(
|
| 94 |
+
"[AnthropicAdapter] réponse vide (modèle=%s, stop_reason=%s).",
|
| 95 |
+
self.model, getattr(response, "stop_reason", None),
|
| 96 |
+
)
|
| 97 |
+
return ""
|
| 98 |
+
|
| 99 |
+
# Chantier 4 — propagation du fix Sprint 15 : le SDK Anthropic
|
| 100 |
+
# retourne ``response.content`` comme une liste de blocs
|
| 101 |
+
# (``ContentBlock`` avec attribut ``text``). ``normalize_llm_content``
|
| 102 |
+
# concatène le texte de tous les blocs au lieu de ne prendre que
|
| 103 |
+
# le premier — utile quand le modèle émet plusieurs blocs.
|
| 104 |
+
text = normalize_llm_content(response.content)
|
| 105 |
+
if not text:
|
| 106 |
+
block = response.content[0]
|
| 107 |
+
logger.warning(
|
| 108 |
+
"[AnthropicAdapter] bloc de type '%s' sans texte (modèle=%s).",
|
| 109 |
+
getattr(block, "type", "unknown"), self.model,
|
| 110 |
+
)
|
| 111 |
+
return text
|
|
@@ -0,0 +1,279 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Interface abstraite commune à tous les adaptateurs LLM."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import logging
|
| 6 |
+
import time
|
| 7 |
+
from abc import ABC, abstractmethod
|
| 8 |
+
from dataclasses import dataclass
|
| 9 |
+
from typing import Any, Optional
|
| 10 |
+
|
| 11 |
+
logger = logging.getLogger(__name__)
|
| 12 |
+
|
| 13 |
+
# Paramètres de retry par défaut
|
| 14 |
+
_DEFAULT_MAX_RETRIES = 3
|
| 15 |
+
_DEFAULT_BACKOFF_BASE = 2.0 # secondes : 2, 4, 8
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def _is_retryable(exc: Exception) -> bool:
|
| 19 |
+
"""Détermine si une exception est retryable (429, 5xx, timeout réseau)."""
|
| 20 |
+
# HTTP status codes retryables
|
| 21 |
+
status = getattr(exc, "status_code", None) or getattr(exc, "http_status", None)
|
| 22 |
+
if status is not None:
|
| 23 |
+
return status == 429 or status >= 500
|
| 24 |
+
|
| 25 |
+
# Erreurs réseau / timeout
|
| 26 |
+
exc_name = type(exc).__name__
|
| 27 |
+
if exc_name in ("TimeoutError", "ConnectionError", "URLError"):
|
| 28 |
+
return True
|
| 29 |
+
|
| 30 |
+
# Messages d'erreur courants
|
| 31 |
+
msg = str(exc).lower()
|
| 32 |
+
if "rate" in msg and "limit" in msg:
|
| 33 |
+
return True
|
| 34 |
+
if "timeout" in msg or "connection" in msg:
|
| 35 |
+
return True
|
| 36 |
+
if "429" in msg or "503" in msg or "502" in msg:
|
| 37 |
+
return True
|
| 38 |
+
|
| 39 |
+
return False
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def normalize_llm_content(raw: Any) -> str:
|
| 43 |
+
"""Normalise une réponse LLM en chaîne plate.
|
| 44 |
+
|
| 45 |
+
Chantier 4 (post-Sprint 97) — propagation du fix Mistral
|
| 46 |
+
Sprint 15 à tous les providers. Le SDK Mistral peut retourner
|
| 47 |
+
une liste de ``ContentChunk`` au lieu d'une chaîne pour certains
|
| 48 |
+
modèles/versions ; le SDK OpenAI peut faire de même quand on
|
| 49 |
+
active des features de structuration. Ce helper applique la même
|
| 50 |
+
discipline pour les 4 adapters :
|
| 51 |
+
|
| 52 |
+
- ``str`` → renvoyée telle quelle (ou ``""``).
|
| 53 |
+
- ``None`` → ``""``.
|
| 54 |
+
- ``list[ContentChunk]`` → concaténation des ``.text``.
|
| 55 |
+
- ``list[dict]`` avec clé ``text`` → concaténation des ``["text"]``.
|
| 56 |
+
- ``list[str]`` → concaténation directe.
|
| 57 |
+
- autre objet avec ``.text`` → ``obj.text``.
|
| 58 |
+
- autre → ``str(obj)`` (best-effort).
|
| 59 |
+
|
| 60 |
+
Le résultat est garanti être une ``str`` ; ``""`` quand la réponse
|
| 61 |
+
est vide. La fonction est idempotente : ``normalize_llm_content(s)
|
| 62 |
+
== s`` pour toute chaîne ``s``.
|
| 63 |
+
"""
|
| 64 |
+
if raw is None:
|
| 65 |
+
return ""
|
| 66 |
+
if isinstance(raw, str):
|
| 67 |
+
return raw
|
| 68 |
+
if isinstance(raw, list):
|
| 69 |
+
parts: list[str] = []
|
| 70 |
+
for chunk in raw:
|
| 71 |
+
if chunk is None:
|
| 72 |
+
continue
|
| 73 |
+
if isinstance(chunk, str):
|
| 74 |
+
parts.append(chunk)
|
| 75 |
+
continue
|
| 76 |
+
if hasattr(chunk, "text"):
|
| 77 |
+
txt = getattr(chunk, "text", None)
|
| 78 |
+
if isinstance(txt, str):
|
| 79 |
+
parts.append(txt)
|
| 80 |
+
continue
|
| 81 |
+
if isinstance(chunk, dict) and isinstance(chunk.get("text"), str):
|
| 82 |
+
parts.append(chunk["text"])
|
| 83 |
+
continue
|
| 84 |
+
# Dernier recours — convertit le chunk en chaîne
|
| 85 |
+
parts.append(str(chunk))
|
| 86 |
+
return "".join(parts)
|
| 87 |
+
if hasattr(raw, "text") and isinstance(getattr(raw, "text", None), str):
|
| 88 |
+
return raw.text # type: ignore[no-any-return]
|
| 89 |
+
return str(raw)
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def log_http_error(
|
| 93 |
+
adapter_name: str,
|
| 94 |
+
model: str,
|
| 95 |
+
exc: Exception,
|
| 96 |
+
*,
|
| 97 |
+
env_var: Optional[str] = None,
|
| 98 |
+
) -> None:
|
| 99 |
+
"""Log standardisé des erreurs HTTP des SDK LLM.
|
| 100 |
+
|
| 101 |
+
Chantier 4 (post-Sprint 97) — propagation du log discriminant
|
| 102 |
+
Mistral/OpenAI à tous les providers. Inspecte ``status_code`` et
|
| 103 |
+
``http_status`` puis émet un warning ciblé selon le code :
|
| 104 |
+
|
| 105 |
+
- 401 : clé API invalide/expirée (mention de la variable
|
| 106 |
+
d'environnement à vérifier si fournie).
|
| 107 |
+
- 429 : rate limit / quota dépassé.
|
| 108 |
+
- 5xx : problème serveur côté provider.
|
| 109 |
+
- autre / pas de status_code : log générique.
|
| 110 |
+
|
| 111 |
+
L'exception n'est pas levée — l'appelant doit ``raise``
|
| 112 |
+
explicitement après ce log s'il veut propager (le retry est géré
|
| 113 |
+
par ``BaseLLMAdapter.complete`` selon ``_is_retryable``).
|
| 114 |
+
"""
|
| 115 |
+
status = getattr(exc, "status_code", None) or getattr(exc, "http_status", None)
|
| 116 |
+
if status == 401:
|
| 117 |
+
suffix = f" Vérifier {env_var}." if env_var else ""
|
| 118 |
+
logger.warning(
|
| 119 |
+
"[%s] erreur HTTP 401 — clé API invalide ou expirée "
|
| 120 |
+
"(modèle=%s).%s",
|
| 121 |
+
adapter_name, model, suffix,
|
| 122 |
+
)
|
| 123 |
+
elif status == 429:
|
| 124 |
+
logger.warning(
|
| 125 |
+
"[%s] erreur HTTP 429 — quota dépassé ou rate-limit "
|
| 126 |
+
"(modèle=%s). Réessayer plus tard.",
|
| 127 |
+
adapter_name, model,
|
| 128 |
+
)
|
| 129 |
+
elif status is not None and status >= 500:
|
| 130 |
+
logger.warning(
|
| 131 |
+
"[%s] erreur HTTP %d — problème serveur (modèle=%s) : %s",
|
| 132 |
+
adapter_name, status, model, exc,
|
| 133 |
+
)
|
| 134 |
+
else:
|
| 135 |
+
logger.warning(
|
| 136 |
+
"[%s] erreur lors de l'appel API (modèle=%s) : %s",
|
| 137 |
+
adapter_name, model, exc,
|
| 138 |
+
)
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
@dataclass
|
| 142 |
+
class LLMResult:
|
| 143 |
+
"""Résultat produit par un appel LLM."""
|
| 144 |
+
|
| 145 |
+
model_id: str
|
| 146 |
+
text: str
|
| 147 |
+
duration_seconds: float
|
| 148 |
+
tokens_used: Optional[int] = None
|
| 149 |
+
error: Optional[str] = None
|
| 150 |
+
|
| 151 |
+
@property
|
| 152 |
+
def success(self) -> bool:
|
| 153 |
+
return self.error is None
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
class BaseLLMAdapter(ABC):
|
| 157 |
+
"""Classe de base pour tous les adaptateurs LLM.
|
| 158 |
+
|
| 159 |
+
Chaque adaptateur doit implémenter :
|
| 160 |
+
- ``name`` : identifiant du provider (ex : 'openai')
|
| 161 |
+
- ``default_model``: modèle par défaut du provider
|
| 162 |
+
- ``_call()`` : appel API effectif, retourne le texte brut
|
| 163 |
+
|
| 164 |
+
Les clés API sont lues depuis les variables d'environnement uniquement.
|
| 165 |
+
|
| 166 |
+
Retry automatique
|
| 167 |
+
-----------------
|
| 168 |
+
Les erreurs retryables (HTTP 429, 5xx, timeout réseau) sont automatiquement
|
| 169 |
+
retentées avec backoff exponentiel (2s, 4s, 8s par défaut). Configurable
|
| 170 |
+
via ``config["max_retries"]`` et ``config["retry_backoff"]``.
|
| 171 |
+
|
| 172 |
+
Normalisation des réponses (chantier 4)
|
| 173 |
+
---------------------------------------
|
| 174 |
+
Les sous-classes utilisent :func:`normalize_llm_content` sur la
|
| 175 |
+
réponse SDK avant de la retourner — garantit qu'une réponse de
|
| 176 |
+
type ``list[ContentChunk]`` (Mistral, parfois OpenAI) est
|
| 177 |
+
convertie en ``str`` plate.
|
| 178 |
+
|
| 179 |
+
Logging d'erreurs HTTP (chantier 4)
|
| 180 |
+
-----------------------------------
|
| 181 |
+
Les sous-classes utilisent :func:`log_http_error` pour produire
|
| 182 |
+
un log discriminant par ``status_code`` (401 → clé invalide,
|
| 183 |
+
429 → rate limit, 5xx → serveur). Auparavant ce log était
|
| 184 |
+
dupliqué chez Mistral/OpenAI et absent chez Anthropic.
|
| 185 |
+
"""
|
| 186 |
+
|
| 187 |
+
# Variable d'environnement portant la clé API. Sous-classes
|
| 188 |
+
# surchargent (ex. ``"OPENAI_API_KEY"``) ; mention utilisée par
|
| 189 |
+
# :func:`log_http_error` quand un 401 est rencontré. ``None``
|
| 190 |
+
# pour les providers sans clé (Ollama).
|
| 191 |
+
api_key_env_var: Optional[str] = None
|
| 192 |
+
|
| 193 |
+
def __init__(
|
| 194 |
+
self,
|
| 195 |
+
model: Optional[str] = None,
|
| 196 |
+
config: Optional[dict] = None,
|
| 197 |
+
) -> None:
|
| 198 |
+
self.config: dict = config or {}
|
| 199 |
+
self.model: str = model or self.default_model
|
| 200 |
+
|
| 201 |
+
@property
|
| 202 |
+
@abstractmethod
|
| 203 |
+
def name(self) -> str:
|
| 204 |
+
"""Identifiant du provider (ex : 'openai', 'anthropic')."""
|
| 205 |
+
|
| 206 |
+
@property
|
| 207 |
+
@abstractmethod
|
| 208 |
+
def default_model(self) -> str:
|
| 209 |
+
"""Modèle utilisé si aucun n'est fourni explicitement."""
|
| 210 |
+
|
| 211 |
+
@abstractmethod
|
| 212 |
+
def _call(self, prompt: str, image_b64: Optional[str] = None) -> str:
|
| 213 |
+
"""Appel LLM effectif.
|
| 214 |
+
|
| 215 |
+
Parameters
|
| 216 |
+
----------
|
| 217 |
+
prompt:
|
| 218 |
+
Texte du prompt final (variables déjà substituées).
|
| 219 |
+
image_b64:
|
| 220 |
+
Image encodée en base64 (sans préfixe data URI).
|
| 221 |
+
None pour les appels texte-uniquement.
|
| 222 |
+
|
| 223 |
+
Returns
|
| 224 |
+
-------
|
| 225 |
+
str
|
| 226 |
+
Texte généré par le LLM.
|
| 227 |
+
"""
|
| 228 |
+
|
| 229 |
+
def complete(
|
| 230 |
+
self,
|
| 231 |
+
prompt: str,
|
| 232 |
+
image_b64: Optional[str] = None,
|
| 233 |
+
) -> LLMResult:
|
| 234 |
+
"""Point d'entrée public : appelle le LLM avec retry automatique."""
|
| 235 |
+
max_retries = int(self.config.get("max_retries", _DEFAULT_MAX_RETRIES))
|
| 236 |
+
backoff_base = float(self.config.get("retry_backoff", _DEFAULT_BACKOFF_BASE))
|
| 237 |
+
|
| 238 |
+
start = time.perf_counter()
|
| 239 |
+
last_exc: Optional[Exception] = None
|
| 240 |
+
|
| 241 |
+
for attempt in range(max_retries + 1):
|
| 242 |
+
try:
|
| 243 |
+
text = self._call(prompt, image_b64)
|
| 244 |
+
duration = time.perf_counter() - start
|
| 245 |
+
return LLMResult(
|
| 246 |
+
model_id=self.model,
|
| 247 |
+
text=text,
|
| 248 |
+
duration_seconds=round(duration, 4),
|
| 249 |
+
)
|
| 250 |
+
except Exception as exc: # noqa: BLE001
|
| 251 |
+
last_exc = exc
|
| 252 |
+
if attempt < max_retries and _is_retryable(exc):
|
| 253 |
+
wait = backoff_base ** (attempt + 1)
|
| 254 |
+
logger.warning(
|
| 255 |
+
"[%s] erreur retryable (tentative %d/%d, attente %.1fs) : %s",
|
| 256 |
+
self.name, attempt + 1, max_retries + 1, wait, exc,
|
| 257 |
+
)
|
| 258 |
+
time.sleep(wait)
|
| 259 |
+
else:
|
| 260 |
+
break
|
| 261 |
+
|
| 262 |
+
duration = time.perf_counter() - start
|
| 263 |
+
return LLMResult(
|
| 264 |
+
model_id=self.model,
|
| 265 |
+
text="",
|
| 266 |
+
duration_seconds=round(duration, 4),
|
| 267 |
+
error=str(last_exc),
|
| 268 |
+
)
|
| 269 |
+
|
| 270 |
+
def __repr__(self) -> str:
|
| 271 |
+
return f"{self.__class__.__name__}(model={self.model!r})"
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
__all__ = [
|
| 275 |
+
"BaseLLMAdapter",
|
| 276 |
+
"LLMResult",
|
| 277 |
+
"log_http_error",
|
| 278 |
+
"normalize_llm_content",
|
| 279 |
+
]
|
|
@@ -0,0 +1,157 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Adaptateur LLM — Mistral AI (Mistral Large, Pixtral)."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import logging
|
| 6 |
+
import os
|
| 7 |
+
from typing import Optional
|
| 8 |
+
|
| 9 |
+
from picarones.adapters.llm.base import (
|
| 10 |
+
BaseLLMAdapter,
|
| 11 |
+
log_http_error,
|
| 12 |
+
normalize_llm_content,
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
logger = logging.getLogger(__name__)
|
| 16 |
+
|
| 17 |
+
# Modèles Mistral qui NE supportent PAS l'API chat/completions multimodale.
|
| 18 |
+
# Ces petits modèles sont text-only; le passer avec une image provoque une erreur.
|
| 19 |
+
_TEXT_ONLY_MODELS = frozenset({
|
| 20 |
+
"ministral-3b-latest",
|
| 21 |
+
"ministral-8b-latest",
|
| 22 |
+
"mistral-tiny",
|
| 23 |
+
"mistral-tiny-latest",
|
| 24 |
+
"open-mistral-7b",
|
| 25 |
+
"open-mixtral-8x7b",
|
| 26 |
+
})
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
class MistralAdapter(BaseLLMAdapter):
|
| 30 |
+
"""Adaptateur pour les modèles Mistral AI.
|
| 31 |
+
|
| 32 |
+
Clé API via la variable d'environnement ``MISTRAL_API_KEY``.
|
| 33 |
+
|
| 34 |
+
Modes supportés : text_only (tous modèles), text_and_image et zero_shot
|
| 35 |
+
avec les modèles multimodaux (pixtral-12b, pixtral-large).
|
| 36 |
+
|
| 37 |
+
Note
|
| 38 |
+
----
|
| 39 |
+
Les modèles ``ministral-3b-latest`` et ``ministral-8b-latest`` ne supportent
|
| 40 |
+
pas le mode multimodal — utiliser ``PipelineMode.TEXT_ONLY`` avec ces modèles.
|
| 41 |
+
"""
|
| 42 |
+
|
| 43 |
+
api_key_env_var = "MISTRAL_API_KEY"
|
| 44 |
+
|
| 45 |
+
@property
|
| 46 |
+
def name(self) -> str:
|
| 47 |
+
return "mistral"
|
| 48 |
+
|
| 49 |
+
@property
|
| 50 |
+
def default_model(self) -> str:
|
| 51 |
+
return "mistral-large-latest"
|
| 52 |
+
|
| 53 |
+
def __init__(
|
| 54 |
+
self,
|
| 55 |
+
model: Optional[str] = None,
|
| 56 |
+
config: Optional[dict] = None,
|
| 57 |
+
) -> None:
|
| 58 |
+
super().__init__(model, config)
|
| 59 |
+
self._api_key = os.environ.get("MISTRAL_API_KEY")
|
| 60 |
+
if self.model in _TEXT_ONLY_MODELS:
|
| 61 |
+
logger.info(
|
| 62 |
+
"[MistralAdapter] modèle '%s' : text-only (pas de support multimodal).",
|
| 63 |
+
self.model,
|
| 64 |
+
)
|
| 65 |
+
|
| 66 |
+
def _call(self, prompt: str, image_b64: Optional[str] = None) -> str:
|
| 67 |
+
if not self._api_key:
|
| 68 |
+
raise RuntimeError(
|
| 69 |
+
"Clé API Mistral manquante — définissez la variable d'environnement MISTRAL_API_KEY"
|
| 70 |
+
)
|
| 71 |
+
try:
|
| 72 |
+
try:
|
| 73 |
+
from mistralai.client import Mistral
|
| 74 |
+
except ImportError:
|
| 75 |
+
from mistralai import Mistral # type: ignore[no-redef]
|
| 76 |
+
except ImportError as exc:
|
| 77 |
+
raise RuntimeError(
|
| 78 |
+
"Le package 'mistralai' n'est pas installé. Lancez : pip install mistralai"
|
| 79 |
+
) from exc
|
| 80 |
+
|
| 81 |
+
client = Mistral(api_key=self._api_key)
|
| 82 |
+
temperature = float(self.config.get("temperature", 0.0))
|
| 83 |
+
max_tokens = int(self.config.get("max_tokens", 4096))
|
| 84 |
+
|
| 85 |
+
# Les modèles text-only ne supportent pas les images
|
| 86 |
+
if image_b64 and self.model in _TEXT_ONLY_MODELS:
|
| 87 |
+
logger.warning(
|
| 88 |
+
"[MistralAdapter] modèle '%s' ne supporte pas les images — "
|
| 89 |
+
"image ignorée, appel en mode texte seul.",
|
| 90 |
+
self.model,
|
| 91 |
+
)
|
| 92 |
+
image_b64 = None
|
| 93 |
+
|
| 94 |
+
if image_b64:
|
| 95 |
+
content: list | str = [
|
| 96 |
+
{"type": "text", "text": prompt},
|
| 97 |
+
{
|
| 98 |
+
"type": "image_url",
|
| 99 |
+
"image_url": f"data:image/png;base64,{image_b64}",
|
| 100 |
+
},
|
| 101 |
+
]
|
| 102 |
+
else:
|
| 103 |
+
content = prompt
|
| 104 |
+
|
| 105 |
+
logger.info(
|
| 106 |
+
"[MistralAdapter] appel %s — prompt=%d chars, image=%s",
|
| 107 |
+
self.model, len(prompt), "oui" if image_b64 else "non",
|
| 108 |
+
)
|
| 109 |
+
|
| 110 |
+
try:
|
| 111 |
+
response = client.chat.complete(
|
| 112 |
+
model=self.model,
|
| 113 |
+
messages=[{"role": "user", "content": content}],
|
| 114 |
+
temperature=temperature,
|
| 115 |
+
max_tokens=max_tokens,
|
| 116 |
+
)
|
| 117 |
+
except Exception as exc:
|
| 118 |
+
log_http_error(
|
| 119 |
+
"MistralAdapter", self.model, exc,
|
| 120 |
+
env_var=self.api_key_env_var,
|
| 121 |
+
)
|
| 122 |
+
raise
|
| 123 |
+
|
| 124 |
+
if not response.choices:
|
| 125 |
+
logger.warning(
|
| 126 |
+
"[MistralAdapter] response.choices vide (modèle=%s).",
|
| 127 |
+
self.model,
|
| 128 |
+
)
|
| 129 |
+
return ""
|
| 130 |
+
|
| 131 |
+
_choice = response.choices[0]
|
| 132 |
+
raw = _choice.message.content
|
| 133 |
+
_finish_reason = _choice.finish_reason
|
| 134 |
+
|
| 135 |
+
# Chantier 4 — normalisation factorisée dans
|
| 136 |
+
# ``picarones.llm.base.normalize_llm_content`` (Sprint 15
|
| 137 |
+
# généralisé : list[ContentChunk] / list[dict] / str → str).
|
| 138 |
+
text = normalize_llm_content(raw)
|
| 139 |
+
|
| 140 |
+
_completion_tokens = None
|
| 141 |
+
if hasattr(response, "usage") and response.usage:
|
| 142 |
+
_completion_tokens = getattr(response.usage, "completion_tokens", None)
|
| 143 |
+
|
| 144 |
+
logger.info(
|
| 145 |
+
"[MistralAdapter] réponse %s — finish_reason=%s, len=%d, tokens=%s",
|
| 146 |
+
self.model, _finish_reason, len(text), _completion_tokens,
|
| 147 |
+
)
|
| 148 |
+
|
| 149 |
+
if not text.strip():
|
| 150 |
+
logger.warning(
|
| 151 |
+
"[MistralAdapter] réponse vide du modèle '%s' "
|
| 152 |
+
"(finish_reason=%s, completion_tokens=%s). "
|
| 153 |
+
"Vérifier le prompt et la compatibilité du modèle.",
|
| 154 |
+
self.model, _finish_reason, _completion_tokens,
|
| 155 |
+
)
|
| 156 |
+
|
| 157 |
+
return text
|
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Adaptateur LLM — Ollama (modèles locaux : Llama 3, Gemma, Phi, Mistral local…)."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import logging
|
| 6 |
+
from typing import Optional
|
| 7 |
+
from urllib.parse import urlparse
|
| 8 |
+
|
| 9 |
+
from picarones.adapters.llm.base import BaseLLMAdapter, normalize_llm_content
|
| 10 |
+
|
| 11 |
+
logger = logging.getLogger(__name__)
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
class OllamaAdapter(BaseLLMAdapter):
|
| 15 |
+
"""Adaptateur pour les modèles locaux via Ollama.
|
| 16 |
+
|
| 17 |
+
Aucune clé API requise. Nécessite un serveur Ollama actif (par défaut
|
| 18 |
+
sur http://localhost:11434).
|
| 19 |
+
|
| 20 |
+
Modes supportés :
|
| 21 |
+
- text_only : tous modèles Ollama
|
| 22 |
+
- text_and_image : modèles multimodaux (llava, bakllava, moondream…)
|
| 23 |
+
- zero_shot : modèles multimodaux uniquement
|
| 24 |
+
|
| 25 |
+
Configuration (via ``config``) :
|
| 26 |
+
- ``base_url`` : URL du serveur Ollama (défaut : http://localhost:11434)
|
| 27 |
+
"""
|
| 28 |
+
|
| 29 |
+
@property
|
| 30 |
+
def name(self) -> str:
|
| 31 |
+
return "ollama"
|
| 32 |
+
|
| 33 |
+
@property
|
| 34 |
+
def default_model(self) -> str:
|
| 35 |
+
return "llama3"
|
| 36 |
+
|
| 37 |
+
def __init__(
|
| 38 |
+
self,
|
| 39 |
+
model: Optional[str] = None,
|
| 40 |
+
config: Optional[dict] = None,
|
| 41 |
+
) -> None:
|
| 42 |
+
super().__init__(model, config)
|
| 43 |
+
base_url = self.config.get("base_url", "http://localhost:11434").rstrip("/")
|
| 44 |
+
parsed = urlparse(base_url)
|
| 45 |
+
if parsed.scheme not in ("http", "https"):
|
| 46 |
+
raise ValueError(
|
| 47 |
+
f"URL Ollama invalide (schéma '{parsed.scheme}' non autorisé, "
|
| 48 |
+
f"seuls http/https sont acceptés) : {base_url}"
|
| 49 |
+
)
|
| 50 |
+
self._base_url = base_url
|
| 51 |
+
|
| 52 |
+
def _call(self, prompt: str, image_b64: Optional[str] = None) -> str:
|
| 53 |
+
import json
|
| 54 |
+
import urllib.error
|
| 55 |
+
import urllib.request
|
| 56 |
+
|
| 57 |
+
temperature = float(self.config.get("temperature", 0.0))
|
| 58 |
+
payload: dict = {
|
| 59 |
+
"model": self.model,
|
| 60 |
+
"prompt": prompt,
|
| 61 |
+
"stream": False,
|
| 62 |
+
"options": {"temperature": temperature},
|
| 63 |
+
}
|
| 64 |
+
if image_b64:
|
| 65 |
+
payload["images"] = [image_b64]
|
| 66 |
+
|
| 67 |
+
data = json.dumps(payload).encode("utf-8")
|
| 68 |
+
req = urllib.request.Request(
|
| 69 |
+
f"{self._base_url}/api/generate",
|
| 70 |
+
data=data,
|
| 71 |
+
headers={"Content-Type": "application/json"},
|
| 72 |
+
)
|
| 73 |
+
try:
|
| 74 |
+
with urllib.request.urlopen(req, timeout=120) as resp:
|
| 75 |
+
raw = resp.read().decode("utf-8")
|
| 76 |
+
except urllib.error.HTTPError as exc:
|
| 77 |
+
logger.warning(
|
| 78 |
+
"[OllamaAdapter] erreur HTTP %d (modèle=%s) : %s",
|
| 79 |
+
exc.code, self.model, exc,
|
| 80 |
+
)
|
| 81 |
+
raise RuntimeError(
|
| 82 |
+
f"Erreur HTTP {exc.code} du serveur Ollama ({self._base_url}) : {exc}"
|
| 83 |
+
) from exc
|
| 84 |
+
except urllib.error.URLError as exc:
|
| 85 |
+
raise RuntimeError(
|
| 86 |
+
f"Impossible de joindre le serveur Ollama sur {self._base_url}. "
|
| 87 |
+
f"Vérifiez qu'Ollama est démarré (ollama serve). Erreur : {exc}"
|
| 88 |
+
) from exc
|
| 89 |
+
|
| 90 |
+
try:
|
| 91 |
+
result = json.loads(raw)
|
| 92 |
+
except json.JSONDecodeError as exc:
|
| 93 |
+
logger.warning(
|
| 94 |
+
"[OllamaAdapter] réponse JSON invalide (modèle=%s) : %s",
|
| 95 |
+
self.model, raw[:200],
|
| 96 |
+
)
|
| 97 |
+
raise RuntimeError(
|
| 98 |
+
f"Réponse JSON invalide du serveur Ollama : {exc}"
|
| 99 |
+
) from exc
|
| 100 |
+
|
| 101 |
+
# Chantier 4 — propagation du fix Sprint 15 : Ollama retourne
|
| 102 |
+
# ``response`` en string mais on normalise par défense (cas où
|
| 103 |
+
# un futur build retournerait un format structuré).
|
| 104 |
+
text = normalize_llm_content(result.get("response", ""))
|
| 105 |
+
if not text:
|
| 106 |
+
logger.warning(
|
| 107 |
+
"[OllamaAdapter] réponse vide (modèle=%s).", self.model,
|
| 108 |
+
)
|
| 109 |
+
return text
|
|
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Adaptateur LLM — OpenAI (GPT-4o, GPT-4o-mini)."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import logging
|
| 6 |
+
import os
|
| 7 |
+
from typing import Optional
|
| 8 |
+
|
| 9 |
+
from picarones.adapters.llm.base import (
|
| 10 |
+
BaseLLMAdapter,
|
| 11 |
+
log_http_error,
|
| 12 |
+
normalize_llm_content,
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
logger = logging.getLogger(__name__)
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
class OpenAIAdapter(BaseLLMAdapter):
|
| 19 |
+
"""Adaptateur pour les modèles OpenAI (GPT-4o, GPT-4o-mini).
|
| 20 |
+
|
| 21 |
+
Clé API via la variable d'environnement ``OPENAI_API_KEY``.
|
| 22 |
+
|
| 23 |
+
Modes supportés : text_only, text_and_image, zero_shot.
|
| 24 |
+
"""
|
| 25 |
+
|
| 26 |
+
api_key_env_var = "OPENAI_API_KEY"
|
| 27 |
+
|
| 28 |
+
@property
|
| 29 |
+
def name(self) -> str:
|
| 30 |
+
return "openai"
|
| 31 |
+
|
| 32 |
+
@property
|
| 33 |
+
def default_model(self) -> str:
|
| 34 |
+
return "gpt-4o"
|
| 35 |
+
|
| 36 |
+
def __init__(
|
| 37 |
+
self,
|
| 38 |
+
model: Optional[str] = None,
|
| 39 |
+
config: Optional[dict] = None,
|
| 40 |
+
) -> None:
|
| 41 |
+
super().__init__(model, config)
|
| 42 |
+
self._api_key = os.environ.get("OPENAI_API_KEY")
|
| 43 |
+
|
| 44 |
+
def _call(self, prompt: str, image_b64: Optional[str] = None) -> str:
|
| 45 |
+
if not self._api_key:
|
| 46 |
+
raise RuntimeError(
|
| 47 |
+
"Clé API OpenAI manquante — définissez la variable d'environnement OPENAI_API_KEY"
|
| 48 |
+
)
|
| 49 |
+
try:
|
| 50 |
+
from openai import OpenAI
|
| 51 |
+
except ImportError as exc:
|
| 52 |
+
raise RuntimeError(
|
| 53 |
+
"Le package 'openai' n'est pas installé. Lancez : pip install openai"
|
| 54 |
+
) from exc
|
| 55 |
+
|
| 56 |
+
client = OpenAI(api_key=self._api_key)
|
| 57 |
+
temperature = float(self.config.get("temperature", 0.0))
|
| 58 |
+
max_tokens = int(self.config.get("max_tokens", 4096))
|
| 59 |
+
|
| 60 |
+
if image_b64:
|
| 61 |
+
content = [
|
| 62 |
+
{"type": "text", "text": prompt},
|
| 63 |
+
{
|
| 64 |
+
"type": "image_url",
|
| 65 |
+
"image_url": {"url": f"data:image/png;base64,{image_b64}"},
|
| 66 |
+
},
|
| 67 |
+
]
|
| 68 |
+
else:
|
| 69 |
+
content = prompt # type: ignore[assignment]
|
| 70 |
+
|
| 71 |
+
try:
|
| 72 |
+
response = client.chat.completions.create(
|
| 73 |
+
model=self.model,
|
| 74 |
+
messages=[{"role": "user", "content": content}],
|
| 75 |
+
temperature=temperature,
|
| 76 |
+
max_tokens=max_tokens,
|
| 77 |
+
)
|
| 78 |
+
except Exception as exc:
|
| 79 |
+
log_http_error(
|
| 80 |
+
"OpenAIAdapter", self.model, exc,
|
| 81 |
+
env_var=self.api_key_env_var,
|
| 82 |
+
)
|
| 83 |
+
raise
|
| 84 |
+
|
| 85 |
+
if not response.choices:
|
| 86 |
+
logger.warning(
|
| 87 |
+
"[OpenAIAdapter] response.choices vide (modèle=%s).", self.model,
|
| 88 |
+
)
|
| 89 |
+
return ""
|
| 90 |
+
# Chantier 4 — propagation du fix Sprint 15 : le SDK OpenAI
|
| 91 |
+
# peut retourner une ``list[ContentBlock]`` selon l'API
|
| 92 |
+
# (Responses, structured outputs). ``normalize_llm_content``
|
| 93 |
+
# gère les deux cas (str et list).
|
| 94 |
+
return normalize_llm_content(response.choices[0].message.content)
|
|
@@ -1,98 +1,7 @@
|
|
| 1 |
-
"""
|
| 2 |
-
|
| 3 |
-
Quand un importer (HuggingFace, HTR-United, Gallica, eScriptorium…)
|
| 4 |
-
bascule en mode dégradé (timeout réseau, JSON mal formé, ZIP corrompu,
|
| 5 |
-
catalogue distant indisponible…), il enregistre un incident ici via
|
| 6 |
-
:func:`record_fallback`. Le moteur narratif consomme ces incidents via
|
| 7 |
-
:func:`consume_fallback_log`, qui **vide** la liste pour qu'un benchmark
|
| 8 |
-
suivant ne remonte pas les incidents du précédent.
|
| 9 |
-
|
| 10 |
-
Conception volontairement minimale :
|
| 11 |
-
|
| 12 |
-
- Pas de persistance disque (les incidents sont contextuels à un run).
|
| 13 |
-
- Pas de structure complexe (juste un ``list[dict]`` thread-safe).
|
| 14 |
-
- Le runner / le rapport peuvent ignorer la liste sans casser.
|
| 15 |
-
|
| 16 |
-
Le détecteur de Fact correspondant (``FactType.IMPORTER_FALLBACK_TRIGGERED``)
|
| 17 |
-
est implémenté dans
|
| 18 |
-
:mod:`picarones.measurements.narrative.detectors.history`.
|
| 19 |
"""
|
| 20 |
|
| 21 |
from __future__ import annotations
|
| 22 |
|
| 23 |
-
import
|
| 24 |
-
import threading
|
| 25 |
-
from typing import Any
|
| 26 |
-
|
| 27 |
-
logger = logging.getLogger(__name__)
|
| 28 |
-
|
| 29 |
-
_lock = threading.Lock()
|
| 30 |
-
_fallbacks: list[dict[str, Any]] = []
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
def record_fallback(
|
| 34 |
-
importer: str,
|
| 35 |
-
operation: str,
|
| 36 |
-
error: BaseException | None = None,
|
| 37 |
-
*,
|
| 38 |
-
extra: dict[str, Any] | None = None,
|
| 39 |
-
) -> None:
|
| 40 |
-
"""Enregistre un incident de mode dégradé.
|
| 41 |
-
|
| 42 |
-
Logge également via ``logger.warning`` pour qu'un opérateur voit
|
| 43 |
-
l'incident en temps réel sans dépendre du rapport.
|
| 44 |
-
|
| 45 |
-
Parameters
|
| 46 |
-
----------
|
| 47 |
-
importer:
|
| 48 |
-
Nom court de l'importer (ex : ``"huggingface"``, ``"htr_united"``).
|
| 49 |
-
operation:
|
| 50 |
-
Description courte de l'opération (ex : ``"yaml_catalogue_parse"``,
|
| 51 |
-
``"image_save"``, ``"hub_search"``).
|
| 52 |
-
error:
|
| 53 |
-
Exception originelle (utilisée pour le message log et stockée dans
|
| 54 |
-
le payload sous forme de chaîne — pas l'objet, pour éviter les
|
| 55 |
-
références persistantes).
|
| 56 |
-
extra:
|
| 57 |
-
Champs additionnels (URL distante, identifiant dataset…) qui peuvent
|
| 58 |
-
être utiles à un détecteur de Fact ultérieur.
|
| 59 |
-
"""
|
| 60 |
-
error_repr = repr(error) if error is not None else None
|
| 61 |
-
logger.warning(
|
| 62 |
-
"[importers/%s] %s a échoué (mode dégradé) : %s",
|
| 63 |
-
importer,
|
| 64 |
-
operation,
|
| 65 |
-
error_repr,
|
| 66 |
-
)
|
| 67 |
-
entry: dict[str, Any] = {
|
| 68 |
-
"importer": importer,
|
| 69 |
-
"operation": operation,
|
| 70 |
-
"error": error_repr,
|
| 71 |
-
}
|
| 72 |
-
if extra:
|
| 73 |
-
entry["extra"] = dict(extra)
|
| 74 |
-
with _lock:
|
| 75 |
-
_fallbacks.append(entry)
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
def consume_fallback_log() -> list[dict[str, Any]]:
|
| 79 |
-
"""Retourne ET vide la liste des incidents accumulés.
|
| 80 |
-
|
| 81 |
-
Le moteur narratif appelle cette fonction au moment de construire
|
| 82 |
-
la synthèse pour transformer chaque incident en ``Fact``."""
|
| 83 |
-
with _lock:
|
| 84 |
-
out = list(_fallbacks)
|
| 85 |
-
_fallbacks.clear()
|
| 86 |
-
return out
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
def peek_fallback_log() -> list[dict[str, Any]]:
|
| 90 |
-
"""Retourne une copie sans vider — utile pour les tests."""
|
| 91 |
-
with _lock:
|
| 92 |
-
return list(_fallbacks)
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
def reset_fallback_log() -> None:
|
| 96 |
-
"""Vide la liste sans rien retourner — utile pour les fixtures pytest."""
|
| 97 |
-
with _lock:
|
| 98 |
-
_fallbacks.clear()
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S11. Le contenu canonique vit dans
|
| 2 |
+
``picarones.adapters.corpus._fallback_log``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
"""
|
| 4 |
|
| 5 |
from __future__ import annotations
|
| 6 |
|
| 7 |
+
from picarones.adapters.corpus._fallback_log import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,473 +1,7 @@
|
|
| 1 |
-
"""
|
| 2 |
-
|
| 3 |
-
HTR-United est un catalogue communautaire de vérités terrain HTR/OCR publiées
|
| 4 |
-
sur GitHub sous licence ouverte. Les métadonnées sont stockées dans un fichier
|
| 5 |
-
YAML (catalogue.yml) sur https://github.com/HTR-United/htr-united.
|
| 6 |
-
|
| 7 |
-
Ce module fournit :
|
| 8 |
-
- :class:`HTRUnitedCatalogue` — chargement et recherche dans le catalogue
|
| 9 |
-
- :func:`fetch_catalogue` — téléchargement du catalogue depuis GitHub
|
| 10 |
-
- :func:`import_htr_united_corpus` — téléchargement et import d'un corpus
|
| 11 |
-
|
| 12 |
-
Exemple
|
| 13 |
-
-------
|
| 14 |
-
catalogue = HTRUnitedCatalogue.from_remote()
|
| 15 |
-
results = catalogue.search("français médiéval")
|
| 16 |
-
corpus = import_htr_united_corpus(results[0], output_dir="./corpus/")
|
| 17 |
"""
|
| 18 |
|
| 19 |
from __future__ import annotations
|
| 20 |
|
| 21 |
-
import
|
| 22 |
-
import logging
|
| 23 |
-
import re
|
| 24 |
-
import urllib.error
|
| 25 |
-
import urllib.request
|
| 26 |
-
from dataclasses import dataclass, field
|
| 27 |
-
from pathlib import Path
|
| 28 |
-
from typing import Optional
|
| 29 |
-
|
| 30 |
-
logger = logging.getLogger(__name__)
|
| 31 |
-
|
| 32 |
-
# ---------------------------------------------------------------------------
|
| 33 |
-
# Catalogue remote URL
|
| 34 |
-
# ---------------------------------------------------------------------------
|
| 35 |
-
|
| 36 |
-
_CATALOGUE_URL = (
|
| 37 |
-
"https://raw.githubusercontent.com/HTR-United/htr-united/master/htr-united.yml"
|
| 38 |
-
)
|
| 39 |
-
_CATALOGUE_API_URL = (
|
| 40 |
-
"https://api.github.com/repos/HTR-United/htr-united/contents/htr-united.yml"
|
| 41 |
-
)
|
| 42 |
-
|
| 43 |
-
# Catalogue de démonstration / fallback (hors-ligne)
|
| 44 |
-
_DEMO_CATALOGUE: list[dict] = [
|
| 45 |
-
{
|
| 46 |
-
"id": "lectaurep-repertoires",
|
| 47 |
-
"title": "Lectaurep — Répertoires de notaires parisiens",
|
| 48 |
-
"url": "https://github.com/HTR-United/lectaurep-repertoires",
|
| 49 |
-
"language": ["French"],
|
| 50 |
-
"script": ["Cursiva"],
|
| 51 |
-
"century": [17, 18],
|
| 52 |
-
"institution": "Archives nationales (France)",
|
| 53 |
-
"description": "Transcriptions de répertoires de notaires, XVIIe-XVIIIe siècles.",
|
| 54 |
-
"license": "CC-BY 4.0",
|
| 55 |
-
"lines": 12400,
|
| 56 |
-
"format": "ALTO",
|
| 57 |
-
"tags": ["notaires", "Paris", "cursive", "imprimé"],
|
| 58 |
-
},
|
| 59 |
-
{
|
| 60 |
-
"id": "bvmm-manuscripts",
|
| 61 |
-
"title": "BVMM — Manuscrits enluminés",
|
| 62 |
-
"url": "https://github.com/HTR-United/bvmm-manuscripts",
|
| 63 |
-
"language": ["Latin", "French"],
|
| 64 |
-
"script": ["Gothic"],
|
| 65 |
-
"century": [13, 14, 15],
|
| 66 |
-
"institution": "IRHT",
|
| 67 |
-
"description": "Manuscrits médiévaux latins et français, XIIIe-XVe siècles.",
|
| 68 |
-
"license": "CC-BY 4.0",
|
| 69 |
-
"lines": 8700,
|
| 70 |
-
"format": "ALTO",
|
| 71 |
-
"tags": ["manuscrits", "latin", "médiéval", "enluminure"],
|
| 72 |
-
},
|
| 73 |
-
{
|
| 74 |
-
"id": "cremma-medieval",
|
| 75 |
-
"title": "CREMMA Médiéval",
|
| 76 |
-
"url": "https://github.com/HTR-United/cremma-medieval",
|
| 77 |
-
"language": ["French", "Latin"],
|
| 78 |
-
"script": ["Gothic", "Humanistica"],
|
| 79 |
-
"century": [12, 13, 14, 15],
|
| 80 |
-
"institution": "École des chartes / Inria",
|
| 81 |
-
"description": "Corpus CREMMA de manuscrits médiévaux français et latins.",
|
| 82 |
-
"license": "CC-BY 4.0",
|
| 83 |
-
"lines": 6200,
|
| 84 |
-
"format": "ALTO",
|
| 85 |
-
"tags": ["médiéval", "chartes", "manuscrits"],
|
| 86 |
-
},
|
| 87 |
-
{
|
| 88 |
-
"id": "simssa-ocr-printed",
|
| 89 |
-
"title": "SIMSSA — Imprimés anciens (XVe-XVIIe)",
|
| 90 |
-
"url": "https://github.com/HTR-United/simssa-printed",
|
| 91 |
-
"language": ["French", "Latin"],
|
| 92 |
-
"script": ["Rotunda", "Roman"],
|
| 93 |
-
"century": [15, 16, 17],
|
| 94 |
-
"institution": "McGill University",
|
| 95 |
-
"description": "Corpus d'imprimés anciens romains et gothiques.",
|
| 96 |
-
"license": "CC-BY 4.0",
|
| 97 |
-
"lines": 4500,
|
| 98 |
-
"format": "PAGE",
|
| 99 |
-
"tags": ["imprimés", "incunables", "roman", "gothique"],
|
| 100 |
-
},
|
| 101 |
-
{
|
| 102 |
-
"id": "fonds-gallica-presse",
|
| 103 |
-
"title": "Presse ancienne — Gallica (XIXe)",
|
| 104 |
-
"url": "https://github.com/HTR-United/gallica-presse-xix",
|
| 105 |
-
"language": ["French"],
|
| 106 |
-
"script": ["Roman"],
|
| 107 |
-
"century": [19],
|
| 108 |
-
"institution": "Gallica",
|
| 109 |
-
"description": "Numérisations de journaux du XIXe siècle (Gallica).",
|
| 110 |
-
"license": "etalab-2.0",
|
| 111 |
-
"lines": 31000,
|
| 112 |
-
"format": "ALTO",
|
| 113 |
-
"tags": ["presse", "XIXe", "Gallica", "journaux"],
|
| 114 |
-
},
|
| 115 |
-
{
|
| 116 |
-
"id": "archives-departem-correspondances",
|
| 117 |
-
"title": "Correspondances administratives (XVIIIe-XIXe)",
|
| 118 |
-
"url": "https://github.com/HTR-United/correspondances-admin",
|
| 119 |
-
"language": ["French"],
|
| 120 |
-
"script": ["Cursiva"],
|
| 121 |
-
"century": [18, 19],
|
| 122 |
-
"institution": "Archives départementales",
|
| 123 |
-
"description": "Lettres et correspondances administratives manuscrites.",
|
| 124 |
-
"license": "CC-BY 4.0",
|
| 125 |
-
"lines": 9800,
|
| 126 |
-
"format": "ALTO",
|
| 127 |
-
"tags": ["correspondances", "administratif", "cursive"],
|
| 128 |
-
},
|
| 129 |
-
{
|
| 130 |
-
"id": "e-codices-latin",
|
| 131 |
-
"title": "e-codices — Manuscrits latins (Suisse)",
|
| 132 |
-
"url": "https://github.com/HTR-United/e-codices-latin",
|
| 133 |
-
"language": ["Latin"],
|
| 134 |
-
"script": ["Caroline", "Gothic"],
|
| 135 |
-
"century": [9, 10, 11, 12],
|
| 136 |
-
"institution": "Bibliothèque cantonale universitaire de Lausanne",
|
| 137 |
-
"description": "Manuscrits carolingiens et gothiques des bibliothèques suisses.",
|
| 138 |
-
"license": "CC-BY 4.0",
|
| 139 |
-
"lines": 3100,
|
| 140 |
-
"format": "ALTO",
|
| 141 |
-
"tags": ["caroline", "latin", "médiéval", "Suisse"],
|
| 142 |
-
},
|
| 143 |
-
{
|
| 144 |
-
"id": "registres-paroissiaux-17",
|
| 145 |
-
"title": "Registres paroissiaux — Bretagne (XVIIe)",
|
| 146 |
-
"url": "https://github.com/HTR-United/registres-paroissiaux-bretagne",
|
| 147 |
-
"language": ["French", "Latin"],
|
| 148 |
-
"script": ["Cursiva"],
|
| 149 |
-
"century": [17],
|
| 150 |
-
"institution": "Archives départementales du Finistère",
|
| 151 |
-
"description": "Registres paroissiaux bretons du XVIIe siècle.",
|
| 152 |
-
"license": "CC-BY 4.0",
|
| 153 |
-
"lines": 15600,
|
| 154 |
-
"format": "ALTO",
|
| 155 |
-
"tags": ["registres", "Bretagne", "paroissial", "cursive"],
|
| 156 |
-
},
|
| 157 |
-
]
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
# ---------------------------------------------------------------------------
|
| 161 |
-
# Dataclass entrée catalogue
|
| 162 |
-
# ---------------------------------------------------------------------------
|
| 163 |
-
|
| 164 |
-
@dataclass
|
| 165 |
-
class HTRUnitedEntry:
|
| 166 |
-
"""Une entrée dans le catalogue HTR-United."""
|
| 167 |
-
|
| 168 |
-
id: str
|
| 169 |
-
title: str
|
| 170 |
-
url: str
|
| 171 |
-
language: list[str] = field(default_factory=list)
|
| 172 |
-
script: list[str] = field(default_factory=list)
|
| 173 |
-
century: list[int] = field(default_factory=list)
|
| 174 |
-
institution: str = ""
|
| 175 |
-
description: str = ""
|
| 176 |
-
license: str = ""
|
| 177 |
-
lines: int = 0
|
| 178 |
-
format: str = "ALTO"
|
| 179 |
-
tags: list[str] = field(default_factory=list)
|
| 180 |
-
|
| 181 |
-
def as_dict(self) -> dict:
|
| 182 |
-
return {
|
| 183 |
-
"id": self.id,
|
| 184 |
-
"title": self.title,
|
| 185 |
-
"url": self.url,
|
| 186 |
-
"language": self.language,
|
| 187 |
-
"script": self.script,
|
| 188 |
-
"century": self.century,
|
| 189 |
-
"institution": self.institution,
|
| 190 |
-
"description": self.description,
|
| 191 |
-
"license": self.license,
|
| 192 |
-
"lines": self.lines,
|
| 193 |
-
"format": self.format,
|
| 194 |
-
"tags": self.tags,
|
| 195 |
-
}
|
| 196 |
-
|
| 197 |
-
@classmethod
|
| 198 |
-
def from_dict(cls, d: dict) -> "HTRUnitedEntry":
|
| 199 |
-
return cls(
|
| 200 |
-
id=d.get("id", ""),
|
| 201 |
-
title=d.get("title", ""),
|
| 202 |
-
url=d.get("url", ""),
|
| 203 |
-
language=d.get("language", []),
|
| 204 |
-
script=d.get("script", []),
|
| 205 |
-
century=d.get("century", []),
|
| 206 |
-
institution=d.get("institution", ""),
|
| 207 |
-
description=d.get("description", ""),
|
| 208 |
-
license=d.get("license", ""),
|
| 209 |
-
lines=d.get("lines", 0),
|
| 210 |
-
format=d.get("format", "ALTO"),
|
| 211 |
-
tags=d.get("tags", []),
|
| 212 |
-
)
|
| 213 |
-
|
| 214 |
-
@property
|
| 215 |
-
def century_str(self) -> str:
|
| 216 |
-
"""Siècles formatés en chiffres romains."""
|
| 217 |
-
roman = {
|
| 218 |
-
1: "Ier", 2: "IIe", 3: "IIIe", 4: "IVe", 5: "Ve",
|
| 219 |
-
6: "VIe", 7: "VIIe", 8: "VIIIe", 9: "IXe", 10: "Xe",
|
| 220 |
-
11: "XIe", 12: "XIIe", 13: "XIIIe", 14: "XIVe", 15: "XVe",
|
| 221 |
-
16: "XVIe", 17: "XVIIe", 18: "XVIIIe", 19: "XIXe", 20: "XXe",
|
| 222 |
-
}
|
| 223 |
-
return ", ".join(roman.get(c, f"{c}e") for c in self.century)
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
# ---------------------------------------------------------------------------
|
| 227 |
-
# Catalogue
|
| 228 |
-
# ---------------------------------------------------------------------------
|
| 229 |
-
|
| 230 |
-
class HTRUnitedCatalogue:
|
| 231 |
-
"""Catalogue HTR-United avec recherche et filtrage."""
|
| 232 |
-
|
| 233 |
-
def __init__(self, entries: list[HTRUnitedEntry], source: str = "demo") -> None:
|
| 234 |
-
self.entries = entries
|
| 235 |
-
self.source = source # "remote" | "demo" | "cache"
|
| 236 |
-
|
| 237 |
-
def __len__(self) -> int:
|
| 238 |
-
return len(self.entries)
|
| 239 |
-
|
| 240 |
-
@classmethod
|
| 241 |
-
def from_demo(cls) -> "HTRUnitedCatalogue":
|
| 242 |
-
"""Charge le catalogue de démonstration intégré."""
|
| 243 |
-
entries = [HTRUnitedEntry.from_dict(d) for d in _DEMO_CATALOGUE]
|
| 244 |
-
return cls(entries, source="demo")
|
| 245 |
-
|
| 246 |
-
@classmethod
|
| 247 |
-
def from_remote(cls, timeout: int = 10) -> "HTRUnitedCatalogue":
|
| 248 |
-
"""Télécharge le catalogue depuis GitHub.
|
| 249 |
-
|
| 250 |
-
En cas d'erreur réseau, retourne le catalogue de démonstration.
|
| 251 |
-
"""
|
| 252 |
-
try:
|
| 253 |
-
req = urllib.request.Request(
|
| 254 |
-
_CATALOGUE_URL,
|
| 255 |
-
headers={"User-Agent": "picarones-htr-united-importer/1.0"},
|
| 256 |
-
)
|
| 257 |
-
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
| 258 |
-
raw = resp.read().decode("utf-8")
|
| 259 |
-
entries = _parse_yml_catalogue(raw)
|
| 260 |
-
return cls(entries, source="remote")
|
| 261 |
-
except (urllib.error.URLError, Exception) as exc:
|
| 262 |
-
# Fallback démo avec avertissement
|
| 263 |
-
logger.warning(
|
| 264 |
-
"[HTR-United] impossible de charger le catalogue distant (%s) : %s. "
|
| 265 |
-
"Utilisation des données de démonstration.",
|
| 266 |
-
_CATALOGUE_URL, exc,
|
| 267 |
-
)
|
| 268 |
-
return cls.from_demo()
|
| 269 |
-
|
| 270 |
-
def search(
|
| 271 |
-
self,
|
| 272 |
-
query: str = "",
|
| 273 |
-
language: Optional[str] = None,
|
| 274 |
-
script: Optional[str] = None,
|
| 275 |
-
century_min: Optional[int] = None,
|
| 276 |
-
century_max: Optional[int] = None,
|
| 277 |
-
) -> list[HTRUnitedEntry]:
|
| 278 |
-
"""Recherche dans le catalogue avec filtres optionnels."""
|
| 279 |
-
results = self.entries
|
| 280 |
-
|
| 281 |
-
if query:
|
| 282 |
-
q = query.lower()
|
| 283 |
-
results = [
|
| 284 |
-
e for e in results
|
| 285 |
-
if (q in e.title.lower()
|
| 286 |
-
or q in e.description.lower()
|
| 287 |
-
or q in e.institution.lower()
|
| 288 |
-
or any(q in t.lower() for t in e.tags)
|
| 289 |
-
or any(q in lang.lower() for lang in e.language))
|
| 290 |
-
]
|
| 291 |
-
|
| 292 |
-
if language:
|
| 293 |
-
lang_lower = language.lower()
|
| 294 |
-
results = [
|
| 295 |
-
e for e in results
|
| 296 |
-
if any(lang_lower in lg.lower() for lg in e.language)
|
| 297 |
-
]
|
| 298 |
-
|
| 299 |
-
if script:
|
| 300 |
-
sc_lower = script.lower()
|
| 301 |
-
results = [
|
| 302 |
-
e for e in results
|
| 303 |
-
if any(sc_lower in s.lower() for s in e.script)
|
| 304 |
-
]
|
| 305 |
-
|
| 306 |
-
if century_min is not None:
|
| 307 |
-
results = [
|
| 308 |
-
e for e in results
|
| 309 |
-
if any(c >= century_min for c in e.century)
|
| 310 |
-
]
|
| 311 |
-
|
| 312 |
-
if century_max is not None:
|
| 313 |
-
results = [
|
| 314 |
-
e for e in results
|
| 315 |
-
if any(c <= century_max for c in e.century)
|
| 316 |
-
]
|
| 317 |
-
|
| 318 |
-
return results
|
| 319 |
-
|
| 320 |
-
def get_by_id(self, entry_id: str) -> Optional[HTRUnitedEntry]:
|
| 321 |
-
"""Retourne une entrée par son identifiant."""
|
| 322 |
-
for e in self.entries:
|
| 323 |
-
if e.id == entry_id:
|
| 324 |
-
return e
|
| 325 |
-
return None
|
| 326 |
-
|
| 327 |
-
def available_languages(self) -> list[str]:
|
| 328 |
-
seen: set[str] = set()
|
| 329 |
-
result: list[str] = []
|
| 330 |
-
for e in self.entries:
|
| 331 |
-
for lang in e.language:
|
| 332 |
-
if lang not in seen:
|
| 333 |
-
seen.add(lang)
|
| 334 |
-
result.append(lang)
|
| 335 |
-
return sorted(result)
|
| 336 |
-
|
| 337 |
-
def available_scripts(self) -> list[str]:
|
| 338 |
-
seen: set[str] = set()
|
| 339 |
-
result: list[str] = []
|
| 340 |
-
for e in self.entries:
|
| 341 |
-
for sc in e.script:
|
| 342 |
-
if sc not in seen:
|
| 343 |
-
seen.add(sc)
|
| 344 |
-
result.append(sc)
|
| 345 |
-
return sorted(result)
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
# ---------------------------------------------------------------------------
|
| 349 |
-
# Import de corpus
|
| 350 |
-
# ---------------------------------------------------------------------------
|
| 351 |
-
|
| 352 |
-
def import_htr_united_corpus(
|
| 353 |
-
entry: HTRUnitedEntry,
|
| 354 |
-
output_dir: str | Path,
|
| 355 |
-
max_samples: int = 100,
|
| 356 |
-
show_progress: bool = True,
|
| 357 |
-
) -> dict:
|
| 358 |
-
"""Importe un corpus HTR-United dans un dossier local.
|
| 359 |
-
|
| 360 |
-
Retourne un dict avec les métadonnées de l'import.
|
| 361 |
-
Note : en l'absence d'accès réseau au dépôt GitHub, génère des fichiers
|
| 362 |
-
placeholder (pour tests et démo).
|
| 363 |
-
"""
|
| 364 |
-
output_path = Path(output_dir)
|
| 365 |
-
output_path.mkdir(parents=True, exist_ok=True)
|
| 366 |
-
|
| 367 |
-
# Sauvegarder les métadonnées
|
| 368 |
-
meta = {
|
| 369 |
-
"source": "htr-united",
|
| 370 |
-
"entry_id": entry.id,
|
| 371 |
-
"title": entry.title,
|
| 372 |
-
"url": entry.url,
|
| 373 |
-
"language": entry.language,
|
| 374 |
-
"script": entry.script,
|
| 375 |
-
"century": entry.century,
|
| 376 |
-
"institution": entry.institution,
|
| 377 |
-
"license": entry.license,
|
| 378 |
-
"format": entry.format,
|
| 379 |
-
"imported_at": _iso_now(),
|
| 380 |
-
}
|
| 381 |
-
(output_path / "htr_united_meta.json").write_text(
|
| 382 |
-
json.dumps(meta, ensure_ascii=False, indent=2), encoding="utf-8"
|
| 383 |
-
)
|
| 384 |
-
|
| 385 |
-
# Essai de téléchargement réel depuis GitHub (archive releases)
|
| 386 |
-
downloaded = _try_download_corpus(entry, output_path, max_samples, show_progress)
|
| 387 |
-
|
| 388 |
-
return {
|
| 389 |
-
"entry_id": entry.id,
|
| 390 |
-
"title": entry.title,
|
| 391 |
-
"output_dir": str(output_path),
|
| 392 |
-
"files_imported": downloaded,
|
| 393 |
-
"metadata_file": str(output_path / "htr_united_meta.json"),
|
| 394 |
-
}
|
| 395 |
-
|
| 396 |
-
|
| 397 |
-
def _try_download_corpus(
|
| 398 |
-
entry: HTRUnitedEntry,
|
| 399 |
-
output_path: Path,
|
| 400 |
-
max_samples: int,
|
| 401 |
-
show_progress: bool,
|
| 402 |
-
) -> int:
|
| 403 |
-
"""Tente de télécharger le corpus depuis GitHub. Retourne le nombre de fichiers importés."""
|
| 404 |
-
# Construit l'URL de l'archive ZIP du dépôt GitHub
|
| 405 |
-
repo_path = _extract_github_repo(entry.url)
|
| 406 |
-
if not repo_path:
|
| 407 |
-
return 0
|
| 408 |
-
|
| 409 |
-
zip_url = f"https://github.com/{repo_path}/archive/refs/heads/main.zip"
|
| 410 |
-
try:
|
| 411 |
-
req = urllib.request.Request(
|
| 412 |
-
zip_url,
|
| 413 |
-
headers={"User-Agent": "picarones-htr-united-importer/1.0"},
|
| 414 |
-
)
|
| 415 |
-
with urllib.request.urlopen(req, timeout=30) as resp:
|
| 416 |
-
import io
|
| 417 |
-
import zipfile
|
| 418 |
-
|
| 419 |
-
data = resp.read()
|
| 420 |
-
with zipfile.ZipFile(io.BytesIO(data)) as zf:
|
| 421 |
-
# Extraire les fichiers ALTO/PAGE/GT
|
| 422 |
-
gt_files = [
|
| 423 |
-
n for n in zf.namelist()
|
| 424 |
-
if n.endswith((".alto.xml", ".page.xml", ".gt.txt", ".xml"))
|
| 425 |
-
and not n.endswith("/")
|
| 426 |
-
][:max_samples]
|
| 427 |
-
for i, fname in enumerate(gt_files):
|
| 428 |
-
dest = output_path / Path(fname).name
|
| 429 |
-
dest.write_bytes(zf.read(fname))
|
| 430 |
-
return len(gt_files)
|
| 431 |
-
except Exception as exc: # noqa: BLE001 — large surface (réseau, ZIP, FS)
|
| 432 |
-
# Sprint A3 (B-3) : on documente l'incident plutôt que de le
|
| 433 |
-
# masquer ; le caller reçoit toujours 0 pour préserver le
|
| 434 |
-
# contrat numérique de retour.
|
| 435 |
-
from picarones.extras.importers._fallback_log import record_fallback
|
| 436 |
-
record_fallback(
|
| 437 |
-
importer="htr_united",
|
| 438 |
-
operation="download_zip_samples",
|
| 439 |
-
error=exc,
|
| 440 |
-
extra={"output_path": str(output_path)},
|
| 441 |
-
)
|
| 442 |
-
return 0
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
def _extract_github_repo(url: str) -> Optional[str]:
|
| 446 |
-
"""Extrait 'owner/repo' depuis une URL GitHub."""
|
| 447 |
-
m = re.match(r"https?://github\.com/([^/]+/[^/]+?)(?:\.git)?/?$", url)
|
| 448 |
-
return m.group(1) if m else None
|
| 449 |
-
|
| 450 |
-
|
| 451 |
-
def _parse_yml_catalogue(raw: str) -> list[HTRUnitedEntry]:
|
| 452 |
-
"""Parse rudimentaire du YAML catalogue HTR-United."""
|
| 453 |
-
try:
|
| 454 |
-
import yaml
|
| 455 |
-
data = yaml.safe_load(raw)
|
| 456 |
-
if isinstance(data, list):
|
| 457 |
-
return [HTRUnitedEntry.from_dict(d) for d in data if isinstance(d, dict)]
|
| 458 |
-
except Exception as exc: # noqa: BLE001 — yaml + parsing user-supplied
|
| 459 |
-
# Sprint A3 (B-3) : un YAML mal formé bascule en mode démo
|
| 460 |
-
# sans que l'utilisateur en soit averti — on logge et on émet
|
| 461 |
-
# un Fact pour que la synthèse du rapport mentionne l'incident.
|
| 462 |
-
from picarones.extras.importers._fallback_log import record_fallback
|
| 463 |
-
record_fallback(
|
| 464 |
-
importer="htr_united",
|
| 465 |
-
operation="yaml_catalogue_parse",
|
| 466 |
-
error=exc,
|
| 467 |
-
)
|
| 468 |
-
return [HTRUnitedEntry.from_dict(d) for d in _DEMO_CATALOGUE]
|
| 469 |
-
|
| 470 |
-
|
| 471 |
-
def _iso_now() -> str:
|
| 472 |
-
from datetime import datetime, timezone
|
| 473 |
-
return datetime.now(timezone.utc).isoformat(timespec="seconds")
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S11. Le contenu canonique vit dans
|
| 2 |
+
``picarones.adapters.corpus.htr_united``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
"""
|
| 4 |
|
| 5 |
from __future__ import annotations
|
| 6 |
|
| 7 |
+
from picarones.adapters.corpus.htr_united import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,464 +1,11 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
de tests d'intégration. À utiliser à vos risques jusqu'à ce qu'un cas
|
| 6 |
-
d'usage institutionnel valide son comportement. Un ``UserWarning`` est
|
| 7 |
-
émis à l'import pour le rappeler.
|
| 8 |
-
|
| 9 |
-
Ce module fournit :
|
| 10 |
-
- :class:`HuggingFaceDataset` — métadonnées d'un dataset HuggingFace
|
| 11 |
-
- :class:`HuggingFaceImporter` — recherche et import de datasets
|
| 12 |
-
- :func:`search_hf_datasets` — recherche par tags dans l'API HuggingFace
|
| 13 |
-
- :func:`import_hf_dataset` — téléchargement d'un dataset vers un dossier local
|
| 14 |
-
|
| 15 |
-
Les datasets patrimoniaux de référence sont pré-référencés pour une découverte
|
| 16 |
-
rapide sans requête réseau.
|
| 17 |
-
|
| 18 |
-
Exemple
|
| 19 |
-
-------
|
| 20 |
-
importer = HuggingFaceImporter()
|
| 21 |
-
results = importer.search("medieval OCR", tags=["ocr"])
|
| 22 |
-
corpus = importer.import_dataset(results[0].dataset_id, output_dir="./corpus/")
|
| 23 |
"""
|
| 24 |
|
| 25 |
from __future__ import annotations
|
| 26 |
|
| 27 |
-
import
|
| 28 |
-
import
|
| 29 |
-
import urllib.error
|
| 30 |
-
import urllib.parse
|
| 31 |
-
import urllib.request
|
| 32 |
-
import warnings
|
| 33 |
-
from dataclasses import dataclass, field
|
| 34 |
-
from pathlib import Path
|
| 35 |
-
from typing import Optional
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
# Émission du warning ``experimental`` à l'import. Phase C du chantier
|
| 39 |
-
# de refonte — voir docstring du module ci-dessus.
|
| 40 |
-
warnings.warn(
|
| 41 |
-
"picarones.extras.importers.huggingface is experimental and may "
|
| 42 |
-
"change or be removed without notice. Use at your own risk until "
|
| 43 |
-
"an institutional use case validates the API.",
|
| 44 |
-
category=UserWarning,
|
| 45 |
-
stacklevel=2,
|
| 46 |
-
)
|
| 47 |
-
|
| 48 |
-
# ---------------------------------------------------------------------------
|
| 49 |
-
# Datasets de référence pré-référencés
|
| 50 |
-
# ---------------------------------------------------------------------------
|
| 51 |
-
|
| 52 |
-
_REFERENCE_DATASETS: list[dict] = [
|
| 53 |
-
{
|
| 54 |
-
"dataset_id": "Teklia/RIMES",
|
| 55 |
-
"title": "RIMES — Reconnaissance et Indexation de données Manuscrites et de fac-similEs",
|
| 56 |
-
"description": "Corpus de courriers manuscrits français modernes. Standard de référence pour la reconnaissance d'écriture manuscrite.",
|
| 57 |
-
"language": ["French"],
|
| 58 |
-
"tags": ["htr", "ocr", "handwritten", "french", "modern"],
|
| 59 |
-
"license": "cc-by-4.0",
|
| 60 |
-
"size_category": "1K<n<10K",
|
| 61 |
-
"task": "image-to-text",
|
| 62 |
-
"institution": "IRISA / A2iA",
|
| 63 |
-
"downloads": 1200,
|
| 64 |
-
},
|
| 65 |
-
{
|
| 66 |
-
"dataset_id": "Teklia/IAM",
|
| 67 |
-
"title": "IAM Handwriting Database",
|
| 68 |
-
"description": "Corpus de référence anglais pour la reconnaissance d'écriture manuscrite.",
|
| 69 |
-
"language": ["English"],
|
| 70 |
-
"tags": ["htr", "ocr", "handwritten", "english"],
|
| 71 |
-
"license": "other",
|
| 72 |
-
"size_category": "10K<n<100K",
|
| 73 |
-
"task": "image-to-text",
|
| 74 |
-
"institution": "University of Bern",
|
| 75 |
-
"downloads": 8400,
|
| 76 |
-
},
|
| 77 |
-
{
|
| 78 |
-
"dataset_id": "CATMuS/medieval",
|
| 79 |
-
"title": "CATMuS Medieval — Consistent Approaches to Transcribing ManuScripts",
|
| 80 |
-
"description": "Dataset multilingue de manuscrits médiévaux (latin, français, occitan, espagnol) pour l'entraînement de modèles HTR.",
|
| 81 |
-
"language": ["Latin", "French", "Occitan", "Spanish"],
|
| 82 |
-
"tags": ["htr", "medieval", "manuscripts", "latin", "french", "historical"],
|
| 83 |
-
"license": "cc-by-4.0",
|
| 84 |
-
"size_category": "100K<n<1M",
|
| 85 |
-
"task": "image-to-text",
|
| 86 |
-
"institution": "Inria / EPHE",
|
| 87 |
-
"downloads": 3100,
|
| 88 |
-
},
|
| 89 |
-
{
|
| 90 |
-
"dataset_id": "htr-united/cremma-medieval",
|
| 91 |
-
"title": "CREMMA Medieval",
|
| 92 |
-
"description": "Corpus de manuscrits médiévaux français XIIe-XVe siècles.",
|
| 93 |
-
"language": ["French", "Latin"],
|
| 94 |
-
"tags": ["htr", "medieval", "french", "manuscripts", "htr-united"],
|
| 95 |
-
"license": "cc-by-4.0",
|
| 96 |
-
"size_category": "1K<n<10K",
|
| 97 |
-
"task": "image-to-text",
|
| 98 |
-
"institution": "Inria",
|
| 99 |
-
"downloads": 520,
|
| 100 |
-
},
|
| 101 |
-
{
|
| 102 |
-
"dataset_id": "biglam/europeana_newspapers",
|
| 103 |
-
"title": "Europeana Newspapers",
|
| 104 |
-
"description": "Journaux numérisés européens du XIXe siècle (OCR + images).",
|
| 105 |
-
"language": ["French", "German", "Dutch", "Finnish"],
|
| 106 |
-
"tags": ["ocr", "newspapers", "historical", "19th-century", "europeana"],
|
| 107 |
-
"license": "cc0-1.0",
|
| 108 |
-
"size_category": "1M<n<10M",
|
| 109 |
-
"task": "image-to-text",
|
| 110 |
-
"institution": "Europeana Foundation",
|
| 111 |
-
"downloads": 15200,
|
| 112 |
-
},
|
| 113 |
-
{
|
| 114 |
-
"dataset_id": "stefanklut/esposalles",
|
| 115 |
-
"title": "Esposalles Dataset",
|
| 116 |
-
"description": "Registres de mariage catalans du XVIIe siècle pour la reconnaissance d'écriture historique.",
|
| 117 |
-
"language": ["Catalan", "Latin"],
|
| 118 |
-
"tags": ["htr", "historical", "registers", "catalan", "17th-century"],
|
| 119 |
-
"license": "cc-by-4.0",
|
| 120 |
-
"size_category": "1K<n<10K",
|
| 121 |
-
"task": "image-to-text",
|
| 122 |
-
"institution": "Universitat Autònoma de Barcelona",
|
| 123 |
-
"downloads": 340,
|
| 124 |
-
},
|
| 125 |
-
{
|
| 126 |
-
"dataset_id": "bnf-gallica/gallica-ocr",
|
| 127 |
-
"title": "Gallica OCR",
|
| 128 |
-
"description": "Extraits d'imprimés anciens numérisés depuis Gallica avec vérité terrain.",
|
| 129 |
-
"language": ["French", "Latin"],
|
| 130 |
-
"tags": ["ocr", "historical", "printed", "gallica", "french"],
|
| 131 |
-
"license": "etalab-2.0",
|
| 132 |
-
"size_category": "10K<n<100K",
|
| 133 |
-
"task": "image-to-text",
|
| 134 |
-
"institution": "Gallica",
|
| 135 |
-
"downloads": 2800,
|
| 136 |
-
},
|
| 137 |
-
{
|
| 138 |
-
"dataset_id": "Bozen-Baptism/baptism-records",
|
| 139 |
-
"title": "Bozen Baptism Records",
|
| 140 |
-
"description": "Registres de baptêmes de Bozen (Italie/Autriche) du XVIIIe siècle.",
|
| 141 |
-
"language": ["German", "Latin"],
|
| 142 |
-
"tags": ["htr", "historical", "registers", "german", "latin", "18th-century"],
|
| 143 |
-
"license": "cc-by-4.0",
|
| 144 |
-
"size_category": "1K<n<10K",
|
| 145 |
-
"task": "image-to-text",
|
| 146 |
-
"institution": "University of Innsbruck",
|
| 147 |
-
"downloads": 190,
|
| 148 |
-
},
|
| 149 |
-
{
|
| 150 |
-
"dataset_id": "read-bad/readbad",
|
| 151 |
-
"title": "READ-BAD — Recognition and Enrichment of Archival Documents",
|
| 152 |
-
"description": "Corpus multilingue de documents d'archives pour l'OCR historique (Latin, Allemand, Anglais).",
|
| 153 |
-
"language": ["German", "English", "Latin"],
|
| 154 |
-
"tags": ["ocr", "htr", "historical", "archives", "read"],
|
| 155 |
-
"license": "cc-by-4.0",
|
| 156 |
-
"size_category": "10K<n<100K",
|
| 157 |
-
"task": "image-to-text",
|
| 158 |
-
"institution": "University of Graz",
|
| 159 |
-
"downloads": 1050,
|
| 160 |
-
},
|
| 161 |
-
]
|
| 162 |
-
|
| 163 |
-
# ---------------------------------------------------------------------------
|
| 164 |
-
# Dataclass
|
| 165 |
-
# ---------------------------------------------------------------------------
|
| 166 |
-
|
| 167 |
-
@dataclass
|
| 168 |
-
class HuggingFaceDataset:
|
| 169 |
-
"""Métadonnées d'un dataset HuggingFace."""
|
| 170 |
-
|
| 171 |
-
dataset_id: str
|
| 172 |
-
title: str
|
| 173 |
-
description: str = ""
|
| 174 |
-
language: list[str] = field(default_factory=list)
|
| 175 |
-
tags: list[str] = field(default_factory=list)
|
| 176 |
-
license: str = ""
|
| 177 |
-
size_category: str = ""
|
| 178 |
-
task: str = "image-to-text"
|
| 179 |
-
institution: str = ""
|
| 180 |
-
downloads: int = 0
|
| 181 |
-
source: str = "reference" # "reference" | "api"
|
| 182 |
-
|
| 183 |
-
def as_dict(self) -> dict:
|
| 184 |
-
return {
|
| 185 |
-
"dataset_id": self.dataset_id,
|
| 186 |
-
"title": self.title,
|
| 187 |
-
"description": self.description,
|
| 188 |
-
"language": self.language,
|
| 189 |
-
"tags": self.tags,
|
| 190 |
-
"license": self.license,
|
| 191 |
-
"size_category": self.size_category,
|
| 192 |
-
"task": self.task,
|
| 193 |
-
"institution": self.institution,
|
| 194 |
-
"downloads": self.downloads,
|
| 195 |
-
"source": self.source,
|
| 196 |
-
}
|
| 197 |
-
|
| 198 |
-
@classmethod
|
| 199 |
-
def from_dict(cls, d: dict) -> "HuggingFaceDataset":
|
| 200 |
-
return cls(
|
| 201 |
-
dataset_id=d.get("dataset_id", d.get("id", "")),
|
| 202 |
-
title=d.get("title", d.get("dataset_id", "")),
|
| 203 |
-
description=d.get("description", ""),
|
| 204 |
-
language=d.get("language", []),
|
| 205 |
-
tags=d.get("tags", []),
|
| 206 |
-
license=d.get("license", ""),
|
| 207 |
-
size_category=d.get("size_category", d.get("cardData", {}).get("size_categories", [""])[0] if isinstance(d.get("cardData"), dict) else ""),
|
| 208 |
-
task=d.get("task", "image-to-text"),
|
| 209 |
-
institution=d.get("institution", ""),
|
| 210 |
-
downloads=d.get("downloads", d.get("downloadsAllTime", 0)),
|
| 211 |
-
source=d.get("source", "api"),
|
| 212 |
-
)
|
| 213 |
-
|
| 214 |
-
@property
|
| 215 |
-
def hf_url(self) -> str:
|
| 216 |
-
return f"https://huggingface.co/datasets/{self.dataset_id}"
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
# ---------------------------------------------------------------------------
|
| 220 |
-
# Importer principal
|
| 221 |
-
# ---------------------------------------------------------------------------
|
| 222 |
-
|
| 223 |
-
class HuggingFaceImporter:
|
| 224 |
-
"""Recherche et importe des datasets depuis HuggingFace Hub."""
|
| 225 |
-
|
| 226 |
-
_API_BASE = "https://huggingface.co/api"
|
| 227 |
-
|
| 228 |
-
def __init__(self, token: Optional[str] = None) -> None:
|
| 229 |
-
self._token = token or os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_TOKEN")
|
| 230 |
-
|
| 231 |
-
def _headers(self) -> dict:
|
| 232 |
-
h = {"User-Agent": "picarones-hf-importer/1.0"}
|
| 233 |
-
if self._token:
|
| 234 |
-
h["Authorization"] = f"Bearer {self._token}"
|
| 235 |
-
return h
|
| 236 |
-
|
| 237 |
-
def search(
|
| 238 |
-
self,
|
| 239 |
-
query: str = "",
|
| 240 |
-
tags: Optional[list[str]] = None,
|
| 241 |
-
language: Optional[str] = None,
|
| 242 |
-
limit: int = 20,
|
| 243 |
-
use_reference: bool = True,
|
| 244 |
-
) -> list[HuggingFaceDataset]:
|
| 245 |
-
"""Recherche des datasets avec filtres.
|
| 246 |
-
|
| 247 |
-
Interroge d'abord les datasets de référence pré-intégrés, puis
|
| 248 |
-
l'API HuggingFace si disponible.
|
| 249 |
-
"""
|
| 250 |
-
results: list[HuggingFaceDataset] = []
|
| 251 |
-
|
| 252 |
-
# Datasets de référence
|
| 253 |
-
if use_reference:
|
| 254 |
-
ref_results = self._search_reference(query, tags, language)
|
| 255 |
-
results.extend(ref_results)
|
| 256 |
-
|
| 257 |
-
# API HuggingFace (optionnel, peut échouer silencieusement)
|
| 258 |
-
try:
|
| 259 |
-
api_results = self._search_api(query, tags, language, limit)
|
| 260 |
-
# Déduplique (priorité aux références)
|
| 261 |
-
existing_ids = {r.dataset_id for r in results}
|
| 262 |
-
for ds in api_results:
|
| 263 |
-
if ds.dataset_id not in existing_ids:
|
| 264 |
-
results.append(ds)
|
| 265 |
-
existing_ids.add(ds.dataset_id)
|
| 266 |
-
except Exception as exc: # noqa: BLE001 — réseau/API tierce
|
| 267 |
-
# Sprint A3 (B-3) : la recherche API échoue silencieusement →
|
| 268 |
-
# l'utilisateur ne voit que les datasets de référence et croit
|
| 269 |
-
# que l'API est vide. On documente l'incident.
|
| 270 |
-
from picarones.extras.importers._fallback_log import record_fallback
|
| 271 |
-
record_fallback(
|
| 272 |
-
importer="huggingface",
|
| 273 |
-
operation="hub_search_api",
|
| 274 |
-
error=exc,
|
| 275 |
-
extra={"query": query, "language": language, "limit": limit},
|
| 276 |
-
)
|
| 277 |
-
|
| 278 |
-
return results[:limit]
|
| 279 |
-
|
| 280 |
-
def _search_reference(
|
| 281 |
-
self,
|
| 282 |
-
query: str,
|
| 283 |
-
tags: Optional[list[str]],
|
| 284 |
-
language: Optional[str],
|
| 285 |
-
) -> list[HuggingFaceDataset]:
|
| 286 |
-
datasets = [HuggingFaceDataset.from_dict(d) for d in _REFERENCE_DATASETS]
|
| 287 |
-
datasets = [ds._replace_source("reference") for ds in datasets]
|
| 288 |
-
|
| 289 |
-
if query:
|
| 290 |
-
q = query.lower()
|
| 291 |
-
datasets = [
|
| 292 |
-
ds for ds in datasets
|
| 293 |
-
if (q in ds.title.lower()
|
| 294 |
-
or q in ds.description.lower()
|
| 295 |
-
or q in ds.dataset_id.lower()
|
| 296 |
-
or any(q in t.lower() for t in ds.tags)
|
| 297 |
-
or any(q in lg.lower() for lg in ds.language))
|
| 298 |
-
]
|
| 299 |
-
|
| 300 |
-
if tags:
|
| 301 |
-
for tag in tags:
|
| 302 |
-
t_lower = tag.lower()
|
| 303 |
-
datasets = [
|
| 304 |
-
ds for ds in datasets
|
| 305 |
-
if any(t_lower in dt.lower() for dt in ds.tags)
|
| 306 |
-
]
|
| 307 |
-
|
| 308 |
-
if language:
|
| 309 |
-
lang_lower = language.lower()
|
| 310 |
-
datasets = [
|
| 311 |
-
ds for ds in datasets
|
| 312 |
-
if any(lang_lower in lg.lower() for lg in ds.language)
|
| 313 |
-
]
|
| 314 |
-
|
| 315 |
-
return datasets
|
| 316 |
-
|
| 317 |
-
def _search_api(
|
| 318 |
-
self,
|
| 319 |
-
query: str,
|
| 320 |
-
tags: Optional[list[str]],
|
| 321 |
-
language: Optional[str],
|
| 322 |
-
limit: int,
|
| 323 |
-
) -> list[HuggingFaceDataset]:
|
| 324 |
-
params: dict[str, str] = {
|
| 325 |
-
"task_categories": "image-to-text",
|
| 326 |
-
"limit": str(min(limit, 50)),
|
| 327 |
-
"full": "False",
|
| 328 |
-
}
|
| 329 |
-
if query:
|
| 330 |
-
params["search"] = query
|
| 331 |
-
if language:
|
| 332 |
-
params["language"] = language
|
| 333 |
-
if tags:
|
| 334 |
-
params["tags"] = ",".join(tags)
|
| 335 |
-
|
| 336 |
-
url = f"{self._API_BASE}/datasets?" + urllib.parse.urlencode(params)
|
| 337 |
-
req = urllib.request.Request(url, headers=self._headers())
|
| 338 |
-
with urllib.request.urlopen(req, timeout=10) as resp:
|
| 339 |
-
data = json.loads(resp.read().decode("utf-8"))
|
| 340 |
-
|
| 341 |
-
results = []
|
| 342 |
-
for item in data if isinstance(data, list) else []:
|
| 343 |
-
ds = HuggingFaceDataset(
|
| 344 |
-
dataset_id=item.get("id", ""),
|
| 345 |
-
title=item.get("id", ""),
|
| 346 |
-
description=item.get("description", ""),
|
| 347 |
-
language=item.get("language", []),
|
| 348 |
-
tags=item.get("tags", []),
|
| 349 |
-
license=item.get("license", ""),
|
| 350 |
-
size_category=(
|
| 351 |
-
item.get("cardData", {}).get("size_categories", [""])[0]
|
| 352 |
-
if isinstance(item.get("cardData"), dict)
|
| 353 |
-
else ""
|
| 354 |
-
),
|
| 355 |
-
task="image-to-text",
|
| 356 |
-
downloads=item.get("downloadsAllTime", 0),
|
| 357 |
-
source="api",
|
| 358 |
-
)
|
| 359 |
-
if ds.dataset_id:
|
| 360 |
-
results.append(ds)
|
| 361 |
-
return results
|
| 362 |
-
|
| 363 |
-
def import_dataset(
|
| 364 |
-
self,
|
| 365 |
-
dataset_id: str,
|
| 366 |
-
output_dir: str | Path,
|
| 367 |
-
split: str = "train",
|
| 368 |
-
max_samples: int = 100,
|
| 369 |
-
show_progress: bool = True,
|
| 370 |
-
) -> dict:
|
| 371 |
-
"""Importe un dataset depuis HuggingFace vers un dossier local.
|
| 372 |
-
|
| 373 |
-
Retourne les métadonnées de l'import.
|
| 374 |
-
"""
|
| 375 |
-
output_path = Path(output_dir)
|
| 376 |
-
output_path.mkdir(parents=True, exist_ok=True)
|
| 377 |
-
|
| 378 |
-
meta = {
|
| 379 |
-
"source": "huggingface",
|
| 380 |
-
"dataset_id": dataset_id,
|
| 381 |
-
"split": split,
|
| 382 |
-
"max_samples": max_samples,
|
| 383 |
-
"imported_at": _iso_now(),
|
| 384 |
-
}
|
| 385 |
-
meta_file = output_path / "huggingface_meta.json"
|
| 386 |
-
meta_file.write_text(json.dumps(meta, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 387 |
-
|
| 388 |
-
# Tentative d'import via datasets library si disponible
|
| 389 |
-
files_imported = _try_import_with_datasets_lib(
|
| 390 |
-
dataset_id, output_path, split, max_samples, show_progress
|
| 391 |
-
)
|
| 392 |
-
|
| 393 |
-
return {
|
| 394 |
-
"dataset_id": dataset_id,
|
| 395 |
-
"output_dir": str(output_path),
|
| 396 |
-
"files_imported": files_imported,
|
| 397 |
-
"metadata_file": str(meta_file),
|
| 398 |
-
}
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
def _try_import_with_datasets_lib(
|
| 402 |
-
dataset_id: str,
|
| 403 |
-
output_path: Path,
|
| 404 |
-
split: str,
|
| 405 |
-
max_samples: int,
|
| 406 |
-
show_progress: bool,
|
| 407 |
-
) -> int:
|
| 408 |
-
"""Essaie d'importer avec la librairie `datasets` de HuggingFace."""
|
| 409 |
-
try:
|
| 410 |
-
from datasets import load_dataset # type: ignore
|
| 411 |
-
|
| 412 |
-
ds = load_dataset(dataset_id, split=split, streaming=True)
|
| 413 |
-
count = 0
|
| 414 |
-
for i, item in enumerate(ds):
|
| 415 |
-
if i >= max_samples:
|
| 416 |
-
break
|
| 417 |
-
# Cherche champ image et texte
|
| 418 |
-
image = item.get("image") or item.get("img")
|
| 419 |
-
text = item.get("text") or item.get("transcription") or item.get("ground_truth", "")
|
| 420 |
-
|
| 421 |
-
if image is not None:
|
| 422 |
-
img_file = output_path / f"doc_{i:04d}.jpg"
|
| 423 |
-
try:
|
| 424 |
-
image.save(str(img_file))
|
| 425 |
-
except Exception as exc: # noqa: BLE001 — PIL/PIL-IO
|
| 426 |
-
# Sprint A3 (B-3) : un échec de sauvegarde d'image
|
| 427 |
-
# produirait un GT orphelin (texte sans image). On
|
| 428 |
-
# documente et on continue — le GT est tout de même
|
| 429 |
-
# écrit pour préserver la cohérence numérique du compteur.
|
| 430 |
-
from picarones.extras.importers._fallback_log import record_fallback
|
| 431 |
-
record_fallback(
|
| 432 |
-
importer="huggingface",
|
| 433 |
-
operation="image_save",
|
| 434 |
-
error=exc,
|
| 435 |
-
extra={"img_file": str(img_file), "doc_index": i},
|
| 436 |
-
)
|
| 437 |
-
|
| 438 |
-
gt_file = output_path / f"doc_{i:04d}.gt.txt"
|
| 439 |
-
gt_file.write_text(str(text), encoding="utf-8")
|
| 440 |
-
count += 1
|
| 441 |
-
|
| 442 |
-
return count
|
| 443 |
-
except (ImportError, Exception):
|
| 444 |
-
return 0
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
def _iso_now() -> str:
|
| 448 |
-
from datetime import datetime, timezone
|
| 449 |
-
return datetime.now(timezone.utc).isoformat(timespec="seconds")
|
| 450 |
-
|
| 451 |
-
|
| 452 |
-
# ---------------------------------------------------------------------------
|
| 453 |
-
# Extension de HuggingFaceDataset (helper privé)
|
| 454 |
-
# ---------------------------------------------------------------------------
|
| 455 |
-
|
| 456 |
-
def _patch_dataset_replace_source() -> None:
|
| 457 |
-
"""Ajoute un helper _replace_source à HuggingFaceDataset."""
|
| 458 |
-
def _replace_source(self, source: str) -> "HuggingFaceDataset":
|
| 459 |
-
from dataclasses import replace
|
| 460 |
-
return replace(self, source=source)
|
| 461 |
-
HuggingFaceDataset._replace_source = _replace_source
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
_patch_dataset_replace_source()
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S11. Le contenu canonique vit dans
|
| 2 |
+
``picarones.adapters.corpus.huggingface``.
|
| 3 |
|
| 4 |
+
Ré-expose explicitement ``_REFERENCE_DATASETS`` (importé par les
|
| 5 |
+
tests web).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.adapters.corpus.huggingface import * # noqa: F401,F403
|
| 11 |
+
from picarones.adapters.corpus.huggingface import _REFERENCE_DATASETS # noqa: F401
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,111 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
import os
|
| 7 |
-
from typing import Optional
|
| 8 |
-
|
| 9 |
-
from picarones.llm.base import (
|
| 10 |
-
BaseLLMAdapter,
|
| 11 |
-
log_http_error,
|
| 12 |
-
normalize_llm_content,
|
| 13 |
-
)
|
| 14 |
-
|
| 15 |
-
logger = logging.getLogger(__name__)
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
class AnthropicAdapter(BaseLLMAdapter):
|
| 19 |
-
"""Adaptateur pour les modèles Anthropic Claude.
|
| 20 |
-
|
| 21 |
-
Clé API via la variable d'environnement ``ANTHROPIC_API_KEY``.
|
| 22 |
-
|
| 23 |
-
Modes supportés : text_only, text_and_image, zero_shot.
|
| 24 |
-
"""
|
| 25 |
-
|
| 26 |
-
api_key_env_var = "ANTHROPIC_API_KEY"
|
| 27 |
|
| 28 |
-
|
| 29 |
-
def name(self) -> str:
|
| 30 |
-
return "anthropic"
|
| 31 |
-
|
| 32 |
-
@property
|
| 33 |
-
def default_model(self) -> str:
|
| 34 |
-
return "claude-sonnet-4-6"
|
| 35 |
-
|
| 36 |
-
def __init__(
|
| 37 |
-
self,
|
| 38 |
-
model: Optional[str] = None,
|
| 39 |
-
config: Optional[dict] = None,
|
| 40 |
-
) -> None:
|
| 41 |
-
super().__init__(model, config)
|
| 42 |
-
self._api_key = os.environ.get("ANTHROPIC_API_KEY")
|
| 43 |
-
|
| 44 |
-
def _call(self, prompt: str, image_b64: Optional[str] = None) -> str:
|
| 45 |
-
if not self._api_key:
|
| 46 |
-
raise RuntimeError(
|
| 47 |
-
"Clé API Anthropic manquante — définissez la variable d'environnement ANTHROPIC_API_KEY"
|
| 48 |
-
)
|
| 49 |
-
try:
|
| 50 |
-
import anthropic
|
| 51 |
-
except ImportError as exc:
|
| 52 |
-
raise RuntimeError(
|
| 53 |
-
"Le package 'anthropic' n'est pas installé. Lancez : pip install anthropic"
|
| 54 |
-
) from exc
|
| 55 |
-
|
| 56 |
-
client = anthropic.Anthropic(api_key=self._api_key)
|
| 57 |
-
temperature = float(self.config.get("temperature", 0.0))
|
| 58 |
-
max_tokens = int(self.config.get("max_tokens", 4096))
|
| 59 |
-
|
| 60 |
-
if image_b64:
|
| 61 |
-
content: list | str = [
|
| 62 |
-
{
|
| 63 |
-
"type": "image",
|
| 64 |
-
"source": {
|
| 65 |
-
"type": "base64",
|
| 66 |
-
"media_type": "image/png",
|
| 67 |
-
"data": image_b64,
|
| 68 |
-
},
|
| 69 |
-
},
|
| 70 |
-
{"type": "text", "text": prompt},
|
| 71 |
-
]
|
| 72 |
-
else:
|
| 73 |
-
content = prompt
|
| 74 |
-
|
| 75 |
-
try:
|
| 76 |
-
response = client.messages.create(
|
| 77 |
-
model=self.model,
|
| 78 |
-
max_tokens=max_tokens,
|
| 79 |
-
temperature=temperature,
|
| 80 |
-
messages=[{"role": "user", "content": content}],
|
| 81 |
-
)
|
| 82 |
-
except Exception as exc:
|
| 83 |
-
# Chantier 4 — log discriminant (401/429/5xx) factorisé.
|
| 84 |
-
# Auparavant Anthropic ne discriminait pas par code HTTP,
|
| 85 |
-
# difficile à diagnostiquer (clé invalide vs rate limit).
|
| 86 |
-
log_http_error(
|
| 87 |
-
"AnthropicAdapter", self.model, exc,
|
| 88 |
-
env_var=self.api_key_env_var,
|
| 89 |
-
)
|
| 90 |
-
raise
|
| 91 |
-
|
| 92 |
-
if not response.content:
|
| 93 |
-
logger.warning(
|
| 94 |
-
"[AnthropicAdapter] réponse vide (modèle=%s, stop_reason=%s).",
|
| 95 |
-
self.model, getattr(response, "stop_reason", None),
|
| 96 |
-
)
|
| 97 |
-
return ""
|
| 98 |
|
| 99 |
-
|
| 100 |
-
# retourne ``response.content`` comme une liste de blocs
|
| 101 |
-
# (``ContentBlock`` avec attribut ``text``). ``normalize_llm_content``
|
| 102 |
-
# concatène le texte de tous les blocs au lieu de ne prendre que
|
| 103 |
-
# le premier — utile quand le modèle émet plusieurs blocs.
|
| 104 |
-
text = normalize_llm_content(response.content)
|
| 105 |
-
if not text:
|
| 106 |
-
block = response.content[0]
|
| 107 |
-
logger.warning(
|
| 108 |
-
"[AnthropicAdapter] bloc de type '%s' sans texte (modèle=%s).",
|
| 109 |
-
getattr(block, "type", "unknown"), self.model,
|
| 110 |
-
)
|
| 111 |
-
return text
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S11. Le contenu canonique vit dans
|
| 2 |
+
``picarones.adapters.llm.anthropic_adapter``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.llm.anthropic_adapter`` est conservé pour ne casser
|
| 5 |
+
aucun consommateur. Au S22, ce re-export disparaîtra.
|
| 6 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
from __future__ import annotations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
from picarones.adapters.llm.anthropic_adapter import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,279 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
import time
|
| 7 |
-
from abc import ABC, abstractmethod
|
| 8 |
-
from dataclasses import dataclass
|
| 9 |
-
from typing import Any, Optional
|
| 10 |
-
|
| 11 |
-
logger = logging.getLogger(__name__)
|
| 12 |
-
|
| 13 |
-
# Paramètres de retry par défaut
|
| 14 |
-
_DEFAULT_MAX_RETRIES = 3
|
| 15 |
-
_DEFAULT_BACKOFF_BASE = 2.0 # secondes : 2, 4, 8
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
def _is_retryable(exc: Exception) -> bool:
|
| 19 |
-
"""Détermine si une exception est retryable (429, 5xx, timeout réseau)."""
|
| 20 |
-
# HTTP status codes retryables
|
| 21 |
-
status = getattr(exc, "status_code", None) or getattr(exc, "http_status", None)
|
| 22 |
-
if status is not None:
|
| 23 |
-
return status == 429 or status >= 500
|
| 24 |
-
|
| 25 |
-
# Erreurs réseau / timeout
|
| 26 |
-
exc_name = type(exc).__name__
|
| 27 |
-
if exc_name in ("TimeoutError", "ConnectionError", "URLError"):
|
| 28 |
-
return True
|
| 29 |
-
|
| 30 |
-
# Messages d'erreur courants
|
| 31 |
-
msg = str(exc).lower()
|
| 32 |
-
if "rate" in msg and "limit" in msg:
|
| 33 |
-
return True
|
| 34 |
-
if "timeout" in msg or "connection" in msg:
|
| 35 |
-
return True
|
| 36 |
-
if "429" in msg or "503" in msg or "502" in msg:
|
| 37 |
-
return True
|
| 38 |
-
|
| 39 |
-
return False
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
def normalize_llm_content(raw: Any) -> str:
|
| 43 |
-
"""Normalise une réponse LLM en chaîne plate.
|
| 44 |
-
|
| 45 |
-
Chantier 4 (post-Sprint 97) — propagation du fix Mistral
|
| 46 |
-
Sprint 15 à tous les providers. Le SDK Mistral peut retourner
|
| 47 |
-
une liste de ``ContentChunk`` au lieu d'une chaîne pour certains
|
| 48 |
-
modèles/versions ; le SDK OpenAI peut faire de même quand on
|
| 49 |
-
active des features de structuration. Ce helper applique la même
|
| 50 |
-
discipline pour les 4 adapters :
|
| 51 |
-
|
| 52 |
-
- ``str`` → renvoyée telle quelle (ou ``""``).
|
| 53 |
-
- ``None`` → ``""``.
|
| 54 |
-
- ``list[ContentChunk]`` → concaténation des ``.text``.
|
| 55 |
-
- ``list[dict]`` avec clé ``text`` → concaténation des ``["text"]``.
|
| 56 |
-
- ``list[str]`` → concaténation directe.
|
| 57 |
-
- autre objet avec ``.text`` → ``obj.text``.
|
| 58 |
-
- autre → ``str(obj)`` (best-effort).
|
| 59 |
-
|
| 60 |
-
Le résultat est garanti être une ``str`` ; ``""`` quand la réponse
|
| 61 |
-
est vide. La fonction est idempotente : ``normalize_llm_content(s)
|
| 62 |
-
== s`` pour toute chaîne ``s``.
|
| 63 |
-
"""
|
| 64 |
-
if raw is None:
|
| 65 |
-
return ""
|
| 66 |
-
if isinstance(raw, str):
|
| 67 |
-
return raw
|
| 68 |
-
if isinstance(raw, list):
|
| 69 |
-
parts: list[str] = []
|
| 70 |
-
for chunk in raw:
|
| 71 |
-
if chunk is None:
|
| 72 |
-
continue
|
| 73 |
-
if isinstance(chunk, str):
|
| 74 |
-
parts.append(chunk)
|
| 75 |
-
continue
|
| 76 |
-
if hasattr(chunk, "text"):
|
| 77 |
-
txt = getattr(chunk, "text", None)
|
| 78 |
-
if isinstance(txt, str):
|
| 79 |
-
parts.append(txt)
|
| 80 |
-
continue
|
| 81 |
-
if isinstance(chunk, dict) and isinstance(chunk.get("text"), str):
|
| 82 |
-
parts.append(chunk["text"])
|
| 83 |
-
continue
|
| 84 |
-
# Dernier recours — convertit le chunk en chaîne
|
| 85 |
-
parts.append(str(chunk))
|
| 86 |
-
return "".join(parts)
|
| 87 |
-
if hasattr(raw, "text") and isinstance(getattr(raw, "text", None), str):
|
| 88 |
-
return raw.text # type: ignore[no-any-return]
|
| 89 |
-
return str(raw)
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
def log_http_error(
|
| 93 |
-
adapter_name: str,
|
| 94 |
-
model: str,
|
| 95 |
-
exc: Exception,
|
| 96 |
-
*,
|
| 97 |
-
env_var: Optional[str] = None,
|
| 98 |
-
) -> None:
|
| 99 |
-
"""Log standardisé des erreurs HTTP des SDK LLM.
|
| 100 |
-
|
| 101 |
-
Chantier 4 (post-Sprint 97) — propagation du log discriminant
|
| 102 |
-
Mistral/OpenAI à tous les providers. Inspecte ``status_code`` et
|
| 103 |
-
``http_status`` puis émet un warning ciblé selon le code :
|
| 104 |
-
|
| 105 |
-
- 401 : clé API invalide/expirée (mention de la variable
|
| 106 |
-
d'environnement à vérifier si fournie).
|
| 107 |
-
- 429 : rate limit / quota dépassé.
|
| 108 |
-
- 5xx : problème serveur côté provider.
|
| 109 |
-
- autre / pas de status_code : log générique.
|
| 110 |
-
|
| 111 |
-
L'exception n'est pas levée — l'appelant doit ``raise``
|
| 112 |
-
explicitement après ce log s'il veut propager (le retry est géré
|
| 113 |
-
par ``BaseLLMAdapter.complete`` selon ``_is_retryable``).
|
| 114 |
-
"""
|
| 115 |
-
status = getattr(exc, "status_code", None) or getattr(exc, "http_status", None)
|
| 116 |
-
if status == 401:
|
| 117 |
-
suffix = f" Vérifier {env_var}." if env_var else ""
|
| 118 |
-
logger.warning(
|
| 119 |
-
"[%s] erreur HTTP 401 — clé API invalide ou expirée "
|
| 120 |
-
"(modèle=%s).%s",
|
| 121 |
-
adapter_name, model, suffix,
|
| 122 |
-
)
|
| 123 |
-
elif status == 429:
|
| 124 |
-
logger.warning(
|
| 125 |
-
"[%s] erreur HTTP 429 — quota dépassé ou rate-limit "
|
| 126 |
-
"(modèle=%s). Réessayer plus tard.",
|
| 127 |
-
adapter_name, model,
|
| 128 |
-
)
|
| 129 |
-
elif status is not None and status >= 500:
|
| 130 |
-
logger.warning(
|
| 131 |
-
"[%s] erreur HTTP %d — problème serveur (modèle=%s) : %s",
|
| 132 |
-
adapter_name, status, model, exc,
|
| 133 |
-
)
|
| 134 |
-
else:
|
| 135 |
-
logger.warning(
|
| 136 |
-
"[%s] erreur lors de l'appel API (modèle=%s) : %s",
|
| 137 |
-
adapter_name, model, exc,
|
| 138 |
-
)
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
@dataclass
|
| 142 |
-
class LLMResult:
|
| 143 |
-
"""Résultat produit par un appel LLM."""
|
| 144 |
-
|
| 145 |
-
model_id: str
|
| 146 |
-
text: str
|
| 147 |
-
duration_seconds: float
|
| 148 |
-
tokens_used: Optional[int] = None
|
| 149 |
-
error: Optional[str] = None
|
| 150 |
-
|
| 151 |
-
@property
|
| 152 |
-
def success(self) -> bool:
|
| 153 |
-
return self.error is None
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
class BaseLLMAdapter(ABC):
|
| 157 |
-
"""Classe de base pour tous les adaptateurs LLM.
|
| 158 |
-
|
| 159 |
-
Chaque adaptateur doit implémenter :
|
| 160 |
-
- ``name`` : identifiant du provider (ex : 'openai')
|
| 161 |
-
- ``default_model``: modèle par défaut du provider
|
| 162 |
-
- ``_call()`` : appel API effectif, retourne le texte brut
|
| 163 |
-
|
| 164 |
-
Les clés API sont lues depuis les variables d'environnement uniquement.
|
| 165 |
-
|
| 166 |
-
Retry automatique
|
| 167 |
-
-----------------
|
| 168 |
-
Les erreurs retryables (HTTP 429, 5xx, timeout réseau) sont automatiquement
|
| 169 |
-
retentées avec backoff exponentiel (2s, 4s, 8s par défaut). Configurable
|
| 170 |
-
via ``config["max_retries"]`` et ``config["retry_backoff"]``.
|
| 171 |
-
|
| 172 |
-
Normalisation des réponses (chantier 4)
|
| 173 |
-
---------------------------------------
|
| 174 |
-
Les sous-classes utilisent :func:`normalize_llm_content` sur la
|
| 175 |
-
réponse SDK avant de la retourner — garantit qu'une réponse de
|
| 176 |
-
type ``list[ContentChunk]`` (Mistral, parfois OpenAI) est
|
| 177 |
-
convertie en ``str`` plate.
|
| 178 |
-
|
| 179 |
-
Logging d'erreurs HTTP (chantier 4)
|
| 180 |
-
-----------------------------------
|
| 181 |
-
Les sous-classes utilisent :func:`log_http_error` pour produire
|
| 182 |
-
un log discriminant par ``status_code`` (401 → clé invalide,
|
| 183 |
-
429 → rate limit, 5xx → serveur). Auparavant ce log était
|
| 184 |
-
dupliqué chez Mistral/OpenAI et absent chez Anthropic.
|
| 185 |
-
"""
|
| 186 |
-
|
| 187 |
-
# Variable d'environnement portant la clé API. Sous-classes
|
| 188 |
-
# surchargent (ex. ``"OPENAI_API_KEY"``) ; mention utilisée par
|
| 189 |
-
# :func:`log_http_error` quand un 401 est rencontré. ``None``
|
| 190 |
-
# pour les providers sans clé (Ollama).
|
| 191 |
-
api_key_env_var: Optional[str] = None
|
| 192 |
-
|
| 193 |
-
def __init__(
|
| 194 |
-
self,
|
| 195 |
-
model: Optional[str] = None,
|
| 196 |
-
config: Optional[dict] = None,
|
| 197 |
-
) -> None:
|
| 198 |
-
self.config: dict = config or {}
|
| 199 |
-
self.model: str = model or self.default_model
|
| 200 |
-
|
| 201 |
-
@property
|
| 202 |
-
@abstractmethod
|
| 203 |
-
def name(self) -> str:
|
| 204 |
-
"""Identifiant du provider (ex : 'openai', 'anthropic')."""
|
| 205 |
-
|
| 206 |
-
@property
|
| 207 |
-
@abstractmethod
|
| 208 |
-
def default_model(self) -> str:
|
| 209 |
-
"""Modèle utilisé si aucun n'est fourni explicitement."""
|
| 210 |
-
|
| 211 |
-
@abstractmethod
|
| 212 |
-
def _call(self, prompt: str, image_b64: Optional[str] = None) -> str:
|
| 213 |
-
"""Appel LLM effectif.
|
| 214 |
-
|
| 215 |
-
Parameters
|
| 216 |
-
----------
|
| 217 |
-
prompt:
|
| 218 |
-
Texte du prompt final (variables déjà substituées).
|
| 219 |
-
image_b64:
|
| 220 |
-
Image encodée en base64 (sans préfixe data URI).
|
| 221 |
-
None pour les appels texte-uniquement.
|
| 222 |
-
|
| 223 |
-
Returns
|
| 224 |
-
-------
|
| 225 |
-
str
|
| 226 |
-
Texte généré par le LLM.
|
| 227 |
-
"""
|
| 228 |
-
|
| 229 |
-
def complete(
|
| 230 |
-
self,
|
| 231 |
-
prompt: str,
|
| 232 |
-
image_b64: Optional[str] = None,
|
| 233 |
-
) -> LLMResult:
|
| 234 |
-
"""Point d'entrée public : appelle le LLM avec retry automatique."""
|
| 235 |
-
max_retries = int(self.config.get("max_retries", _DEFAULT_MAX_RETRIES))
|
| 236 |
-
backoff_base = float(self.config.get("retry_backoff", _DEFAULT_BACKOFF_BASE))
|
| 237 |
-
|
| 238 |
-
start = time.perf_counter()
|
| 239 |
-
last_exc: Optional[Exception] = None
|
| 240 |
-
|
| 241 |
-
for attempt in range(max_retries + 1):
|
| 242 |
-
try:
|
| 243 |
-
text = self._call(prompt, image_b64)
|
| 244 |
-
duration = time.perf_counter() - start
|
| 245 |
-
return LLMResult(
|
| 246 |
-
model_id=self.model,
|
| 247 |
-
text=text,
|
| 248 |
-
duration_seconds=round(duration, 4),
|
| 249 |
-
)
|
| 250 |
-
except Exception as exc: # noqa: BLE001
|
| 251 |
-
last_exc = exc
|
| 252 |
-
if attempt < max_retries and _is_retryable(exc):
|
| 253 |
-
wait = backoff_base ** (attempt + 1)
|
| 254 |
-
logger.warning(
|
| 255 |
-
"[%s] erreur retryable (tentative %d/%d, attente %.1fs) : %s",
|
| 256 |
-
self.name, attempt + 1, max_retries + 1, wait, exc,
|
| 257 |
-
)
|
| 258 |
-
time.sleep(wait)
|
| 259 |
-
else:
|
| 260 |
-
break
|
| 261 |
-
|
| 262 |
-
duration = time.perf_counter() - start
|
| 263 |
-
return LLMResult(
|
| 264 |
-
model_id=self.model,
|
| 265 |
-
text="",
|
| 266 |
-
duration_seconds=round(duration, 4),
|
| 267 |
-
error=str(last_exc),
|
| 268 |
-
)
|
| 269 |
-
|
| 270 |
-
def __repr__(self) -> str:
|
| 271 |
-
return f"{self.__class__.__name__}(model={self.model!r})"
|
| 272 |
|
|
|
|
| 273 |
|
| 274 |
-
|
| 275 |
-
"BaseLLMAdapter",
|
| 276 |
-
"LLMResult",
|
| 277 |
-
"log_http_error",
|
| 278 |
-
"normalize_llm_content",
|
| 279 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S11. Le contenu canonique vit dans
|
| 2 |
+
``picarones.adapters.llm.base``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.llm.base`` est conservé pour ne casser
|
| 5 |
+
aucun consommateur. Au S22, ce re-export disparaîtra.
|
| 6 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.adapters.llm.base import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,157 +1,11 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
import os
|
| 7 |
-
from typing import Optional
|
| 8 |
-
|
| 9 |
-
from picarones.llm.base import (
|
| 10 |
-
BaseLLMAdapter,
|
| 11 |
-
log_http_error,
|
| 12 |
-
normalize_llm_content,
|
| 13 |
-
)
|
| 14 |
-
|
| 15 |
-
logger = logging.getLogger(__name__)
|
| 16 |
-
|
| 17 |
-
# Modèles Mistral qui NE supportent PAS l'API chat/completions multimodale.
|
| 18 |
-
# Ces petits modèles sont text-only; le passer avec une image provoque une erreur.
|
| 19 |
-
_TEXT_ONLY_MODELS = frozenset({
|
| 20 |
-
"ministral-3b-latest",
|
| 21 |
-
"ministral-8b-latest",
|
| 22 |
-
"mistral-tiny",
|
| 23 |
-
"mistral-tiny-latest",
|
| 24 |
-
"open-mistral-7b",
|
| 25 |
-
"open-mixtral-8x7b",
|
| 26 |
-
})
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
class MistralAdapter(BaseLLMAdapter):
|
| 30 |
-
"""Adaptateur pour les modèles Mistral AI.
|
| 31 |
-
|
| 32 |
-
Clé API via la variable d'environnement ``MISTRAL_API_KEY``.
|
| 33 |
-
|
| 34 |
-
Modes supportés : text_only (tous modèles), text_and_image et zero_shot
|
| 35 |
-
avec les modèles multimodaux (pixtral-12b, pixtral-large).
|
| 36 |
-
|
| 37 |
-
Note
|
| 38 |
-
----
|
| 39 |
-
Les modèles ``ministral-3b-latest`` et ``ministral-8b-latest`` ne supportent
|
| 40 |
-
pas le mode multimodal — utiliser ``PipelineMode.TEXT_ONLY`` avec ces modèles.
|
| 41 |
-
"""
|
| 42 |
-
|
| 43 |
-
api_key_env_var = "MISTRAL_API_KEY"
|
| 44 |
-
|
| 45 |
-
@property
|
| 46 |
-
def name(self) -> str:
|
| 47 |
-
return "mistral"
|
| 48 |
-
|
| 49 |
-
@property
|
| 50 |
-
def default_model(self) -> str:
|
| 51 |
-
return "mistral-large-latest"
|
| 52 |
|
| 53 |
-
|
| 54 |
-
self,
|
| 55 |
-
model: Optional[str] = None,
|
| 56 |
-
config: Optional[dict] = None,
|
| 57 |
-
) -> None:
|
| 58 |
-
super().__init__(model, config)
|
| 59 |
-
self._api_key = os.environ.get("MISTRAL_API_KEY")
|
| 60 |
-
if self.model in _TEXT_ONLY_MODELS:
|
| 61 |
-
logger.info(
|
| 62 |
-
"[MistralAdapter] modèle '%s' : text-only (pas de support multimodal).",
|
| 63 |
-
self.model,
|
| 64 |
-
)
|
| 65 |
-
|
| 66 |
-
def _call(self, prompt: str, image_b64: Optional[str] = None) -> str:
|
| 67 |
-
if not self._api_key:
|
| 68 |
-
raise RuntimeError(
|
| 69 |
-
"Clé API Mistral manquante — définissez la variable d'environnement MISTRAL_API_KEY"
|
| 70 |
-
)
|
| 71 |
-
try:
|
| 72 |
-
try:
|
| 73 |
-
from mistralai.client import Mistral
|
| 74 |
-
except ImportError:
|
| 75 |
-
from mistralai import Mistral # type: ignore[no-redef]
|
| 76 |
-
except ImportError as exc:
|
| 77 |
-
raise RuntimeError(
|
| 78 |
-
"Le package 'mistralai' n'est pas installé. Lancez : pip install mistralai"
|
| 79 |
-
) from exc
|
| 80 |
-
|
| 81 |
-
client = Mistral(api_key=self._api_key)
|
| 82 |
-
temperature = float(self.config.get("temperature", 0.0))
|
| 83 |
-
max_tokens = int(self.config.get("max_tokens", 4096))
|
| 84 |
-
|
| 85 |
-
# Les modèles text-only ne supportent pas les images
|
| 86 |
-
if image_b64 and self.model in _TEXT_ONLY_MODELS:
|
| 87 |
-
logger.warning(
|
| 88 |
-
"[MistralAdapter] modèle '%s' ne supporte pas les images — "
|
| 89 |
-
"image ignorée, appel en mode texte seul.",
|
| 90 |
-
self.model,
|
| 91 |
-
)
|
| 92 |
-
image_b64 = None
|
| 93 |
-
|
| 94 |
-
if image_b64:
|
| 95 |
-
content: list | str = [
|
| 96 |
-
{"type": "text", "text": prompt},
|
| 97 |
-
{
|
| 98 |
-
"type": "image_url",
|
| 99 |
-
"image_url": f"data:image/png;base64,{image_b64}",
|
| 100 |
-
},
|
| 101 |
-
]
|
| 102 |
-
else:
|
| 103 |
-
content = prompt
|
| 104 |
-
|
| 105 |
-
logger.info(
|
| 106 |
-
"[MistralAdapter] appel %s — prompt=%d chars, image=%s",
|
| 107 |
-
self.model, len(prompt), "oui" if image_b64 else "non",
|
| 108 |
-
)
|
| 109 |
-
|
| 110 |
-
try:
|
| 111 |
-
response = client.chat.complete(
|
| 112 |
-
model=self.model,
|
| 113 |
-
messages=[{"role": "user", "content": content}],
|
| 114 |
-
temperature=temperature,
|
| 115 |
-
max_tokens=max_tokens,
|
| 116 |
-
)
|
| 117 |
-
except Exception as exc:
|
| 118 |
-
log_http_error(
|
| 119 |
-
"MistralAdapter", self.model, exc,
|
| 120 |
-
env_var=self.api_key_env_var,
|
| 121 |
-
)
|
| 122 |
-
raise
|
| 123 |
-
|
| 124 |
-
if not response.choices:
|
| 125 |
-
logger.warning(
|
| 126 |
-
"[MistralAdapter] response.choices vide (modèle=%s).",
|
| 127 |
-
self.model,
|
| 128 |
-
)
|
| 129 |
-
return ""
|
| 130 |
-
|
| 131 |
-
_choice = response.choices[0]
|
| 132 |
-
raw = _choice.message.content
|
| 133 |
-
_finish_reason = _choice.finish_reason
|
| 134 |
-
|
| 135 |
-
# Chantier 4 — normalisation factorisée dans
|
| 136 |
-
# ``picarones.llm.base.normalize_llm_content`` (Sprint 15
|
| 137 |
-
# généralisé : list[ContentChunk] / list[dict] / str → str).
|
| 138 |
-
text = normalize_llm_content(raw)
|
| 139 |
-
|
| 140 |
-
_completion_tokens = None
|
| 141 |
-
if hasattr(response, "usage") and response.usage:
|
| 142 |
-
_completion_tokens = getattr(response.usage, "completion_tokens", None)
|
| 143 |
-
|
| 144 |
-
logger.info(
|
| 145 |
-
"[MistralAdapter] réponse %s — finish_reason=%s, len=%d, tokens=%s",
|
| 146 |
-
self.model, _finish_reason, len(text), _completion_tokens,
|
| 147 |
-
)
|
| 148 |
-
|
| 149 |
-
if not text.strip():
|
| 150 |
-
logger.warning(
|
| 151 |
-
"[MistralAdapter] réponse vide du modèle '%s' "
|
| 152 |
-
"(finish_reason=%s, completion_tokens=%s). "
|
| 153 |
-
"Vérifier le prompt et la compatibilité du modèle.",
|
| 154 |
-
self.model, _finish_reason, _completion_tokens,
|
| 155 |
-
)
|
| 156 |
|
| 157 |
-
|
|
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S11. Le contenu canonique vit dans
|
| 2 |
+
``picarones.adapters.llm.mistral_adapter``.
|
| 3 |
|
| 4 |
+
Ré-expose explicitement ``_TEXT_ONLY_MODELS`` (importé par les
|
| 5 |
+
tests Sprint 15).
|
| 6 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
from __future__ import annotations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
from picarones.adapters.llm.mistral_adapter import * # noqa: F401,F403
|
| 11 |
+
from picarones.adapters.llm.mistral_adapter import _TEXT_ONLY_MODELS # noqa: F401
|
|
@@ -1,109 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
from typing import Optional
|
| 7 |
-
from urllib.parse import urlparse
|
| 8 |
-
|
| 9 |
-
from picarones.llm.base import BaseLLMAdapter, normalize_llm_content
|
| 10 |
-
|
| 11 |
-
logger = logging.getLogger(__name__)
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
class OllamaAdapter(BaseLLMAdapter):
|
| 15 |
-
"""Adaptateur pour les modèles locaux via Ollama.
|
| 16 |
-
|
| 17 |
-
Aucune clé API requise. Nécessite un serveur Ollama actif (par défaut
|
| 18 |
-
sur http://localhost:11434).
|
| 19 |
-
|
| 20 |
-
Modes supportés :
|
| 21 |
-
- text_only : tous modèles Ollama
|
| 22 |
-
- text_and_image : modèles multimodaux (llava, bakllava, moondream…)
|
| 23 |
-
- zero_shot : modèles multimodaux uniquement
|
| 24 |
|
| 25 |
-
|
| 26 |
-
- ``base_url`` : URL du serveur Ollama (défaut : http://localhost:11434)
|
| 27 |
-
"""
|
| 28 |
-
|
| 29 |
-
@property
|
| 30 |
-
def name(self) -> str:
|
| 31 |
-
return "ollama"
|
| 32 |
-
|
| 33 |
-
@property
|
| 34 |
-
def default_model(self) -> str:
|
| 35 |
-
return "llama3"
|
| 36 |
-
|
| 37 |
-
def __init__(
|
| 38 |
-
self,
|
| 39 |
-
model: Optional[str] = None,
|
| 40 |
-
config: Optional[dict] = None,
|
| 41 |
-
) -> None:
|
| 42 |
-
super().__init__(model, config)
|
| 43 |
-
base_url = self.config.get("base_url", "http://localhost:11434").rstrip("/")
|
| 44 |
-
parsed = urlparse(base_url)
|
| 45 |
-
if parsed.scheme not in ("http", "https"):
|
| 46 |
-
raise ValueError(
|
| 47 |
-
f"URL Ollama invalide (schéma '{parsed.scheme}' non autorisé, "
|
| 48 |
-
f"seuls http/https sont acceptés) : {base_url}"
|
| 49 |
-
)
|
| 50 |
-
self._base_url = base_url
|
| 51 |
-
|
| 52 |
-
def _call(self, prompt: str, image_b64: Optional[str] = None) -> str:
|
| 53 |
-
import json
|
| 54 |
-
import urllib.error
|
| 55 |
-
import urllib.request
|
| 56 |
-
|
| 57 |
-
temperature = float(self.config.get("temperature", 0.0))
|
| 58 |
-
payload: dict = {
|
| 59 |
-
"model": self.model,
|
| 60 |
-
"prompt": prompt,
|
| 61 |
-
"stream": False,
|
| 62 |
-
"options": {"temperature": temperature},
|
| 63 |
-
}
|
| 64 |
-
if image_b64:
|
| 65 |
-
payload["images"] = [image_b64]
|
| 66 |
-
|
| 67 |
-
data = json.dumps(payload).encode("utf-8")
|
| 68 |
-
req = urllib.request.Request(
|
| 69 |
-
f"{self._base_url}/api/generate",
|
| 70 |
-
data=data,
|
| 71 |
-
headers={"Content-Type": "application/json"},
|
| 72 |
-
)
|
| 73 |
-
try:
|
| 74 |
-
with urllib.request.urlopen(req, timeout=120) as resp:
|
| 75 |
-
raw = resp.read().decode("utf-8")
|
| 76 |
-
except urllib.error.HTTPError as exc:
|
| 77 |
-
logger.warning(
|
| 78 |
-
"[OllamaAdapter] erreur HTTP %d (modèle=%s) : %s",
|
| 79 |
-
exc.code, self.model, exc,
|
| 80 |
-
)
|
| 81 |
-
raise RuntimeError(
|
| 82 |
-
f"Erreur HTTP {exc.code} du serveur Ollama ({self._base_url}) : {exc}"
|
| 83 |
-
) from exc
|
| 84 |
-
except urllib.error.URLError as exc:
|
| 85 |
-
raise RuntimeError(
|
| 86 |
-
f"Impossible de joindre le serveur Ollama sur {self._base_url}. "
|
| 87 |
-
f"Vérifiez qu'Ollama est démarré (ollama serve). Erreur : {exc}"
|
| 88 |
-
) from exc
|
| 89 |
-
|
| 90 |
-
try:
|
| 91 |
-
result = json.loads(raw)
|
| 92 |
-
except json.JSONDecodeError as exc:
|
| 93 |
-
logger.warning(
|
| 94 |
-
"[OllamaAdapter] réponse JSON invalide (modèle=%s) : %s",
|
| 95 |
-
self.model, raw[:200],
|
| 96 |
-
)
|
| 97 |
-
raise RuntimeError(
|
| 98 |
-
f"Réponse JSON invalide du serveur Ollama : {exc}"
|
| 99 |
-
) from exc
|
| 100 |
|
| 101 |
-
|
| 102 |
-
# ``response`` en string mais on normalise par défense (cas où
|
| 103 |
-
# un futur build retournerait un format structuré).
|
| 104 |
-
text = normalize_llm_content(result.get("response", ""))
|
| 105 |
-
if not text:
|
| 106 |
-
logger.warning(
|
| 107 |
-
"[OllamaAdapter] réponse vide (modèle=%s).", self.model,
|
| 108 |
-
)
|
| 109 |
-
return text
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S11. Le contenu canonique vit dans
|
| 2 |
+
``picarones.adapters.llm.ollama_adapter``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.llm.ollama_adapter`` est conservé pour ne casser
|
| 5 |
+
aucun consommateur. Au S22, ce re-export disparaîtra.
|
| 6 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
from __future__ import annotations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
from picarones.adapters.llm.ollama_adapter import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,94 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
import os
|
| 7 |
-
from typing import Optional
|
| 8 |
-
|
| 9 |
-
from picarones.llm.base import (
|
| 10 |
-
BaseLLMAdapter,
|
| 11 |
-
log_http_error,
|
| 12 |
-
normalize_llm_content,
|
| 13 |
-
)
|
| 14 |
-
|
| 15 |
-
logger = logging.getLogger(__name__)
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
class OpenAIAdapter(BaseLLMAdapter):
|
| 19 |
-
"""Adaptateur pour les modèles OpenAI (GPT-4o, GPT-4o-mini).
|
| 20 |
-
|
| 21 |
-
Clé API via la variable d'environnement ``OPENAI_API_KEY``.
|
| 22 |
-
|
| 23 |
-
Modes supportés : text_only, text_and_image, zero_shot.
|
| 24 |
-
"""
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
@property
|
| 29 |
-
def name(self) -> str:
|
| 30 |
-
return "openai"
|
| 31 |
-
|
| 32 |
-
@property
|
| 33 |
-
def default_model(self) -> str:
|
| 34 |
-
return "gpt-4o"
|
| 35 |
-
|
| 36 |
-
def __init__(
|
| 37 |
-
self,
|
| 38 |
-
model: Optional[str] = None,
|
| 39 |
-
config: Optional[dict] = None,
|
| 40 |
-
) -> None:
|
| 41 |
-
super().__init__(model, config)
|
| 42 |
-
self._api_key = os.environ.get("OPENAI_API_KEY")
|
| 43 |
-
|
| 44 |
-
def _call(self, prompt: str, image_b64: Optional[str] = None) -> str:
|
| 45 |
-
if not self._api_key:
|
| 46 |
-
raise RuntimeError(
|
| 47 |
-
"Clé API OpenAI manquante — définissez la variable d'environnement OPENAI_API_KEY"
|
| 48 |
-
)
|
| 49 |
-
try:
|
| 50 |
-
from openai import OpenAI
|
| 51 |
-
except ImportError as exc:
|
| 52 |
-
raise RuntimeError(
|
| 53 |
-
"Le package 'openai' n'est pas installé. Lancez : pip install openai"
|
| 54 |
-
) from exc
|
| 55 |
-
|
| 56 |
-
client = OpenAI(api_key=self._api_key)
|
| 57 |
-
temperature = float(self.config.get("temperature", 0.0))
|
| 58 |
-
max_tokens = int(self.config.get("max_tokens", 4096))
|
| 59 |
-
|
| 60 |
-
if image_b64:
|
| 61 |
-
content = [
|
| 62 |
-
{"type": "text", "text": prompt},
|
| 63 |
-
{
|
| 64 |
-
"type": "image_url",
|
| 65 |
-
"image_url": {"url": f"data:image/png;base64,{image_b64}"},
|
| 66 |
-
},
|
| 67 |
-
]
|
| 68 |
-
else:
|
| 69 |
-
content = prompt # type: ignore[assignment]
|
| 70 |
-
|
| 71 |
-
try:
|
| 72 |
-
response = client.chat.completions.create(
|
| 73 |
-
model=self.model,
|
| 74 |
-
messages=[{"role": "user", "content": content}],
|
| 75 |
-
temperature=temperature,
|
| 76 |
-
max_tokens=max_tokens,
|
| 77 |
-
)
|
| 78 |
-
except Exception as exc:
|
| 79 |
-
log_http_error(
|
| 80 |
-
"OpenAIAdapter", self.model, exc,
|
| 81 |
-
env_var=self.api_key_env_var,
|
| 82 |
-
)
|
| 83 |
-
raise
|
| 84 |
|
| 85 |
-
|
| 86 |
-
logger.warning(
|
| 87 |
-
"[OpenAIAdapter] response.choices vide (modèle=%s).", self.model,
|
| 88 |
-
)
|
| 89 |
-
return ""
|
| 90 |
-
# Chantier 4 — propagation du fix Sprint 15 : le SDK OpenAI
|
| 91 |
-
# peut retourner une ``list[ContentBlock]`` selon l'API
|
| 92 |
-
# (Responses, structured outputs). ``normalize_llm_content``
|
| 93 |
-
# gère les deux cas (str et list).
|
| 94 |
-
return normalize_llm_content(response.choices[0].message.content)
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S11. Le contenu canonique vit dans
|
| 2 |
+
``picarones.adapters.llm.openai_adapter``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.llm.openai_adapter`` est conservé pour ne casser
|
| 5 |
+
aucun consommateur. Au S22, ce re-export disparaîtra.
|
| 6 |
+
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
from __future__ import annotations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
from picarones.adapters.llm.openai_adapter import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -82,7 +82,11 @@ FILE_BUDGETS: dict[str, int] = {
|
|
| 82 |
"picarones/fixtures.py": 600, # actuel 510
|
| 83 |
"picarones/measurements/inter_engine.py": 575, # actuel 484
|
| 84 |
"picarones/measurements/roman_numerals.py": 575, # actuel 478
|
| 85 |
-
"picarones/extras/importers/htr_united.py": 575, # actuel 473
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
"picarones/cli/_workflows.py": 550, # actuel 469
|
| 87 |
"picarones/extras/importers/huggingface.py": 550, # actuel 464
|
| 88 |
"picarones/core/metric_hooks.py": 500, # actuel 423
|
|
|
|
| 82 |
"picarones/fixtures.py": 600, # actuel 510
|
| 83 |
"picarones/measurements/inter_engine.py": 575, # actuel 484
|
| 84 |
"picarones/measurements/roman_numerals.py": 575, # actuel 478
|
| 85 |
+
"picarones/extras/importers/htr_united.py": 575, # actuel 473 (re-export S11)
|
| 86 |
+
# Sprint A14-S11 — d\xc3\xa9plac\xc3\xa9s depuis extras/importers/, l'ancien
|
| 87 |
+
# emplacement est d\xc3\xa9sormais un re-export.
|
| 88 |
+
"picarones/adapters/corpus/htr_united.py": 575, # actuel 473
|
| 89 |
+
"picarones/adapters/corpus/huggingface.py": 550, # actuel 464
|
| 90 |
"picarones/cli/_workflows.py": 550, # actuel 469
|
| 91 |
"picarones/extras/importers/huggingface.py": 550, # actuel 464
|
| 92 |
"picarones/core/metric_hooks.py": 500, # actuel 423
|