Spaces:
Sleeping
Sprint 9 : documentation, packaging, Docker et CI/CD — version 1.0.0
Browse filesDocumentation
-------------
- README.md complet bilingue (français + anglais) : présentation, fonctionnalités,
moteurs supportés, usage rapide, variables d'environnement, roadmap
- INSTALL.md : guide d'installation détaillé pour Linux (Ubuntu/Debian), macOS
et Windows — Tesseract, Pero OCR, Ollama, configuration des APIs, Docker
- CHANGELOG.md : historique complet des sprints 1 à 9 avec livrables détaillés
- CONTRIBUTING.md : guide pour ajouter un moteur OCR, un adaptateur LLM,
une source d'import, conventions de code (Google docstrings), checklist PR
Packaging
---------
- pyproject.toml version 1.0.0, nouveaux extras [llm], [ocr-cloud], [all],
URLs projet (GitHub, docs, issues), classifiers mis à jour (Production/Stable)
- picarones/__main__.py : permet python -m picarones
- picarones/__init__.py version 1.0.0
- picarones.spec : configuration PyInstaller pour exécutable standalone
(Linux, macOS, Windows), hiddenimports complets
Infrastructure
--------------
- Dockerfile multi-étape (builder + runtime), Python 3.11-slim, Tesseract
pré-installé (fra, lat, eng, deu, ita, spa), utilisateur non-root,
HEALTHCHECK, CMD ["picarones", "serve", "--host", "0.0.0.0"]
- docker-compose.yml : service Picarones + service Ollama (profil optionnel),
volumes persistants (history SQLite, corpus, rapports)
- Makefile : make install, make test, make demo, make serve, make build,
make build-exe, make docker-build, make docker-run, make docker-compose-up,
make lint, make clean
- .github/workflows/ci.yml : pipeline GitHub Actions — tests Python 3.11/3.12
sur Linux/macOS/Windows, job demo end-to-end, job build distribution,
job lint (ruff, optionnel)
Tests Sprint 9 (58 tests, 801 total)
--------------------------------------
- tests/test_sprint9_packaging.py
- TestVersion (4) — cohérence version 1.0.0 dans tous les fichiers
- TestMainModule (3) — python -m picarones
- TestMakefile (5) — cibles install/test/demo/docker-build/help
- TestDockerfile (6) — structure, Tesseract, CMD serve
- TestDockerCompose (5) — services, ports, volumes
- TestCIWorkflow (6) — Python 3.11/3.12, Linux/macOS/Windows, pytest, demo
- TestPyInstallerSpec (4) — Analysis, EXE, hiddenimports
- TestCLIDemoEndToEnd (6) — HTML généré, taille, flags --with-history/robustness
- TestReadme (5) — FR+EN, installation, CLI, moteurs
- TestInstallMd (4) — Linux, macOS, Windows, Docker
- TestChangelog (5) — sprints 1/8/9, versions, dates
- TestContributing (4) — moteurs, tests, PR, style
https://claude.ai/code/session_017gXea9mxBQqDTAsSQd7aAq
- .github/workflows/ci.yml +222 -0
- CHANGELOG.md +256 -0
- CONTRIBUTING.md +512 -0
- Dockerfile +100 -0
- INSTALL.md +501 -0
- Makefile +221 -0
- README.md +239 -73
- docker-compose.yml +111 -0
- picarones.spec +155 -0
- picarones/__init__.py +1 -1
- picarones/__main__.py +13 -0
- pyproject.toml +36 -5
- rapport_demo.html +2 -2
- tests/test_sprint9_packaging.py +415 -0
|
@@ -0,0 +1,222 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# .github/workflows/ci.yml — Picarones CI/CD
|
| 2 |
+
#
|
| 3 |
+
# Pipeline GitHub Actions :
|
| 4 |
+
# - Tests sur Python 3.11 et 3.12
|
| 5 |
+
# - Linux, macOS, Windows
|
| 6 |
+
# - Rapport de couverture (Codecov)
|
| 7 |
+
# - Build de la distribution Python
|
| 8 |
+
# - Vérification de l'exécutable demo
|
| 9 |
+
|
| 10 |
+
name: CI
|
| 11 |
+
|
| 12 |
+
on:
|
| 13 |
+
push:
|
| 14 |
+
branches: [main, develop, "feature/**", "sprint/**"]
|
| 15 |
+
pull_request:
|
| 16 |
+
branches: [main, develop]
|
| 17 |
+
workflow_dispatch: # Déclenchement manuel
|
| 18 |
+
|
| 19 |
+
permissions:
|
| 20 |
+
contents: read
|
| 21 |
+
|
| 22 |
+
# ──────────────────────────────────────────────────────────────────
|
| 23 |
+
# Job 1 : Tests unitaires et d'intégration
|
| 24 |
+
# ──────────────────────────────────────────────────────────────────
|
| 25 |
+
jobs:
|
| 26 |
+
tests:
|
| 27 |
+
name: Tests Python ${{ matrix.python-version }} / ${{ matrix.os }}
|
| 28 |
+
runs-on: ${{ matrix.os }}
|
| 29 |
+
|
| 30 |
+
strategy:
|
| 31 |
+
fail-fast: false
|
| 32 |
+
matrix:
|
| 33 |
+
os: [ubuntu-latest, macos-latest, windows-latest]
|
| 34 |
+
python-version: ["3.11", "3.12"]
|
| 35 |
+
|
| 36 |
+
steps:
|
| 37 |
+
- name: Checkout
|
| 38 |
+
uses: actions/checkout@v4
|
| 39 |
+
|
| 40 |
+
- name: Set up Python ${{ matrix.python-version }}
|
| 41 |
+
uses: actions/setup-python@v5
|
| 42 |
+
with:
|
| 43 |
+
python-version: ${{ matrix.python-version }}
|
| 44 |
+
cache: pip
|
| 45 |
+
|
| 46 |
+
# ── Tesseract ──────────────────────────────────────────────
|
| 47 |
+
- name: Install Tesseract (Ubuntu)
|
| 48 |
+
if: runner.os == 'Linux'
|
| 49 |
+
run: |
|
| 50 |
+
sudo apt-get update -qq
|
| 51 |
+
sudo apt-get install -y tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat
|
| 52 |
+
|
| 53 |
+
- name: Install Tesseract (macOS)
|
| 54 |
+
if: runner.os == 'macOS'
|
| 55 |
+
run: |
|
| 56 |
+
brew install tesseract tesseract-lang
|
| 57 |
+
env:
|
| 58 |
+
HOMEBREW_NO_AUTO_UPDATE: "1"
|
| 59 |
+
|
| 60 |
+
- name: Install Tesseract (Windows)
|
| 61 |
+
if: runner.os == 'Windows'
|
| 62 |
+
run: |
|
| 63 |
+
choco install tesseract --version=5.3.3 -y
|
| 64 |
+
echo "C:\Program Files\Tesseract-OCR" >> $env:GITHUB_PATH
|
| 65 |
+
shell: pwsh
|
| 66 |
+
|
| 67 |
+
# ── Dépendances Python ──────────────────────────────────────
|
| 68 |
+
- name: Install dependencies
|
| 69 |
+
run: |
|
| 70 |
+
python -m pip install --upgrade pip
|
| 71 |
+
pip install -e ".[dev]"
|
| 72 |
+
|
| 73 |
+
# ── Tests ───────────────────────────────────────────────────
|
| 74 |
+
- name: Run tests
|
| 75 |
+
run: |
|
| 76 |
+
pytest tests/ -q --tb=short --no-header \
|
| 77 |
+
--cov=picarones --cov-report=xml --cov-report=term-missing
|
| 78 |
+
env:
|
| 79 |
+
PYTHONIOENCODING: utf-8
|
| 80 |
+
PYTHONUTF8: "1"
|
| 81 |
+
|
| 82 |
+
# ── Couverture ──────────────────────────────────────────────
|
| 83 |
+
- name: Upload coverage to Codecov
|
| 84 |
+
if: runner.os == 'Linux' && matrix.python-version == '3.11'
|
| 85 |
+
uses: codecov/codecov-action@v4
|
| 86 |
+
with:
|
| 87 |
+
files: coverage.xml
|
| 88 |
+
flags: unittests
|
| 89 |
+
name: picarones-coverage
|
| 90 |
+
fail_ci_if_error: false
|
| 91 |
+
|
| 92 |
+
# ──────────────────────────────────────────────────────────────────
|
| 93 |
+
# Job 2 : Vérification du rapport demo
|
| 94 |
+
# ──────────────────────────────────────────────────────────────────
|
| 95 |
+
demo:
|
| 96 |
+
name: Demo end-to-end
|
| 97 |
+
runs-on: ubuntu-latest
|
| 98 |
+
needs: tests
|
| 99 |
+
|
| 100 |
+
steps:
|
| 101 |
+
- name: Checkout
|
| 102 |
+
uses: actions/checkout@v4
|
| 103 |
+
|
| 104 |
+
- name: Set up Python
|
| 105 |
+
uses: actions/setup-python@v5
|
| 106 |
+
with:
|
| 107 |
+
python-version: "3.11"
|
| 108 |
+
cache: pip
|
| 109 |
+
|
| 110 |
+
- name: Install Tesseract
|
| 111 |
+
run: |
|
| 112 |
+
sudo apt-get update -qq
|
| 113 |
+
sudo apt-get install -y tesseract-ocr tesseract-ocr-fra
|
| 114 |
+
|
| 115 |
+
- name: Install Picarones
|
| 116 |
+
run: pip install -e .
|
| 117 |
+
|
| 118 |
+
- name: Run demo
|
| 119 |
+
run: |
|
| 120 |
+
picarones demo --docs 12 --output rapport_demo_ci.html \
|
| 121 |
+
--with-history --with-robustness
|
| 122 |
+
ls -lh rapport_demo_ci.html
|
| 123 |
+
# Vérifier que le fichier est valide et contient les sections attendues
|
| 124 |
+
python -c "
|
| 125 |
+
content = open('rapport_demo_ci.html').read()
|
| 126 |
+
assert 'Picarones' in content, 'Picarones non trouvé dans le rapport'
|
| 127 |
+
assert 'CER' in content, 'CER non trouvé dans le rapport'
|
| 128 |
+
assert len(content) > 50000, f'Rapport trop petit : {len(content)} octets'
|
| 129 |
+
print(f'Rapport OK : {len(content):,} octets')
|
| 130 |
+
"
|
| 131 |
+
|
| 132 |
+
- name: Upload demo report as artifact
|
| 133 |
+
uses: actions/upload-artifact@v4
|
| 134 |
+
with:
|
| 135 |
+
name: rapport-demo
|
| 136 |
+
path: rapport_demo_ci.html
|
| 137 |
+
retention-days: 7
|
| 138 |
+
|
| 139 |
+
# ──────────────────────────────────────────────────────────────────
|
| 140 |
+
# Job 3 : Build de la distribution Python
|
| 141 |
+
# ──────────────────────────────────────────────────────────────────
|
| 142 |
+
build:
|
| 143 |
+
name: Build distribution
|
| 144 |
+
runs-on: ubuntu-latest
|
| 145 |
+
needs: tests
|
| 146 |
+
|
| 147 |
+
steps:
|
| 148 |
+
- name: Checkout
|
| 149 |
+
uses: actions/checkout@v4
|
| 150 |
+
|
| 151 |
+
- name: Set up Python
|
| 152 |
+
uses: actions/setup-python@v5
|
| 153 |
+
with:
|
| 154 |
+
python-version: "3.11"
|
| 155 |
+
cache: pip
|
| 156 |
+
|
| 157 |
+
- name: Install build tools
|
| 158 |
+
run: pip install --upgrade build twine
|
| 159 |
+
|
| 160 |
+
- name: Build wheel and sdist
|
| 161 |
+
run: python -m build
|
| 162 |
+
|
| 163 |
+
- name: Check distribution
|
| 164 |
+
run: twine check dist/*
|
| 165 |
+
|
| 166 |
+
- name: Upload distribution as artifact
|
| 167 |
+
uses: actions/upload-artifact@v4
|
| 168 |
+
with:
|
| 169 |
+
name: dist-packages
|
| 170 |
+
path: dist/
|
| 171 |
+
retention-days: 30
|
| 172 |
+
|
| 173 |
+
# ──────────────────────────────────────────────────────────────────
|
| 174 |
+
# Job 4 : Vérification de la qualité du code (optionnel)
|
| 175 |
+
# ──────────────────────────────────────────────────────────────────
|
| 176 |
+
lint:
|
| 177 |
+
name: Code quality
|
| 178 |
+
runs-on: ubuntu-latest
|
| 179 |
+
continue-on-error: true # Ne bloque pas le CI si le lint échoue
|
| 180 |
+
|
| 181 |
+
steps:
|
| 182 |
+
- name: Checkout
|
| 183 |
+
uses: actions/checkout@v4
|
| 184 |
+
|
| 185 |
+
- name: Set up Python
|
| 186 |
+
uses: actions/setup-python@v5
|
| 187 |
+
with:
|
| 188 |
+
python-version: "3.11"
|
| 189 |
+
cache: pip
|
| 190 |
+
|
| 191 |
+
- name: Install ruff
|
| 192 |
+
run: pip install ruff
|
| 193 |
+
|
| 194 |
+
- name: Run ruff
|
| 195 |
+
run: |
|
| 196 |
+
ruff check picarones/ --select=E,W,F --ignore=E501,W503 || true
|
| 197 |
+
ruff check tests/ --select=E,W,F --ignore=E501,W503 || true
|
| 198 |
+
|
| 199 |
+
# ──────────────────────────────────────────────────────────────────
|
| 200 |
+
# Job 5 : CI/CD — Détection de régression CER (optionnel)
|
| 201 |
+
# Commenté par défaut — activer si vous avez un corpus de référence
|
| 202 |
+
# ──────────────────────────────────────────────────────────────────
|
| 203 |
+
# regression-check:
|
| 204 |
+
# name: Regression check
|
| 205 |
+
# runs-on: ubuntu-latest
|
| 206 |
+
# needs: tests
|
| 207 |
+
# if: github.event_name == 'pull_request'
|
| 208 |
+
#
|
| 209 |
+
# steps:
|
| 210 |
+
# - name: Checkout
|
| 211 |
+
# uses: actions/checkout@v4
|
| 212 |
+
#
|
| 213 |
+
# - name: Install
|
| 214 |
+
# run: pip install -e .
|
| 215 |
+
#
|
| 216 |
+
# - name: Run benchmark on reference corpus
|
| 217 |
+
# run: |
|
| 218 |
+
# picarones run \
|
| 219 |
+
# --corpus ./tests/fixtures/reference_corpus/ \
|
| 220 |
+
# --engines tesseract \
|
| 221 |
+
# --output results_pr.json \
|
| 222 |
+
# --fail-if-cer-above 15.0
|
|
@@ -0,0 +1,256 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Changelog — Picarones
|
| 2 |
+
|
| 3 |
+
Tous les changements notables de ce projet sont documentés dans ce fichier.
|
| 4 |
+
|
| 5 |
+
Le format suit [Keep a Changelog](https://keepachangelog.com/fr/1.0.0/).
|
| 6 |
+
La numérotation de version suit [Semantic Versioning](https://semver.org/lang/fr/).
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## [1.0.0] — Sprint 9 — 2025-03
|
| 11 |
+
|
| 12 |
+
### Ajouté
|
| 13 |
+
- `README.md` complet bilingue (français + anglais) avec badges CI, description des fonctionnalités, tableau des moteurs, variables d'environnement
|
| 14 |
+
- `INSTALL.md` — guide d'installation détaillé pour Linux (Ubuntu/Debian), macOS et Windows, incluant Tesseract, Pero OCR, Ollama, configuration des clés API, Docker
|
| 15 |
+
- `CHANGELOG.md` — historique des sprints 1 à 9
|
| 16 |
+
- `CONTRIBUTING.md` — guide pour contribuer : ajouter un moteur OCR, un adaptateur LLM, soumettre une PR
|
| 17 |
+
- `Makefile` — commandes `make install`, `make test`, `make demo`, `make serve`, `make build`, `make build-exe`, `make docker-build`, `make lint`, `make clean`
|
| 18 |
+
- `Dockerfile` — image Docker multi-étape basée sur Python 3.11-slim, Tesseract pré-installé, `CMD ["picarones", "serve", "--host", "0.0.0.0"]`
|
| 19 |
+
- `docker-compose.yml` — service Picarones + service Ollama optionnel (profil `ollama`)
|
| 20 |
+
- `.github/workflows/ci.yml` — pipeline GitHub Actions : tests sur Python 3.11/3.12, Linux/macOS/Windows, rapport de couverture
|
| 21 |
+
- `picarones.spec` — configuration PyInstaller pour générer des exécutables standalone (Linux, macOS, Windows)
|
| 22 |
+
- `picarones/__main__.py` — permet l'exécution via `python -m picarones`
|
| 23 |
+
- Version bumped à `1.0.0` dans `pyproject.toml` et `__init__.py`
|
| 24 |
+
- Extras PyPI `[llm]`, `[ocr-cloud]`, `[all]` dans `pyproject.toml`
|
| 25 |
+
- Tests Sprint 9 : `tests/test_sprint9_packaging.py` (30 tests)
|
| 26 |
+
|
| 27 |
+
### Modifié
|
| 28 |
+
- `pyproject.toml` : version 1.0.0, nouveaux extras, classifiers mis à jour, URLs projet ajoutées
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## [0.8.0] — Sprint 8 — 2025-03
|
| 33 |
+
|
| 34 |
+
### Ajouté
|
| 35 |
+
- **eScriptorium** (`picarones/importers/escriptorium.py`)
|
| 36 |
+
- `EScriptoriumClient` : connexion par token API, listing projets/documents/pages, gestion de la pagination
|
| 37 |
+
- `import_document()` : import d'un document avec ses transcriptions comme corpus Picarones
|
| 38 |
+
- `export_benchmark_as_layer()` : export des résultats benchmark comme couche OCR nommée dans eScriptorium
|
| 39 |
+
- `connect_escriptorium()` : connexion avec validation automatique
|
| 40 |
+
- **Gallica API** (`picarones/importers/gallica.py`)
|
| 41 |
+
- `GallicaClient` : recherche SRU BnF par cote/titre/auteur/date/langue/type
|
| 42 |
+
- Récupération OCR Gallica texte brut (`f{n}.texteBrut`)
|
| 43 |
+
- Import IIIF Gallica avec enrichissement OCR comme vérité terrain de référence
|
| 44 |
+
- Métadonnées OAI-PMH (`/services/OAIRecord`)
|
| 45 |
+
- `search_gallica()`, `import_gallica_document()` — fonctions de commodité
|
| 46 |
+
- **Suivi longitudinal** (`picarones/core/history.py`)
|
| 47 |
+
- `BenchmarkHistory` : base SQLite horodatée par run, moteur, corpus, CER/WER
|
| 48 |
+
- `record()` depuis `BenchmarkResult`, `record_single()` pour imports manuels
|
| 49 |
+
- `query()` avec filtres engine/corpus/since/limit
|
| 50 |
+
- `get_cer_curve()` : données prêtes pour Chart.js
|
| 51 |
+
- `detect_regression()` / `detect_all_regressions()` : seuil configurable en points de CER
|
| 52 |
+
- `export_json()` — export complet de l'historique
|
| 53 |
+
- `generate_demo_history()` : 8 runs fictifs avec régression simulée au run 5
|
| 54 |
+
- **Analyse de robustesse** (`picarones/core/robustness.py`)
|
| 55 |
+
- 5 types de dégradation : bruit gaussien, flou, rotation, réduction de résolution, binarisation
|
| 56 |
+
- `degrade_image_bytes()` : Pillow (préféré) ou fallback pur Python
|
| 57 |
+
- `RobustnessAnalyzer.analyze()` : CER par niveau, seuil critique automatique
|
| 58 |
+
- `DegradationCurve`, `RobustnessReport`, `_build_summary()`
|
| 59 |
+
- `generate_demo_robustness_report()` : rapport fictif réaliste sans moteur réel
|
| 60 |
+
- **CLI Sprint 8**
|
| 61 |
+
- `picarones history` : historique avec filtres, détection de régression, export JSON, mode `--demo`
|
| 62 |
+
- `picarones robustness` : analyse de robustesse, barres ASCII, export JSON, mode `--demo`
|
| 63 |
+
- `picarones demo --with-history --with-robustness` : démonstration intégrée
|
| 64 |
+
- `picarones/importers/__init__.py` mis à jour pour exporter les nouveaux importeurs
|
| 65 |
+
|
| 66 |
+
### Tests
|
| 67 |
+
- `tests/test_sprint8_escriptorium_gallica.py` : 74 tests (eScriptorium, Gallica, CLI)
|
| 68 |
+
- `tests/test_sprint8_longitudinal_robustness.py` : 86 tests (history, robustesse, CLI)
|
| 69 |
+
- **Total** : 743 tests (anciennement 583)
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
## [0.7.0] — Sprint 7 — 2025-02
|
| 74 |
+
|
| 75 |
+
### Ajouté
|
| 76 |
+
- **Rapport HTML v2**
|
| 77 |
+
- Intervalles de confiance Bootstrap à 95% (`bootstrap_ci()`)
|
| 78 |
+
- Tests de Wilcoxon et matrices de tests par paires (`wilcoxon_test()`, `pairwise_stats()`)
|
| 79 |
+
- Courbes de fiabilité (CER cumulatif par percentile de qualité)
|
| 80 |
+
- Diagrammes de Venn des erreurs communes/exclusives entre concurrents (2 et 3 ensembles)
|
| 81 |
+
- Clustering des patterns d'erreurs (k-means simplifié sur n-grammes d'erreur)
|
| 82 |
+
- Matrice de corrélation entre métriques (Pearson)
|
| 83 |
+
- Score de difficulté intrinsèque par document (`compute_difficulty()`, `compute_all_difficulties()`)
|
| 84 |
+
- Scatter plots interactifs qualité image vs CER, colorés par type de script
|
| 85 |
+
- Heatmaps de confusion unicode améliorées
|
| 86 |
+
- `picarones/core/statistics.py` : module dédié aux tests statistiques
|
| 87 |
+
- `picarones/core/difficulty.py` : score de difficulté intrinsèque
|
| 88 |
+
|
| 89 |
+
### Tests
|
| 90 |
+
- `tests/test_sprint7_advanced_report.py` : 100 tests (bootstrap, Wilcoxon, Venn, clustering, difficulté)
|
| 91 |
+
- **Total** : 583 tests (anciennement 483)
|
| 92 |
+
|
| 93 |
+
---
|
| 94 |
+
|
| 95 |
+
## [0.6.0] — Sprint 6 — 2025-02
|
| 96 |
+
|
| 97 |
+
### Ajouté
|
| 98 |
+
- **Interface web FastAPI** (`picarones/web/app.py`)
|
| 99 |
+
- Endpoints REST pour lancer des benchmarks, consulter les résultats, lister les moteurs
|
| 100 |
+
- Streaming des logs en temps réel (Server-Sent Events)
|
| 101 |
+
- `picarones serve` — lancement du serveur uvicorn
|
| 102 |
+
- **Import HuggingFace Datasets** (`picarones/importers/huggingface.py`)
|
| 103 |
+
- Recherche, filtrage et import partiel de datasets OCR/HTR
|
| 104 |
+
- Datasets patrimoniaux pré-référencés : IAM, RIMES, READ-BAD, Esposalles…
|
| 105 |
+
- Cache local avec gestion des versions
|
| 106 |
+
- **Import HTR-United** (`picarones/importers/htr_united.py`)
|
| 107 |
+
- Listing et import depuis le catalogue HTR-United
|
| 108 |
+
- Lecture des métadonnées : langue, script, institution, époque
|
| 109 |
+
- **Adaptateurs Ollama** (`picarones/llm/ollama_adapter.py`)
|
| 110 |
+
- Support de Llama 3, Gemma, Phi et tout modèle Ollama local
|
| 111 |
+
- Mode texte seul (LLMs non multimodaux)
|
| 112 |
+
- **Profils de normalisation pré-configurés**
|
| 113 |
+
- Français médiéval, Français moderne, Latin médiéval, Imprimés anciens
|
| 114 |
+
- Profil personnalisé exportable/importable
|
| 115 |
+
|
| 116 |
+
### Tests
|
| 117 |
+
- `tests/test_sprint6_web_interface.py` : 90 tests
|
| 118 |
+
- **Total** : 483 tests (anciennement 393)
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## [0.5.0] — Sprint 5 — 2025-02
|
| 123 |
+
|
| 124 |
+
### Ajouté
|
| 125 |
+
- **Matrice de confusion unicode** (`picarones/core/confusion.py`)
|
| 126 |
+
- `build_confusion_matrix()`, `aggregate_confusion_matrices()`
|
| 127 |
+
- Affichage compact trié par fréquence d'erreur
|
| 128 |
+
- **Scores ligatures et diacritiques** (`picarones/core/char_scores.py`)
|
| 129 |
+
- `compute_ligature_score()` : fi, fl, ff, ffi, ffl, st, ct, œ, æ, ꝑ, ꝓ…
|
| 130 |
+
- `compute_diacritic_score()` : accents, cédilles, trémas, diacritiques combinants
|
| 131 |
+
- **Taxonomie des erreurs en 10 classes** (`picarones/core/taxonomy.py`)
|
| 132 |
+
- Confusion visuelle, erreur diacritique, casse, ligature, abréviation, hapax, segmentation, hors-vocabulaire, lacune, sur-normalisation LLM
|
| 133 |
+
- **Analyse structurelle** (`picarones/core/structure.py`)
|
| 134 |
+
- Score d'ordre de lecture, taux de segmentation des lignes, conservation des sauts de paragraphe
|
| 135 |
+
- **Métriques de qualité image** (`picarones/core/image_quality.py`)
|
| 136 |
+
- Netteté (Laplacien), niveau de bruit, contraste (Michelson), détection rotation résiduelle
|
| 137 |
+
- Corrélations image ↔ CER
|
| 138 |
+
- Intégration de toutes ces métriques dans le rapport HTML (vue Analyse, vue Caractères)
|
| 139 |
+
- Scatter plots qualité image vs CER
|
| 140 |
+
|
| 141 |
+
### Tests
|
| 142 |
+
- `tests/test_sprint5_advanced_metrics.py` : 100 tests
|
| 143 |
+
- **Total** : 393 tests (anciennement 293)
|
| 144 |
+
|
| 145 |
+
---
|
| 146 |
+
|
| 147 |
+
## [0.4.0] — Sprint 4 — 2025-01
|
| 148 |
+
|
| 149 |
+
### Ajouté
|
| 150 |
+
- **Adaptateurs APIs cloud OCR**
|
| 151 |
+
- Mistral OCR (`picarones/engines/mistral_ocr.py`) — Mistral OCR 3, multimodal
|
| 152 |
+
- Google Vision (`picarones/engines/google_vision.py`) — Document AI
|
| 153 |
+
- Azure Document Intelligence (`picarones/engines/azure_doc_intel.py`)
|
| 154 |
+
- **Import IIIF v2/v3** (`picarones/importers/iiif.py`)
|
| 155 |
+
- Sélecteur de pages (`"1-10"`, `"1,3,5"`, `"all"`)
|
| 156 |
+
- Téléchargement images et extraction des annotations de transcription si disponibles
|
| 157 |
+
- Compatibilité : Gallica, Bodleian, British Library, BSB, e-codices, Europeana
|
| 158 |
+
- `picarones import iiif <url>` — commande CLI
|
| 159 |
+
- **Normalisation unicode** (`picarones/core/normalization.py`)
|
| 160 |
+
- NFC, caseless, diplomatique (tables ſ=s, u=v, i=j, æ=ae, œ=oe…)
|
| 161 |
+
- Profils configurables via YAML
|
| 162 |
+
- CER diplomatique dans les métriques
|
| 163 |
+
|
| 164 |
+
### Tests
|
| 165 |
+
- `tests/test_sprint4_normalization_iiif.py` : 100 tests
|
| 166 |
+
- **Total** : 293 tests (anciennement 193)
|
| 167 |
+
|
| 168 |
+
---
|
| 169 |
+
|
| 170 |
+
## [0.3.0] — Sprint 3 — 2025-01
|
| 171 |
+
|
| 172 |
+
### Ajouté
|
| 173 |
+
- **Pipelines OCR+LLM** (`picarones/pipelines/base.py`)
|
| 174 |
+
- Mode 1 — Post-correction texte brut (LLM reçoit la sortie OCR)
|
| 175 |
+
- Mode 2 — Post-correction avec image (LLM reçoit image + OCR)
|
| 176 |
+
- Mode 3 — Zero-shot LLM (LLM reçoit uniquement l'image)
|
| 177 |
+
- Chaînes composables multi-étapes
|
| 178 |
+
- **Adaptateurs LLM**
|
| 179 |
+
- OpenAI (`picarones/llm/openai_adapter.py`) — GPT-4o, GPT-4o mini
|
| 180 |
+
- Anthropic (`picarones/llm/anthropic_adapter.py`) — Claude Sonnet, Haiku
|
| 181 |
+
- Mistral (`picarones/llm/mistral_adapter.py`) — Mistral Large, Pixtral
|
| 182 |
+
- **Détection de sur-normalisation LLM** (`picarones/pipelines/over_normalization.py`)
|
| 183 |
+
- Mesure du taux de modification sur des passages déjà corrects
|
| 184 |
+
- Classe 10 dans la taxonomie des erreurs
|
| 185 |
+
- **Bibliothèque de prompts**
|
| 186 |
+
- Prompts pour manuscrits médiévaux, imprimés anciens, latin
|
| 187 |
+
- Versionning des prompts dans les métadonnées du rapport
|
| 188 |
+
- Vue spécifique OCR+LLM dans le rapport : diff triple GT / OCR brut / après correction
|
| 189 |
+
|
| 190 |
+
### Tests
|
| 191 |
+
- `tests/test_sprint3_llm_pipelines.py` : 100 tests
|
| 192 |
+
- **Total** : 193 tests (anciennement 93)
|
| 193 |
+
|
| 194 |
+
---
|
| 195 |
+
|
| 196 |
+
## [0.2.0] — Sprint 2 — 2025-01
|
| 197 |
+
|
| 198 |
+
### Ajouté
|
| 199 |
+
- **Rapport HTML interactif** (`picarones/report/generator.py`)
|
| 200 |
+
- Fichier HTML auto-contenu, lisible hors-ligne
|
| 201 |
+
- Tableau de classement des concurrents (CER, WER, scores), tri par colonne
|
| 202 |
+
- Graphique radar (spider chart) : CER / WER / Précision diacritiques / Ligatures
|
| 203 |
+
- Vue Galerie : toutes les images avec badges CER colorés (vert→rouge), filtres
|
| 204 |
+
- Vue Document : image zoomable + diff coloré façon GitHub, scroll synchronisé N-way
|
| 205 |
+
- Vue Analyse : histogrammes de distribution CER, scatter plots
|
| 206 |
+
- Recommandation automatique de moteur
|
| 207 |
+
- Exports CSV, JSON, ALTO XML depuis le rapport
|
| 208 |
+
- **Diff coloré** (`picarones/report/diff_utils.py`)
|
| 209 |
+
- Diff au niveau caractère et mot
|
| 210 |
+
- Insertions (vert), suppressions (rouge), substitutions (orange)
|
| 211 |
+
- Bascule diplomatique / normalisé
|
| 212 |
+
- `picarones demo` — rapport de démonstration avec données fictives réalistes
|
| 213 |
+
- `picarones report --results results.json` — génère le HTML depuis un JSON existant
|
| 214 |
+
- `picarones/fixtures.py` — générateur de benchmarks fictifs (12 textes médiévaux, 4 concurrents)
|
| 215 |
+
|
| 216 |
+
### Tests
|
| 217 |
+
- `tests/test_report.py`, `tests/test_diff_utils.py` : 93 tests
|
| 218 |
+
- **Total** : 93 tests (anciennement 20)
|
| 219 |
+
|
| 220 |
+
---
|
| 221 |
+
|
| 222 |
+
## [0.1.0] — Sprint 1 — 2025-01
|
| 223 |
+
|
| 224 |
+
### Ajouté
|
| 225 |
+
- **Structure complète du projet** Python avec `pyproject.toml`, `setup`, packaging
|
| 226 |
+
- **Adaptateur Tesseract 5** (`picarones/engines/tesseract.py`) via `pytesseract`
|
| 227 |
+
- Configuration lang, PSM, DPI
|
| 228 |
+
- Récupération de la version
|
| 229 |
+
- **Adaptateur Pero OCR** (`picarones/engines/pero_ocr.py`)
|
| 230 |
+
- Chargement de modèle, traitement d'image
|
| 231 |
+
- **Interface abstraite** `BaseOCREngine` avec `process_image()`, `get_version()`, propriétés
|
| 232 |
+
- **Calcul CER et WER** (`picarones/core/metrics.py`) via `jiwer`
|
| 233 |
+
- CER brut, NFC, caseless
|
| 234 |
+
- WER, WER normalisé, MER, WIL
|
| 235 |
+
- Longueurs de référence et hypothèse
|
| 236 |
+
- **Chargement de corpus** (`picarones/core/corpus.py`)
|
| 237 |
+
- Dossier local : paires image / `.gt.txt`
|
| 238 |
+
- Détection automatique des extensions image (jpg, png, tif, bmp…)
|
| 239 |
+
- Classe `Corpus`, `Document`
|
| 240 |
+
- **Export JSON** (`picarones/core/results.py`)
|
| 241 |
+
- `BenchmarkResult`, `EngineReport`, `DocumentResult`
|
| 242 |
+
- `ranking()` : classement par CER moyen
|
| 243 |
+
- `to_json()` avec horodatage et métadonnées
|
| 244 |
+
- **Orchestrateur benchmark** (`picarones/core/runner.py`)
|
| 245 |
+
- Traitement séquentiel des documents par moteur
|
| 246 |
+
- Barre de progression `tqdm`
|
| 247 |
+
- Cache des sorties par hash SHA-256
|
| 248 |
+
- **CLI Click** (`picarones/cli.py`)
|
| 249 |
+
- `picarones run` — benchmark complet
|
| 250 |
+
- `picarones metrics` — CER/WER entre deux fichiers
|
| 251 |
+
- `picarones engines` — liste des moteurs avec statut
|
| 252 |
+
- `picarones info` — version et dépendances
|
| 253 |
+
- `--fail-if-cer-above` pour intégration CI/CD
|
| 254 |
+
|
| 255 |
+
### Tests
|
| 256 |
+
- `tests/test_metrics.py`, `test_corpus.py`, `test_engines.py`, `test_results.py` : 20 tests
|
|
@@ -0,0 +1,512 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Guide de contribution — Picarones
|
| 2 |
+
|
| 3 |
+
Merci de votre intérêt pour Picarones ! Ce guide explique comment contribuer au projet.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Sommaire
|
| 8 |
+
|
| 9 |
+
1. [Démarrage rapide](#1-démarrage-rapide)
|
| 10 |
+
2. [Ajouter un moteur OCR](#2-ajouter-un-moteur-ocr)
|
| 11 |
+
3. [Ajouter un adaptateur LLM](#3-ajouter-un-adaptateur-llm)
|
| 12 |
+
4. [Ajouter une source d'import](#4-ajouter-une-source-dimport)
|
| 13 |
+
5. [Écrire des tests](#5-écrire-des-tests)
|
| 14 |
+
6. [Soumettre une Pull Request](#6-soumettre-une-pull-request)
|
| 15 |
+
7. [Conventions de code](#7-conventions-de-code)
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 1. Démarrage rapide
|
| 20 |
+
|
| 21 |
+
```bash
|
| 22 |
+
# Forker le dépôt sur GitHub, puis :
|
| 23 |
+
git clone https://github.com/VOTRE_USERNAME/picarones.git
|
| 24 |
+
cd picarones
|
| 25 |
+
|
| 26 |
+
# Environnement de développement
|
| 27 |
+
python3.11 -m venv .venv
|
| 28 |
+
source .venv/bin/activate
|
| 29 |
+
pip install -e ".[dev,web]"
|
| 30 |
+
|
| 31 |
+
# Vérifier que tout passe
|
| 32 |
+
make test
|
| 33 |
+
# ou : pytest
|
| 34 |
+
|
| 35 |
+
# Créer une branche de travail
|
| 36 |
+
git checkout -b feat/mon-nouveau-moteur
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## 2. Ajouter un moteur OCR
|
| 42 |
+
|
| 43 |
+
Ajouter un nouveau moteur OCR nécessite de créer **un seul fichier Python** et de modifier
|
| 44 |
+
deux fichiers de configuration. Pas de refactoring du reste du code.
|
| 45 |
+
|
| 46 |
+
### 2.1 Créer l'adaptateur
|
| 47 |
+
|
| 48 |
+
Créer `picarones/engines/mon_moteur.py` en héritant de `BaseOCREngine` :
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
"""Adaptateur pour Mon Moteur OCR.
|
| 52 |
+
|
| 53 |
+
Installation :
|
| 54 |
+
pip install mon-moteur
|
| 55 |
+
|
| 56 |
+
Configuration :
|
| 57 |
+
config:
|
| 58 |
+
model: mon_modele_v2
|
| 59 |
+
lang: fra
|
| 60 |
+
"""
|
| 61 |
+
|
| 62 |
+
from __future__ import annotations
|
| 63 |
+
|
| 64 |
+
import logging
|
| 65 |
+
from pathlib import Path
|
| 66 |
+
from typing import Optional
|
| 67 |
+
|
| 68 |
+
from picarones.engines.base import BaseOCREngine
|
| 69 |
+
|
| 70 |
+
logger = logging.getLogger(__name__)
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
class MonMoteurEngine(BaseOCREngine):
|
| 74 |
+
"""Adaptateur pour Mon Moteur OCR.
|
| 75 |
+
|
| 76 |
+
Args:
|
| 77 |
+
config: Dictionnaire de configuration.
|
| 78 |
+
- ``model`` (str): Identifiant du modèle. Défaut: ``"default"``.
|
| 79 |
+
- ``lang`` (str): Code langue. Défaut: ``"fra"``.
|
| 80 |
+
"""
|
| 81 |
+
|
| 82 |
+
name = "mon_moteur"
|
| 83 |
+
|
| 84 |
+
def __init__(self, config: Optional[dict] = None) -> None:
|
| 85 |
+
super().__init__(config or {})
|
| 86 |
+
self.model = self.config.get("model", "default")
|
| 87 |
+
self.lang = self.config.get("lang", "fra")
|
| 88 |
+
|
| 89 |
+
def get_version(self) -> str:
|
| 90 |
+
"""Retourne la version du moteur."""
|
| 91 |
+
try:
|
| 92 |
+
import mon_moteur
|
| 93 |
+
return getattr(mon_moteur, "__version__", "inconnu")
|
| 94 |
+
except ImportError:
|
| 95 |
+
return "non installé"
|
| 96 |
+
|
| 97 |
+
def process_image(self, image_path: str) -> str:
|
| 98 |
+
"""Transcrit une image et retourne le texte.
|
| 99 |
+
|
| 100 |
+
Args:
|
| 101 |
+
image_path: Chemin absolu vers l'image (JPEG, PNG, TIFF…).
|
| 102 |
+
|
| 103 |
+
Returns:
|
| 104 |
+
Texte transcrit par le moteur.
|
| 105 |
+
|
| 106 |
+
Raises:
|
| 107 |
+
RuntimeError: Si le moteur n'est pas installé ou si la transcription échoue.
|
| 108 |
+
"""
|
| 109 |
+
try:
|
| 110 |
+
import mon_moteur
|
| 111 |
+
except ImportError as exc:
|
| 112 |
+
raise RuntimeError(
|
| 113 |
+
"mon-moteur n'est pas installé. Installez-le avec : pip install mon-moteur"
|
| 114 |
+
) from exc
|
| 115 |
+
|
| 116 |
+
try:
|
| 117 |
+
result = mon_moteur.transcribe(
|
| 118 |
+
image_path,
|
| 119 |
+
model=self.model,
|
| 120 |
+
lang=self.lang,
|
| 121 |
+
)
|
| 122 |
+
return result.text.strip()
|
| 123 |
+
except Exception as exc:
|
| 124 |
+
raise RuntimeError(f"Erreur de transcription : {exc}") from exc
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
### 2.2 Enregistrer le moteur dans le CLI
|
| 128 |
+
|
| 129 |
+
Dans `picarones/cli.py`, modifier la fonction `_engine_from_name()` :
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
def _engine_from_name(engine_name: str, lang: str, psm: int) -> "BaseOCREngine":
|
| 133 |
+
from picarones.engines.tesseract import TesseractEngine
|
| 134 |
+
if engine_name in {"tesseract", "tess"}:
|
| 135 |
+
return TesseractEngine(config={"lang": lang, "psm": psm})
|
| 136 |
+
|
| 137 |
+
# ↓ Ajouter ici
|
| 138 |
+
try:
|
| 139 |
+
from picarones.engines.mon_moteur import MonMoteurEngine
|
| 140 |
+
if engine_name in {"mon_moteur", "monmoteur"}:
|
| 141 |
+
return MonMoteurEngine(config={"lang": lang})
|
| 142 |
+
except ImportError:
|
| 143 |
+
pass
|
| 144 |
+
# ↑
|
| 145 |
+
|
| 146 |
+
raise click.BadParameter(...)
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
### 2.3 Ajouter dans la liste `picarones engines`
|
| 150 |
+
|
| 151 |
+
Dans `picarones/cli.py`, dans la fonction `engines_cmd()` :
|
| 152 |
+
|
| 153 |
+
```python
|
| 154 |
+
engines = [
|
| 155 |
+
("tesseract", "Tesseract 5 (pytesseract)", "pytesseract"),
|
| 156 |
+
("pero_ocr", "Pero OCR", "pero_ocr"),
|
| 157 |
+
("mon_moteur", "Mon Moteur OCR", "mon_moteur"), # ← Ajouter
|
| 158 |
+
]
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
### 2.4 Ajouter l'extra dans `pyproject.toml` (optionnel)
|
| 162 |
+
|
| 163 |
+
```toml
|
| 164 |
+
[project.optional-dependencies]
|
| 165 |
+
mon-moteur = ["mon-moteur>=1.0.0"]
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
### 2.5 Écrire les tests
|
| 169 |
+
|
| 170 |
+
Créer `tests/test_mon_moteur.py` :
|
| 171 |
+
|
| 172 |
+
```python
|
| 173 |
+
"""Tests pour l'adaptateur Mon Moteur OCR."""
|
| 174 |
+
|
| 175 |
+
import pytest
|
| 176 |
+
from unittest.mock import patch
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
class TestMonMoteurEngine:
|
| 180 |
+
|
| 181 |
+
def test_name(self):
|
| 182 |
+
from picarones.engines.mon_moteur import MonMoteurEngine
|
| 183 |
+
engine = MonMoteurEngine()
|
| 184 |
+
assert engine.name == "mon_moteur"
|
| 185 |
+
|
| 186 |
+
def test_process_image_mock(self):
|
| 187 |
+
from picarones.engines.mon_moteur import MonMoteurEngine
|
| 188 |
+
engine = MonMoteurEngine(config={"lang": "fra"})
|
| 189 |
+
mock_result = type("R", (), {"text": "Texte transcrit"})()
|
| 190 |
+
with patch("mon_moteur.transcribe", return_value=mock_result):
|
| 191 |
+
text = engine.process_image("/tmp/test.jpg")
|
| 192 |
+
assert text == "Texte transcrit"
|
| 193 |
+
|
| 194 |
+
def test_process_image_import_error(self):
|
| 195 |
+
from picarones.engines.mon_moteur import MonMoteurEngine
|
| 196 |
+
engine = MonMoteurEngine()
|
| 197 |
+
with patch.dict("sys.modules", {"mon_moteur": None}):
|
| 198 |
+
with pytest.raises(RuntimeError, match="non installé"):
|
| 199 |
+
engine.process_image("/tmp/test.jpg")
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
---
|
| 203 |
+
|
| 204 |
+
## 3. Ajouter un adaptateur LLM
|
| 205 |
+
|
| 206 |
+
Les adaptateurs LLM sont dans `picarones/llm/`. Créer `picarones/llm/mon_llm_adapter.py` :
|
| 207 |
+
|
| 208 |
+
```python
|
| 209 |
+
"""Adaptateur pour Mon LLM.
|
| 210 |
+
|
| 211 |
+
Supporte les modes : text_only, text_and_image, zero_shot.
|
| 212 |
+
"""
|
| 213 |
+
|
| 214 |
+
from __future__ import annotations
|
| 215 |
+
|
| 216 |
+
import base64
|
| 217 |
+
import logging
|
| 218 |
+
from pathlib import Path
|
| 219 |
+
from typing import Optional
|
| 220 |
+
|
| 221 |
+
from picarones.llm.base import BaseLLMAdapter
|
| 222 |
+
|
| 223 |
+
logger = logging.getLogger(__name__)
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
class MonLLMAdapter(BaseLLMAdapter):
|
| 227 |
+
"""Adaptateur pour Mon LLM.
|
| 228 |
+
|
| 229 |
+
Args:
|
| 230 |
+
config: Configuration.
|
| 231 |
+
- ``model`` (str): Modèle à utiliser.
|
| 232 |
+
- ``api_key`` (str): Clé API (peut aussi être dans ``MON_LLM_API_KEY``).
|
| 233 |
+
- ``temperature`` (float): Température (0.0 à 1.0). Défaut: 0.0.
|
| 234 |
+
- ``max_tokens`` (int): Nombre maximum de tokens. Défaut: 4096.
|
| 235 |
+
"""
|
| 236 |
+
|
| 237 |
+
name = "mon_llm"
|
| 238 |
+
|
| 239 |
+
def __init__(self, config: Optional[dict] = None) -> None:
|
| 240 |
+
super().__init__(config or {})
|
| 241 |
+
import os
|
| 242 |
+
self.api_key = self.config.get("api_key") or os.getenv("MON_LLM_API_KEY", "")
|
| 243 |
+
self.model = self.config.get("model", "mon-modele-v1")
|
| 244 |
+
self.temperature = float(self.config.get("temperature", 0.0))
|
| 245 |
+
self.max_tokens = int(self.config.get("max_tokens", 4096))
|
| 246 |
+
|
| 247 |
+
def correct_text(self, ocr_text: str, prompt: str) -> str:
|
| 248 |
+
"""Corrige le texte OCR en mode texte seul (Mode 1).
|
| 249 |
+
|
| 250 |
+
Args:
|
| 251 |
+
ocr_text: Sortie brute du moteur OCR à corriger.
|
| 252 |
+
prompt: Prompt de correction.
|
| 253 |
+
|
| 254 |
+
Returns:
|
| 255 |
+
Texte corrigé par le LLM.
|
| 256 |
+
"""
|
| 257 |
+
# Implémenter l'appel API ici
|
| 258 |
+
full_prompt = prompt.replace("{ocr_output}", ocr_text)
|
| 259 |
+
return self._call_api(messages=[{"role": "user", "content": full_prompt}])
|
| 260 |
+
|
| 261 |
+
def correct_with_image(self, ocr_text: str, image_path: str, prompt: str) -> str:
|
| 262 |
+
"""Corrige le texte OCR avec l'image (Mode 2).
|
| 263 |
+
|
| 264 |
+
Args:
|
| 265 |
+
ocr_text: Sortie brute du moteur OCR.
|
| 266 |
+
image_path: Chemin vers l'image originale.
|
| 267 |
+
prompt: Prompt de correction.
|
| 268 |
+
|
| 269 |
+
Returns:
|
| 270 |
+
Texte corrigé.
|
| 271 |
+
"""
|
| 272 |
+
image_b64 = base64.b64encode(Path(image_path).read_bytes()).decode()
|
| 273 |
+
# Implémenter selon l'API de votre LLM
|
| 274 |
+
return self._call_api_with_image(ocr_text, image_b64, prompt)
|
| 275 |
+
|
| 276 |
+
def transcribe_image(self, image_path: str, prompt: str) -> str:
|
| 277 |
+
"""Transcription zero-shot depuis l'image seule (Mode 3).
|
| 278 |
+
|
| 279 |
+
Args:
|
| 280 |
+
image_path: Chemin vers l'image.
|
| 281 |
+
prompt: Prompt de transcription.
|
| 282 |
+
|
| 283 |
+
Returns:
|
| 284 |
+
Transcription produite par le LLM.
|
| 285 |
+
"""
|
| 286 |
+
image_b64 = base64.b64encode(Path(image_path).read_bytes()).decode()
|
| 287 |
+
return self._call_api_with_image("", image_b64, prompt)
|
| 288 |
+
|
| 289 |
+
def _call_api(self, messages: list[dict]) -> str:
|
| 290 |
+
"""Appel API générique."""
|
| 291 |
+
raise NotImplementedError("Implémenter _call_api()")
|
| 292 |
+
|
| 293 |
+
def _call_api_with_image(self, text: str, image_b64: str, prompt: str) -> str:
|
| 294 |
+
"""Appel API avec image."""
|
| 295 |
+
raise NotImplementedError("Implémenter _call_api_with_image()")
|
| 296 |
+
```
|
| 297 |
+
|
| 298 |
+
---
|
| 299 |
+
|
| 300 |
+
## 4. Ajouter une source d'import
|
| 301 |
+
|
| 302 |
+
Les importeurs sont dans `picarones/importers/`. Voir `iiif.py` et `gallica.py` comme exemples.
|
| 303 |
+
|
| 304 |
+
Votre importeur doit retourner un objet `Corpus` de `picarones.core.corpus` :
|
| 305 |
+
|
| 306 |
+
```python
|
| 307 |
+
from picarones.core.corpus import Corpus, Document
|
| 308 |
+
|
| 309 |
+
def import_from_ma_source(url: str, output_dir: str) -> Corpus:
|
| 310 |
+
documents = []
|
| 311 |
+
# ... télécharger et préparer les documents ...
|
| 312 |
+
for img_path, gt_text in zip(images, ground_truths):
|
| 313 |
+
documents.append(Document(
|
| 314 |
+
doc_id=Path(img_path).stem,
|
| 315 |
+
image_path=str(img_path),
|
| 316 |
+
ground_truth=gt_text,
|
| 317 |
+
metadata={"source": "ma_source"},
|
| 318 |
+
))
|
| 319 |
+
return Corpus(
|
| 320 |
+
name="Corpus depuis Ma Source",
|
| 321 |
+
source=url,
|
| 322 |
+
documents=documents,
|
| 323 |
+
)
|
| 324 |
+
```
|
| 325 |
+
|
| 326 |
+
Ajouter la nouvelle commande dans `picarones/cli.py` (sous-commande de `picarones import`).
|
| 327 |
+
|
| 328 |
+
---
|
| 329 |
+
|
| 330 |
+
## 5. Écrire des tests
|
| 331 |
+
|
| 332 |
+
### Conventions
|
| 333 |
+
|
| 334 |
+
- Un fichier de test par module/sprint : `tests/test_mon_module.py`
|
| 335 |
+
- Classes de test groupées par fonctionnalité : `class TestMonModule:`
|
| 336 |
+
- Mocker les appels réseau et les moteurs OCR avec `unittest.mock.patch`
|
| 337 |
+
- Viser **100% de couverture** sur les modules publics
|
| 338 |
+
|
| 339 |
+
### Structure recommandée
|
| 340 |
+
|
| 341 |
+
```python
|
| 342 |
+
"""Tests pour MonModule.
|
| 343 |
+
|
| 344 |
+
Classes
|
| 345 |
+
-------
|
| 346 |
+
TestFonctionnalite1 (N tests) — description
|
| 347 |
+
TestFonctionnalite2 (M tests) — description
|
| 348 |
+
"""
|
| 349 |
+
|
| 350 |
+
from __future__ import annotations
|
| 351 |
+
import pytest
|
| 352 |
+
from unittest.mock import patch, MagicMock
|
| 353 |
+
|
| 354 |
+
|
| 355 |
+
class TestFonctionnalite1:
|
| 356 |
+
|
| 357 |
+
def test_cas_nominal(self):
|
| 358 |
+
from picarones.mon_module import ma_fonction
|
| 359 |
+
result = ma_fonction("entrée")
|
| 360 |
+
assert result == "sortie attendue"
|
| 361 |
+
|
| 362 |
+
def test_cas_erreur(self):
|
| 363 |
+
from picarones.mon_module import ma_fonction
|
| 364 |
+
with pytest.raises(ValueError, match="message d'erreur"):
|
| 365 |
+
ma_fonction(None)
|
| 366 |
+
|
| 367 |
+
def test_avec_mock(self):
|
| 368 |
+
from picarones.mon_module import MonClient
|
| 369 |
+
client = MonClient("https://example.org", token="tok")
|
| 370 |
+
with patch.object(client, "_fetch", return_value=b"réponse"):
|
| 371 |
+
result = client.appel_api()
|
| 372 |
+
assert result is not None
|
| 373 |
+
```
|
| 374 |
+
|
| 375 |
+
### Lancer les tests
|
| 376 |
+
|
| 377 |
+
```bash
|
| 378 |
+
# Tous les tests
|
| 379 |
+
make test
|
| 380 |
+
# ou
|
| 381 |
+
pytest
|
| 382 |
+
|
| 383 |
+
# Un fichier spécifique
|
| 384 |
+
pytest tests/test_mon_module.py -v
|
| 385 |
+
|
| 386 |
+
# Avec couverture
|
| 387 |
+
pytest --cov=picarones --cov-report=html
|
| 388 |
+
open htmlcov/index.html
|
| 389 |
+
|
| 390 |
+
# Tests rapides (sans les tests lents)
|
| 391 |
+
pytest -m "not slow"
|
| 392 |
+
```
|
| 393 |
+
|
| 394 |
+
---
|
| 395 |
+
|
| 396 |
+
## 6. Soumettre une Pull Request
|
| 397 |
+
|
| 398 |
+
### Avant de soumettre
|
| 399 |
+
|
| 400 |
+
```bash
|
| 401 |
+
# 1. Vérifier que tous les tests passent
|
| 402 |
+
make test
|
| 403 |
+
|
| 404 |
+
# 2. Vérifier le style de code (si ruff/flake8 disponible)
|
| 405 |
+
make lint
|
| 406 |
+
|
| 407 |
+
# 3. Mettre à jour le CHANGELOG.md
|
| 408 |
+
|
| 409 |
+
# 4. Pousser votre branche
|
| 410 |
+
git push origin feat/mon-nouveau-moteur
|
| 411 |
+
```
|
| 412 |
+
|
| 413 |
+
### Checklist PR
|
| 414 |
+
|
| 415 |
+
- [ ] Tests unitaires pour toutes les nouvelles fonctions publiques
|
| 416 |
+
- [ ] Docstrings Google style sur les classes et méthodes publiques
|
| 417 |
+
- [ ] CHANGELOG.md mis à jour dans la section `[Unreleased]`
|
| 418 |
+
- [ ] Pas de régression sur la suite de tests existante (`pytest` passe en vert)
|
| 419 |
+
- [ ] Code compatible Python 3.11 et 3.12
|
| 420 |
+
- [ ] Pas de clés API en dur dans le code
|
| 421 |
+
|
| 422 |
+
### Description de PR
|
| 423 |
+
|
| 424 |
+
```markdown
|
| 425 |
+
## Résumé
|
| 426 |
+
- Ajout de l'adaptateur pour Mon Moteur OCR
|
| 427 |
+
- Support des langues latin et français
|
| 428 |
+
|
| 429 |
+
## Tests
|
| 430 |
+
- 15 tests unitaires dans `tests/test_mon_moteur.py`
|
| 431 |
+
- Mocké avec `unittest.mock.patch` (pas de dépendance externe requise pour les tests)
|
| 432 |
+
|
| 433 |
+
## Changements
|
| 434 |
+
- `picarones/engines/mon_moteur.py` : nouvel adaptateur
|
| 435 |
+
- `picarones/cli.py` : enregistrement du moteur
|
| 436 |
+
- `pyproject.toml` : extra `[mon-moteur]`
|
| 437 |
+
```
|
| 438 |
+
|
| 439 |
+
---
|
| 440 |
+
|
| 441 |
+
## 7. Conventions de code
|
| 442 |
+
|
| 443 |
+
### Style
|
| 444 |
+
|
| 445 |
+
- **Python 3.11+** avec annotations de type
|
| 446 |
+
- `from __future__ import annotations` en tête de fichier
|
| 447 |
+
- Format : PEP 8, lignes ≤ 100 caractères (pas de formatage automatique imposé)
|
| 448 |
+
|
| 449 |
+
### Docstrings — format Google
|
| 450 |
+
|
| 451 |
+
```python
|
| 452 |
+
def compute_cer(reference: str, hypothesis: str) -> float:
|
| 453 |
+
"""Calcule le Character Error Rate (CER) entre référence et hypothèse.
|
| 454 |
+
|
| 455 |
+
Le CER est défini comme la distance de Levenshtein au niveau caractère
|
| 456 |
+
divisée par la longueur de la référence.
|
| 457 |
+
|
| 458 |
+
Args:
|
| 459 |
+
reference: Texte de vérité terrain (GT).
|
| 460 |
+
hypothesis: Texte produit par le moteur OCR.
|
| 461 |
+
|
| 462 |
+
Returns:
|
| 463 |
+
CER entre 0.0 (parfait) et 1.0+ (nombreuses erreurs).
|
| 464 |
+
|
| 465 |
+
Raises:
|
| 466 |
+
ValueError: Si ``reference`` est vide.
|
| 467 |
+
|
| 468 |
+
Examples:
|
| 469 |
+
>>> compute_cer("bonjour", "bnjour")
|
| 470 |
+
0.14285714285714285
|
| 471 |
+
"""
|
| 472 |
+
```
|
| 473 |
+
|
| 474 |
+
### Nommage
|
| 475 |
+
|
| 476 |
+
- Classes : `PascalCase` (ex : `TesseractEngine`, `GallicaClient`)
|
| 477 |
+
- Fonctions/méthodes : `snake_case` (ex : `compute_metrics`, `list_projects`)
|
| 478 |
+
- Constantes : `UPPER_SNAKE_CASE` (ex : `DEGRADATION_LEVELS`)
|
| 479 |
+
- Fichiers de module : `snake_case.py` (ex : `gallica.py`, `char_scores.py`)
|
| 480 |
+
|
| 481 |
+
### Gestion des imports optionnels
|
| 482 |
+
|
| 483 |
+
```python
|
| 484 |
+
# Pattern recommandé pour les dépendances optionnelles
|
| 485 |
+
def process_image(self, image_path: str) -> str:
|
| 486 |
+
try:
|
| 487 |
+
import mon_moteur
|
| 488 |
+
except ImportError as exc:
|
| 489 |
+
raise RuntimeError(
|
| 490 |
+
"mon-moteur n'est pas installé. Installez-le avec : pip install mon-moteur"
|
| 491 |
+
) from exc
|
| 492 |
+
# utiliser mon_moteur...
|
| 493 |
+
```
|
| 494 |
+
|
| 495 |
+
### Variables d'environnement pour les clés API
|
| 496 |
+
|
| 497 |
+
```python
|
| 498 |
+
import os
|
| 499 |
+
|
| 500 |
+
api_key = config.get("api_key") or os.getenv("MON_API_KEY", "")
|
| 501 |
+
if not api_key:
|
| 502 |
+
raise RuntimeError(
|
| 503 |
+
"Clé API manquante. Définissez MON_API_KEY ou passez api_key dans la config."
|
| 504 |
+
)
|
| 505 |
+
```
|
| 506 |
+
|
| 507 |
+
---
|
| 508 |
+
|
| 509 |
+
## Licence
|
| 510 |
+
|
| 511 |
+
En contribuant à Picarones, vous acceptez que votre contribution soit distribuée
|
| 512 |
+
sous licence Apache 2.0.
|
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Dockerfile — Picarones
|
| 2 |
+
# Image Docker multi-étape avec Tesseract OCR pré-installé
|
| 3 |
+
#
|
| 4 |
+
# Usage :
|
| 5 |
+
# docker build -t picarones:latest .
|
| 6 |
+
# docker run -p 8000:8000 picarones:latest
|
| 7 |
+
# docker run -p 8000:8000 -v $(pwd)/corpus:/app/corpus picarones:latest
|
| 8 |
+
#
|
| 9 |
+
# Variables d'environnement supportées :
|
| 10 |
+
# OPENAI_API_KEY, ANTHROPIC_API_KEY, MISTRAL_API_KEY
|
| 11 |
+
# GOOGLE_APPLICATION_CREDENTIALS
|
| 12 |
+
# AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
|
| 13 |
+
# AZURE_DOC_INTEL_ENDPOINT, AZURE_DOC_INTEL_KEY
|
| 14 |
+
|
| 15 |
+
# ──────────────────────────────────────────────────────────────────
|
| 16 |
+
# Étape 1 : builder — installe les dépendances Python dans un venv
|
| 17 |
+
# ──────────────────────────────────────────────────────────────────
|
| 18 |
+
FROM python:3.11-slim AS builder
|
| 19 |
+
|
| 20 |
+
WORKDIR /app
|
| 21 |
+
|
| 22 |
+
# Dépendances système pour la compilation
|
| 23 |
+
RUN apt-get update && apt-get install -y --no-install-recommends \
|
| 24 |
+
build-essential \
|
| 25 |
+
git \
|
| 26 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 27 |
+
|
| 28 |
+
# Copier les fichiers de configuration du package
|
| 29 |
+
COPY pyproject.toml .
|
| 30 |
+
COPY README.md .
|
| 31 |
+
COPY picarones/ picarones/
|
| 32 |
+
|
| 33 |
+
# Créer un venv isolé et installer Picarones avec les extras web
|
| 34 |
+
RUN python -m venv /opt/venv
|
| 35 |
+
ENV PATH="/opt/venv/bin:$PATH"
|
| 36 |
+
RUN pip install --upgrade pip && \
|
| 37 |
+
pip install -e ".[web]" && \
|
| 38 |
+
pip cache purge
|
| 39 |
+
|
| 40 |
+
# ──────────────────────────────────────────────────────────────────
|
| 41 |
+
# Étape 2 : runtime — image finale légère avec Tesseract
|
| 42 |
+
# ──────────────────────────────────────────────────────────────────
|
| 43 |
+
FROM python:3.11-slim AS runtime
|
| 44 |
+
|
| 45 |
+
LABEL maintainer="BnF — Département numérique"
|
| 46 |
+
LABEL description="Picarones — Plateforme de comparaison de moteurs OCR pour documents patrimoniaux"
|
| 47 |
+
LABEL version="1.0.0"
|
| 48 |
+
LABEL org.opencontainers.image.source="https://github.com/bnf/picarones"
|
| 49 |
+
LABEL org.opencontainers.image.licenses="Apache-2.0"
|
| 50 |
+
|
| 51 |
+
WORKDIR /app
|
| 52 |
+
|
| 53 |
+
# ── Dépendances système ─────────────────────────────────────────
|
| 54 |
+
RUN apt-get update && apt-get install -y --no-install-recommends \
|
| 55 |
+
# Tesseract OCR 5 et modèles de langues
|
| 56 |
+
tesseract-ocr \
|
| 57 |
+
tesseract-ocr-fra \
|
| 58 |
+
tesseract-ocr-lat \
|
| 59 |
+
tesseract-ocr-eng \
|
| 60 |
+
tesseract-ocr-deu \
|
| 61 |
+
tesseract-ocr-ita \
|
| 62 |
+
tesseract-ocr-spa \
|
| 63 |
+
# Bibliothèques image pour Pillow
|
| 64 |
+
libpng16-16 \
|
| 65 |
+
libjpeg62-turbo \
|
| 66 |
+
libtiff6 \
|
| 67 |
+
libwebp7 \
|
| 68 |
+
# Utilitaires
|
| 69 |
+
curl \
|
| 70 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 71 |
+
|
| 72 |
+
# ── Venv Python depuis le builder ──────────────────────────────
|
| 73 |
+
COPY --from=builder /opt/venv /opt/venv
|
| 74 |
+
ENV PATH="/opt/venv/bin:$PATH"
|
| 75 |
+
|
| 76 |
+
# ── Code source de l'application ───────────────────────────────
|
| 77 |
+
COPY --from=builder /app /app
|
| 78 |
+
|
| 79 |
+
# ── Répertoires de données ──────────────────────────────────────
|
| 80 |
+
RUN mkdir -p /app/corpus /app/rapports /app/data
|
| 81 |
+
|
| 82 |
+
# ── Utilisateur non-root pour la sécurité ──────────────────────
|
| 83 |
+
RUN useradd -m -u 1000 picarones && \
|
| 84 |
+
chown -R picarones:picarones /app
|
| 85 |
+
USER picarones
|
| 86 |
+
|
| 87 |
+
# ── Variables d'environnement par défaut ───────────────────────
|
| 88 |
+
ENV PYTHONUNBUFFERED=1
|
| 89 |
+
ENV PYTHONIOENCODING=utf-8
|
| 90 |
+
ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata
|
| 91 |
+
|
| 92 |
+
# ── Ports ───────────────────────────────────────────────────────
|
| 93 |
+
EXPOSE 8000
|
| 94 |
+
|
| 95 |
+
# ── Health check ────────────────────────────────────────────────
|
| 96 |
+
HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \
|
| 97 |
+
CMD curl -f http://localhost:8000/health || exit 1
|
| 98 |
+
|
| 99 |
+
# ── Démarrage ───────────────────────────────────────────────────
|
| 100 |
+
CMD ["picarones", "serve", "--host", "0.0.0.0", "--port", "8000"]
|
|
@@ -0,0 +1,501 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Guide d'installation — Picarones
|
| 2 |
+
|
| 3 |
+
> Guide détaillé pour Linux, macOS et Windows.
|
| 4 |
+
> Pour une installation en 5 minutes : voir [README.md](README.md#installation-rapide).
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Sommaire
|
| 9 |
+
|
| 10 |
+
1. [Prérequis](#1-prérequis)
|
| 11 |
+
2. [Installation Linux (Ubuntu/Debian)](#2-installation-linux-ubuntudebian)
|
| 12 |
+
3. [Installation macOS](#3-installation-macos)
|
| 13 |
+
4. [Installation Windows](#4-installation-windows)
|
| 14 |
+
5. [Configuration des moteurs OCR](#5-configuration-des-moteurs-ocr)
|
| 15 |
+
6. [Configuration des APIs](#6-configuration-des-apis)
|
| 16 |
+
7. [Lancement de l'interface web](#7-lancement-de-linterface-web)
|
| 17 |
+
8. [Installation Docker](#8-installation-docker)
|
| 18 |
+
9. [Vérification de l'installation](#9-vérification-de-linstallation)
|
| 19 |
+
10. [Résolution des problèmes courants](#10-résolution-des-problèmes-courants)
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 1. Prérequis
|
| 24 |
+
|
| 25 |
+
| Composant | Version minimale | Obligatoire |
|
| 26 |
+
|-----------|-----------------|-------------|
|
| 27 |
+
| Python | 3.11 | Oui |
|
| 28 |
+
| pip | 23.0+ | Oui |
|
| 29 |
+
| Git | 2.x | Oui (pour cloner) |
|
| 30 |
+
| Tesseract | 5.0+ | Pour le moteur Tesseract |
|
| 31 |
+
| Pero OCR | 0.1+ | Pour le moteur Pero OCR |
|
| 32 |
+
| Docker | 24.x | Pour déploiement containerisé |
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## 2. Installation Linux (Ubuntu/Debian)
|
| 37 |
+
|
| 38 |
+
### 2.1 Python et pip
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
sudo apt update
|
| 42 |
+
sudo apt install python3.11 python3.11-venv python3-pip git
|
| 43 |
+
python3.11 --version # Vérifier : Python 3.11.x
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
### 2.2 Tesseract OCR
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
# Tesseract 5 (PPA pour Ubuntu < 22.04)
|
| 50 |
+
sudo add-apt-repository ppa:alex-p/tesseract-ocr5 -y
|
| 51 |
+
sudo apt update
|
| 52 |
+
sudo apt install tesseract-ocr
|
| 53 |
+
|
| 54 |
+
# Modèles de langues (choisir selon votre corpus)
|
| 55 |
+
sudo apt install tesseract-ocr-fra # Français
|
| 56 |
+
sudo apt install tesseract-ocr-lat # Latin
|
| 57 |
+
sudo apt install tesseract-ocr-eng # Anglais
|
| 58 |
+
sudo apt install tesseract-ocr-deu # Allemand
|
| 59 |
+
sudo apt install tesseract-ocr-ita # Italien
|
| 60 |
+
sudo apt install tesseract-ocr-spa # Espagnol
|
| 61 |
+
|
| 62 |
+
# Vérifier
|
| 63 |
+
tesseract --version # Tesseract 5.x.x
|
| 64 |
+
tesseract --list-langs
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
### 2.3 Picarones
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
git clone https://github.com/bnf/picarones.git
|
| 71 |
+
cd picarones
|
| 72 |
+
|
| 73 |
+
# Créer un environnement virtuel (recommandé)
|
| 74 |
+
python3.11 -m venv .venv
|
| 75 |
+
source .venv/bin/activate
|
| 76 |
+
|
| 77 |
+
# Installation de base
|
| 78 |
+
pip install -e .
|
| 79 |
+
|
| 80 |
+
# Installation avec interface web (FastAPI + uvicorn)
|
| 81 |
+
pip install -e ".[web]"
|
| 82 |
+
|
| 83 |
+
# Installation complète (tous les extras)
|
| 84 |
+
pip install -e ".[web,hf,dev]"
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### 2.4 Pero OCR (optionnel)
|
| 88 |
+
|
| 89 |
+
```bash
|
| 90 |
+
# Pero OCR nécessite quelques dépendances système
|
| 91 |
+
sudo apt install libgl1 libglib2.0-0
|
| 92 |
+
|
| 93 |
+
pip install pero-ocr
|
| 94 |
+
|
| 95 |
+
# Télécharger un modèle pré-entraîné
|
| 96 |
+
# Voir https://github.com/DCGM/pero-ocr pour les modèles disponibles
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## 3. Installation macOS
|
| 102 |
+
|
| 103 |
+
### 3.1 Homebrew (si non installé)
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
### 3.2 Python et Tesseract
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
brew install python@3.11 tesseract
|
| 113 |
+
|
| 114 |
+
# Modèles de langues Tesseract
|
| 115 |
+
brew install tesseract-lang # Installe tous les modèles
|
| 116 |
+
|
| 117 |
+
# Ou modèles individuels via les données de tessdata
|
| 118 |
+
# Voir https://github.com/tesseract-ocr/tessdata
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
### 3.3 Picarones
|
| 122 |
+
|
| 123 |
+
```bash
|
| 124 |
+
git clone https://github.com/bnf/picarones.git
|
| 125 |
+
cd picarones
|
| 126 |
+
|
| 127 |
+
python3.11 -m venv .venv
|
| 128 |
+
source .venv/bin/activate
|
| 129 |
+
|
| 130 |
+
pip install -e ".[web]"
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### 3.4 Résolution d'un problème courant macOS
|
| 134 |
+
|
| 135 |
+
Si `pytesseract` ne trouve pas Tesseract :
|
| 136 |
+
|
| 137 |
+
```bash
|
| 138 |
+
# Trouver le chemin de Tesseract
|
| 139 |
+
which tesseract # Ex : /opt/homebrew/bin/tesseract
|
| 140 |
+
|
| 141 |
+
# L'indiquer explicitement dans votre script Python :
|
| 142 |
+
import pytesseract
|
| 143 |
+
pytesseract.pytesseract.tesseract_cmd = '/opt/homebrew/bin/tesseract'
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
Ou définir la variable d'environnement :
|
| 147 |
+
|
| 148 |
+
```bash
|
| 149 |
+
export TESSDATA_PREFIX=/opt/homebrew/share/tessdata/
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
+
## 4. Installation Windows
|
| 155 |
+
|
| 156 |
+
### 4.1 Python
|
| 157 |
+
|
| 158 |
+
1. Télécharger Python 3.11+ depuis [python.org](https://www.python.org/downloads/windows/)
|
| 159 |
+
2. Cocher "Add Python to PATH" lors de l'installation
|
| 160 |
+
3. Vérifier : `python --version` dans PowerShell
|
| 161 |
+
|
| 162 |
+
### 4.2 Tesseract
|
| 163 |
+
|
| 164 |
+
1. Télécharger l'installateur depuis [UB-Mannheim/tesseract](https://github.com/UB-Mannheim/tesseract/wiki)
|
| 165 |
+
2. Choisir la version 5.x (64-bit recommandé)
|
| 166 |
+
3. **Pendant l'installation** : cocher les modèles de langues souhaités (Français, Latin…)
|
| 167 |
+
4. Ajouter Tesseract au PATH :
|
| 168 |
+
- Chercher "Variables d'environnement" dans le menu Démarrer
|
| 169 |
+
- Ajouter `C:\Program Files\Tesseract-OCR` à la variable `Path`
|
| 170 |
+
5. Vérifier : `tesseract --version` dans PowerShell
|
| 171 |
+
|
| 172 |
+
### 4.3 Git
|
| 173 |
+
|
| 174 |
+
Télécharger depuis [git-scm.com](https://git-scm.com/download/win) et installer.
|
| 175 |
+
|
| 176 |
+
### 4.4 Picarones
|
| 177 |
+
|
| 178 |
+
```powershell
|
| 179 |
+
git clone https://github.com/bnf/picarones.git
|
| 180 |
+
cd picarones
|
| 181 |
+
|
| 182 |
+
python -m venv .venv
|
| 183 |
+
.venv\Scripts\activate
|
| 184 |
+
|
| 185 |
+
pip install -e ".[web]"
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
### 4.5 Problème d'encodage Windows
|
| 189 |
+
|
| 190 |
+
Si vous rencontrez des erreurs d'encodage, définir :
|
| 191 |
+
|
| 192 |
+
```powershell
|
| 193 |
+
$env:PYTHONIOENCODING = "utf-8"
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
Ou dans votre profil PowerShell : `[Console]::OutputEncoding = [System.Text.Encoding]::UTF8`
|
| 197 |
+
|
| 198 |
+
---
|
| 199 |
+
|
| 200 |
+
## 5. Configuration des moteurs OCR
|
| 201 |
+
|
| 202 |
+
### 5.1 Tesseract — Configuration avancée
|
| 203 |
+
|
| 204 |
+
```bash
|
| 205 |
+
# Vérifier les modèles installés
|
| 206 |
+
tesseract --list-langs
|
| 207 |
+
|
| 208 |
+
# Tester sur une image
|
| 209 |
+
tesseract image.jpg sortie -l fra --psm 6
|
| 210 |
+
|
| 211 |
+
# Configuration dans Picarones
|
| 212 |
+
picarones run --corpus ./corpus/ --engines tesseract --lang fra --psm 6
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
Modes PSM (Page Segmentation Mode) recommandés :
|
| 216 |
+
|
| 217 |
+
| PSM | Usage |
|
| 218 |
+
|-----|-------|
|
| 219 |
+
| 6 (défaut) | Bloc de texte uniforme |
|
| 220 |
+
| 3 | Détection automatique de la mise en page |
|
| 221 |
+
| 11 | Texte épars, sans mise en page |
|
| 222 |
+
| 1 | Détection automatique avec OSD |
|
| 223 |
+
|
| 224 |
+
### 5.2 Pero OCR
|
| 225 |
+
|
| 226 |
+
```bash
|
| 227 |
+
# Télécharger un modèle pré-entraîné (exemple)
|
| 228 |
+
mkdir -p ~/.pero/models
|
| 229 |
+
# Voir https://github.com/DCGM/pero-ocr/releases
|
| 230 |
+
|
| 231 |
+
# Configurer via YAML
|
| 232 |
+
cat > pero_config.yaml << 'EOF'
|
| 233 |
+
name: pero_printed
|
| 234 |
+
type: pero_ocr
|
| 235 |
+
config_path: /path/to/pero_model/config.yaml
|
| 236 |
+
EOF
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
+
### 5.3 Kraken (optionnel)
|
| 240 |
+
|
| 241 |
+
```bash
|
| 242 |
+
pip install kraken
|
| 243 |
+
|
| 244 |
+
# Télécharger un modèle
|
| 245 |
+
kraken get 10.5281/zenodo.XXXXXXX
|
| 246 |
+
|
| 247 |
+
# Lister les modèles installés
|
| 248 |
+
kraken list
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
### 5.4 Ollama (LLMs locaux)
|
| 252 |
+
|
| 253 |
+
```bash
|
| 254 |
+
# Installer Ollama
|
| 255 |
+
curl -fsSL https://ollama.ai/install.sh | sh
|
| 256 |
+
|
| 257 |
+
# Démarrer le service
|
| 258 |
+
ollama serve
|
| 259 |
+
|
| 260 |
+
# Télécharger un modèle
|
| 261 |
+
ollama pull llama3
|
| 262 |
+
ollama pull gemma2
|
| 263 |
+
|
| 264 |
+
# Vérifier
|
| 265 |
+
ollama list
|
| 266 |
+
```
|
| 267 |
+
|
| 268 |
+
---
|
| 269 |
+
|
| 270 |
+
## 6. Configuration des APIs
|
| 271 |
+
|
| 272 |
+
Les clés API sont lues depuis les variables d'environnement. **Ne jamais les écrire dans le code.**
|
| 273 |
+
|
| 274 |
+
### 6.1 Fichier `.env` (recommandé)
|
| 275 |
+
|
| 276 |
+
Créer un fichier `.env` à la racine du projet (ajouté au `.gitignore`) :
|
| 277 |
+
|
| 278 |
+
```bash
|
| 279 |
+
# .env — Ne pas commiter ce fichier !
|
| 280 |
+
|
| 281 |
+
# OpenAI (GPT-4o, GPT-4o mini)
|
| 282 |
+
OPENAI_API_KEY=sk-...
|
| 283 |
+
|
| 284 |
+
# Anthropic (Claude Sonnet, Haiku)
|
| 285 |
+
ANTHROPIC_API_KEY=sk-ant-...
|
| 286 |
+
|
| 287 |
+
# Mistral (Mistral Large, Pixtral, Mistral OCR)
|
| 288 |
+
MISTRAL_API_KEY=...
|
| 289 |
+
|
| 290 |
+
# Google Vision
|
| 291 |
+
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
|
| 292 |
+
|
| 293 |
+
# AWS Textract
|
| 294 |
+
AWS_ACCESS_KEY_ID=...
|
| 295 |
+
AWS_SECRET_ACCESS_KEY=...
|
| 296 |
+
AWS_DEFAULT_REGION=eu-west-1
|
| 297 |
+
|
| 298 |
+
# Azure Document Intelligence
|
| 299 |
+
AZURE_DOC_INTEL_ENDPOINT=https://...cognitiveservices.azure.com/
|
| 300 |
+
AZURE_DOC_INTEL_KEY=...
|
| 301 |
+
```
|
| 302 |
+
|
| 303 |
+
Charger avec `python-dotenv` ou directement dans le shell :
|
| 304 |
+
|
| 305 |
+
```bash
|
| 306 |
+
# Linux/macOS
|
| 307 |
+
export $(cat .env | grep -v '^#' | xargs)
|
| 308 |
+
|
| 309 |
+
# Ou avec python-dotenv
|
| 310 |
+
pip install python-dotenv
|
| 311 |
+
```
|
| 312 |
+
|
| 313 |
+
### 6.2 Vérification des APIs
|
| 314 |
+
|
| 315 |
+
```bash
|
| 316 |
+
# Tester les APIs configurées
|
| 317 |
+
picarones engines # affiche les moteurs disponibles et leur statut
|
| 318 |
+
```
|
| 319 |
+
|
| 320 |
+
---
|
| 321 |
+
|
| 322 |
+
## 7. Lancement de l'interface web
|
| 323 |
+
|
| 324 |
+
```bash
|
| 325 |
+
# Installer les dépendances web
|
| 326 |
+
pip install -e ".[web]"
|
| 327 |
+
|
| 328 |
+
# Lancer le serveur (localhost uniquement)
|
| 329 |
+
picarones serve
|
| 330 |
+
|
| 331 |
+
# Ou avec adresse publique (Docker, serveur distant)
|
| 332 |
+
picarones serve --host 0.0.0.0 --port 8000
|
| 333 |
+
|
| 334 |
+
# Mode développement (rechargement automatique)
|
| 335 |
+
picarones serve --reload --verbose
|
| 336 |
+
|
| 337 |
+
# Accéder dans le navigateur
|
| 338 |
+
# http://localhost:8000
|
| 339 |
+
```
|
| 340 |
+
|
| 341 |
+
---
|
| 342 |
+
|
| 343 |
+
## 8. Installation Docker
|
| 344 |
+
|
| 345 |
+
### 8.1 Utiliser l'image Docker officielle
|
| 346 |
+
|
| 347 |
+
```bash
|
| 348 |
+
# Construire l'image
|
| 349 |
+
docker build -t picarones:latest .
|
| 350 |
+
|
| 351 |
+
# Lancer le service
|
| 352 |
+
docker run -p 8000:8000 \
|
| 353 |
+
-e OPENAI_API_KEY="$OPENAI_API_KEY" \
|
| 354 |
+
-v $(pwd)/corpus:/app/corpus \
|
| 355 |
+
picarones:latest
|
| 356 |
+
|
| 357 |
+
# Accéder dans le navigateur
|
| 358 |
+
# http://localhost:8000
|
| 359 |
+
```
|
| 360 |
+
|
| 361 |
+
### 8.2 Docker Compose (Picarones + Ollama)
|
| 362 |
+
|
| 363 |
+
```bash
|
| 364 |
+
# Lancer tous les services
|
| 365 |
+
docker compose up -d
|
| 366 |
+
|
| 367 |
+
# Avec Ollama pour les LLMs locaux
|
| 368 |
+
docker compose --profile ollama up -d
|
| 369 |
+
|
| 370 |
+
# Arrêter
|
| 371 |
+
docker compose down
|
| 372 |
+
```
|
| 373 |
+
|
| 374 |
+
Voir [docker-compose.yml](docker-compose.yml) pour la configuration complète.
|
| 375 |
+
|
| 376 |
+
### 8.3 Variables d'environnement pour Docker
|
| 377 |
+
|
| 378 |
+
Créer un fichier `.env.docker` :
|
| 379 |
+
|
| 380 |
+
```bash
|
| 381 |
+
OPENAI_API_KEY=sk-...
|
| 382 |
+
ANTHROPIC_API_KEY=sk-ant-...
|
| 383 |
+
MISTRAL_API_KEY=...
|
| 384 |
+
```
|
| 385 |
+
|
| 386 |
+
```bash
|
| 387 |
+
docker compose --env-file .env.docker up -d
|
| 388 |
+
```
|
| 389 |
+
|
| 390 |
+
---
|
| 391 |
+
|
| 392 |
+
## 9. Vérification de l'installation
|
| 393 |
+
|
| 394 |
+
```bash
|
| 395 |
+
# 1. Version et dépendances
|
| 396 |
+
picarones info
|
| 397 |
+
|
| 398 |
+
# 2. Moteurs disponibles
|
| 399 |
+
picarones engines
|
| 400 |
+
|
| 401 |
+
# 3. Rapport de démonstration (sans moteur OCR réel)
|
| 402 |
+
picarones demo --docs 3 --output test_demo.html
|
| 403 |
+
# Ouvrir test_demo.html dans un navigateur
|
| 404 |
+
|
| 405 |
+
# 4. Suivi longitudinal (demo)
|
| 406 |
+
picarones history --demo
|
| 407 |
+
|
| 408 |
+
# 5. Analyse de robustesse (demo)
|
| 409 |
+
picarones robustness --corpus . --engine tesseract --demo
|
| 410 |
+
|
| 411 |
+
# 6. Suite de tests complète
|
| 412 |
+
make test
|
| 413 |
+
# ou
|
| 414 |
+
pytest
|
| 415 |
+
```
|
| 416 |
+
|
| 417 |
+
---
|
| 418 |
+
|
| 419 |
+
## 10. Résolution des problèmes courants
|
| 420 |
+
|
| 421 |
+
### `tesseract: command not found`
|
| 422 |
+
|
| 423 |
+
```bash
|
| 424 |
+
# Ubuntu : réinstaller
|
| 425 |
+
sudo apt install tesseract-ocr
|
| 426 |
+
|
| 427 |
+
# macOS : vérifier Homebrew
|
| 428 |
+
brew install tesseract
|
| 429 |
+
|
| 430 |
+
# Windows : vérifier le PATH
|
| 431 |
+
where tesseract # doit retourner un chemin
|
| 432 |
+
```
|
| 433 |
+
|
| 434 |
+
### `Error: No module named 'picarones'`
|
| 435 |
+
|
| 436 |
+
```bash
|
| 437 |
+
# Réinstaller en mode éditable
|
| 438 |
+
pip install -e .
|
| 439 |
+
|
| 440 |
+
# Vérifier l'environnement virtuel actif
|
| 441 |
+
which python # doit pointer vers .venv/bin/python
|
| 442 |
+
```
|
| 443 |
+
|
| 444 |
+
### `pytesseract.pytesseract.TesseractNotFoundError`
|
| 445 |
+
|
| 446 |
+
```bash
|
| 447 |
+
# Linux/macOS : vérifier le PATH
|
| 448 |
+
which tesseract
|
| 449 |
+
|
| 450 |
+
# Windows : vérifier l'installation et le PATH
|
| 451 |
+
# Puis dans Python :
|
| 452 |
+
import pytesseract
|
| 453 |
+
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
|
| 454 |
+
```
|
| 455 |
+
|
| 456 |
+
### Erreur d'encodage UTF-8 (Windows)
|
| 457 |
+
|
| 458 |
+
```powershell
|
| 459 |
+
$env:PYTHONIOENCODING = "utf-8"
|
| 460 |
+
$env:PYTHONUTF8 = "1"
|
| 461 |
+
```
|
| 462 |
+
|
| 463 |
+
### Interface web inaccessible
|
| 464 |
+
|
| 465 |
+
```bash
|
| 466 |
+
# Vérifier que le port n'est pas occupé
|
| 467 |
+
lsof -i :8000 # Linux/macOS
|
| 468 |
+
netstat -ano | findstr :8000 # Windows
|
| 469 |
+
|
| 470 |
+
# Utiliser un autre port
|
| 471 |
+
picarones serve --port 8080
|
| 472 |
+
```
|
| 473 |
+
|
| 474 |
+
### `ImportError: No module named 'fastapi'`
|
| 475 |
+
|
| 476 |
+
```bash
|
| 477 |
+
pip install -e ".[web]"
|
| 478 |
+
```
|
| 479 |
+
|
| 480 |
+
### Tesseract lent sur de grands corpus
|
| 481 |
+
|
| 482 |
+
```bash
|
| 483 |
+
# Augmenter le parallélisme (si votre machine le permet)
|
| 484 |
+
picarones run --corpus ./corpus/ --engines tesseract # traitement séquentiel par défaut
|
| 485 |
+
```
|
| 486 |
+
|
| 487 |
+
---
|
| 488 |
+
|
| 489 |
+
## Désinstallation
|
| 490 |
+
|
| 491 |
+
```bash
|
| 492 |
+
# Dans l'environnement virtuel
|
| 493 |
+
pip uninstall picarones
|
| 494 |
+
|
| 495 |
+
# Supprimer l'historique SQLite (optionnel)
|
| 496 |
+
rm -rf ~/.picarones/
|
| 497 |
+
|
| 498 |
+
# Supprimer l'environnement virtuel
|
| 499 |
+
deactivate
|
| 500 |
+
rm -rf .venv/
|
| 501 |
+
```
|
|
@@ -0,0 +1,221 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Makefile — Picarones
|
| 2 |
+
# Usage : make <cible>
|
| 3 |
+
# Cibles principales : install, test, demo, serve, build, build-exe, docker-build, clean
|
| 4 |
+
|
| 5 |
+
.PHONY: all install install-dev install-all test test-cov lint demo serve \
|
| 6 |
+
build build-exe docker-build docker-run docker-compose-up clean help
|
| 7 |
+
|
| 8 |
+
PYTHON := python3
|
| 9 |
+
PIP := pip
|
| 10 |
+
VENV := .venv
|
| 11 |
+
VENV_BIN := $(VENV)/bin
|
| 12 |
+
PICARONES := $(VENV_BIN)/picarones
|
| 13 |
+
PYTEST := $(VENV_BIN)/pytest
|
| 14 |
+
PACKAGE := picarones
|
| 15 |
+
|
| 16 |
+
# Couleurs
|
| 17 |
+
BOLD := \033[1m
|
| 18 |
+
GREEN := \033[32m
|
| 19 |
+
CYAN := \033[36m
|
| 20 |
+
RESET := \033[0m
|
| 21 |
+
|
| 22 |
+
# ──────────────────────────────────────────────────────────────────
|
| 23 |
+
# Aide
|
| 24 |
+
# ──────────────────────────────────────────────────────────────────
|
| 25 |
+
|
| 26 |
+
help: ## Affiche cette aide
|
| 27 |
+
@echo ""
|
| 28 |
+
@echo "$(BOLD)Picarones — Commandes disponibles$(RESET)"
|
| 29 |
+
@echo ""
|
| 30 |
+
@grep -E '^[a-zA-Z_-]+:.*## ' $(MAKEFILE_LIST) \
|
| 31 |
+
| sort \
|
| 32 |
+
| awk 'BEGIN {FS = ":.*## "}; {printf " $(CYAN)%-18s$(RESET) %s\n", $$1, $$2}'
|
| 33 |
+
@echo ""
|
| 34 |
+
|
| 35 |
+
all: install test ## Installer et tester
|
| 36 |
+
|
| 37 |
+
# ──────────────────────────────────────────────────────────────────
|
| 38 |
+
# Installation
|
| 39 |
+
# ──────────────────────────────────────────────────────────────────
|
| 40 |
+
|
| 41 |
+
$(VENV):
|
| 42 |
+
$(PYTHON) -m venv $(VENV)
|
| 43 |
+
|
| 44 |
+
install: $(VENV) ## Installe Picarones en mode éditable (dépendances de base)
|
| 45 |
+
$(VENV_BIN)/pip install --upgrade pip
|
| 46 |
+
$(VENV_BIN)/pip install -e .
|
| 47 |
+
@echo "$(GREEN)✓ Installation de base terminée$(RESET)"
|
| 48 |
+
@echo " Activez l'environnement : source $(VENV)/bin/activate"
|
| 49 |
+
|
| 50 |
+
install-dev: $(VENV) ## Installe avec les dépendances de développement (tests, lint)
|
| 51 |
+
$(VENV_BIN)/pip install --upgrade pip
|
| 52 |
+
$(VENV_BIN)/pip install -e ".[dev]"
|
| 53 |
+
@echo "$(GREEN)✓ Installation dev terminée$(RESET)"
|
| 54 |
+
|
| 55 |
+
install-web: $(VENV) ## Installe avec l'interface web (FastAPI + uvicorn)
|
| 56 |
+
$(VENV_BIN)/pip install --upgrade pip
|
| 57 |
+
$(VENV_BIN)/pip install -e ".[web,dev]"
|
| 58 |
+
@echo "$(GREEN)✓ Installation web terminée$(RESET)"
|
| 59 |
+
|
| 60 |
+
install-all: $(VENV) ## Installe avec tous les extras (web, HuggingFace, dev)
|
| 61 |
+
$(VENV_BIN)/pip install --upgrade pip
|
| 62 |
+
$(VENV_BIN)/pip install -e ".[web,hf,dev]"
|
| 63 |
+
@echo "$(GREEN)✓ Installation complète terminée$(RESET)"
|
| 64 |
+
|
| 65 |
+
# ──────────────────────────────────────────────────────────────────
|
| 66 |
+
# Tests
|
| 67 |
+
# ──────────────────────────────────────────────────────────────────
|
| 68 |
+
|
| 69 |
+
test: ## Lance la suite de tests complète
|
| 70 |
+
$(PYTEST) tests/ -q --tb=short
|
| 71 |
+
@echo "$(GREEN)✓ Tests terminés$(RESET)"
|
| 72 |
+
|
| 73 |
+
test-cov: ## Tests avec rapport de couverture HTML
|
| 74 |
+
$(PYTEST) tests/ --cov=$(PACKAGE) --cov-report=html --cov-report=term-missing -q
|
| 75 |
+
@echo "$(GREEN)✓ Rapport de couverture : htmlcov/index.html$(RESET)"
|
| 76 |
+
|
| 77 |
+
test-fast: ## Tests rapides uniquement (exclut les tests lents)
|
| 78 |
+
$(PYTEST) tests/ -q --tb=short -x
|
| 79 |
+
|
| 80 |
+
test-sprint9: ## Tests Sprint 9 uniquement
|
| 81 |
+
$(PYTEST) tests/test_sprint9_packaging.py -v
|
| 82 |
+
|
| 83 |
+
# ──────────────────────────────────────────────────────────────────
|
| 84 |
+
# Qualité du code
|
| 85 |
+
# ──────────────────────────────────────────────────────────────────
|
| 86 |
+
|
| 87 |
+
lint: ## Vérifie le style du code (ruff si disponible, sinon flake8)
|
| 88 |
+
@if command -v ruff > /dev/null 2>&1; then \
|
| 89 |
+
ruff check $(PACKAGE)/ tests/; \
|
| 90 |
+
elif $(VENV_BIN)/python -m ruff --version > /dev/null 2>&1; then \
|
| 91 |
+
$(VENV_BIN)/python -m ruff check $(PACKAGE)/ tests/; \
|
| 92 |
+
elif command -v flake8 > /dev/null 2>&1; then \
|
| 93 |
+
flake8 $(PACKAGE)/ tests/ --max-line-length=100 --ignore=E501,W503; \
|
| 94 |
+
else \
|
| 95 |
+
echo "Aucun linter disponible (installez ruff : pip install ruff)"; \
|
| 96 |
+
fi
|
| 97 |
+
|
| 98 |
+
typecheck: ## Vérification de types avec mypy (si installé)
|
| 99 |
+
@$(VENV_BIN)/python -m mypy $(PACKAGE)/ --ignore-missing-imports --no-strict-optional 2>/dev/null \
|
| 100 |
+
|| echo "mypy non installé : pip install mypy"
|
| 101 |
+
|
| 102 |
+
# ──────────────────────────────────────────────────────────────────
|
| 103 |
+
# Démonstration
|
| 104 |
+
# ────────────────────���─────────────────────────────────────────────
|
| 105 |
+
|
| 106 |
+
demo: ## Génère un rapport de démonstration complet (rapport_demo.html)
|
| 107 |
+
$(PICARONES) demo --docs 12 --output rapport_demo.html \
|
| 108 |
+
--with-history --with-robustness
|
| 109 |
+
@echo "$(GREEN)✓ Rapport demo : rapport_demo.html$(RESET)"
|
| 110 |
+
@echo " Ouvrez : file://$(PWD)/rapport_demo.html"
|
| 111 |
+
|
| 112 |
+
demo-json: ## Génère rapport demo + export JSON
|
| 113 |
+
$(PICARONES) demo --docs 12 --output rapport_demo.html --json-output resultats_demo.json
|
| 114 |
+
@echo "$(GREEN)✓ Rapport : rapport_demo.html | JSON : resultats_demo.json$(RESET)"
|
| 115 |
+
|
| 116 |
+
demo-history: ## Démonstration du suivi longitudinal
|
| 117 |
+
$(PICARONES) history --demo --regression
|
| 118 |
+
|
| 119 |
+
demo-robustness: ## Démonstration de l'analyse de robustesse
|
| 120 |
+
mkdir -p /tmp/picarones_demo_corpus
|
| 121 |
+
$(PICARONES) robustness \
|
| 122 |
+
--corpus /tmp/picarones_demo_corpus \
|
| 123 |
+
--engine tesseract \
|
| 124 |
+
--demo \
|
| 125 |
+
--degradations noise,blur,rotation
|
| 126 |
+
|
| 127 |
+
# ──────────────────────────────────────────────────────────────────
|
| 128 |
+
# Serveur web
|
| 129 |
+
# ──────────────────────────────────────────────────────────────────
|
| 130 |
+
|
| 131 |
+
serve: ## Lance l'interface web locale (http://localhost:8000)
|
| 132 |
+
$(PICARONES) serve --host 127.0.0.1 --port 8000
|
| 133 |
+
|
| 134 |
+
serve-public: ## Lance le serveur en mode public (0.0.0.0:8000)
|
| 135 |
+
$(PICARONES) serve --host 0.0.0.0 --port 8000
|
| 136 |
+
|
| 137 |
+
serve-dev: ## Lance le serveur en mode développement (rechargement automatique)
|
| 138 |
+
$(PICARONES) serve --reload --verbose
|
| 139 |
+
|
| 140 |
+
# ──────────────────────────────────────────────────────────────────
|
| 141 |
+
# Build & packaging
|
| 142 |
+
# ──────────────────────────────────────────────────────────────────
|
| 143 |
+
|
| 144 |
+
build: ## Construit la distribution Python (wheel + sdist)
|
| 145 |
+
$(VENV_BIN)/pip install --upgrade build
|
| 146 |
+
$(VENV_BIN)/python -m build
|
| 147 |
+
@echo "$(GREEN)✓ Distribution : dist/$(RESET)"
|
| 148 |
+
|
| 149 |
+
build-exe: ## Génère un exécutable standalone avec PyInstaller
|
| 150 |
+
@echo "$(CYAN)Construction de l'exécutable standalone…$(RESET)"
|
| 151 |
+
$(VENV_BIN)/pip install pyinstaller
|
| 152 |
+
$(VENV_BIN)/pyinstaller picarones.spec --noconfirm
|
| 153 |
+
@echo "$(GREEN)✓ Exécutable : dist/picarones/$(RESET)"
|
| 154 |
+
|
| 155 |
+
build-exe-onefile: ## Génère un exécutable unique (plus lent au démarrage)
|
| 156 |
+
$(VENV_BIN)/pip install pyinstaller
|
| 157 |
+
$(VENV_BIN)/pyinstaller picarones.spec --noconfirm --onefile
|
| 158 |
+
@echo "$(GREEN)✓ Exécutable : dist/picarones$(RESET)"
|
| 159 |
+
|
| 160 |
+
# ──────────────────────────────────────────────────────────────────
|
| 161 |
+
# Docker
|
| 162 |
+
# ──────────────────────────────────────────────────────────────────
|
| 163 |
+
|
| 164 |
+
docker-build: ## Construit l'image Docker Picarones
|
| 165 |
+
docker build -t picarones:latest -t picarones:1.0.0 .
|
| 166 |
+
@echo "$(GREEN)✓ Image Docker : picarones:latest$(RESET)"
|
| 167 |
+
|
| 168 |
+
docker-run: ## Lance Picarones dans Docker (http://localhost:8000)
|
| 169 |
+
docker run --rm -p 8000:8000 \
|
| 170 |
+
-e OPENAI_API_KEY="$${OPENAI_API_KEY:-}" \
|
| 171 |
+
-e ANTHROPIC_API_KEY="$${ANTHROPIC_API_KEY:-}" \
|
| 172 |
+
-e MISTRAL_API_KEY="$${MISTRAL_API_KEY:-}" \
|
| 173 |
+
-v "$(PWD)/corpus:/app/corpus:ro" \
|
| 174 |
+
picarones:latest
|
| 175 |
+
|
| 176 |
+
docker-compose-up: ## Lance Picarones + Ollama avec Docker Compose
|
| 177 |
+
docker compose up -d
|
| 178 |
+
@echo "$(GREEN)✓ Services démarrés$(RESET)"
|
| 179 |
+
@echo " Picarones : http://localhost:8000"
|
| 180 |
+
@echo " Ollama : http://localhost:11434"
|
| 181 |
+
|
| 182 |
+
docker-compose-down: ## Arrête les services Docker Compose
|
| 183 |
+
docker compose down
|
| 184 |
+
|
| 185 |
+
docker-compose-logs: ## Affiche les logs Docker Compose
|
| 186 |
+
docker compose logs -f picarones
|
| 187 |
+
|
| 188 |
+
# ──────────────────────────────────────────────────────────────────
|
| 189 |
+
# Nettoyage
|
| 190 |
+
# ──────────────────────────────────────────────────────────────────
|
| 191 |
+
|
| 192 |
+
clean: ## Supprime les fichiers générés (cache, build, dist)
|
| 193 |
+
find . -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null || true
|
| 194 |
+
find . -type f -name "*.pyc" -delete 2>/dev/null || true
|
| 195 |
+
find . -type f -name "*.pyo" -delete 2>/dev/null || true
|
| 196 |
+
find . -type d -name "*.egg-info" -exec rm -rf {} + 2>/dev/null || true
|
| 197 |
+
rm -rf dist/ build/ .eggs/ htmlcov/ .coverage .pytest_cache/
|
| 198 |
+
@echo "$(GREEN)✓ Nettoyage terminé$(RESET)"
|
| 199 |
+
|
| 200 |
+
clean-all: clean ## Supprime aussi l'environnement virtuel
|
| 201 |
+
rm -rf $(VENV)/
|
| 202 |
+
@echo "$(GREEN)✓ Environnement virtuel supprimé$(RESET)"
|
| 203 |
+
|
| 204 |
+
# ──────────────────────────────────────────────────────────────────
|
| 205 |
+
# Utilitaires
|
| 206 |
+
# ──────────────────────────────────────────────────────────────────
|
| 207 |
+
|
| 208 |
+
info: ## Affiche les informations de version Picarones
|
| 209 |
+
$(PICARONES) info
|
| 210 |
+
|
| 211 |
+
engines: ## Liste les moteurs OCR disponibles
|
| 212 |
+
$(PICARONES) engines
|
| 213 |
+
|
| 214 |
+
history-demo: ## Affiche l'historique de démonstration
|
| 215 |
+
$(PICARONES) history --demo --regression
|
| 216 |
+
|
| 217 |
+
changelog: ## Affiche le CHANGELOG
|
| 218 |
+
@cat CHANGELOG.md | head -80
|
| 219 |
+
|
| 220 |
+
version: ## Affiche la version courante
|
| 221 |
+
@grep -m1 'version' pyproject.toml | awk '{print $$3}' | tr -d '"'
|
|
@@ -1,119 +1,285 @@
|
|
| 1 |
# Picarones
|
| 2 |
|
| 3 |
-
> **Plateforme de comparaison de moteurs OCR/HTR pour documents patrimoniaux**
|
| 4 |
-
>
|
|
|
|
| 5 |
|
| 6 |
-
|
|
|
|
|
|
|
| 7 |
|
| 8 |
---
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
| 11 |
|
| 12 |
-
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
- Interface abstraite `BaseOCREngine` pour ajouter facilement de nouveaux moteurs
|
| 16 |
-
- Calcul **CER** et **WER** via `jiwer` (brut, NFC, caseless, normalisé, MER, WIL)
|
| 17 |
-
- Chargement de **corpus** depuis dossier local (paires image / `.gt.txt`)
|
| 18 |
-
- **Export JSON** structuré des résultats avec classement
|
| 19 |
-
- **CLI** `click` : commandes `run`, `metrics`, `engines`, `info`
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
```bash
|
|
|
|
|
|
|
|
|
|
| 26 |
pip install -e .
|
| 27 |
|
| 28 |
-
#
|
| 29 |
-
# Ubuntu/Debian
|
| 30 |
-
|
| 31 |
|
| 32 |
-
#
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
| 34 |
```
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
## Usage rapide
|
| 37 |
|
| 38 |
```bash
|
| 39 |
-
#
|
|
|
|
|
|
|
|
|
|
| 40 |
picarones run --corpus ./mon_corpus/ --engines tesseract --output resultats.json
|
| 41 |
|
| 42 |
-
#
|
| 43 |
-
picarones
|
| 44 |
|
| 45 |
# Calculer CER/WER entre deux fichiers
|
| 46 |
picarones metrics --reference gt.txt --hypothesis ocr.txt
|
| 47 |
|
| 48 |
-
#
|
| 49 |
-
picarones
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
#
|
| 52 |
-
picarones
|
|
|
|
|
|
|
|
|
|
| 53 |
```
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
## Structure du projet
|
| 56 |
|
| 57 |
```
|
| 58 |
picarones/
|
| 59 |
-
├──
|
| 60 |
-
├──
|
| 61 |
├── core/
|
| 62 |
-
│ ├── corpus.py
|
| 63 |
-
│ ├── metrics.py
|
| 64 |
-
│ ├──
|
| 65 |
-
│
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
├──
|
| 72 |
-
├──
|
| 73 |
-
├──
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
```
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
-
```
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
├── page_002.gt.txt
|
| 87 |
-
└── ...
|
| 88 |
-
```
|
| 89 |
|
| 90 |
-
#
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
"ranking": [
|
| 98 |
-
{ "engine": "tesseract", "mean_cer": 0.043, "mean_wer": 0.112 }
|
| 99 |
-
],
|
| 100 |
-
"engine_reports": [...]
|
| 101 |
-
}
|
| 102 |
```
|
| 103 |
|
| 104 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
```bash
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
```
|
| 109 |
|
| 110 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
-
|
| 113 |
-
|--------|-----------|
|
| 114 |
-
| **Sprint 1** ✅ | Structure, adaptateurs Tesseract + Pero OCR, CER/WER, JSON, CLI |
|
| 115 |
-
| Sprint 2 | Rapport HTML interactif avec diff coloré |
|
| 116 |
-
| Sprint 3 | Pipelines OCR+LLM (GPT-4o, Claude) |
|
| 117 |
-
| Sprint 4 | APIs cloud OCR, import IIIF, normalisation diplomatique |
|
| 118 |
-
| Sprint 5 | Métriques avancées : matrice de confusion unicode, ligatures |
|
| 119 |
-
| Sprint 6 | Interface web FastAPI, import HTR-United / HuggingFace |
|
|
|
|
| 1 |
# Picarones
|
| 2 |
|
| 3 |
+
> **Plateforme de comparaison et d'évaluation de moteurs OCR/HTR pour documents patrimoniaux**
|
| 4 |
+
>
|
| 5 |
+
> BnF — Département numérique · [Apache 2.0](LICENSE)
|
| 6 |
|
| 7 |
+
[](https://github.com/bnf/picarones/actions/workflows/ci.yml)
|
| 8 |
+
[](https://www.python.org/downloads/)
|
| 9 |
+
[](LICENSE)
|
| 10 |
|
| 11 |
---
|
| 12 |
|
| 13 |
+
**Picarones** est un outil open-source conçu pour comparer rigoureusement des moteurs OCR et HTR
|
| 14 |
+
(Tesseract, Pero OCR, Kraken, APIs cloud…) ainsi que des pipelines OCR+LLM sur des corpus de
|
| 15 |
+
documents historiques — manuscrits, imprimés anciens, archives.
|
| 16 |
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
*[English version below](#english)*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
+
## Sommaire
|
| 24 |
+
|
| 25 |
+
- [Fonctionnalités](#fonctionnalités)
|
| 26 |
+
- [Installation rapide](#installation-rapide)
|
| 27 |
+
- [Usage rapide](#usage-rapide)
|
| 28 |
+
- [Moteurs supportés](#moteurs-supportés)
|
| 29 |
+
- [Structure du projet](#structure-du-projet)
|
| 30 |
+
- [Variables d'environnement](#variables-denvironnement)
|
| 31 |
+
- [Roadmap](#roadmap)
|
| 32 |
+
- [English](#english)
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## Fonctionnalités
|
| 37 |
+
|
| 38 |
+
### Métriques adaptées aux documents patrimoniaux
|
| 39 |
+
|
| 40 |
+
- **CER** (Character Error Rate) : brut, NFC, caseless, diplomatique (ſ=s, u=v, i=j…)
|
| 41 |
+
- **WER**, MER, WIL avec tokenisation historique
|
| 42 |
+
- **Matrice de confusion unicode** — fingerprint de chaque moteur
|
| 43 |
+
- **Scores ligatures** : fi, fl, ff, œ, æ, ꝑ, ꝓ…
|
| 44 |
+
- **Scores diacritiques** : accents, cédilles, trémas
|
| 45 |
+
- **Taxonomie des erreurs** en 10 classes (confusion visuelle, abréviation, ligature, casse…)
|
| 46 |
+
- **Intervalles de confiance à 95%** par bootstrap — tests de Wilcoxon pour la significativité
|
| 47 |
+
- **Score de difficulté intrinsèque** par document (indépendant des moteurs)
|
| 48 |
+
|
| 49 |
+
### Pipelines OCR+LLM
|
| 50 |
+
|
| 51 |
+
- Chaînes composables : `tesseract → gpt-4o`, `pero_ocr → claude-sonnet`, LLM zero-shot…
|
| 52 |
+
- Modes : texte seul, image+texte, zero-shot
|
| 53 |
+
- Détection de **sur-normalisation LLM** : le LLM modernise-t-il à tort la graphie médiévale ?
|
| 54 |
+
- Bibliothèque de prompts pour manuscrits médiévaux, imprimés anciens, latin…
|
| 55 |
+
|
| 56 |
+
### Import de corpus
|
| 57 |
+
|
| 58 |
+
| Source | Commande |
|
| 59 |
+
|--------|----------|
|
| 60 |
+
| Dossier local | `picarones run --corpus ./corpus/` |
|
| 61 |
+
| IIIF (Gallica, Bodleian, BL…) | `picarones import iiif <url>` |
|
| 62 |
+
| Gallica (API BnF + OCR) | `GallicaClient` / `picarones import iiif` |
|
| 63 |
+
| HuggingFace Datasets | `picarones import hf <dataset>` |
|
| 64 |
+
| HTR-United | `picarones import htr-united` |
|
| 65 |
+
| eScriptorium | `EScriptoriumClient` |
|
| 66 |
+
|
| 67 |
+
### Rapport HTML interactif
|
| 68 |
+
|
| 69 |
+
- Fichier HTML **auto-contenu**, lisible hors-ligne
|
| 70 |
+
- Tableau de classement trié, graphiques radar, histogrammes
|
| 71 |
+
- Vue galerie avec filtres dynamiques et badges CER colorés
|
| 72 |
+
- Diff coloré façon GitHub, scroll synchronisé N-way
|
| 73 |
+
- Vue spécifique OCR+LLM : diff triple GT / OCR brut / après correction
|
| 74 |
+
- Vue Caractères : matrice de confusion unicode interactive
|
| 75 |
+
- Export CSV, JSON, ALTO XML, PAGE XML, images annotées
|
| 76 |
+
|
| 77 |
+
### Suivi longitudinal & robustesse
|
| 78 |
+
|
| 79 |
+
- **Base SQLite** optionnelle pour historiser les runs
|
| 80 |
+
- **Courbes d'évolution CER** dans le temps par moteur
|
| 81 |
+
- **Détection automatique des régressions** entre deux runs
|
| 82 |
+
- **Analyse de robustesse** : bruit, flou, rotation, réduction de résolution, binarisation
|
| 83 |
+
- Commandes `picarones history`, `picarones robustness`
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## Installation rapide
|
| 88 |
|
| 89 |
```bash
|
| 90 |
+
# Cloner et installer
|
| 91 |
+
git clone https://github.com/bnf/picarones.git
|
| 92 |
+
cd picarones
|
| 93 |
pip install -e .
|
| 94 |
|
| 95 |
+
# Tesseract (binaire système, obligatoire pour le moteur Tesseract)
|
| 96 |
+
# Ubuntu/Debian
|
| 97 |
+
sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat
|
| 98 |
|
| 99 |
+
# macOS
|
| 100 |
+
brew install tesseract
|
| 101 |
+
|
| 102 |
+
# Vérifier l'installation
|
| 103 |
+
picarones engines
|
| 104 |
```
|
| 105 |
|
| 106 |
+
Voir [INSTALL.md](INSTALL.md) pour un guide détaillé (Linux, macOS, Windows, Docker).
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
## Usage rapide
|
| 111 |
|
| 112 |
```bash
|
| 113 |
+
# Rapport de démonstration (sans moteur OCR installé)
|
| 114 |
+
picarones demo
|
| 115 |
+
|
| 116 |
+
# Benchmark sur un corpus local
|
| 117 |
picarones run --corpus ./mon_corpus/ --engines tesseract --output resultats.json
|
| 118 |
|
| 119 |
+
# Générer le rapport HTML interactif
|
| 120 |
+
picarones report --results resultats.json --output rapport.html
|
| 121 |
|
| 122 |
# Calculer CER/WER entre deux fichiers
|
| 123 |
picarones metrics --reference gt.txt --hypothesis ocr.txt
|
| 124 |
|
| 125 |
+
# Importer depuis Gallica (IIIF)
|
| 126 |
+
picarones import iiif https://gallica.bnf.fr/ark:/12148/xxx/manifest.json --pages 1-10
|
| 127 |
+
|
| 128 |
+
# Suivi longitudinal (historique des runs)
|
| 129 |
+
picarones history --demo
|
| 130 |
+
picarones history --engine tesseract --regression
|
| 131 |
|
| 132 |
+
# Analyse de robustesse
|
| 133 |
+
picarones robustness --corpus ./gt/ --engine tesseract --demo
|
| 134 |
+
|
| 135 |
+
# Interface web locale
|
| 136 |
+
picarones serve
|
| 137 |
```
|
| 138 |
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
## Moteurs supportés
|
| 142 |
+
|
| 143 |
+
| Moteur | Type | Installation |
|
| 144 |
+
|--------|------|--------------|
|
| 145 |
+
| **Tesseract 5** | Local CLI | `pip install pytesseract` + binaire système |
|
| 146 |
+
| **Pero OCR** | Local Python | `pip install pero-ocr` |
|
| 147 |
+
| **Kraken** | Local Python | `pip install kraken` |
|
| 148 |
+
| **Mistral OCR** | API REST | Clé `MISTRAL_API_KEY` |
|
| 149 |
+
| **GPT-4o** (LLM) | API REST | Clé `OPENAI_API_KEY` |
|
| 150 |
+
| **Claude Sonnet** (LLM) | API REST | Clé `ANTHROPIC_API_KEY` |
|
| 151 |
+
| **Mistral Large** (LLM) | API REST | Clé `MISTRAL_API_KEY` |
|
| 152 |
+
| **Google Vision** | API REST | Credentials JSON Google |
|
| 153 |
+
| **AWS Textract** | API REST | Credentials AWS |
|
| 154 |
+
| **Azure Doc. Intel.** | API REST | Endpoint + clé Azure |
|
| 155 |
+
| **Ollama** (LLM local) | Local | `ollama serve` |
|
| 156 |
+
| **Moteur custom** | CLI/API YAML | Déclaration YAML, sans code |
|
| 157 |
+
|
| 158 |
+
---
|
| 159 |
+
|
| 160 |
## Structure du projet
|
| 161 |
|
| 162 |
```
|
| 163 |
picarones/
|
| 164 |
+
├── cli.py # CLI Click (run, demo, report, history, robustness…)
|
| 165 |
+
├── fixtures.py # Données de test fictives réalistes
|
| 166 |
├── core/
|
| 167 |
+
│ ├── corpus.py # Chargement corpus (dossier, ALTO, PAGE XML…)
|
| 168 |
+
│ ├── metrics.py # CER, WER, MER, WIL (jiwer)
|
| 169 |
+
│ ├── normalization.py # Normalisation unicode, profils diplomatiques
|
| 170 |
+
│ ├── statistics.py # Bootstrap CI, Wilcoxon, corrélations
|
| 171 |
+
│ ├── confusion.py # Matrice de confusion unicode
|
| 172 |
+
│ ├── char_scores.py # Scores ligatures et diacritiques
|
| 173 |
+
│ ├── taxonomy.py # Taxonomie des erreurs (10 classes)
|
| 174 |
+
│ ├── structure.py # Analyse structurelle
|
| 175 |
+
│ ├── image_quality.py # Métriques qualité image
|
| 176 |
+
│ ├── difficulty.py # Score de difficulté intrinsèque
|
| 177 |
+
│ ├── history.py # Suivi longitudinal SQLite
|
| 178 |
+
│ ├── robustness.py # Analyse de robustesse
|
| 179 |
+
│ ├── results.py # Modèles de données + export JSON
|
| 180 |
+
│ └── runner.py # Orchestrateur benchmark
|
| 181 |
+
├── engines/ # Adaptateurs moteurs OCR
|
| 182 |
+
├── llm/ # Adaptateurs LLM (OpenAI, Anthropic, Mistral, Ollama)
|
| 183 |
+
├── importers/ # Sources d'import (IIIF, Gallica, eScriptorium, HF…)
|
| 184 |
+
├── pipelines/ # Orchestrateur OCR+LLM
|
| 185 |
+
├── report/ # Générateur rapport HTML
|
| 186 |
+
└── web/ # Interface web FastAPI
|
| 187 |
+
tests/ # Tests unitaires et d'intégration (743 tests)
|
| 188 |
```
|
| 189 |
|
| 190 |
+
---
|
| 191 |
|
| 192 |
+
## Variables d'environnement
|
| 193 |
|
| 194 |
+
```bash
|
| 195 |
+
# APIs LLM (selon les moteurs utilisés)
|
| 196 |
+
export OPENAI_API_KEY="sk-..."
|
| 197 |
+
export ANTHROPIC_API_KEY="sk-ant-..."
|
| 198 |
+
export MISTRAL_API_KEY="..."
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
+
# APIs OCR cloud (optionnel)
|
| 201 |
+
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
|
| 202 |
+
export AWS_ACCESS_KEY_ID="..."
|
| 203 |
+
export AWS_SECRET_ACCESS_KEY="..."
|
| 204 |
+
export AWS_DEFAULT_REGION="eu-west-1"
|
| 205 |
+
export AZURE_DOC_INTEL_ENDPOINT="https://..."
|
| 206 |
+
export AZURE_DOC_INTEL_KEY="..."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
```
|
| 208 |
|
| 209 |
+
---
|
| 210 |
+
|
| 211 |
+
## Roadmap
|
| 212 |
+
|
| 213 |
+
| Sprint | Statut | Livrables |
|
| 214 |
+
|--------|--------|-----------|
|
| 215 |
+
| Sprint 1 | ✅ | Structure, Tesseract, Pero OCR, CER/WER, CLI |
|
| 216 |
+
| Sprint 2 | ✅ | Rapport HTML v1, diff coloré, galerie |
|
| 217 |
+
| Sprint 3 | ✅ | Pipelines OCR+LLM, GPT-4o, Claude |
|
| 218 |
+
| Sprint 4 | ✅ | APIs cloud, import IIIF, normalisation diplomatique |
|
| 219 |
+
| Sprint 5 | ✅ | Métriques avancées : confusion unicode, ligatures, taxonomie |
|
| 220 |
+
| Sprint 6 | ✅ | Interface web FastAPI, HTR-United, HuggingFace, Ollama |
|
| 221 |
+
| Sprint 7 | ✅ | Rapport HTML v2 : Wilcoxon, clustering, scatter plots |
|
| 222 |
+
| Sprint 8 | ✅ | eScriptorium, Gallica API, historique longitudinal, robustesse |
|
| 223 |
+
| Sprint 9 | ✅ | Documentation, packaging, Docker, CI/CD |
|
| 224 |
+
|
| 225 |
+
---
|
| 226 |
+
|
| 227 |
+
## Contribuer
|
| 228 |
+
|
| 229 |
+
Voir [CONTRIBUTING.md](CONTRIBUTING.md) pour ajouter un moteur OCR, un adaptateur LLM, ou soumettre une pull request.
|
| 230 |
+
|
| 231 |
+
---
|
| 232 |
+
|
| 233 |
+
## Licence
|
| 234 |
+
|
| 235 |
+
Apache License 2.0 — © BnF — Département numérique
|
| 236 |
+
|
| 237 |
+
---
|
| 238 |
+
|
| 239 |
+
---
|
| 240 |
+
|
| 241 |
+
# English
|
| 242 |
+
|
| 243 |
+
## Picarones — OCR/HTR Benchmark Platform for Heritage Documents
|
| 244 |
+
|
| 245 |
+
**Picarones** is an open-source platform for rigorously comparing OCR and HTR engines (Tesseract,
|
| 246 |
+
Pero OCR, Kraken, cloud APIs…) and OCR+LLM pipelines on historical document corpora — manuscripts,
|
| 247 |
+
early printed books, archives.
|
| 248 |
+
|
| 249 |
+
### Key Features
|
| 250 |
+
|
| 251 |
+
- **Metrics tailored to historical documents**: CER (raw, NFC, caseless, diplomatic), WER, MER,
|
| 252 |
+
WIL; unicode confusion matrix; ligature and diacritic scores; 10-class error taxonomy; bootstrap
|
| 253 |
+
confidence intervals; Wilcoxon significance tests
|
| 254 |
+
- **OCR+LLM pipelines**: composable chains (`tesseract → gpt-4o`), three modes (text-only,
|
| 255 |
+
image+text, zero-shot), LLM over-normalisation detection
|
| 256 |
+
- **Corpus import**: local folder, IIIF (Gallica, Bodleian, BL…), Gallica API + OCR, HuggingFace
|
| 257 |
+
Datasets, HTR-United, eScriptorium
|
| 258 |
+
- **Interactive HTML report**: self-contained file, sortable ranking, gallery, coloured diff,
|
| 259 |
+
unicode character view, CSV/JSON/ALTO/PAGE XML export
|
| 260 |
+
- **Longitudinal tracking**: SQLite benchmark history, CER evolution curves, automatic regression
|
| 261 |
+
detection
|
| 262 |
+
- **Robustness analysis**: degraded image versions (noise, blur, rotation, resolution,
|
| 263 |
+
binarisation), critical threshold detection
|
| 264 |
+
|
| 265 |
+
### Quick Start
|
| 266 |
|
| 267 |
```bash
|
| 268 |
+
pip install -e .
|
| 269 |
+
sudo apt install tesseract-ocr tesseract-ocr-fra # Ubuntu/Debian
|
| 270 |
+
picarones demo # demo report without any engine installed
|
| 271 |
+
picarones engines # list available engines
|
| 272 |
+
picarones run --corpus ./corpus/ --engines tesseract --output results.json
|
| 273 |
+
picarones report --results results.json
|
| 274 |
```
|
| 275 |
|
| 276 |
+
See [INSTALL.md](INSTALL.md) for detailed installation on Linux, macOS, Windows, and Docker.
|
| 277 |
+
|
| 278 |
+
### Supported Engines
|
| 279 |
+
|
| 280 |
+
Tesseract 5 · Pero OCR · Kraken · Mistral OCR · GPT-4o · Claude Sonnet · Mistral Large ·
|
| 281 |
+
Google Vision · AWS Textract · Azure Document Intelligence · Ollama (local LLMs) · Custom YAML engine
|
| 282 |
+
|
| 283 |
+
### License
|
| 284 |
|
| 285 |
+
Apache License 2.0 — © BnF — Département numérique
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# docker-compose.yml — Picarones
|
| 2 |
+
#
|
| 3 |
+
# Services disponibles :
|
| 4 |
+
# - picarones : interface web + benchmarks (port 8000)
|
| 5 |
+
# - ollama : LLMs locaux (port 11434, profil optionnel)
|
| 6 |
+
#
|
| 7 |
+
# Usage :
|
| 8 |
+
# docker compose up -d # Picarones seul
|
| 9 |
+
# docker compose --profile ollama up -d # Picarones + Ollama
|
| 10 |
+
# docker compose down
|
| 11 |
+
#
|
| 12 |
+
# Variables d'environnement :
|
| 13 |
+
# Créer un fichier .env à la racine (voir .env.example)
|
| 14 |
+
|
| 15 |
+
services:
|
| 16 |
+
|
| 17 |
+
# ────────────────────────────────────────────────
|
| 18 |
+
# Service principal : Picarones
|
| 19 |
+
# ────────────────────────────────────────────────
|
| 20 |
+
picarones:
|
| 21 |
+
build:
|
| 22 |
+
context: .
|
| 23 |
+
dockerfile: Dockerfile
|
| 24 |
+
target: runtime
|
| 25 |
+
image: picarones:latest
|
| 26 |
+
container_name: picarones
|
| 27 |
+
restart: unless-stopped
|
| 28 |
+
ports:
|
| 29 |
+
- "${PICARONES_PORT:-8000}:8000"
|
| 30 |
+
volumes:
|
| 31 |
+
# Corpus à benchmarker (lecture seule)
|
| 32 |
+
- "${CORPUS_DIR:-./corpus}:/app/corpus:ro"
|
| 33 |
+
# Rapports générés (lecture/écriture)
|
| 34 |
+
- "${RAPPORTS_DIR:-./rapports}:/app/rapports:rw"
|
| 35 |
+
# Historique SQLite (persistant)
|
| 36 |
+
- picarones_history:/home/picarones/.picarones
|
| 37 |
+
environment:
|
| 38 |
+
# LLM APIs
|
| 39 |
+
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
|
| 40 |
+
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
|
| 41 |
+
- MISTRAL_API_KEY=${MISTRAL_API_KEY:-}
|
| 42 |
+
# OCR cloud APIs
|
| 43 |
+
- GOOGLE_APPLICATION_CREDENTIALS=${GOOGLE_APPLICATION_CREDENTIALS:-}
|
| 44 |
+
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID:-}
|
| 45 |
+
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY:-}
|
| 46 |
+
- AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION:-eu-west-1}
|
| 47 |
+
- AZURE_DOC_INTEL_ENDPOINT=${AZURE_DOC_INTEL_ENDPOINT:-}
|
| 48 |
+
- AZURE_DOC_INTEL_KEY=${AZURE_DOC_INTEL_KEY:-}
|
| 49 |
+
# Ollama (si le service ollama est actif)
|
| 50 |
+
- OLLAMA_BASE_URL=http://ollama:11434
|
| 51 |
+
# Python
|
| 52 |
+
- PYTHONUNBUFFERED=1
|
| 53 |
+
- PYTHONIOENCODING=utf-8
|
| 54 |
+
depends_on:
|
| 55 |
+
- ollama
|
| 56 |
+
healthcheck:
|
| 57 |
+
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
| 58 |
+
interval: 30s
|
| 59 |
+
timeout: 10s
|
| 60 |
+
retries: 3
|
| 61 |
+
start_period: 20s
|
| 62 |
+
networks:
|
| 63 |
+
- picarones_net
|
| 64 |
+
|
| 65 |
+
# ────────────────────────────────────────────────
|
| 66 |
+
# Service optionnel : Ollama (LLMs locaux)
|
| 67 |
+
# Activer avec : docker compose --profile ollama up
|
| 68 |
+
# ────────────────────────────────────────────────
|
| 69 |
+
ollama:
|
| 70 |
+
image: ollama/ollama:latest
|
| 71 |
+
container_name: picarones_ollama
|
| 72 |
+
restart: unless-stopped
|
| 73 |
+
profiles:
|
| 74 |
+
- ollama
|
| 75 |
+
ports:
|
| 76 |
+
- "${OLLAMA_PORT:-11434}:11434"
|
| 77 |
+
volumes:
|
| 78 |
+
- ollama_models:/root/.ollama
|
| 79 |
+
environment:
|
| 80 |
+
- OLLAMA_ORIGINS=*
|
| 81 |
+
deploy:
|
| 82 |
+
resources:
|
| 83 |
+
reservations:
|
| 84 |
+
devices:
|
| 85 |
+
- driver: nvidia
|
| 86 |
+
count: all
|
| 87 |
+
capabilities: [gpu]
|
| 88 |
+
healthcheck:
|
| 89 |
+
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
|
| 90 |
+
interval: 30s
|
| 91 |
+
timeout: 10s
|
| 92 |
+
retries: 5
|
| 93 |
+
start_period: 30s
|
| 94 |
+
networks:
|
| 95 |
+
- picarones_net
|
| 96 |
+
|
| 97 |
+
# ────────────────────────────────────────────────
|
| 98 |
+
# Volumes persistants
|
| 99 |
+
# ────────────────────────────────────────────────
|
| 100 |
+
volumes:
|
| 101 |
+
picarones_history:
|
| 102 |
+
driver: local
|
| 103 |
+
ollama_models:
|
| 104 |
+
driver: local
|
| 105 |
+
|
| 106 |
+
# ────────────────────────────────────────────────
|
| 107 |
+
# Réseau interne
|
| 108 |
+
# ────────────────────────────────────────────────
|
| 109 |
+
networks:
|
| 110 |
+
picarones_net:
|
| 111 |
+
driver: bridge
|
|
@@ -0,0 +1,155 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# picarones.spec — Configuration PyInstaller
|
| 2 |
+
#
|
| 3 |
+
# Génère un exécutable standalone Picarones pour Linux, macOS et Windows.
|
| 4 |
+
# L'exécutable embarque Python et toutes les dépendances — aucune installation requise.
|
| 5 |
+
#
|
| 6 |
+
# Usage :
|
| 7 |
+
# pip install pyinstaller
|
| 8 |
+
# pyinstaller picarones.spec --noconfirm
|
| 9 |
+
#
|
| 10 |
+
# Sortie :
|
| 11 |
+
# dist/picarones/picarones (Linux/macOS)
|
| 12 |
+
# dist/picarones/picarones.exe (Windows)
|
| 13 |
+
#
|
| 14 |
+
# Pour un seul fichier (démarrage plus lent) :
|
| 15 |
+
# pyinstaller picarones.spec --noconfirm --onefile
|
| 16 |
+
|
| 17 |
+
import sys
|
| 18 |
+
from pathlib import Path
|
| 19 |
+
|
| 20 |
+
# Chemin racine du projet
|
| 21 |
+
ROOT = Path(spec_file).parent # noqa: F821 (spec_file est défini par PyInstaller)
|
| 22 |
+
|
| 23 |
+
# ──────────────────────────────────────────────────────────────────
|
| 24 |
+
# Analyse des dépendances
|
| 25 |
+
# ──────────────────────────────────────────────────────────────────
|
| 26 |
+
a = Analysis(
|
| 27 |
+
# Point d'entrée : le script CLI principal
|
| 28 |
+
scripts=[str(ROOT / "picarones" / "__main__.py")],
|
| 29 |
+
|
| 30 |
+
# Chemins de recherche des modules
|
| 31 |
+
pathex=[str(ROOT)],
|
| 32 |
+
|
| 33 |
+
# Dépendances binaires supplémentaires (DLLs, .so)
|
| 34 |
+
binaries=[],
|
| 35 |
+
|
| 36 |
+
# Fichiers de données à embarquer
|
| 37 |
+
datas=[
|
| 38 |
+
# Données de configuration
|
| 39 |
+
(str(ROOT / "picarones"), "picarones"),
|
| 40 |
+
# Prompts LLM (si présents)
|
| 41 |
+
# (str(ROOT / "prompts"), "prompts"),
|
| 42 |
+
],
|
| 43 |
+
|
| 44 |
+
# Imports cachés (non détectés automatiquement par PyInstaller)
|
| 45 |
+
hiddenimports=[
|
| 46 |
+
# CLI
|
| 47 |
+
"picarones.cli",
|
| 48 |
+
"picarones.core.corpus",
|
| 49 |
+
"picarones.core.metrics",
|
| 50 |
+
"picarones.core.results",
|
| 51 |
+
"picarones.core.runner",
|
| 52 |
+
"picarones.core.normalization",
|
| 53 |
+
"picarones.core.statistics",
|
| 54 |
+
"picarones.core.confusion",
|
| 55 |
+
"picarones.core.char_scores",
|
| 56 |
+
"picarones.core.taxonomy",
|
| 57 |
+
"picarones.core.structure",
|
| 58 |
+
"picarones.core.image_quality",
|
| 59 |
+
"picarones.core.difficulty",
|
| 60 |
+
"picarones.core.history",
|
| 61 |
+
"picarones.core.robustness",
|
| 62 |
+
"picarones.engines.base",
|
| 63 |
+
"picarones.engines.tesseract",
|
| 64 |
+
"picarones.engines.pero_ocr",
|
| 65 |
+
"picarones.engines.mistral_ocr",
|
| 66 |
+
"picarones.engines.google_vision",
|
| 67 |
+
"picarones.engines.azure_doc_intel",
|
| 68 |
+
"picarones.llm.base",
|
| 69 |
+
"picarones.llm.openai_adapter",
|
| 70 |
+
"picarones.llm.anthropic_adapter",
|
| 71 |
+
"picarones.llm.mistral_adapter",
|
| 72 |
+
"picarones.llm.ollama_adapter",
|
| 73 |
+
"picarones.importers.iiif",
|
| 74 |
+
"picarones.importers.gallica",
|
| 75 |
+
"picarones.importers.escriptorium",
|
| 76 |
+
"picarones.importers.huggingface",
|
| 77 |
+
"picarones.importers.htr_united",
|
| 78 |
+
"picarones.pipelines.base",
|
| 79 |
+
"picarones.pipelines.over_normalization",
|
| 80 |
+
"picarones.report.generator",
|
| 81 |
+
"picarones.report.diff_utils",
|
| 82 |
+
"picarones.fixtures",
|
| 83 |
+
# Dépendances tiers
|
| 84 |
+
"click",
|
| 85 |
+
"jiwer",
|
| 86 |
+
"PIL",
|
| 87 |
+
"PIL.Image",
|
| 88 |
+
"PIL.ImageFilter",
|
| 89 |
+
"PIL.ImageOps",
|
| 90 |
+
"yaml",
|
| 91 |
+
"tqdm",
|
| 92 |
+
"numpy",
|
| 93 |
+
"pytesseract",
|
| 94 |
+
# SQLite (stdlib, mais parfois manquant)
|
| 95 |
+
"sqlite3",
|
| 96 |
+
# Encodage
|
| 97 |
+
"unicodedata",
|
| 98 |
+
],
|
| 99 |
+
|
| 100 |
+
# Fichiers à exclure pour réduire la taille
|
| 101 |
+
excludes=[
|
| 102 |
+
"tkinter",
|
| 103 |
+
"matplotlib",
|
| 104 |
+
"scipy",
|
| 105 |
+
"sklearn",
|
| 106 |
+
"pandas",
|
| 107 |
+
"IPython",
|
| 108 |
+
"jupyter",
|
| 109 |
+
"notebook",
|
| 110 |
+
],
|
| 111 |
+
|
| 112 |
+
# Options de collection
|
| 113 |
+
win_no_prefer_redirects=False,
|
| 114 |
+
win_private_assemblies=False,
|
| 115 |
+
noarchive=False,
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
# ──────────────────────────────────────────────────────────────────
|
| 119 |
+
# Archive PYZ (modules Python compilés)
|
| 120 |
+
# ──────────────────────────────────────────────────────────────────
|
| 121 |
+
pyz = PYZ(a.pure, a.zipped_data) # noqa: F821
|
| 122 |
+
|
| 123 |
+
# ──────────────────────────────────────────────────────────────────
|
| 124 |
+
# Exécutable principal
|
| 125 |
+
# ──────────────────────────────────────────────────────────────────
|
| 126 |
+
exe = EXE( # noqa: F821
|
| 127 |
+
pyz,
|
| 128 |
+
a.scripts,
|
| 129 |
+
[],
|
| 130 |
+
exclude_binaries=True,
|
| 131 |
+
name="picarones",
|
| 132 |
+
debug=False,
|
| 133 |
+
bootloader_ignore_signals=False,
|
| 134 |
+
strip=False,
|
| 135 |
+
upx=True, # Compression UPX si disponible
|
| 136 |
+
console=True, # Mode console (pas de fenêtre graphique)
|
| 137 |
+
disable_windowed_traceback=False,
|
| 138 |
+
argv_emulation=False,
|
| 139 |
+
# Icône (optionnelle)
|
| 140 |
+
# icon=str(ROOT / "assets" / "picarones.ico"),
|
| 141 |
+
)
|
| 142 |
+
|
| 143 |
+
# ──────────────────────────────────────────────────────────────────
|
| 144 |
+
# Collection finale (dossier dist/picarones/)
|
| 145 |
+
# ──────────────────────────────────────────────────────────────────
|
| 146 |
+
coll = COLLECT( # noqa: F821
|
| 147 |
+
exe,
|
| 148 |
+
a.binaries,
|
| 149 |
+
a.zipfiles,
|
| 150 |
+
a.datas,
|
| 151 |
+
strip=False,
|
| 152 |
+
upx=True,
|
| 153 |
+
upx_exclude=[],
|
| 154 |
+
name="picarones",
|
| 155 |
+
)
|
|
@@ -5,5 +5,5 @@ BnF — Département numérique, 2025.
|
|
| 5 |
Licence Apache 2.0.
|
| 6 |
"""
|
| 7 |
|
| 8 |
-
__version__ = "
|
| 9 |
__author__ = "BnF — Département numérique"
|
|
|
|
| 5 |
Licence Apache 2.0.
|
| 6 |
"""
|
| 7 |
|
| 8 |
+
__version__ = "1.0.0"
|
| 9 |
__author__ = "BnF — Département numérique"
|
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Point d'entrée pour l'exécution via ``python -m picarones``.
|
| 2 |
+
|
| 3 |
+
Permet d'utiliser Picarones sans que la commande ``picarones`` soit dans le PATH :
|
| 4 |
+
|
| 5 |
+
python -m picarones demo
|
| 6 |
+
python -m picarones run --corpus ./corpus/ --engines tesseract
|
| 7 |
+
python -m picarones --help
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
from picarones.cli import cli
|
| 11 |
+
|
| 12 |
+
if __name__ == "__main__":
|
| 13 |
+
cli()
|
|
@@ -4,19 +4,23 @@ build-backend = "setuptools.build_meta"
|
|
| 4 |
|
| 5 |
[project]
|
| 6 |
name = "picarones"
|
| 7 |
-
version = "
|
| 8 |
-
description = "Plateforme de comparaison de moteurs OCR pour documents patrimoniaux"
|
| 9 |
readme = "README.md"
|
| 10 |
requires-python = ">=3.11"
|
| 11 |
license = { text = "Apache-2.0" }
|
| 12 |
authors = [{ name = "Bibliothèque nationale de France — Département numérique" }]
|
| 13 |
-
keywords = ["ocr", "htr", "patrimoine", "benchmark", "cer", "wer"]
|
| 14 |
classifiers = [
|
| 15 |
-
"Development Status ::
|
| 16 |
"Programming Language :: Python :: 3.11",
|
| 17 |
"Programming Language :: Python :: 3.12",
|
| 18 |
"License :: OSI Approved :: Apache Software License",
|
|
|
|
| 19 |
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
|
|
|
|
|
|
|
|
|
| 20 |
]
|
| 21 |
dependencies = [
|
| 22 |
"click>=8.1.0",
|
|
@@ -28,11 +32,38 @@ dependencies = [
|
|
| 28 |
"numpy>=1.24.0",
|
| 29 |
]
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
[project.optional-dependencies]
|
|
|
|
| 32 |
dev = ["pytest>=7.4.0", "pytest-cov>=4.1.0", "httpx>=0.27.0"]
|
| 33 |
-
|
| 34 |
web = ["fastapi>=0.111.0", "uvicorn[standard]>=0.29.0", "httpx>=0.27.0"]
|
|
|
|
| 35 |
hf = ["datasets>=2.19.0"]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
[project.scripts]
|
| 38 |
picarones = "picarones.cli:cli"
|
|
|
|
| 4 |
|
| 5 |
[project]
|
| 6 |
name = "picarones"
|
| 7 |
+
version = "1.0.0"
|
| 8 |
+
description = "Plateforme de comparaison de moteurs OCR/HTR pour documents patrimoniaux"
|
| 9 |
readme = "README.md"
|
| 10 |
requires-python = ">=3.11"
|
| 11 |
license = { text = "Apache-2.0" }
|
| 12 |
authors = [{ name = "Bibliothèque nationale de France — Département numérique" }]
|
| 13 |
+
keywords = ["ocr", "htr", "patrimoine", "benchmark", "cer", "wer", "gallica", "escriptorium", "iiif"]
|
| 14 |
classifiers = [
|
| 15 |
+
"Development Status :: 5 - Production/Stable",
|
| 16 |
"Programming Language :: Python :: 3.11",
|
| 17 |
"Programming Language :: Python :: 3.12",
|
| 18 |
"License :: OSI Approved :: Apache Software License",
|
| 19 |
+
"Operating System :: OS Independent",
|
| 20 |
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
| 21 |
+
"Topic :: Text Processing :: Linguistic",
|
| 22 |
+
"Intended Audience :: Science/Research",
|
| 23 |
+
"Natural Language :: French",
|
| 24 |
]
|
| 25 |
dependencies = [
|
| 26 |
"click>=8.1.0",
|
|
|
|
| 32 |
"numpy>=1.24.0",
|
| 33 |
]
|
| 34 |
|
| 35 |
+
[project.urls]
|
| 36 |
+
Homepage = "https://github.com/bnf/picarones"
|
| 37 |
+
Documentation = "https://github.com/bnf/picarones/blob/main/INSTALL.md"
|
| 38 |
+
Repository = "https://github.com/bnf/picarones"
|
| 39 |
+
Changelog = "https://github.com/bnf/picarones/blob/main/CHANGELOG.md"
|
| 40 |
+
"Bug Tracker" = "https://github.com/bnf/picarones/issues"
|
| 41 |
+
|
| 42 |
[project.optional-dependencies]
|
| 43 |
+
# Développement et tests
|
| 44 |
dev = ["pytest>=7.4.0", "pytest-cov>=4.1.0", "httpx>=0.27.0"]
|
| 45 |
+
# Interface web FastAPI
|
| 46 |
web = ["fastapi>=0.111.0", "uvicorn[standard]>=0.29.0", "httpx>=0.27.0"]
|
| 47 |
+
# Import HuggingFace Datasets
|
| 48 |
hf = ["datasets>=2.19.0"]
|
| 49 |
+
# Moteurs OCR optionnels
|
| 50 |
+
pero = ["pero-ocr>=0.1.0"]
|
| 51 |
+
kraken = ["kraken>=4.0.0"]
|
| 52 |
+
# Adaptateurs LLM
|
| 53 |
+
llm = [
|
| 54 |
+
"openai>=1.0.0",
|
| 55 |
+
"anthropic>=0.20.0",
|
| 56 |
+
]
|
| 57 |
+
# OCR cloud APIs
|
| 58 |
+
ocr-cloud = [
|
| 59 |
+
"google-cloud-vision>=3.0.0",
|
| 60 |
+
"boto3>=1.34.0",
|
| 61 |
+
"azure-ai-formrecognizer>=3.3.0",
|
| 62 |
+
]
|
| 63 |
+
# Installation complète (tous les extras sauf les OCR cloud)
|
| 64 |
+
all = [
|
| 65 |
+
"picarones[web,hf,llm,dev]",
|
| 66 |
+
]
|
| 67 |
|
| 68 |
[project.scripts]
|
| 69 |
picarones = "picarones.cli:cli"
|
|
@@ -796,14 +796,14 @@ body.present-mode nav .meta { display: none; }
|
|
| 796 |
</main>
|
| 797 |
|
| 798 |
<footer>
|
| 799 |
-
Généré par <strong>Picarones</strong>
|
| 800 |
— BnF, Département numérique
|
| 801 |
— <span id="footer-date"></span>
|
| 802 |
</footer>
|
| 803 |
|
| 804 |
<!-- ── Données embarquées ──────────────────────────────────────────── -->
|
| 805 |
<script>
|
| 806 |
-
const DATA = {"meta":{"corpus_name":"Corpus de test — Chroniques médiévales BnF","corpus_source":"/corpus/chroniques/","document_count":3,"run_date":"2026-03-05T15:25:49.934520+00:00","picarones_version":"0.1.0","metadata":{"description":"Données de démonstration générées par picarones.fixtures","script":"gothique textura","langue":"Français médiéval (XIVe-XVe siècle)","institution":"BnF — Département des manuscrits","_images_b64":{"folio_001":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg==","folio_002":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg==","folio_003":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg=="}}},"ranking":[{"engine":"tesseract → gpt-4o","mean_cer":0.038091,"mean_wer":0.038095,"documents":3,"failed":0},{"engine":"tesseract","mean_cer":0.044933,"mean_wer":0.08254,"documents":3,"failed":0},{"engine":"ancien_moteur","mean_cer":0.179834,"mean_wer":0.288889,"documents":3,"failed":0},{"engine":"pero_ocr","mean_cer":0.0,"mean_wer":0.0,"documents":3,"failed":0}],"engines":[{"name":"pero_ocr","version":"0.7.2","cer":0.0,"wer":0.0,"mer":0.0,"wil":0.0,"cer_median":0.0,"cer_min":0.0,"cer_max":0.0,"doc_count":3,"failed":0,"cer_diplomatic":0.0,"cer_diplomatic_profile":"medieval_french","cer_values":[0.0,0.0,0.0],"cer_diplomatic_values":[0.0,0.0,0.0],"is_pipeline":false,"pipeline_info":{},"ligature_score":1.0,"diacritic_score":1.0,"aggregated_confusion":{"matrix":{},"total_substitutions":0,"total_insertions":0,"total_deletions":0},"aggregated_taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":0,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.0}},"aggregated_structure":{"mean_line_fusion_rate":0.0,"mean_line_fragmentation_rate":0.0,"mean_reading_order_score":1.0,"mean_paragraph_conservation":1.0,"mean_line_accuracy":1.0,"document_count":3},"aggregated_image_quality":{"mean_quality_score":0.732,"mean_sharpness":0.614,"mean_noise_level":0.2979,"quality_distribution":{"good":2,"medium":1,"poor":0},"document_count":3,"scores":[0.5875,0.7747,0.8339]}},{"name":"tesseract","version":"5.3.3","cer":0.0449,"wer":0.0825,"mer":0.0825,"wil":0.139,"cer_median":0.01,"cer_min":0.009,"cer_max":0.1158,"doc_count":3,"failed":0,"cer_diplomatic":0.0513,"cer_diplomatic_profile":"medieval_french","cer_values":[0.1158,0.009,0.01],"cer_diplomatic_values":[0.125,0.009,0.0198],"is_pipeline":false,"pipeline_info":{},"ligature_score":1.0,"diacritic_score":1.0,"aggregated_confusion":{"matrix":{"c":{"∅":1},"r":{"∅":1},"o":{"∅":1},"n":{"∅":1},"i":{"∅":1},"q":{"∅":1},"u":{"∅":1},"e":{"∅":1},"s":{"∅":1},"&":{"8":2},"ſ":{"f":1}},"total_substitutions":3,"total_insertions":0,"total_deletions":9},"aggregated_taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":2,"segmentation_error":0,"oov_character":0,"lacuna":1},"total_errors":4,"class_distribution":{"visual_confusion":0.25,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.5,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.25}},"aggregated_structure":{"mean_line_fusion_rate":0.0,"mean_line_fragmentation_rate":0.0,"mean_reading_order_score":0.9274,"mean_paragraph_conservation":1.0,"mean_line_accuracy":1.0,"document_count":3},"aggregated_image_quality":{"mean_quality_score":0.7363,"mean_sharpness":0.7263,"mean_noise_level":0.2437,"quality_distribution":{"good":2,"medium":1,"poor":0},"document_count":3,"scores":[0.5284,0.8648,0.8158]}},{"name":"ancien_moteur","version":"2.1.0","cer":0.1798,"wer":0.2889,"mer":0.2889,"wil":0.3963,"cer_median":0.09,"cer_min":0.0811,"cer_max":0.3684,"doc_count":3,"failed":0,"cer_diplomatic":0.1783,"cer_diplomatic_profile":"medieval_french","cer_values":[0.3684,0.0811,0.09],"cer_diplomatic_values":[0.3646,0.0811,0.0891],"is_pipeline":false,"pipeline_info":{},"ligature_score":1.0,"diacritic_score":1.0,"aggregated_confusion":{"matrix":{"p":{"∅":1},"r":{"∅":5,"z":1},"o":{"∅":3},"l":{"∅":1},"g":{"∅":1},"u":{"∅":1},"e":{"∅":5},"m":{"∅":2},"a":{"∅":3,"f":1,"w":1},"i":{"∅":2},"ſ":{"∅":3},"t":{"∅":3},"F":{"∅":2},"s":{"t":1},"n":{"∅":2},"c":{"∅":1},"E":{"∅":1},"x":{"f":1},"b":{"y":1},"J":{"z":1},"y":{"w":1},"I":{"∅":1},"d":{"∅":1}},"total_substitutions":8,"total_insertions":0,"total_deletions":38},"aggregated_taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":5,"segmentation_error":2,"oov_character":0,"lacuna":5},"total_errors":13,"class_distribution":{"visual_confusion":0.0769,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.3846,"segmentation_error":0.1538,"oov_character":0.0,"lacuna":0.3846}},"aggregated_structure":{"mean_line_fusion_rate":0.0,"mean_line_fragmentation_rate":0.0,"mean_reading_order_score":0.7697,"mean_paragraph_conservation":1.0,"mean_line_accuracy":1.0,"document_count":3},"aggregated_image_quality":{"mean_quality_score":0.4803,"mean_sharpness":0.4196,"mean_noise_level":0.4834,"quality_distribution":{"good":1,"medium":0,"poor":2},"document_count":3,"scores":[0.2888,0.388,0.7641]}},{"name":"tesseract → gpt-4o","version":"ocr=5.3.3; llm=gpt-4o","cer":0.0381,"wer":0.0381,"mer":0.0381,"wil":0.0532,"cer_median":0.009,"cer_min":0.0,"cer_max":0.1053,"doc_count":3,"failed":0,"cer_diplomatic":0.0377,"cer_diplomatic_profile":"medieval_french","cer_values":[0.1053,0.009,0.0],"cer_diplomatic_values":[0.1042,0.009,0.0],"is_pipeline":true,"pipeline_info":{"pipeline_mode":"text_and_image","prompt_file":"correction_medieval_french.txt","llm_model":"gpt-4o","llm_provider":"openai","pipeline_steps":[{"type":"ocr","engine":"tesseract","version":"5.3.3"},{"type":"llm","model":"gpt-4o","provider":"openai","mode":"text_and_image","prompt_file":"correction_medieval_french.txt"}],"over_normalization":{"score":0.0,"total_correct_ocr_words":44,"over_normalized_count":0,"document_count":3}},"ligature_score":1.0,"diacritic_score":1.0,"aggregated_confusion":{"matrix":{"c":{"∅":1},"r":{"∅":1},"o":{"∅":1},"n":{"∅":1},"i":{"∅":1},"q":{"∅":1},"u":{"∅":1},"e":{"∅":1},"s":{"∅":1},"ſ":{"f":1}},"total_substitutions":1,"total_insertions":0,"total_deletions":9},"aggregated_taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":1},"total_errors":2,"class_distribution":{"visual_confusion":0.5,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.5}},"aggregated_structure":{"mean_line_fusion_rate":0.0,"mean_line_fragmentation_rate":0.0,"mean_reading_order_score":0.9726,"mean_paragraph_conservation":1.0,"mean_line_accuracy":1.0,"document_count":3},"aggregated_image_quality":{"mean_quality_score":0.6755,"mean_sharpness":0.7034,"mean_noise_level":0.2303,"quality_distribution":{"good":1,"medium":2,"poor":0},"document_count":3,"scores":[0.6047,0.6787,0.7431]}}],"documents":[{"doc_id":"folio_001","image_path":"/corpus/images/folio_001.jpg","image_b64":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg==","ground_truth":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les croniques de France & d'Angleterre.","mean_cer":0.1474,"best_engine":"pero_ocr","engine_results":[{"engine":"pero_ocr","hypothesis":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les croniques de France & d'Angleterre.","cer":0.0,"cer_diplomatic":0.0,"wer":0.0,"duration":0.405,"error":null,"diff":[{"op":"equal","text":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les croniques de France & d'Angleterre."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":0,"class_distribution":{},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":1.0,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.5031,"noise_level":0.4962,"rotation_degrees":0.05,"contrast_score":0.6198,"quality_score":0.5875,"quality_tier":"medium","analysis_method":"mock","script_type":"gothique textura"}},{"engine":"tesseract","hypothesis":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les de France 8 d'Angleterre.","cer":0.1158,"cer_diplomatic":0.125,"wer":0.1333,"duration":0.411,"error":null,"diff":[{"op":"equal","text":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les"},{"op":"delete","text":"croniques"},{"op":"equal","text":"de France"},{"op":"replace","old":"&","new":"8"},{"op":"equal","text":"d'Angleterre."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":1,"segmentation_error":0,"oov_character":0,"lacuna":1},"total_errors":2,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.5,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.5},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[{"gt":"&","ocr":"8"}],"segmentation_error":[],"oov_character":[],"lacuna":[{"gt":"croniques","ocr":"","position":10}]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.8966,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.6518,"noise_level":0.495,"rotation_degrees":-1.34,"contrast_score":0.2668,"quality_score":0.5284,"quality_tier":"medium","analysis_method":"mock","script_type":"gothique textura"}},{"engine":"ancien_moteur","hypothesis":"Icy commence le de Jehan ſut les croniques de & d'Angleterre.","cer":0.3684,"cer_diplomatic":0.3646,"wer":0.3333,"duration":3.892,"error":null,"diff":[{"op":"equal","text":"Icy commence le"},{"op":"delete","text":"prologue"},{"op":"equal","text":"de"},{"op":"delete","text":"maiſtre"},{"op":"equal","text":"Jehan"},{"op":"replace","old":"Froiſſart ſus","new":"ſut"},{"op":"equal","text":"les croniques de"},{"op":"delete","text":"France"},{"op":"equal","text":"& d'Angleterre."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":1,"oov_character":0,"lacuna":3},"total_errors":4,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.25,"oov_character":0.0,"lacuna":0.75},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[{"gt":"Froiſſart ſus","ocr":"ſut","position":7}],"oov_character":[],"lacuna":[{"gt":"prologue","ocr":"","position":3},{"gt":"maiſtre","ocr":"","position":5},{"gt":"France","ocr":"","position":12}]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.7692,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.1537,"noise_level":0.5589,"rotation_degrees":-2.09,"contrast_score":0.2,"quality_score":0.2888,"quality_tier":"poor","analysis_method":"mock","script_type":"gothique textura"}},{"engine":"tesseract → gpt-4o","hypothesis":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les de France & d'Angleterre.","cer":0.1053,"cer_diplomatic":0.1042,"wer":0.0667,"duration":11.725,"error":null,"diff":[{"op":"equal","text":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les"},{"op":"delete","text":"croniques"},{"op":"equal","text":"de France & d'Angleterre."}],"ocr_intermediate":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les de France 8 d'Angleterre.","ocr_diff":[{"op":"equal","text":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les"},{"op":"delete","text":"croniques"},{"op":"equal","text":"de France"},{"op":"replace","old":"&","new":"8"},{"op":"equal","text":"d'Angleterre."}],"llm_correction_diff":[{"op":"equal","text":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les de France"},{"op":"replace","old":"8","new":"&"},{"op":"equal","text":"d'Angleterre."}],"over_normalization":{"score":0.0,"total_correct_ocr_words":10,"over_normalized_count":0,"over_normalized_passages":[]},"pipeline_mode":"text_and_image","ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":1},"total_errors":1,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":1.0},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[{"gt":"croniques","ocr":"","position":10}]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.9655,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.6971,"noise_level":0.3585,"rotation_degrees":2.94,"contrast_score":0.4231,"quality_score":0.6047,"quality_tier":"medium","analysis_method":"mock","script_type":"gothique textura"}}],"script_type":"gothique textura","difficulty_score":0.2072,"difficulty_label":"Facile"},{"doc_id":"folio_002","image_path":"/corpus/images/folio_002.jpg","image_b64":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg==","ground_truth":"En l'an de grace mil trois cens ſoixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois.","mean_cer":0.0248,"best_engine":"pero_ocr","engine_results":[{"engine":"pero_ocr","hypothesis":"En l'an de grace mil trois cens ſoixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois.","cer":0.0,"cer_diplomatic":0.0,"wer":0.0,"duration":0.886,"error":null,"diff":[{"op":"equal","text":"En l'an de grace mil trois cens ſoixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":0,"class_distribution":{},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":1.0,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.6798,"noise_level":0.2595,"rotation_degrees":1.37,"contrast_score":0.8946,"quality_score":0.7747,"quality_tier":"good","analysis_method":"mock","script_type":"humanistique"}},{"engine":"tesseract","hypothesis":"En l'an de grace mil trois cens foixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois.","cer":0.009,"cer_diplomatic":0.009,"wer":0.0476,"duration":0.971,"error":null,"diff":[{"op":"equal","text":"En l'an de grace mil trois cens"},{"op":"replace","old":"ſoixante,","new":"foixante,"},{"op":"equal","text":"regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":1,"class_distribution":{"visual_confusion":1.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.0},"examples":{"visual_confusion":[{"gt":"ſoixante,","ocr":"foixante,"}],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.9524,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.7507,"noise_level":0.0967,"rotation_degrees":0.68,"contrast_score":0.969,"quality_score":0.8648,"quality_tier":"good","analysis_method":"mock","script_type":"humanistique"}},{"engine":"ancien_moteur","hypothesis":"l'fn de grwce mil trois cens ſoifante, regnoit en France le noyle roy zehan, filz du row Phelippe de Valois.","cer":0.0811,"cer_diplomatic":0.0811,"wer":0.3333,"duration":2.227,"error":null,"diff":[{"op":"replace","old":"En l'an","new":"l'fn"},{"op":"equal","text":"de"},{"op":"replace","old":"grace","new":"grwce"},{"op":"equal","text":"mil trois cens"},{"op":"replace","old":"ſoixante,","new":"ſoifante,"},{"op":"equal","text":"regnoit en France le"},{"op":"replace","old":"noble","new":"noyle"},{"op":"equal","text":"roy"},{"op":"replace","old":"Jehan,","new":"zehan,"},{"op":"equal","text":"filz du"},{"op":"replace","old":"roy","new":"row"},{"op":"equal","text":"Phelippe de Valois."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":5,"segmentation_error":1,"oov_character":0,"lacuna":0},"total_errors":6,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.8333,"segmentation_error":0.1667,"oov_character":0.0,"lacuna":0.0},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[{"gt":"grace","ocr":"grwce"},{"gt":"ſoixante,","ocr":"ſoifante,"},{"gt":"noble","ocr":"noyle"}],"segmentation_error":[{"gt":"En l'an","ocr":"l'fn","position":0}],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.6829,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.1939,"noise_level":0.5855,"rotation_degrees":-0.28,"contrast_score":0.4345,"quality_score":0.388,"quality_tier":"poor","analysis_method":"mock","script_type":"humanistique"}},{"engine":"tesseract → gpt-4o","hypothesis":"En l'an de grace mil trois cens foixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois.","cer":0.009,"cer_diplomatic":0.009,"wer":0.0476,"duration":8.963,"error":null,"diff":[{"op":"equal","text":"En l'an de grace mil trois cens"},{"op":"replace","old":"ſoixante,","new":"foixante,"},{"op":"equal","text":"regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois."}],"ocr_intermediate":"En l'an de grace mil trois cens foixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois.","ocr_diff":[{"op":"equal","text":"En l'an de grace mil trois cens"},{"op":"replace","old":"ſoixante,","new":"foixante,"},{"op":"equal","text":"regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois."}],"llm_correction_diff":[{"op":"equal","text":"En l'an de grace mil trois cens foixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois."}],"over_normalization":{"score":0.0,"total_correct_ocr_words":20,"over_normalized_count":0,"over_normalized_passages":[]},"pipeline_mode":"text_and_image","ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":1,"class_distribution":{"visual_confusion":1.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.0},"examples":{"visual_confusion":[{"gt":"ſoixante,","ocr":"foixante,"}],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.9524,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.7141,"noise_level":0.3019,"rotation_degrees":0.75,"contrast_score":0.5365,"quality_score":0.6787,"quality_tier":"medium","analysis_method":"mock","script_type":"humanistique"}}],"script_type":"humanistique","difficulty_score":0.1209,"difficulty_label":"Facile"},{"doc_id":"folio_003","image_path":"/corpus/images/folio_003.jpg","image_b64":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg==","ground_truth":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins & mahommetans.","mean_cer":0.025,"best_engine":"pero_ocr","engine_results":[{"engine":"pero_ocr","hypothesis":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins & mahommetans.","cer":0.0,"cer_diplomatic":0.0,"wer":0.0,"duration":2.78,"error":null,"diff":[{"op":"equal","text":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins & mahommetans."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":0,"class_distribution":{},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":1.0,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.6592,"noise_level":0.138,"rotation_degrees":-0.22,"contrast_score":1.0,"quality_score":0.8339,"quality_tier":"good","analysis_method":"mock","script_type":"cursive administrative"}},{"engine":"tesseract","hypothesis":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins 8 mahommetans.","cer":0.01,"cer_diplomatic":0.0198,"wer":0.0667,"duration":0.69,"error":null,"diff":[{"op":"equal","text":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins"},{"op":"replace","old":"&","new":"8"},{"op":"equal","text":"mahommetans."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":1,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":1,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":1.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.0},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[{"gt":"&","ocr":"8"}],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.9333,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.7764,"noise_level":0.1395,"rotation_degrees":-0.69,"contrast_score":0.8002,"quality_score":0.8158,"quality_tier":"good","analysis_method":"mock","script_type":"cursive administrative"}},{"engine":"ancien_moteur","hypothesis":"ledit iour furent menez en ladicte ville Paris pluſieurs priſonniers ſazaſins & mahommetans.","cer":0.09,"cer_diplomatic":0.0891,"wer":0.2,"duration":2.803,"error":null,"diff":[{"op":"delete","text":"Item"},{"op":"equal","text":"ledit iour furent menez en ladicte ville"},{"op":"delete","text":"de"},{"op":"equal","text":"Paris pluſieurs priſonniers"},{"op":"replace","old":"ſaraſins","new":"ſazaſins"},{"op":"equal","text":"& mahommetans."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":2},"total_errors":3,"class_distribution":{"visual_confusion":0.3333,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.6667},"examples":{"visual_confusion":[{"gt":"ſaraſins","ocr":"ſazaſins"}],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[{"gt":"Item","ocr":"","position":0},{"gt":"de","ocr":"","position":8}]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.8571,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.9112,"noise_level":0.3059,"rotation_degrees":1.1,"contrast_score":0.5727,"quality_score":0.7641,"quality_tier":"good","analysis_method":"mock","script_type":"cursive administrative"}},{"engine":"tesseract → gpt-4o","hypothesis":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins & mahommetans.","cer":0.0,"cer_diplomatic":0.0,"wer":0.0,"duration":7.601,"error":null,"diff":[{"op":"equal","text":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins & mahommetans."}],"ocr_intermediate":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins 8 mahommetans.","ocr_diff":[{"op":"equal","text":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins"},{"op":"replace","old":"&","new":"8"},{"op":"equal","text":"mahommetans."}],"llm_correction_diff":[{"op":"equal","text":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins"},{"op":"replace","old":"8","new":"&"},{"op":"equal","text":"mahommetans."}],"over_normalization":{"score":0.0,"total_correct_ocr_words":14,"over_normalized_count":0,"over_normalized_passages":[]},"pipeline_mode":"text_and_image","ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":0,"class_distribution":{},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":1.0,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.6989,"noise_level":0.0306,"rotation_degrees":2.4,"contrast_score":0.6456,"quality_score":0.7431,"quality_tier":"good","analysis_method":"mock","script_type":"cursive administrative"}}],"script_type":"cursive administrative","difficulty_score":0.1297,"difficulty_label":"Facile"}],"statistics":{"pairwise_wilcoxon":[{"engine_a":"pero_ocr","engine_b":"tesseract","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le premier concurrent obtient de meilleurs scores.","n_pairs":3,"W_plus":0,"W_minus":6.0},{"engine_a":"pero_ocr","engine_b":"ancien_moteur","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le premier concurrent obtient de meilleurs scores.","n_pairs":3,"W_plus":0,"W_minus":6.0},{"engine_a":"pero_ocr","engine_b":"tesseract → gpt-4o","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le premier concurrent obtient de meilleurs scores.","n_pairs":2,"W_plus":0,"W_minus":3.0},{"engine_a":"tesseract","engine_b":"ancien_moteur","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le premier concurrent obtient de meilleurs scores.","n_pairs":3,"W_plus":0,"W_minus":6.0},{"engine_a":"tesseract","engine_b":"tesseract → gpt-4o","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le second concurrent obtient de meilleurs scores.","n_pairs":2,"W_plus":3.0,"W_minus":0},{"engine_a":"ancien_moteur","engine_b":"tesseract → gpt-4o","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le second concurrent obtient de meilleurs scores.","n_pairs":3,"W_plus":6.0,"W_minus":0}],"bootstrap_cis":[{"engine":"pero_ocr","mean":0.0,"ci_lower":0.0,"ci_upper":0.0},{"engine":"tesseract","mean":0.0449,"ci_lower":0.009,"ci_upper":0.1158},{"engine":"ancien_moteur","mean":0.1798,"ci_lower":0.0811,"ci_upper":0.3684},{"engine":"tesseract → gpt-4o","mean":0.0381,"ci_lower":0.0,"ci_upper":0.1053}]},"reliability_curves":[{"engine":"pero_ocr","points":[{"pct_docs":5.0,"mean_cer":0.0},{"pct_docs":10.0,"mean_cer":0.0},{"pct_docs":15.0,"mean_cer":0.0},{"pct_docs":20.0,"mean_cer":0.0},{"pct_docs":25.0,"mean_cer":0.0},{"pct_docs":30.0,"mean_cer":0.0},{"pct_docs":35.0,"mean_cer":0.0},{"pct_docs":40.0,"mean_cer":0.0},{"pct_docs":45.0,"mean_cer":0.0},{"pct_docs":50.0,"mean_cer":0.0},{"pct_docs":55.0,"mean_cer":0.0},{"pct_docs":60.0,"mean_cer":0.0},{"pct_docs":65.0,"mean_cer":0.0},{"pct_docs":70.0,"mean_cer":0.0},{"pct_docs":75.0,"mean_cer":0.0},{"pct_docs":80.0,"mean_cer":0.0},{"pct_docs":85.0,"mean_cer":0.0},{"pct_docs":90.0,"mean_cer":0.0},{"pct_docs":95.0,"mean_cer":0.0},{"pct_docs":100.0,"mean_cer":0.0}]},{"engine":"tesseract","points":[{"pct_docs":5.0,"mean_cer":0.009},{"pct_docs":10.0,"mean_cer":0.009},{"pct_docs":15.0,"mean_cer":0.009},{"pct_docs":20.0,"mean_cer":0.009},{"pct_docs":25.0,"mean_cer":0.009},{"pct_docs":30.0,"mean_cer":0.009},{"pct_docs":35.0,"mean_cer":0.009},{"pct_docs":40.0,"mean_cer":0.009},{"pct_docs":45.0,"mean_cer":0.009},{"pct_docs":50.0,"mean_cer":0.009},{"pct_docs":55.0,"mean_cer":0.009},{"pct_docs":60.0,"mean_cer":0.009},{"pct_docs":65.0,"mean_cer":0.009},{"pct_docs":70.0,"mean_cer":0.0095},{"pct_docs":75.0,"mean_cer":0.0095},{"pct_docs":80.0,"mean_cer":0.0095},{"pct_docs":85.0,"mean_cer":0.0095},{"pct_docs":90.0,"mean_cer":0.0095},{"pct_docs":95.0,"mean_cer":0.0095},{"pct_docs":100.0,"mean_cer":0.044933}]},{"engine":"ancien_moteur","points":[{"pct_docs":5.0,"mean_cer":0.0811},{"pct_docs":10.0,"mean_cer":0.0811},{"pct_docs":15.0,"mean_cer":0.0811},{"pct_docs":20.0,"mean_cer":0.0811},{"pct_docs":25.0,"mean_cer":0.0811},{"pct_docs":30.0,"mean_cer":0.0811},{"pct_docs":35.0,"mean_cer":0.0811},{"pct_docs":40.0,"mean_cer":0.0811},{"pct_docs":45.0,"mean_cer":0.0811},{"pct_docs":50.0,"mean_cer":0.0811},{"pct_docs":55.0,"mean_cer":0.0811},{"pct_docs":60.0,"mean_cer":0.0811},{"pct_docs":65.0,"mean_cer":0.0811},{"pct_docs":70.0,"mean_cer":0.08555},{"pct_docs":75.0,"mean_cer":0.08555},{"pct_docs":80.0,"mean_cer":0.08555},{"pct_docs":85.0,"mean_cer":0.08555},{"pct_docs":90.0,"mean_cer":0.08555},{"pct_docs":95.0,"mean_cer":0.08555},{"pct_docs":100.0,"mean_cer":0.179833}]},{"engine":"tesseract → gpt-4o","points":[{"pct_docs":5.0,"mean_cer":0.0},{"pct_docs":10.0,"mean_cer":0.0},{"pct_docs":15.0,"mean_cer":0.0},{"pct_docs":20.0,"mean_cer":0.0},{"pct_docs":25.0,"mean_cer":0.0},{"pct_docs":30.0,"mean_cer":0.0},{"pct_docs":35.0,"mean_cer":0.0},{"pct_docs":40.0,"mean_cer":0.0},{"pct_docs":45.0,"mean_cer":0.0},{"pct_docs":50.0,"mean_cer":0.0},{"pct_docs":55.0,"mean_cer":0.0},{"pct_docs":60.0,"mean_cer":0.0},{"pct_docs":65.0,"mean_cer":0.0},{"pct_docs":70.0,"mean_cer":0.0045},{"pct_docs":75.0,"mean_cer":0.0045},{"pct_docs":80.0,"mean_cer":0.0045},{"pct_docs":85.0,"mean_cer":0.0045},{"pct_docs":90.0,"mean_cer":0.0045},{"pct_docs":95.0,"mean_cer":0.0045},{"pct_docs":100.0,"mean_cer":0.0381}]}],"venn_data":{"type":"venn3","label_a":"pero_ocr","label_b":"tesseract","label_c":"ancien_moteur","only_a":0,"only_b":4,"only_c":13,"ab":0,"ac":0,"bc":0,"abc":0},"error_clusters":[{"cluster_id":5,"label":"autres substitutions","count":13,"examples":[{"engine":"tesseract","gt_fragment":"croniques","ocr_fragment":""},{"engine":"tesseract","gt_fragment":"ſoixante,","ocr_fragment":"foixante,"},{"engine":"ancien_moteur","gt_fragment":"prologue","ocr_fragment":""},{"engine":"ancien_moteur","gt_fragment":"maiſtre","ocr_fragment":""},{"engine":"ancien_moteur","gt_fragment":"France","ocr_fragment":""}]},{"cluster_id":1,"label":"&→8","count":2,"examples":[{"engine":"tesseract","gt_fragment":"&","ocr_fragment":"8"},{"engine":"tesseract","gt_fragment":"&","ocr_fragment":"8"}]},{"cluster_id":2,"label":"confusion ſ/f/s","count":2,"examples":[{"engine":"ancien_moteur","gt_fragment":"Froiſſart ſus","ocr_fragment":"ſut"},{"engine":"ancien_moteur","gt_fragment":"ſaraſins","ocr_fragment":"ſazaſins"}]},{"cluster_id":3,"label":"roy→row","count":1,"examples":[{"engine":"ancien_moteur","gt_fragment":"roy","ocr_fragment":"row"}]},{"cluster_id":4,"label":"de→—","count":1,"examples":[{"engine":"ancien_moteur","gt_fragment":"de","ocr_fragment":""}]}],"correlation_per_engine":[{"engine":"pero_ocr","labels":["cer","wer","mer","wil","quality_score","sharpness","ligature","diacritic"],"matrix":[[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,1.0,0.9431,0.0,0.0],[0.0,0.0,0.0,0.0,0.9431,1.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]]},{"engine":"tesseract","labels":["cer","wer","mer","wil","quality_score","sharpness","ligature","diacritic"],"matrix":[[1.0,0.9789,0.9789,0.9409,-0.9919,-0.9791,0.0,0.0],[0.9789,1.0,1.0,0.9903,-0.9969,-0.9169,0.0,0.0],[0.9789,1.0,1.0,0.9903,-0.9969,-0.9169,0.0,0.0],[0.9409,0.9903,0.9903,1.0,-0.9763,-0.8525,0.0,0.0],[-0.9919,-0.9969,-0.9969,-0.9763,1.0,0.9455,0.0,0.0],[-0.9791,-0.9169,-0.9169,-0.8525,0.9455,1.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]]},{"engine":"ancien_moteur","labels":["cer","wer","mer","wil","quality_score","sharpness","ligature","diacritic"],"matrix":[[1.0,0.4762,0.4762,-0.0421,-0.6408,-0.5172,0.0,0.0],[0.4762,1.0,1.0,0.8585,-0.9802,-0.9989,0.0,0.0],[0.4762,1.0,1.0,0.8585,-0.9802,-0.9989,0.0,0.0],[-0.0421,0.8585,0.8585,1.0,-0.7401,-0.8334,0.0,0.0],[-0.6408,-0.9802,-0.9802,-0.7401,1.0,0.9885,0.0,0.0],[-0.5172,-0.9989,-0.9989,-0.8334,0.9885,1.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]]},{"engine":"tesseract → gpt-4o","labels":["cer","wer","mer","wil","quality_score","sharpness","ligature","diacritic"],"matrix":[[1.0,0.7723,0.7723,0.3173,-0.9185,-0.5167,0.0,0.0],[0.7723,1.0,1.0,0.8475,-0.9605,0.1448,0.0,0.0],[0.7723,1.0,1.0,0.8475,-0.9605,0.1448,0.0,0.0],[0.3173,0.8475,0.8475,1.0,-0.6664,0.648,0.0,0.0],[-0.9185,-0.9605,-0.9605,-0.6664,1.0,0.1361,0.0,0.0],[-0.5167,0.1448,0.1448,0.648,0.1361,1.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]]}]};
|
| 807 |
</script>
|
| 808 |
|
| 809 |
<!-- ── Application ────────────────────────────────────────────────── -->
|
|
|
|
| 796 |
</main>
|
| 797 |
|
| 798 |
<footer>
|
| 799 |
+
Généré par <strong>Picarones</strong> v1.0.0
|
| 800 |
— BnF, Département numérique
|
| 801 |
— <span id="footer-date"></span>
|
| 802 |
</footer>
|
| 803 |
|
| 804 |
<!-- ── Données embarquées ──────────────────────────────────────────── -->
|
| 805 |
<script>
|
| 806 |
+
const DATA = {"meta":{"corpus_name":"Corpus de test — Chroniques médiévales BnF","corpus_source":"/corpus/chroniques/","document_count":3,"run_date":"2026-03-05T15:58:04.169037+00:00","picarones_version":"1.0.0","metadata":{"description":"Données de démonstration générées par picarones.fixtures","script":"gothique textura","langue":"Français médiéval (XIVe-XVe siècle)","institution":"BnF — Département des manuscrits","_images_b64":{"folio_001":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg==","folio_002":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg==","folio_003":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg=="}}},"ranking":[{"engine":"tesseract → gpt-4o","mean_cer":0.038091,"mean_wer":0.038095,"documents":3,"failed":0},{"engine":"tesseract","mean_cer":0.044933,"mean_wer":0.08254,"documents":3,"failed":0},{"engine":"ancien_moteur","mean_cer":0.179834,"mean_wer":0.288889,"documents":3,"failed":0},{"engine":"pero_ocr","mean_cer":0.0,"mean_wer":0.0,"documents":3,"failed":0}],"engines":[{"name":"pero_ocr","version":"0.7.2","cer":0.0,"wer":0.0,"mer":0.0,"wil":0.0,"cer_median":0.0,"cer_min":0.0,"cer_max":0.0,"doc_count":3,"failed":0,"cer_diplomatic":0.0,"cer_diplomatic_profile":"medieval_french","cer_values":[0.0,0.0,0.0],"cer_diplomatic_values":[0.0,0.0,0.0],"is_pipeline":false,"pipeline_info":{},"ligature_score":1.0,"diacritic_score":1.0,"aggregated_confusion":{"matrix":{},"total_substitutions":0,"total_insertions":0,"total_deletions":0},"aggregated_taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":0,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.0}},"aggregated_structure":{"mean_line_fusion_rate":0.0,"mean_line_fragmentation_rate":0.0,"mean_reading_order_score":1.0,"mean_paragraph_conservation":1.0,"mean_line_accuracy":1.0,"document_count":3},"aggregated_image_quality":{"mean_quality_score":0.732,"mean_sharpness":0.614,"mean_noise_level":0.2979,"quality_distribution":{"good":2,"medium":1,"poor":0},"document_count":3,"scores":[0.5875,0.7747,0.8339]}},{"name":"tesseract","version":"5.3.3","cer":0.0449,"wer":0.0825,"mer":0.0825,"wil":0.139,"cer_median":0.01,"cer_min":0.009,"cer_max":0.1158,"doc_count":3,"failed":0,"cer_diplomatic":0.0513,"cer_diplomatic_profile":"medieval_french","cer_values":[0.1158,0.009,0.01],"cer_diplomatic_values":[0.125,0.009,0.0198],"is_pipeline":false,"pipeline_info":{},"ligature_score":1.0,"diacritic_score":1.0,"aggregated_confusion":{"matrix":{"c":{"∅":1},"r":{"∅":1},"o":{"∅":1},"n":{"∅":1},"i":{"∅":1},"q":{"∅":1},"u":{"∅":1},"e":{"∅":1},"s":{"∅":1},"&":{"8":2},"ſ":{"f":1}},"total_substitutions":3,"total_insertions":0,"total_deletions":9},"aggregated_taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":2,"segmentation_error":0,"oov_character":0,"lacuna":1},"total_errors":4,"class_distribution":{"visual_confusion":0.25,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.5,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.25}},"aggregated_structure":{"mean_line_fusion_rate":0.0,"mean_line_fragmentation_rate":0.0,"mean_reading_order_score":0.9274,"mean_paragraph_conservation":1.0,"mean_line_accuracy":1.0,"document_count":3},"aggregated_image_quality":{"mean_quality_score":0.7363,"mean_sharpness":0.7263,"mean_noise_level":0.2437,"quality_distribution":{"good":2,"medium":1,"poor":0},"document_count":3,"scores":[0.5284,0.8648,0.8158]}},{"name":"ancien_moteur","version":"2.1.0","cer":0.1798,"wer":0.2889,"mer":0.2889,"wil":0.3963,"cer_median":0.09,"cer_min":0.0811,"cer_max":0.3684,"doc_count":3,"failed":0,"cer_diplomatic":0.1783,"cer_diplomatic_profile":"medieval_french","cer_values":[0.3684,0.0811,0.09],"cer_diplomatic_values":[0.3646,0.0811,0.0891],"is_pipeline":false,"pipeline_info":{},"ligature_score":1.0,"diacritic_score":1.0,"aggregated_confusion":{"matrix":{"p":{"∅":1},"r":{"∅":5,"z":1},"o":{"∅":3},"l":{"∅":1},"g":{"∅":1},"u":{"∅":1},"e":{"∅":5},"m":{"∅":2},"a":{"∅":3,"f":1,"w":1},"i":{"∅":2},"ſ":{"∅":3},"t":{"∅":3},"F":{"∅":2},"s":{"t":1},"n":{"∅":2},"c":{"∅":1},"E":{"∅":1},"x":{"f":1},"b":{"y":1},"J":{"z":1},"y":{"w":1},"I":{"∅":1},"d":{"∅":1}},"total_substitutions":8,"total_insertions":0,"total_deletions":38},"aggregated_taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":5,"segmentation_error":2,"oov_character":0,"lacuna":5},"total_errors":13,"class_distribution":{"visual_confusion":0.0769,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.3846,"segmentation_error":0.1538,"oov_character":0.0,"lacuna":0.3846}},"aggregated_structure":{"mean_line_fusion_rate":0.0,"mean_line_fragmentation_rate":0.0,"mean_reading_order_score":0.7697,"mean_paragraph_conservation":1.0,"mean_line_accuracy":1.0,"document_count":3},"aggregated_image_quality":{"mean_quality_score":0.4803,"mean_sharpness":0.4196,"mean_noise_level":0.4834,"quality_distribution":{"good":1,"medium":0,"poor":2},"document_count":3,"scores":[0.2888,0.388,0.7641]}},{"name":"tesseract → gpt-4o","version":"ocr=5.3.3; llm=gpt-4o","cer":0.0381,"wer":0.0381,"mer":0.0381,"wil":0.0532,"cer_median":0.009,"cer_min":0.0,"cer_max":0.1053,"doc_count":3,"failed":0,"cer_diplomatic":0.0377,"cer_diplomatic_profile":"medieval_french","cer_values":[0.1053,0.009,0.0],"cer_diplomatic_values":[0.1042,0.009,0.0],"is_pipeline":true,"pipeline_info":{"pipeline_mode":"text_and_image","prompt_file":"correction_medieval_french.txt","llm_model":"gpt-4o","llm_provider":"openai","pipeline_steps":[{"type":"ocr","engine":"tesseract","version":"5.3.3"},{"type":"llm","model":"gpt-4o","provider":"openai","mode":"text_and_image","prompt_file":"correction_medieval_french.txt"}],"over_normalization":{"score":0.0,"total_correct_ocr_words":44,"over_normalized_count":0,"document_count":3}},"ligature_score":1.0,"diacritic_score":1.0,"aggregated_confusion":{"matrix":{"c":{"∅":1},"r":{"∅":1},"o":{"∅":1},"n":{"∅":1},"i":{"∅":1},"q":{"∅":1},"u":{"∅":1},"e":{"∅":1},"s":{"∅":1},"ſ":{"f":1}},"total_substitutions":1,"total_insertions":0,"total_deletions":9},"aggregated_taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":1},"total_errors":2,"class_distribution":{"visual_confusion":0.5,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.5}},"aggregated_structure":{"mean_line_fusion_rate":0.0,"mean_line_fragmentation_rate":0.0,"mean_reading_order_score":0.9726,"mean_paragraph_conservation":1.0,"mean_line_accuracy":1.0,"document_count":3},"aggregated_image_quality":{"mean_quality_score":0.6755,"mean_sharpness":0.7034,"mean_noise_level":0.2303,"quality_distribution":{"good":1,"medium":2,"poor":0},"document_count":3,"scores":[0.6047,0.6787,0.7431]}}],"documents":[{"doc_id":"folio_001","image_path":"/corpus/images/folio_001.jpg","image_b64":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg==","ground_truth":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les croniques de France & d'Angleterre.","mean_cer":0.1474,"best_engine":"pero_ocr","engine_results":[{"engine":"pero_ocr","hypothesis":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les croniques de France & d'Angleterre.","cer":0.0,"cer_diplomatic":0.0,"wer":0.0,"duration":0.405,"error":null,"diff":[{"op":"equal","text":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les croniques de France & d'Angleterre."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":0,"class_distribution":{},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":1.0,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.5031,"noise_level":0.4962,"rotation_degrees":0.05,"contrast_score":0.6198,"quality_score":0.5875,"quality_tier":"medium","analysis_method":"mock","script_type":"gothique textura"}},{"engine":"tesseract","hypothesis":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les de France 8 d'Angleterre.","cer":0.1158,"cer_diplomatic":0.125,"wer":0.1333,"duration":0.411,"error":null,"diff":[{"op":"equal","text":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les"},{"op":"delete","text":"croniques"},{"op":"equal","text":"de France"},{"op":"replace","old":"&","new":"8"},{"op":"equal","text":"d'Angleterre."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":1,"segmentation_error":0,"oov_character":0,"lacuna":1},"total_errors":2,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.5,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.5},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[{"gt":"&","ocr":"8"}],"segmentation_error":[],"oov_character":[],"lacuna":[{"gt":"croniques","ocr":"","position":10}]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.8966,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.6518,"noise_level":0.495,"rotation_degrees":-1.34,"contrast_score":0.2668,"quality_score":0.5284,"quality_tier":"medium","analysis_method":"mock","script_type":"gothique textura"}},{"engine":"ancien_moteur","hypothesis":"Icy commence le de Jehan ſut les croniques de & d'Angleterre.","cer":0.3684,"cer_diplomatic":0.3646,"wer":0.3333,"duration":3.892,"error":null,"diff":[{"op":"equal","text":"Icy commence le"},{"op":"delete","text":"prologue"},{"op":"equal","text":"de"},{"op":"delete","text":"maiſtre"},{"op":"equal","text":"Jehan"},{"op":"replace","old":"Froiſſart ſus","new":"ſut"},{"op":"equal","text":"les croniques de"},{"op":"delete","text":"France"},{"op":"equal","text":"& d'Angleterre."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":1,"oov_character":0,"lacuna":3},"total_errors":4,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.25,"oov_character":0.0,"lacuna":0.75},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[{"gt":"Froiſſart ſus","ocr":"ſut","position":7}],"oov_character":[],"lacuna":[{"gt":"prologue","ocr":"","position":3},{"gt":"maiſtre","ocr":"","position":5},{"gt":"France","ocr":"","position":12}]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.7692,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.1537,"noise_level":0.5589,"rotation_degrees":-2.09,"contrast_score":0.2,"quality_score":0.2888,"quality_tier":"poor","analysis_method":"mock","script_type":"gothique textura"}},{"engine":"tesseract → gpt-4o","hypothesis":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les de France & d'Angleterre.","cer":0.1053,"cer_diplomatic":0.1042,"wer":0.0667,"duration":11.725,"error":null,"diff":[{"op":"equal","text":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les"},{"op":"delete","text":"croniques"},{"op":"equal","text":"de France & d'Angleterre."}],"ocr_intermediate":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les de France 8 d'Angleterre.","ocr_diff":[{"op":"equal","text":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les"},{"op":"delete","text":"croniques"},{"op":"equal","text":"de France"},{"op":"replace","old":"&","new":"8"},{"op":"equal","text":"d'Angleterre."}],"llm_correction_diff":[{"op":"equal","text":"Icy commence le prologue de maiſtre Jehan Froiſſart ſus les de France"},{"op":"replace","old":"8","new":"&"},{"op":"equal","text":"d'Angleterre."}],"over_normalization":{"score":0.0,"total_correct_ocr_words":10,"over_normalized_count":0,"over_normalized_passages":[]},"pipeline_mode":"text_and_image","ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":1},"total_errors":1,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":1.0},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[{"gt":"croniques","ocr":"","position":10}]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.9655,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.6971,"noise_level":0.3585,"rotation_degrees":2.94,"contrast_score":0.4231,"quality_score":0.6047,"quality_tier":"medium","analysis_method":"mock","script_type":"gothique textura"}}],"script_type":"gothique textura","difficulty_score":0.2072,"difficulty_label":"Facile"},{"doc_id":"folio_002","image_path":"/corpus/images/folio_002.jpg","image_b64":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg==","ground_truth":"En l'an de grace mil trois cens ſoixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois.","mean_cer":0.0248,"best_engine":"pero_ocr","engine_results":[{"engine":"pero_ocr","hypothesis":"En l'an de grace mil trois cens ſoixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois.","cer":0.0,"cer_diplomatic":0.0,"wer":0.0,"duration":0.886,"error":null,"diff":[{"op":"equal","text":"En l'an de grace mil trois cens ſoixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":0,"class_distribution":{},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":1.0,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.6798,"noise_level":0.2595,"rotation_degrees":1.37,"contrast_score":0.8946,"quality_score":0.7747,"quality_tier":"good","analysis_method":"mock","script_type":"humanistique"}},{"engine":"tesseract","hypothesis":"En l'an de grace mil trois cens foixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois.","cer":0.009,"cer_diplomatic":0.009,"wer":0.0476,"duration":0.971,"error":null,"diff":[{"op":"equal","text":"En l'an de grace mil trois cens"},{"op":"replace","old":"ſoixante,","new":"foixante,"},{"op":"equal","text":"regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":1,"class_distribution":{"visual_confusion":1.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.0},"examples":{"visual_confusion":[{"gt":"ſoixante,","ocr":"foixante,"}],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.9524,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.7507,"noise_level":0.0967,"rotation_degrees":0.68,"contrast_score":0.969,"quality_score":0.8648,"quality_tier":"good","analysis_method":"mock","script_type":"humanistique"}},{"engine":"ancien_moteur","hypothesis":"l'fn de grwce mil trois cens ſoifante, regnoit en France le noyle roy zehan, filz du row Phelippe de Valois.","cer":0.0811,"cer_diplomatic":0.0811,"wer":0.3333,"duration":2.227,"error":null,"diff":[{"op":"replace","old":"En l'an","new":"l'fn"},{"op":"equal","text":"de"},{"op":"replace","old":"grace","new":"grwce"},{"op":"equal","text":"mil trois cens"},{"op":"replace","old":"ſoixante,","new":"ſoifante,"},{"op":"equal","text":"regnoit en France le"},{"op":"replace","old":"noble","new":"noyle"},{"op":"equal","text":"roy"},{"op":"replace","old":"Jehan,","new":"zehan,"},{"op":"equal","text":"filz du"},{"op":"replace","old":"roy","new":"row"},{"op":"equal","text":"Phelippe de Valois."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":5,"segmentation_error":1,"oov_character":0,"lacuna":0},"total_errors":6,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.8333,"segmentation_error":0.1667,"oov_character":0.0,"lacuna":0.0},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[{"gt":"grace","ocr":"grwce"},{"gt":"ſoixante,","ocr":"ſoifante,"},{"gt":"noble","ocr":"noyle"}],"segmentation_error":[{"gt":"En l'an","ocr":"l'fn","position":0}],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.6829,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.1939,"noise_level":0.5855,"rotation_degrees":-0.28,"contrast_score":0.4345,"quality_score":0.388,"quality_tier":"poor","analysis_method":"mock","script_type":"humanistique"}},{"engine":"tesseract → gpt-4o","hypothesis":"En l'an de grace mil trois cens foixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois.","cer":0.009,"cer_diplomatic":0.009,"wer":0.0476,"duration":8.963,"error":null,"diff":[{"op":"equal","text":"En l'an de grace mil trois cens"},{"op":"replace","old":"ſoixante,","new":"foixante,"},{"op":"equal","text":"regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois."}],"ocr_intermediate":"En l'an de grace mil trois cens foixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois.","ocr_diff":[{"op":"equal","text":"En l'an de grace mil trois cens"},{"op":"replace","old":"ſoixante,","new":"foixante,"},{"op":"equal","text":"regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois."}],"llm_correction_diff":[{"op":"equal","text":"En l'an de grace mil trois cens foixante, regnoit en France le noble roy Jehan, filz du roy Phelippe de Valois."}],"over_normalization":{"score":0.0,"total_correct_ocr_words":20,"over_normalized_count":0,"over_normalized_passages":[]},"pipeline_mode":"text_and_image","ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":1,"class_distribution":{"visual_confusion":1.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.0},"examples":{"visual_confusion":[{"gt":"ſoixante,","ocr":"foixante,"}],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.9524,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.7141,"noise_level":0.3019,"rotation_degrees":0.75,"contrast_score":0.5365,"quality_score":0.6787,"quality_tier":"medium","analysis_method":"mock","script_type":"humanistique"}}],"script_type":"humanistique","difficulty_score":0.1209,"difficulty_label":"Facile"},{"doc_id":"folio_003","image_path":"/corpus/images/folio_003.jpg","image_b64":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAUAAAADcCAIAAACOIe9xAAAC4ElEQVR4nO3bsVFDUQxFQZf4y6EIiqAIiiJ0Qu7QCRFg6Z2ZvbMFKDmhbp8f70DUbf0C4NcEDGHPgL/vX0CCgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhbCjgy8yuS8Bm4QnYLLxqwMArCBjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjDPDGZzE7BZeAI2C68aMPAKAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIezfAn4zs79tM2BgnoAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjAP/WanTMBm4W0GDMwTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQ5qHf7JQJ2Cy8zYCBeQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMI89JudMgGbhbcZMDBPwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYQKGMAFDmId+s1MmYLPwNgMG5gkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCFMwBAmYAjz0G92ygRsFt5mwMA8AUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhAkYwgQMYR76zU6ZgM3C2wwYmCdgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwAUOYgCHMQ7/ZKROwWXibAQPzBAxhAoYwAUOYgCFMwBAmYAgTMIQJGMIEDGEChjABQ5iAIUzAECZgCBMwhHnoNztlAjYLbzNgYJ6AIUzAECZgCBMwhAkYwgQMYQKGMAFDmIAhTMAQJmAIEzCECRjCBAxhAoYwD/1mp0zAZuFtBgzMEzCECRjCfggYyBEwhAkYwh59L5rsdXrQDgAAAABJRU5ErkJggg==","ground_truth":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins & mahommetans.","mean_cer":0.025,"best_engine":"pero_ocr","engine_results":[{"engine":"pero_ocr","hypothesis":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins & mahommetans.","cer":0.0,"cer_diplomatic":0.0,"wer":0.0,"duration":2.78,"error":null,"diff":[{"op":"equal","text":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins & mahommetans."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":0,"class_distribution":{},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":1.0,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.6592,"noise_level":0.138,"rotation_degrees":-0.22,"contrast_score":1.0,"quality_score":0.8339,"quality_tier":"good","analysis_method":"mock","script_type":"cursive administrative"}},{"engine":"tesseract","hypothesis":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins 8 mahommetans.","cer":0.01,"cer_diplomatic":0.0198,"wer":0.0667,"duration":0.69,"error":null,"diff":[{"op":"equal","text":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins"},{"op":"replace","old":"&","new":"8"},{"op":"equal","text":"mahommetans."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":1,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":1,"class_distribution":{"visual_confusion":0.0,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":1.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.0},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[{"gt":"&","ocr":"8"}],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.9333,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.7764,"noise_level":0.1395,"rotation_degrees":-0.69,"contrast_score":0.8002,"quality_score":0.8158,"quality_tier":"good","analysis_method":"mock","script_type":"cursive administrative"}},{"engine":"ancien_moteur","hypothesis":"ledit iour furent menez en ladicte ville Paris pluſieurs priſonniers ſazaſins & mahommetans.","cer":0.09,"cer_diplomatic":0.0891,"wer":0.2,"duration":2.803,"error":null,"diff":[{"op":"delete","text":"Item"},{"op":"equal","text":"ledit iour furent menez en ladicte ville"},{"op":"delete","text":"de"},{"op":"equal","text":"Paris pluſieurs priſonniers"},{"op":"replace","old":"ſaraſins","new":"ſazaſins"},{"op":"equal","text":"& mahommetans."}],"ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":1,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":2},"total_errors":3,"class_distribution":{"visual_confusion":0.3333,"diacritic_error":0.0,"case_error":0.0,"ligature_error":0.0,"abbreviation_error":0.0,"hapax":0.0,"segmentation_error":0.0,"oov_character":0.0,"lacuna":0.6667},"examples":{"visual_confusion":[{"gt":"ſaraſins","ocr":"ſazaſins"}],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[{"gt":"Item","ocr":"","position":0},{"gt":"de","ocr":"","position":8}]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":0.8571,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.9112,"noise_level":0.3059,"rotation_degrees":1.1,"contrast_score":0.5727,"quality_score":0.7641,"quality_tier":"good","analysis_method":"mock","script_type":"cursive administrative"}},{"engine":"tesseract → gpt-4o","hypothesis":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins & mahommetans.","cer":0.0,"cer_diplomatic":0.0,"wer":0.0,"duration":7.601,"error":null,"diff":[{"op":"equal","text":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins & mahommetans."}],"ocr_intermediate":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins 8 mahommetans.","ocr_diff":[{"op":"equal","text":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins"},{"op":"replace","old":"&","new":"8"},{"op":"equal","text":"mahommetans."}],"llm_correction_diff":[{"op":"equal","text":"Item ledit iour furent menez en ladicte ville de Paris pluſieurs priſonniers ſaraſins"},{"op":"replace","old":"8","new":"&"},{"op":"equal","text":"mahommetans."}],"over_normalization":{"score":0.0,"total_correct_ocr_words":14,"over_normalized_count":0,"over_normalized_passages":[]},"pipeline_mode":"text_and_image","ligature_score":1.0,"diacritic_score":1.0,"taxonomy":{"counts":{"visual_confusion":0,"diacritic_error":0,"case_error":0,"ligature_error":0,"abbreviation_error":0,"hapax":0,"segmentation_error":0,"oov_character":0,"lacuna":0},"total_errors":0,"class_distribution":{},"examples":{"visual_confusion":[],"diacritic_error":[],"case_error":[],"ligature_error":[],"abbreviation_error":[],"hapax":[],"segmentation_error":[],"oov_character":[],"lacuna":[]}},"structure":{"gt_line_count":1,"ocr_line_count":1,"line_fusion_count":0,"line_fragmentation_count":0,"line_fusion_rate":0.0,"line_fragmentation_rate":0.0,"line_accuracy":1.0,"reading_order_score":1.0,"paragraph_conservation_score":1.0},"image_quality":{"sharpness_score":0.6989,"noise_level":0.0306,"rotation_degrees":2.4,"contrast_score":0.6456,"quality_score":0.7431,"quality_tier":"good","analysis_method":"mock","script_type":"cursive administrative"}}],"script_type":"cursive administrative","difficulty_score":0.1297,"difficulty_label":"Facile"}],"statistics":{"pairwise_wilcoxon":[{"engine_a":"pero_ocr","engine_b":"tesseract","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le premier concurrent obtient de meilleurs scores.","n_pairs":3,"W_plus":0,"W_minus":6.0},{"engine_a":"pero_ocr","engine_b":"ancien_moteur","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le premier concurrent obtient de meilleurs scores.","n_pairs":3,"W_plus":0,"W_minus":6.0},{"engine_a":"pero_ocr","engine_b":"tesseract → gpt-4o","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le premier concurrent obtient de meilleurs scores.","n_pairs":2,"W_plus":0,"W_minus":3.0},{"engine_a":"tesseract","engine_b":"ancien_moteur","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le premier concurrent obtient de meilleurs scores.","n_pairs":3,"W_plus":0,"W_minus":6.0},{"engine_a":"tesseract","engine_b":"tesseract → gpt-4o","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le second concurrent obtient de meilleurs scores.","n_pairs":2,"W_plus":3.0,"W_minus":0},{"engine_a":"ancien_moteur","engine_b":"tesseract → gpt-4o","statistic":0,"p_value":0.04,"significant":true,"interpretation":"Différence statistiquement significative (p = 0.0400 < 0.05). Le second concurrent obtient de meilleurs scores.","n_pairs":3,"W_plus":6.0,"W_minus":0}],"bootstrap_cis":[{"engine":"pero_ocr","mean":0.0,"ci_lower":0.0,"ci_upper":0.0},{"engine":"tesseract","mean":0.0449,"ci_lower":0.009,"ci_upper":0.1158},{"engine":"ancien_moteur","mean":0.1798,"ci_lower":0.0811,"ci_upper":0.3684},{"engine":"tesseract → gpt-4o","mean":0.0381,"ci_lower":0.0,"ci_upper":0.1053}]},"reliability_curves":[{"engine":"pero_ocr","points":[{"pct_docs":5.0,"mean_cer":0.0},{"pct_docs":10.0,"mean_cer":0.0},{"pct_docs":15.0,"mean_cer":0.0},{"pct_docs":20.0,"mean_cer":0.0},{"pct_docs":25.0,"mean_cer":0.0},{"pct_docs":30.0,"mean_cer":0.0},{"pct_docs":35.0,"mean_cer":0.0},{"pct_docs":40.0,"mean_cer":0.0},{"pct_docs":45.0,"mean_cer":0.0},{"pct_docs":50.0,"mean_cer":0.0},{"pct_docs":55.0,"mean_cer":0.0},{"pct_docs":60.0,"mean_cer":0.0},{"pct_docs":65.0,"mean_cer":0.0},{"pct_docs":70.0,"mean_cer":0.0},{"pct_docs":75.0,"mean_cer":0.0},{"pct_docs":80.0,"mean_cer":0.0},{"pct_docs":85.0,"mean_cer":0.0},{"pct_docs":90.0,"mean_cer":0.0},{"pct_docs":95.0,"mean_cer":0.0},{"pct_docs":100.0,"mean_cer":0.0}]},{"engine":"tesseract","points":[{"pct_docs":5.0,"mean_cer":0.009},{"pct_docs":10.0,"mean_cer":0.009},{"pct_docs":15.0,"mean_cer":0.009},{"pct_docs":20.0,"mean_cer":0.009},{"pct_docs":25.0,"mean_cer":0.009},{"pct_docs":30.0,"mean_cer":0.009},{"pct_docs":35.0,"mean_cer":0.009},{"pct_docs":40.0,"mean_cer":0.009},{"pct_docs":45.0,"mean_cer":0.009},{"pct_docs":50.0,"mean_cer":0.009},{"pct_docs":55.0,"mean_cer":0.009},{"pct_docs":60.0,"mean_cer":0.009},{"pct_docs":65.0,"mean_cer":0.009},{"pct_docs":70.0,"mean_cer":0.0095},{"pct_docs":75.0,"mean_cer":0.0095},{"pct_docs":80.0,"mean_cer":0.0095},{"pct_docs":85.0,"mean_cer":0.0095},{"pct_docs":90.0,"mean_cer":0.0095},{"pct_docs":95.0,"mean_cer":0.0095},{"pct_docs":100.0,"mean_cer":0.044933}]},{"engine":"ancien_moteur","points":[{"pct_docs":5.0,"mean_cer":0.0811},{"pct_docs":10.0,"mean_cer":0.0811},{"pct_docs":15.0,"mean_cer":0.0811},{"pct_docs":20.0,"mean_cer":0.0811},{"pct_docs":25.0,"mean_cer":0.0811},{"pct_docs":30.0,"mean_cer":0.0811},{"pct_docs":35.0,"mean_cer":0.0811},{"pct_docs":40.0,"mean_cer":0.0811},{"pct_docs":45.0,"mean_cer":0.0811},{"pct_docs":50.0,"mean_cer":0.0811},{"pct_docs":55.0,"mean_cer":0.0811},{"pct_docs":60.0,"mean_cer":0.0811},{"pct_docs":65.0,"mean_cer":0.0811},{"pct_docs":70.0,"mean_cer":0.08555},{"pct_docs":75.0,"mean_cer":0.08555},{"pct_docs":80.0,"mean_cer":0.08555},{"pct_docs":85.0,"mean_cer":0.08555},{"pct_docs":90.0,"mean_cer":0.08555},{"pct_docs":95.0,"mean_cer":0.08555},{"pct_docs":100.0,"mean_cer":0.179833}]},{"engine":"tesseract → gpt-4o","points":[{"pct_docs":5.0,"mean_cer":0.0},{"pct_docs":10.0,"mean_cer":0.0},{"pct_docs":15.0,"mean_cer":0.0},{"pct_docs":20.0,"mean_cer":0.0},{"pct_docs":25.0,"mean_cer":0.0},{"pct_docs":30.0,"mean_cer":0.0},{"pct_docs":35.0,"mean_cer":0.0},{"pct_docs":40.0,"mean_cer":0.0},{"pct_docs":45.0,"mean_cer":0.0},{"pct_docs":50.0,"mean_cer":0.0},{"pct_docs":55.0,"mean_cer":0.0},{"pct_docs":60.0,"mean_cer":0.0},{"pct_docs":65.0,"mean_cer":0.0},{"pct_docs":70.0,"mean_cer":0.0045},{"pct_docs":75.0,"mean_cer":0.0045},{"pct_docs":80.0,"mean_cer":0.0045},{"pct_docs":85.0,"mean_cer":0.0045},{"pct_docs":90.0,"mean_cer":0.0045},{"pct_docs":95.0,"mean_cer":0.0045},{"pct_docs":100.0,"mean_cer":0.0381}]}],"venn_data":{"type":"venn3","label_a":"pero_ocr","label_b":"tesseract","label_c":"ancien_moteur","only_a":0,"only_b":4,"only_c":13,"ab":0,"ac":0,"bc":0,"abc":0},"error_clusters":[{"cluster_id":5,"label":"autres substitutions","count":13,"examples":[{"engine":"tesseract","gt_fragment":"croniques","ocr_fragment":""},{"engine":"tesseract","gt_fragment":"ſoixante,","ocr_fragment":"foixante,"},{"engine":"ancien_moteur","gt_fragment":"prologue","ocr_fragment":""},{"engine":"ancien_moteur","gt_fragment":"maiſtre","ocr_fragment":""},{"engine":"ancien_moteur","gt_fragment":"France","ocr_fragment":""}]},{"cluster_id":1,"label":"&→8","count":2,"examples":[{"engine":"tesseract","gt_fragment":"&","ocr_fragment":"8"},{"engine":"tesseract","gt_fragment":"&","ocr_fragment":"8"}]},{"cluster_id":2,"label":"confusion ſ/f/s","count":2,"examples":[{"engine":"ancien_moteur","gt_fragment":"Froiſſart ſus","ocr_fragment":"ſut"},{"engine":"ancien_moteur","gt_fragment":"ſaraſins","ocr_fragment":"ſazaſins"}]},{"cluster_id":3,"label":"roy→row","count":1,"examples":[{"engine":"ancien_moteur","gt_fragment":"roy","ocr_fragment":"row"}]},{"cluster_id":4,"label":"de→—","count":1,"examples":[{"engine":"ancien_moteur","gt_fragment":"de","ocr_fragment":""}]}],"correlation_per_engine":[{"engine":"pero_ocr","labels":["cer","wer","mer","wil","quality_score","sharpness","ligature","diacritic"],"matrix":[[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,1.0,0.9431,0.0,0.0],[0.0,0.0,0.0,0.0,0.9431,1.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]]},{"engine":"tesseract","labels":["cer","wer","mer","wil","quality_score","sharpness","ligature","diacritic"],"matrix":[[1.0,0.9789,0.9789,0.9409,-0.9919,-0.9791,0.0,0.0],[0.9789,1.0,1.0,0.9903,-0.9969,-0.9169,0.0,0.0],[0.9789,1.0,1.0,0.9903,-0.9969,-0.9169,0.0,0.0],[0.9409,0.9903,0.9903,1.0,-0.9763,-0.8525,0.0,0.0],[-0.9919,-0.9969,-0.9969,-0.9763,1.0,0.9455,0.0,0.0],[-0.9791,-0.9169,-0.9169,-0.8525,0.9455,1.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]]},{"engine":"ancien_moteur","labels":["cer","wer","mer","wil","quality_score","sharpness","ligature","diacritic"],"matrix":[[1.0,0.4762,0.4762,-0.0421,-0.6408,-0.5172,0.0,0.0],[0.4762,1.0,1.0,0.8585,-0.9802,-0.9989,0.0,0.0],[0.4762,1.0,1.0,0.8585,-0.9802,-0.9989,0.0,0.0],[-0.0421,0.8585,0.8585,1.0,-0.7401,-0.8334,0.0,0.0],[-0.6408,-0.9802,-0.9802,-0.7401,1.0,0.9885,0.0,0.0],[-0.5172,-0.9989,-0.9989,-0.8334,0.9885,1.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]]},{"engine":"tesseract → gpt-4o","labels":["cer","wer","mer","wil","quality_score","sharpness","ligature","diacritic"],"matrix":[[1.0,0.7723,0.7723,0.3173,-0.9185,-0.5167,0.0,0.0],[0.7723,1.0,1.0,0.8475,-0.9605,0.1448,0.0,0.0],[0.7723,1.0,1.0,0.8475,-0.9605,0.1448,0.0,0.0],[0.3173,0.8475,0.8475,1.0,-0.6664,0.648,0.0,0.0],[-0.9185,-0.9605,-0.9605,-0.6664,1.0,0.1361,0.0,0.0],[-0.5167,0.1448,0.1448,0.648,0.1361,1.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]]}]};
|
| 807 |
</script>
|
| 808 |
|
| 809 |
<!-- ── Application ────────────────────────────────────────────────── -->
|
|
@@ -0,0 +1,415 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests Sprint 9 — Documentation, packaging et intégration finale.
|
| 2 |
+
|
| 3 |
+
Classes de tests
|
| 4 |
+
----------------
|
| 5 |
+
TestVersion (4 tests) — version cohérente dans tous les fichiers
|
| 6 |
+
TestMainModule (3 tests) — python -m picarones fonctionne
|
| 7 |
+
TestMakefile (5 tests) — Makefile syntaxe et cibles
|
| 8 |
+
TestDockerfile (6 tests) — Dockerfile structure et commandes
|
| 9 |
+
TestDockerCompose (5 tests) — docker-compose.yml structure
|
| 10 |
+
TestCIWorkflow (6 tests) — .github/workflows/ci.yml structure
|
| 11 |
+
TestPyInstallerSpec (4 tests) — picarones.spec structure
|
| 12 |
+
TestCLIDemoEndToEnd (6 tests) — picarones demo bout en bout
|
| 13 |
+
TestReadme (5 tests) — README.md complet et bilingue
|
| 14 |
+
TestInstallMd (4 tests) — INSTALL.md contenu
|
| 15 |
+
TestChangelog (5 tests) — CHANGELOG.md contenu et structure
|
| 16 |
+
TestContributing (4 tests) — CONTRIBUTING.md contenu
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from __future__ import annotations
|
| 20 |
+
|
| 21 |
+
import re
|
| 22 |
+
from pathlib import Path
|
| 23 |
+
import pytest
|
| 24 |
+
|
| 25 |
+
ROOT = Path(__file__).parent.parent
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
# ===========================================================================
|
| 29 |
+
# TestVersion
|
| 30 |
+
# ===========================================================================
|
| 31 |
+
|
| 32 |
+
class TestVersion:
|
| 33 |
+
|
| 34 |
+
def test_version_in_init(self):
|
| 35 |
+
from picarones import __version__
|
| 36 |
+
assert __version__ == "1.0.0"
|
| 37 |
+
|
| 38 |
+
def test_version_in_pyproject(self):
|
| 39 |
+
pyproject = (ROOT / "pyproject.toml").read_text(encoding="utf-8")
|
| 40 |
+
assert 'version = "1.0.0"' in pyproject
|
| 41 |
+
|
| 42 |
+
def test_version_cli(self):
|
| 43 |
+
from click.testing import CliRunner
|
| 44 |
+
from picarones.cli import cli
|
| 45 |
+
runner = CliRunner()
|
| 46 |
+
result = runner.invoke(cli, ["--version"])
|
| 47 |
+
assert result.exit_code == 0
|
| 48 |
+
assert "1.0.0" in result.output
|
| 49 |
+
|
| 50 |
+
def test_version_consistent(self):
|
| 51 |
+
"""La version dans __init__.py et pyproject.toml doit être identique."""
|
| 52 |
+
from picarones import __version__
|
| 53 |
+
pyproject = (ROOT / "pyproject.toml").read_text(encoding="utf-8")
|
| 54 |
+
m = re.search(r'version\s*=\s*"([^"]+)"', pyproject)
|
| 55 |
+
assert m is not None
|
| 56 |
+
pyproject_version = m.group(1)
|
| 57 |
+
assert __version__ == pyproject_version, (
|
| 58 |
+
f"Version incohérente : __init__.py={__version__} vs pyproject.toml={pyproject_version}"
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
# ===========================================================================
|
| 63 |
+
# TestMainModule
|
| 64 |
+
# ===========================================================================
|
| 65 |
+
|
| 66 |
+
class TestMainModule:
|
| 67 |
+
|
| 68 |
+
def test_main_module_exists(self):
|
| 69 |
+
main_path = ROOT / "picarones" / "__main__.py"
|
| 70 |
+
assert main_path.exists(), "picarones/__main__.py est manquant"
|
| 71 |
+
|
| 72 |
+
def test_main_imports_cli(self):
|
| 73 |
+
content = (ROOT / "picarones" / "__main__.py").read_text(encoding="utf-8")
|
| 74 |
+
assert "from picarones.cli import cli" in content
|
| 75 |
+
|
| 76 |
+
def test_main_importable(self):
|
| 77 |
+
import importlib
|
| 78 |
+
mod = importlib.import_module("picarones.__main__")
|
| 79 |
+
assert hasattr(mod, "cli")
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
# ===========================================================================
|
| 83 |
+
# TestMakefile
|
| 84 |
+
# ===========================================================================
|
| 85 |
+
|
| 86 |
+
class TestMakefile:
|
| 87 |
+
|
| 88 |
+
@pytest.fixture
|
| 89 |
+
def makefile(self):
|
| 90 |
+
path = ROOT / "Makefile"
|
| 91 |
+
assert path.exists(), "Makefile est manquant"
|
| 92 |
+
return path.read_text(encoding="utf-8")
|
| 93 |
+
|
| 94 |
+
def test_makefile_exists(self):
|
| 95 |
+
assert (ROOT / "Makefile").exists()
|
| 96 |
+
|
| 97 |
+
def test_has_install_target(self, makefile):
|
| 98 |
+
assert "install:" in makefile
|
| 99 |
+
|
| 100 |
+
def test_has_test_target(self, makefile):
|
| 101 |
+
assert "test:" in makefile
|
| 102 |
+
|
| 103 |
+
def test_has_demo_target(self, makefile):
|
| 104 |
+
assert "demo:" in makefile
|
| 105 |
+
|
| 106 |
+
def test_has_docker_build_target(self, makefile):
|
| 107 |
+
assert "docker-build:" in makefile
|
| 108 |
+
|
| 109 |
+
def test_has_help_target(self, makefile):
|
| 110 |
+
assert "help:" in makefile
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
# ===========================================================================
|
| 114 |
+
# TestDockerfile
|
| 115 |
+
# ===========================================================================
|
| 116 |
+
|
| 117 |
+
class TestDockerfile:
|
| 118 |
+
|
| 119 |
+
@pytest.fixture
|
| 120 |
+
def dockerfile(self):
|
| 121 |
+
path = ROOT / "Dockerfile"
|
| 122 |
+
assert path.exists(), "Dockerfile est manquant"
|
| 123 |
+
return path.read_text(encoding="utf-8")
|
| 124 |
+
|
| 125 |
+
def test_dockerfile_exists(self):
|
| 126 |
+
assert (ROOT / "Dockerfile").exists()
|
| 127 |
+
|
| 128 |
+
def test_has_python_base(self, dockerfile):
|
| 129 |
+
assert "python:3.11" in dockerfile
|
| 130 |
+
|
| 131 |
+
def test_has_tesseract_install(self, dockerfile):
|
| 132 |
+
assert "tesseract-ocr" in dockerfile
|
| 133 |
+
|
| 134 |
+
def test_has_picarones_serve_cmd(self, dockerfile):
|
| 135 |
+
assert "picarones" in dockerfile
|
| 136 |
+
assert "serve" in dockerfile
|
| 137 |
+
assert "0.0.0.0" in dockerfile
|
| 138 |
+
|
| 139 |
+
def test_has_workdir(self, dockerfile):
|
| 140 |
+
assert "WORKDIR" in dockerfile
|
| 141 |
+
|
| 142 |
+
def test_has_healthcheck(self, dockerfile):
|
| 143 |
+
assert "HEALTHCHECK" in dockerfile
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
# ===========================================================================
|
| 147 |
+
# TestDockerCompose
|
| 148 |
+
# ===========================================================================
|
| 149 |
+
|
| 150 |
+
class TestDockerCompose:
|
| 151 |
+
|
| 152 |
+
@pytest.fixture
|
| 153 |
+
def compose(self):
|
| 154 |
+
path = ROOT / "docker-compose.yml"
|
| 155 |
+
assert path.exists(), "docker-compose.yml est manquant"
|
| 156 |
+
return path.read_text(encoding="utf-8")
|
| 157 |
+
|
| 158 |
+
def test_compose_exists(self):
|
| 159 |
+
assert (ROOT / "docker-compose.yml").exists()
|
| 160 |
+
|
| 161 |
+
def test_has_picarones_service(self, compose):
|
| 162 |
+
assert "picarones:" in compose
|
| 163 |
+
|
| 164 |
+
def test_has_ollama_service(self, compose):
|
| 165 |
+
assert "ollama" in compose
|
| 166 |
+
|
| 167 |
+
def test_has_port_mapping(self, compose):
|
| 168 |
+
assert "8000" in compose
|
| 169 |
+
|
| 170 |
+
def test_has_volume_for_history(self, compose):
|
| 171 |
+
assert "picarones_history" in compose
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
# ===========================================================================
|
| 175 |
+
# TestCIWorkflow
|
| 176 |
+
# ===========================================================================
|
| 177 |
+
|
| 178 |
+
class TestCIWorkflow:
|
| 179 |
+
|
| 180 |
+
@pytest.fixture
|
| 181 |
+
def ci(self):
|
| 182 |
+
path = ROOT / ".github" / "workflows" / "ci.yml"
|
| 183 |
+
assert path.exists(), ".github/workflows/ci.yml est manquant"
|
| 184 |
+
return path.read_text(encoding="utf-8")
|
| 185 |
+
|
| 186 |
+
def test_ci_exists(self):
|
| 187 |
+
assert (ROOT / ".github" / "workflows" / "ci.yml").exists()
|
| 188 |
+
|
| 189 |
+
def test_has_python_311(self, ci):
|
| 190 |
+
assert "3.11" in ci
|
| 191 |
+
|
| 192 |
+
def test_has_python_312(self, ci):
|
| 193 |
+
assert "3.12" in ci
|
| 194 |
+
|
| 195 |
+
def test_has_linux_macos_windows(self, ci):
|
| 196 |
+
assert "ubuntu-latest" in ci
|
| 197 |
+
assert "macos-latest" in ci
|
| 198 |
+
assert "windows-latest" in ci
|
| 199 |
+
|
| 200 |
+
def test_has_pytest_step(self, ci):
|
| 201 |
+
assert "pytest" in ci
|
| 202 |
+
|
| 203 |
+
def test_has_demo_job(self, ci):
|
| 204 |
+
assert "demo" in ci
|
| 205 |
+
|
| 206 |
+
|
| 207 |
+
# ===========================================================================
|
| 208 |
+
# TestPyInstallerSpec
|
| 209 |
+
# ===========================================================================
|
| 210 |
+
|
| 211 |
+
class TestPyInstallerSpec:
|
| 212 |
+
|
| 213 |
+
@pytest.fixture
|
| 214 |
+
def spec(self):
|
| 215 |
+
path = ROOT / "picarones.spec"
|
| 216 |
+
assert path.exists(), "picarones.spec est manquant"
|
| 217 |
+
return path.read_text(encoding="utf-8")
|
| 218 |
+
|
| 219 |
+
def test_spec_exists(self):
|
| 220 |
+
assert (ROOT / "picarones.spec").exists()
|
| 221 |
+
|
| 222 |
+
def test_spec_has_analysis(self, spec):
|
| 223 |
+
assert "Analysis(" in spec
|
| 224 |
+
|
| 225 |
+
def test_spec_has_picarones_cli(self, spec):
|
| 226 |
+
assert "picarones.cli" in spec
|
| 227 |
+
|
| 228 |
+
def test_spec_has_exe(self, spec):
|
| 229 |
+
assert "EXE(" in spec
|
| 230 |
+
|
| 231 |
+
|
| 232 |
+
# ===========================================================================
|
| 233 |
+
# TestCLIDemoEndToEnd
|
| 234 |
+
# ===========================================================================
|
| 235 |
+
|
| 236 |
+
class TestCLIDemoEndToEnd:
|
| 237 |
+
|
| 238 |
+
def test_demo_runs_without_error(self, tmp_path):
|
| 239 |
+
from click.testing import CliRunner
|
| 240 |
+
from picarones.cli import cli
|
| 241 |
+
runner = CliRunner()
|
| 242 |
+
result = runner.invoke(cli, [
|
| 243 |
+
"demo", "--docs", "3",
|
| 244 |
+
"--output", str(tmp_path / "test.html"),
|
| 245 |
+
])
|
| 246 |
+
assert result.exit_code == 0, f"demo a échoué : {result.output}"
|
| 247 |
+
|
| 248 |
+
def test_demo_generates_html_file(self, tmp_path):
|
| 249 |
+
from click.testing import CliRunner
|
| 250 |
+
from picarones.cli import cli
|
| 251 |
+
runner = CliRunner()
|
| 252 |
+
output = tmp_path / "rapport.html"
|
| 253 |
+
runner.invoke(cli, ["demo", "--docs", "3", "--output", str(output)])
|
| 254 |
+
assert output.exists()
|
| 255 |
+
|
| 256 |
+
def test_demo_html_contains_expected_content(self, tmp_path):
|
| 257 |
+
from click.testing import CliRunner
|
| 258 |
+
from picarones.cli import cli
|
| 259 |
+
runner = CliRunner()
|
| 260 |
+
output = tmp_path / "rapport.html"
|
| 261 |
+
runner.invoke(cli, ["demo", "--docs", "3", "--output", str(output)])
|
| 262 |
+
content = output.read_text(encoding="utf-8")
|
| 263 |
+
assert "Picarones" in content
|
| 264 |
+
assert "CER" in content
|
| 265 |
+
assert len(content) > 50_000, f"Rapport trop petit : {len(content):,} octets"
|
| 266 |
+
|
| 267 |
+
def test_demo_with_history_flag(self, tmp_path):
|
| 268 |
+
from click.testing import CliRunner
|
| 269 |
+
from picarones.cli import cli
|
| 270 |
+
runner = CliRunner()
|
| 271 |
+
result = runner.invoke(cli, [
|
| 272 |
+
"demo", "--docs", "3",
|
| 273 |
+
"--output", str(tmp_path / "test.html"),
|
| 274 |
+
"--with-history",
|
| 275 |
+
])
|
| 276 |
+
assert result.exit_code == 0
|
| 277 |
+
assert "CER" in result.output
|
| 278 |
+
|
| 279 |
+
def test_demo_with_robustness_flag(self, tmp_path):
|
| 280 |
+
from click.testing import CliRunner
|
| 281 |
+
from picarones.cli import cli
|
| 282 |
+
runner = CliRunner()
|
| 283 |
+
result = runner.invoke(cli, [
|
| 284 |
+
"demo", "--docs", "3",
|
| 285 |
+
"--output", str(tmp_path / "test.html"),
|
| 286 |
+
"--with-robustness",
|
| 287 |
+
])
|
| 288 |
+
assert result.exit_code == 0
|
| 289 |
+
|
| 290 |
+
def test_demo_with_json_output(self, tmp_path):
|
| 291 |
+
from click.testing import CliRunner
|
| 292 |
+
from picarones.cli import cli
|
| 293 |
+
import json
|
| 294 |
+
runner = CliRunner()
|
| 295 |
+
json_out = tmp_path / "results.json"
|
| 296 |
+
result = runner.invoke(cli, [
|
| 297 |
+
"demo", "--docs", "3",
|
| 298 |
+
"--output", str(tmp_path / "test.html"),
|
| 299 |
+
"--json-output", str(json_out),
|
| 300 |
+
])
|
| 301 |
+
assert result.exit_code == 0
|
| 302 |
+
assert json_out.exists()
|
| 303 |
+
data = json.loads(json_out.read_text())
|
| 304 |
+
assert "engine_reports" in data
|
| 305 |
+
|
| 306 |
+
|
| 307 |
+
# ===========================================================================
|
| 308 |
+
# TestReadme
|
| 309 |
+
# ===========================================================================
|
| 310 |
+
|
| 311 |
+
class TestReadme:
|
| 312 |
+
|
| 313 |
+
@pytest.fixture
|
| 314 |
+
def readme(self):
|
| 315 |
+
path = ROOT / "README.md"
|
| 316 |
+
assert path.exists()
|
| 317 |
+
return path.read_text(encoding="utf-8")
|
| 318 |
+
|
| 319 |
+
def test_readme_has_french_section(self, readme):
|
| 320 |
+
assert "Fonctionnalités" in readme or "Picarones" in readme
|
| 321 |
+
|
| 322 |
+
def test_readme_has_english_section(self, readme):
|
| 323 |
+
assert "English" in readme or "Quick Start" in readme
|
| 324 |
+
|
| 325 |
+
def test_readme_has_installation(self, readme):
|
| 326 |
+
assert "Installation" in readme
|
| 327 |
+
assert "pip install" in readme
|
| 328 |
+
|
| 329 |
+
def test_readme_has_cli_examples(self, readme):
|
| 330 |
+
assert "picarones demo" in readme
|
| 331 |
+
assert "picarones run" in readme
|
| 332 |
+
|
| 333 |
+
def test_readme_has_engines_table(self, readme):
|
| 334 |
+
assert "Tesseract" in readme
|
| 335 |
+
assert "Pero OCR" in readme
|
| 336 |
+
|
| 337 |
+
|
| 338 |
+
# ===========================================================================
|
| 339 |
+
# TestInstallMd
|
| 340 |
+
# ===========================================================================
|
| 341 |
+
|
| 342 |
+
class TestInstallMd:
|
| 343 |
+
|
| 344 |
+
@pytest.fixture
|
| 345 |
+
def install(self):
|
| 346 |
+
path = ROOT / "INSTALL.md"
|
| 347 |
+
assert path.exists(), "INSTALL.md est manquant"
|
| 348 |
+
return path.read_text(encoding="utf-8")
|
| 349 |
+
|
| 350 |
+
def test_has_linux_section(self, install):
|
| 351 |
+
assert "Linux" in install or "Ubuntu" in install
|
| 352 |
+
|
| 353 |
+
def test_has_macos_section(self, install):
|
| 354 |
+
assert "macOS" in install
|
| 355 |
+
|
| 356 |
+
def test_has_windows_section(self, install):
|
| 357 |
+
assert "Windows" in install
|
| 358 |
+
|
| 359 |
+
def test_has_docker_section(self, install):
|
| 360 |
+
assert "Docker" in install
|
| 361 |
+
|
| 362 |
+
|
| 363 |
+
# ===========================================================================
|
| 364 |
+
# TestChangelog
|
| 365 |
+
# ===========================================================================
|
| 366 |
+
|
| 367 |
+
class TestChangelog:
|
| 368 |
+
|
| 369 |
+
@pytest.fixture
|
| 370 |
+
def changelog(self):
|
| 371 |
+
path = ROOT / "CHANGELOG.md"
|
| 372 |
+
assert path.exists(), "CHANGELOG.md est manquant"
|
| 373 |
+
return path.read_text(encoding="utf-8")
|
| 374 |
+
|
| 375 |
+
def test_has_sprint1(self, changelog):
|
| 376 |
+
assert "Sprint 1" in changelog or "0.1.0" in changelog
|
| 377 |
+
|
| 378 |
+
def test_has_sprint8(self, changelog):
|
| 379 |
+
assert "Sprint 8" in changelog or "0.8.0" in changelog
|
| 380 |
+
|
| 381 |
+
def test_has_sprint9(self, changelog):
|
| 382 |
+
assert "Sprint 9" in changelog or "1.0.0" in changelog
|
| 383 |
+
|
| 384 |
+
def test_has_versions(self, changelog):
|
| 385 |
+
# Au moins 2 versions documentées
|
| 386 |
+
versions = re.findall(r"\[[\d.]+\]", changelog)
|
| 387 |
+
assert len(versions) >= 2
|
| 388 |
+
|
| 389 |
+
def test_has_date(self, changelog):
|
| 390 |
+
assert "2025" in changelog
|
| 391 |
+
|
| 392 |
+
|
| 393 |
+
# ===========================================================================
|
| 394 |
+
# TestContributing
|
| 395 |
+
# ===========================================================================
|
| 396 |
+
|
| 397 |
+
class TestContributing:
|
| 398 |
+
|
| 399 |
+
@pytest.fixture
|
| 400 |
+
def contrib(self):
|
| 401 |
+
path = ROOT / "CONTRIBUTING.md"
|
| 402 |
+
assert path.exists(), "CONTRIBUTING.md est manquant"
|
| 403 |
+
return path.read_text(encoding="utf-8")
|
| 404 |
+
|
| 405 |
+
def test_has_how_to_add_engine(self, contrib):
|
| 406 |
+
assert "moteur" in contrib.lower() or "engine" in contrib.lower()
|
| 407 |
+
|
| 408 |
+
def test_has_tests_section(self, contrib):
|
| 409 |
+
assert "test" in contrib.lower()
|
| 410 |
+
|
| 411 |
+
def test_has_pull_request_section(self, contrib):
|
| 412 |
+
assert "pull request" in contrib.lower() or "PR" in contrib
|
| 413 |
+
|
| 414 |
+
def test_has_code_style(self, contrib):
|
| 415 |
+
assert "Google" in contrib or "docstring" in contrib.lower() or "style" in contrib.lower()
|