Spaces:
Configuration error
FiberGate — Document Analysis Pipeline for Orange's PAR Localisation Workflow
Automated processing of demandes de localisation du Point d'Accès au Réseau (PAR) for the Orange "Guichet Accueil Infrastructures" team. Given a folder (or ZIP) of documents submitted by a bureau d'études, the system:
- classifies each document (fiche / autorisation / mandat / plan de masse / plan de situation / certificat),
- extracts 13 business fields with a fine-tuned LayoutLMv3 model,
- applies the AGILIS rule set to verdict the demande's completeness (complète / incomplète / hors-périmètre),
- pre-fills the CMS IMMO 9 BANBOU Excel template with the derived values,
- drafts the AR mail ready to paste into MSURVEY.
A polished Streamlit demo wraps the whole pipeline with one-click sample loaders for presentation.
Architecture
flowchart LR
classDef transformer fill:#1e3a8a,stroke:#93c5fd,stroke-width:2px,color:#e0f2fe
classDef rule fill:#1c1917,stroke:#fb923c,stroke-width:2px,color:#ffedd5
classDef output fill:#14532d,stroke:#4ade80,stroke-width:2px,color:#dcfce7
classDef io fill:#312e81,stroke:#a5b4fc,stroke-width:2px,color:#e0e7ff
classDef decision fill:#7c2d12,stroke:#fb923c,stroke-width:2px,color:#ffedd5
ZIP(["📁 ZIP / loose files"]):::io
subgraph PIPE["🔬 Transformer Pipeline · guichetoi.inference"]
direction TB
OCR["🔍 OCR<br/>Tesseract · fra<br/>conf ≥ 30"]:::transformer
CLS["🧠 Classifier<br/>LayoutLMv3<br/>6 document classes"]:::transformer
EXT["🧠 Extractor<br/>LayoutLMv3 BIO<br/>13 business fields"]:::transformer
POST["⚙️ Post-processing<br/>regex cleaners<br/>per-class allowlist<br/>mandat checkbox scorer"]:::transformer
OCR --> CLS --> EXT --> POST
end
subgraph RULES["📋 Rule Engine · guichetoi.recommendation"]
direction TB
FNHINT["🏷️ Filename hints<br/>PlanSituation · PlanMasse<br/>ARRETE PC · ADRESSAGE"]:::rule
OOS["🚫 Out-of-scope filter<br/>PV-Loc-PAR · Autre_*<br/>Plan-et-ou-photo"]:::rule
RECOL{"♻️ Récolement?"}:::decision
AGILIS["📐 AGILIS rules<br/>R1 – R5<br/>champs obligatoires fiche"]:::rule
REFMATCH["🔗 Reference cross-check<br/>fiche ↔ autorisation<br/>Levenshtein-tolerant"]:::rule
FNHINT --> OOS --> RECOL
RECOL -- "non" --> AGILIS --> REFMATCH
end
subgraph OUT["📤 Outputs"]
direction TB
VERDICT["✅ Verdict<br/>complète / incomplète<br/>hors-périmètre"]:::output
ARMAIL["📨 Brouillon AR<br/>ready to paste<br/>into MSURVEY"]:::output
CMS["📊 CMS IMMO 9<br/>BANBOU pre-filled<br/>xlsx template"]:::output
end
UI(["🌐 FastAPI · /docs<br/>azizmiladi-fibergate.hf.space"]):::io
ZIP --> PIPE
PIPE --> RULES
RECOL -- "oui" --> VERDICT
REFMATCH --> VERDICT
VERDICT --> ARMAIL
VERDICT --> CMS
OUT --> UI
Two-tier design: the transformer handles perception (what kind of document it is, where the data is), rules handle business logic (what makes a demande complete, how to fill the CMS). Each layer is independently testable and fixable — extraction errors don't propagate into wrong verdicts thanks to per-field validators and OCR-tolerant cross-checks.
Headline numbers
| Metric | Value |
|---|---|
| Document classes | 6 (fiche, Autorisation, Mandat, Certificat, PlanMasse, PlanSituation) |
| Fields extracted | 13 (Reference_Urbanisme, DLPI, nb_log_totale, Disposition_Mandat, …) |
| Training set (de-duped, leakage-free) | 754 annotated pages → 528 train / 114 val / 112 test |
| Classifier accuracy (val) | ~ 95 % |
| Extractor macro span-F1 (val, honest) | 0.62 — Reference_Urbanisme 0.77, Email 1.00, nb_log_totale 0.82 |
| Audited demandes (real Orange data) | 11 ZIPs → 7 auto-complète, 3 justifiably-incomplète, 1 hors-périmètre |
| Test suite | 171 passing unit + integration tests (pytest -q, ~25 s) |
Repository layout
GuichetOI_ML/
├── src/guichetoi/ The library (importable as `guichetoi.…`)
│ ├── inference.py GuichetOIPipeline + post-processing
│ ├── recommendation.py AGILIS rule engine + AR-mail rendering
│ ├── cms.py Fills the CMS IMMO 9 BANBOU xlsx
│ └── api/main.py FastAPI service (Spring Boot / Angular ready)
├── apps/
│ └── streamlit_demo.py One-page demo UI (Orange-branded)
├── scripts/ Training pipeline + batch CLIs
│ ├── 01_convert_labelstudio.py
│ ├── 02_train_classifier.py
│ ├── 03_train_extractor.py
│ ├── 05_evaluate.py
│ ├── ocr_rasterise.py
│ ├── batch_process_dataref.py
│ ├── resplit.py · label.py
├── tools/ Dev / debug one-offs
├── tests/ 181 pytest unit/integration tests
├── docs/
│ ├── DEMO_SCRIPT.md Voiceover script for the recorded demo
│ └── LOGEMENT_IMPROVEMENTS.md
├── assets/
│ ├── orange_logo.png Brand mark used by the demo
│ ├── cms_template.xlsx Official CMS template
│ └── label_mappings.json 6 doc classes + 13 field labels (training output)
├── models/ Gitignored — LayoutLMv3 weights
│ ├── classifier/ Fine-tuned doc-class model
│ ├── extractor_v3/ Field extractor (current production)
│ └── extractor_v3_backup_v2/ Previous training run (kept for rollback)
├── .github/workflows/ci.yml Ruff + mypy + pytest on every PR
├── outputs/ Generated verdicts + CMS files (gitignored)
├── Dockerfile · .dockerignore Production container image
├── pyproject.toml Installable package metadata
├── requirements.txt Pinned dependencies (Dockerfile + CI)
├── Makefile Common dev shortcuts (test, demo, api, docker, …)
├── pytest.ini · mypy.ini Test + type-check config
└── CONTRIBUTING.md Branch strategy, setup, sensitive-data rules
Setup
Prerequisites
- Python 3.14 (tested) — likely works on 3.11+
- Tesseract OCR with the French language pack
- Windows: download from https://github.com/UB-Mannheim/tesseract/wiki
- During install, tick "Additional language data" → French
- 8 GB+ RAM (model loading), CPU works but GPU strongly recommended for retraining
Install
python -m venv .venv
.venv\Scripts\activate
pip install -e .[dev,ui] # installs the guichetoi package
pip install -r requirements.txt
Verify
python -m pytest -q # should print: 181 passed in ~25 s
Common dev commands (Makefile)
make help # list all targets
make test # full pytest suite (181 tests)
make test-fast # cms tests only (no model load, < 2 s)
make demo # streamlit run apps/streamlit_demo.py
make api # uvicorn guichetoi.api.main:app on :8000
make docker # docker build -t guichetoi-ml .
make lint # ruff + mypy
make clean # remove caches and temp outputs
On Windows without make, run the command on the right of each : line in Makefile directly.
Live demo (Hugging Face Space)
The FastAPI service is deployed at azizmiladi-fibergate.hf.space.
- The App tab opens the interactive Swagger UI (
/docs) — tryPOST /analyzedirectly in the browser. - Models are downloaded from HF Hub on first boot (not baked into the image); cold-start takes ~2 min on HF's free CPU tier.
Run the demo locally
streamlit run apps/streamlit_demo.py
A browser tab opens at http://localhost:8501.
For a quick demo: click any 🎬 Échantillon de démonstration button — results are pre-computed and appear instantly (~1 s).
For a live analysis: drop a ZIP of a real demande into the file uploader. CPU inference takes ~5-15 s per document.
See docs/DEMO_SCRIPT.md for a 3-5 minute presentation script with timing and key talking points.
CLI usage
Analyse one document
python -m guichetoi.inference --image path/to/doc.pdf
# → prints classification + extracted fields, saves JSON to outputs/
Analyse a complete demande (folder)
python -m guichetoi.recommendation --folder path/to/demande/
# → produces outputs/<demande>/verdict.json + ar_mail.txt
Use as a Python library
from guichetoi.recommendation import RecommendationEngine
engine = RecommendationEngine() # loads model once
verdict = engine.evaluate_folder("path/to/demande/")
print(verdict.status) # "complète" / "incomplète" / "hors-périmètre"
Run as an HTTP service (for Spring Boot / Angular)
uvicorn guichetoi.api.main:app --host 0.0.0.0 --port 8000
# or: docker build -t guichetoi-ml . && docker run -p 8000:8000 guichetoi-ml
Endpoints: POST /analyze, POST /cms, POST /cms/preview, GET /metadata, GET /health.
GET / redirects to /docs (Swagger UI).
OpenAPI spec at /openapi.json (consume with openapi-generator for a typed Spring WebClient).
Deployment
Hugging Face Space (public demo)
The canonical live environment is the HF Space azizmiladi/fibergate (Docker SDK, port 8000).
- Models are not baked into the image — the container downloads them from HF Hub on first boot using
HF_TOKEN(set as a Space secret). GET /→ 302 →/docsso the HF "App" tab shows the Swagger UI immediately.- Free CPU tier: ~2 min cold-start, ~10–30 s per analyze call.
Render (production / Spring Boot integration)
Topology: Angular (Static Site) → Spring Boot (Web Service) → guichetoi-ml (Private Service). The Python service has no public URL; only your Spring Boot can reach it.
Why a Pro plan is required: LayoutLMv3 holds ~1.2 GB resident. Free/Starter (512 MB) crashes at model load. Pro = 2 GB = minimum viable.
One-time setup
- Create a GitHub PAT with
write:packagesandread:packagesscopes (Settings → Developer settings → Personal access tokens → Fine-grained → repo-scoped to this one). - Log in to GHCR locally:
echo $env:GHCR_PAT | docker login ghcr.io -u medaziz012 --password-stdin - In Render dashboard → Settings → Registry Credentials → add one named
ghcr(username = GitHub username, password = same PAT).
Each deploy
make release # builds locally and pushes to GHCR
Render auto-pulls :latest and redeploys (autoDeploy: true in render.yaml). First boot takes ~30 s; Render's health check polls /health until pipeline_loaded: true.
Spring Boot configuration
guichetoi:
ml:
base-url: http://guichetoi-ml:10000
No CORS, no public exposure, no separate auth — Spring Boot is the only client.
Retraining
# 1. Annotate new documents in Label Studio, export JSON
# 2. Convert to training format
python scripts/01_convert_labelstudio.py path/to/export.json
# 3. Train (writes to models/extractor_v3/)
python scripts/03_train_extractor.py
# 4. Evaluate on the held-out test split
python scripts/05_evaluate.py
Training the extractor takes ~6 hours on CPU, ~30 min on a single GPU.
Move old checkpoints first: HuggingFace Trainer's save_total_limit=3 rotates by step number, not date — leaving old checkpoints in place silently keeps the old model.
mv models/extractor_v3/checkpoint-* models/extractor_v3_backup_v2/
Architecture highlights
Hybrid Transformer + rules
Pure LayoutLMv3 (a multimodal document transformer) extraction was unreliable on this small dataset (528 training examples, noisy OCR on form-cell digits). Wrapping the transformer with regex post-processing + per-class field allowlists + OCR-tolerant cross-checks turned a "mostly works" prototype into a system whose verdicts can be trusted at the demande level — even when individual field confidences are low.
Six engine adjustments derived from real-data audit
A 11-demande audit on production-shaped ZIPs surfaced systemic failure modes that the test scores didn't reveal. Each was addressed with a targeted fix (all locked in by regression tests):
- Stricter
_RE_REFURB— rejects "rue Abbé" / "Parcelle" false positives from theRU/PAprefixes. - Tri-state
_autorisation_matches— distinguishes "different ref" (incohérent) from "no ref readable" (manual review). - Out-of-scope filename detection —
PV-Loc-PAR,Plan-et-ou-photo,Autre_*files no longer satisfy class requirements. - Recolement short-circuit — dossiers de récolement get
hors-périmètrestatus + dedicated AR mail. - Filename hints broadened —
ARRETE PC.jpg,CERTIFICAT ADRESSAGE.jpg,Mandat_PAR-1-1.pdfall match now. - Strict mandat checkbox scorer —
!andsino longer count as marked boxes; ambiguous cases fall through to manual review instead of false OUI.
Test suite (171 tests, ~25 s)
| File | Tests | Coverage |
|---|---|---|
tests/test_cms_generator.py |
67 | All derivations + 4 end-to-end fill_cms scenarios |
tests/test_recommendation_engine.py |
50 | Rule helpers + verdict logic on synthetic Documents |
tests/test_inference_postprocess.py |
54 | Regex constants + mandat detector + cleaner |
Every bug debugged during development has a regression test. Running them takes the place of "I checked it manually" — a senior-eng quality signal.
Limits & known gaps
- Handwritten / small-font form-cell digits drop Tesseract confidence below MIN_CONF=30 →
Nb_log_proandNb_log_resmacro-F1 ≈ 0.25. Mitigated by regex backstops where possible, falls through to "manual completion" otherwise. - No live re-extraction after filename override — when the model picks PlanMasse with 65% confidence and we override to Autorisation, we don't re-run extraction on the override target. The CMS gets the right class but no fields; consultant fills them in.
- XY coordinates (Géoréso) and Mondofi ref are always manual — explicitly listed in the CMS download's "À compléter manuellement" panel.
- Single-page PDFs assumed for several extraction shortcuts — multi-page docs work but only the first page drives classification.
Author
Mohamed Aziz Miladi — AMARIS internship project (Guichet Accueil Infrastructures).