FiberGate / README.md
AzizMiladi's picture
docs(readme): remove AI from title
b108a0a
|
Raw
History Blame
14.9 kB

FiberGate — Document Analysis Pipeline for Orange's PAR Localisation Workflow

Automated processing of demandes de localisation du Point d'Accès au Réseau (PAR) for the Orange "Guichet Accueil Infrastructures" team. Given a folder (or ZIP) of documents submitted by a bureau d'études, the system:

  1. classifies each document (fiche / autorisation / mandat / plan de masse / plan de situation / certificat),
  2. extracts 13 business fields with a fine-tuned LayoutLMv3 model,
  3. applies the AGILIS rule set to verdict the demande's completeness (complète / incomplète / hors-périmètre),
  4. pre-fills the CMS IMMO 9 BANBOU Excel template with the derived values,
  5. drafts the AR mail ready to paste into MSURVEY.

A polished Streamlit demo wraps the whole pipeline with one-click sample loaders for presentation.


Architecture

flowchart LR
    classDef transformer fill:#1e3a8a,stroke:#93c5fd,stroke-width:2px,color:#e0f2fe
    classDef rule     fill:#1c1917,stroke:#fb923c,stroke-width:2px,color:#ffedd5
    classDef output   fill:#14532d,stroke:#4ade80,stroke-width:2px,color:#dcfce7
    classDef io       fill:#312e81,stroke:#a5b4fc,stroke-width:2px,color:#e0e7ff
    classDef decision fill:#7c2d12,stroke:#fb923c,stroke-width:2px,color:#ffedd5

    ZIP(["📁 ZIP / loose files"]):::io

    subgraph PIPE["🔬 Transformer Pipeline · guichetoi.inference"]
        direction TB
        OCR["🔍 OCR<br/>Tesseract · fra<br/>conf ≥ 30"]:::transformer
        CLS["🧠 Classifier<br/>LayoutLMv3<br/>6 document classes"]:::transformer
        EXT["🧠 Extractor<br/>LayoutLMv3 BIO<br/>13 business fields"]:::transformer
        POST["⚙️ Post-processing<br/>regex cleaners<br/>per-class allowlist<br/>mandat checkbox scorer"]:::transformer
        OCR --> CLS --> EXT --> POST
    end

    subgraph RULES["📋 Rule Engine · guichetoi.recommendation"]
        direction TB
        FNHINT["🏷️ Filename hints<br/>PlanSituation · PlanMasse<br/>ARRETE PC · ADRESSAGE"]:::rule
        OOS["🚫 Out-of-scope filter<br/>PV-Loc-PAR · Autre_*<br/>Plan-et-ou-photo"]:::rule
        RECOL{"♻️ Récolement?"}:::decision
        AGILIS["📐 AGILIS rules<br/>R1 – R5<br/>champs obligatoires fiche"]:::rule
        REFMATCH["🔗 Reference cross-check<br/>fiche ↔ autorisation<br/>Levenshtein-tolerant"]:::rule
        FNHINT --> OOS --> RECOL
        RECOL -- "non" --> AGILIS --> REFMATCH
    end

    subgraph OUT["📤 Outputs"]
        direction TB
        VERDICT["✅ Verdict<br/>complète / incomplète<br/>hors-périmètre"]:::output
        ARMAIL["📨 Brouillon AR<br/>ready to paste<br/>into MSURVEY"]:::output
        CMS["📊 CMS IMMO 9<br/>BANBOU pre-filled<br/>xlsx template"]:::output
    end

    UI(["🌐 FastAPI · /docs<br/>azizmiladi-fibergate.hf.space"]):::io

    ZIP      --> PIPE
    PIPE     --> RULES
    RECOL    -- "oui" --> VERDICT
    REFMATCH --> VERDICT
    VERDICT  --> ARMAIL
    VERDICT  --> CMS
    OUT      --> UI

Two-tier design: the transformer handles perception (what kind of document it is, where the data is), rules handle business logic (what makes a demande complete, how to fill the CMS). Each layer is independently testable and fixable — extraction errors don't propagate into wrong verdicts thanks to per-field validators and OCR-tolerant cross-checks.


Headline numbers

Metric Value
Document classes 6 (fiche, Autorisation, Mandat, Certificat, PlanMasse, PlanSituation)
Fields extracted 13 (Reference_Urbanisme, DLPI, nb_log_totale, Disposition_Mandat, …)
Training set (de-duped, leakage-free) 754 annotated pages → 528 train / 114 val / 112 test
Classifier accuracy (val) ~ 95 %
Extractor macro span-F1 (val, honest) 0.62 — Reference_Urbanisme 0.77, Email 1.00, nb_log_totale 0.82
Audited demandes (real Orange data) 11 ZIPs → 7 auto-complète, 3 justifiably-incomplète, 1 hors-périmètre
Test suite 171 passing unit + integration tests (pytest -q, ~25 s)

Repository layout

GuichetOI_ML/
├── src/guichetoi/              The library (importable as `guichetoi.…`)
│   ├── inference.py            GuichetOIPipeline + post-processing
│   ├── recommendation.py       AGILIS rule engine + AR-mail rendering
│   ├── cms.py                  Fills the CMS IMMO 9 BANBOU xlsx
│   └── api/main.py             FastAPI service (Spring Boot / Angular ready)
├── apps/
│   └── streamlit_demo.py       One-page demo UI (Orange-branded)
├── scripts/                    Training pipeline + batch CLIs
│   ├── 01_convert_labelstudio.py
│   ├── 02_train_classifier.py
│   ├── 03_train_extractor.py
│   ├── 05_evaluate.py
│   ├── ocr_rasterise.py
│   ├── batch_process_dataref.py
│   ├── resplit.py · label.py
├── tools/                      Dev / debug one-offs
├── tests/                      181 pytest unit/integration tests
├── docs/
│   ├── DEMO_SCRIPT.md          Voiceover script for the recorded demo
│   └── LOGEMENT_IMPROVEMENTS.md
├── assets/
│   ├── orange_logo.png         Brand mark used by the demo
│   ├── cms_template.xlsx       Official CMS template
│   └── label_mappings.json     6 doc classes + 13 field labels (training output)
├── models/                     Gitignored — LayoutLMv3 weights
│   ├── classifier/             Fine-tuned doc-class model
│   ├── extractor_v3/           Field extractor (current production)
│   └── extractor_v3_backup_v2/ Previous training run (kept for rollback)
├── .github/workflows/ci.yml    Ruff + mypy + pytest on every PR
├── outputs/                    Generated verdicts + CMS files (gitignored)
├── Dockerfile · .dockerignore  Production container image
├── pyproject.toml              Installable package metadata
├── requirements.txt            Pinned dependencies (Dockerfile + CI)
├── Makefile                    Common dev shortcuts (test, demo, api, docker, …)
├── pytest.ini · mypy.ini       Test + type-check config
└── CONTRIBUTING.md             Branch strategy, setup, sensitive-data rules

Setup

Prerequisites

  • Python 3.14 (tested) — likely works on 3.11+
  • Tesseract OCR with the French language pack
  • 8 GB+ RAM (model loading), CPU works but GPU strongly recommended for retraining

Install

python -m venv .venv
.venv\Scripts\activate
pip install -e .[dev,ui]      # installs the guichetoi package
pip install -r requirements.txt

Verify

python -m pytest -q     # should print: 181 passed in ~25 s

Common dev commands (Makefile)

make help          # list all targets
make test          # full pytest suite (181 tests)
make test-fast     # cms tests only (no model load, < 2 s)
make demo          # streamlit run apps/streamlit_demo.py
make api           # uvicorn guichetoi.api.main:app on :8000
make docker        # docker build -t guichetoi-ml .
make lint          # ruff + mypy
make clean         # remove caches and temp outputs

On Windows without make, run the command on the right of each : line in Makefile directly.


Live demo (Hugging Face Space)

The FastAPI service is deployed at azizmiladi-fibergate.hf.space.

  • The App tab opens the interactive Swagger UI (/docs) — try POST /analyze directly in the browser.
  • Models are downloaded from HF Hub on first boot (not baked into the image); cold-start takes ~2 min on HF's free CPU tier.

Run the demo locally

streamlit run apps/streamlit_demo.py

A browser tab opens at http://localhost:8501.

For a quick demo: click any 🎬 Échantillon de démonstration button — results are pre-computed and appear instantly (~1 s).

For a live analysis: drop a ZIP of a real demande into the file uploader. CPU inference takes ~5-15 s per document.

See docs/DEMO_SCRIPT.md for a 3-5 minute presentation script with timing and key talking points.


CLI usage

Analyse one document

python -m guichetoi.inference --image path/to/doc.pdf
# → prints classification + extracted fields, saves JSON to outputs/

Analyse a complete demande (folder)

python -m guichetoi.recommendation --folder path/to/demande/
# → produces outputs/<demande>/verdict.json + ar_mail.txt

Use as a Python library

from guichetoi.recommendation import RecommendationEngine

engine = RecommendationEngine()    # loads model once
verdict = engine.evaluate_folder("path/to/demande/")
print(verdict.status)              # "complète" / "incomplète" / "hors-périmètre"

Run as an HTTP service (for Spring Boot / Angular)

uvicorn guichetoi.api.main:app --host 0.0.0.0 --port 8000
# or: docker build -t guichetoi-ml . && docker run -p 8000:8000 guichetoi-ml

Endpoints: POST /analyze, POST /cms, POST /cms/preview, GET /metadata, GET /health. GET / redirects to /docs (Swagger UI). OpenAPI spec at /openapi.json (consume with openapi-generator for a typed Spring WebClient).


Deployment

Hugging Face Space (public demo)

The canonical live environment is the HF Space azizmiladi/fibergate (Docker SDK, port 8000).

  • Models are not baked into the image — the container downloads them from HF Hub on first boot using HF_TOKEN (set as a Space secret).
  • GET / → 302 → /docs so the HF "App" tab shows the Swagger UI immediately.
  • Free CPU tier: ~2 min cold-start, ~10–30 s per analyze call.

Render (production / Spring Boot integration)

Topology: Angular (Static Site) → Spring Boot (Web Service) → guichetoi-ml (Private Service). The Python service has no public URL; only your Spring Boot can reach it.

Why a Pro plan is required: LayoutLMv3 holds ~1.2 GB resident. Free/Starter (512 MB) crashes at model load. Pro = 2 GB = minimum viable.

One-time setup

  1. Create a GitHub PAT with write:packages and read:packages scopes (Settings → Developer settings → Personal access tokens → Fine-grained → repo-scoped to this one).
  2. Log in to GHCR locally:
    echo $env:GHCR_PAT | docker login ghcr.io -u medaziz012 --password-stdin
    
  3. In Render dashboard → Settings → Registry Credentials → add one named ghcr (username = GitHub username, password = same PAT).

Each deploy

make release      # builds locally and pushes to GHCR

Render auto-pulls :latest and redeploys (autoDeploy: true in render.yaml). First boot takes ~30 s; Render's health check polls /health until pipeline_loaded: true.

Spring Boot configuration

guichetoi:
  ml:
    base-url: http://guichetoi-ml:10000

No CORS, no public exposure, no separate auth — Spring Boot is the only client.


Retraining

# 1. Annotate new documents in Label Studio, export JSON
# 2. Convert to training format
python scripts/01_convert_labelstudio.py path/to/export.json

# 3. Train (writes to models/extractor_v3/)
python scripts/03_train_extractor.py

# 4. Evaluate on the held-out test split
python scripts/05_evaluate.py

Training the extractor takes ~6 hours on CPU, ~30 min on a single GPU. Move old checkpoints first: HuggingFace Trainer's save_total_limit=3 rotates by step number, not date — leaving old checkpoints in place silently keeps the old model.

mv models/extractor_v3/checkpoint-* models/extractor_v3_backup_v2/

Architecture highlights

Hybrid Transformer + rules

Pure LayoutLMv3 (a multimodal document transformer) extraction was unreliable on this small dataset (528 training examples, noisy OCR on form-cell digits). Wrapping the transformer with regex post-processing + per-class field allowlists + OCR-tolerant cross-checks turned a "mostly works" prototype into a system whose verdicts can be trusted at the demande level — even when individual field confidences are low.

Six engine adjustments derived from real-data audit

A 11-demande audit on production-shaped ZIPs surfaced systemic failure modes that the test scores didn't reveal. Each was addressed with a targeted fix (all locked in by regression tests):

  • Stricter _RE_REFURB — rejects "rue Abbé" / "Parcelle" false positives from the RU/PA prefixes.
  • Tri-state _autorisation_matches — distinguishes "different ref" (incohérent) from "no ref readable" (manual review).
  • Out-of-scope filename detectionPV-Loc-PAR, Plan-et-ou-photo, Autre_* files no longer satisfy class requirements.
  • Recolement short-circuit — dossiers de récolement get hors-périmètre status + dedicated AR mail.
  • Filename hints broadenedARRETE PC.jpg, CERTIFICAT ADRESSAGE.jpg, Mandat_PAR-1-1.pdf all match now.
  • Strict mandat checkbox scorer! and si no longer count as marked boxes; ambiguous cases fall through to manual review instead of false OUI.

Test suite (171 tests, ~25 s)

File Tests Coverage
tests/test_cms_generator.py 67 All derivations + 4 end-to-end fill_cms scenarios
tests/test_recommendation_engine.py 50 Rule helpers + verdict logic on synthetic Documents
tests/test_inference_postprocess.py 54 Regex constants + mandat detector + cleaner

Every bug debugged during development has a regression test. Running them takes the place of "I checked it manually" — a senior-eng quality signal.


Limits & known gaps

  • Handwritten / small-font form-cell digits drop Tesseract confidence below MIN_CONF=30 → Nb_log_pro and Nb_log_res macro-F1 ≈ 0.25. Mitigated by regex backstops where possible, falls through to "manual completion" otherwise.
  • No live re-extraction after filename override — when the model picks PlanMasse with 65% confidence and we override to Autorisation, we don't re-run extraction on the override target. The CMS gets the right class but no fields; consultant fills them in.
  • XY coordinates (Géoréso) and Mondofi ref are always manual — explicitly listed in the CMS download's "À compléter manuellement" panel.
  • Single-page PDFs assumed for several extraction shortcuts — multi-page docs work but only the first page drives classification.

Author

Mohamed Aziz Miladi — AMARIS internship project (Guichet Accueil Infrastructures).