# FiberGate — Document Analysis Pipeline for Orange's PAR Localisation Workflow Automated processing of *demandes de localisation du Point d'Accès au Réseau (PAR)* for the Orange "Guichet Accueil Infrastructures" team. Given a folder (or ZIP) of documents submitted by a bureau d'études, the system: 1. **classifies** each document (fiche / autorisation / mandat / plan de masse / plan de situation / certificat), 2. **extracts** 13 business fields with a fine-tuned LayoutLMv3 model, 3. **applies the AGILIS rule set** to verdict the demande's completeness (complète / incomplète / hors-périmètre), 4. **pre-fills the CMS IMMO 9 BANBOU** Excel template with the derived values, 5. **drafts the AR mail** ready to paste into MSURVEY. A polished Streamlit demo wraps the whole pipeline with one-click sample loaders for presentation. --- ## Architecture ```mermaid flowchart LR classDef transformer fill:#1e3a8a,stroke:#93c5fd,stroke-width:2px,color:#e0f2fe classDef rule fill:#1c1917,stroke:#fb923c,stroke-width:2px,color:#ffedd5 classDef output fill:#14532d,stroke:#4ade80,stroke-width:2px,color:#dcfce7 classDef io fill:#312e81,stroke:#a5b4fc,stroke-width:2px,color:#e0e7ff classDef decision fill:#7c2d12,stroke:#fb923c,stroke-width:2px,color:#ffedd5 ZIP(["📁 ZIP / loose files"]):::io subgraph PIPE["🔬 Transformer Pipeline · guichetoi.inference"] direction TB OCR["🔍 OCR
Tesseract · fra
conf ≥ 30"]:::transformer CLS["🧠 Classifier
LayoutLMv3
6 document classes"]:::transformer EXT["🧠 Extractor
LayoutLMv3 BIO
13 business fields"]:::transformer POST["⚙️ Post-processing
regex cleaners
per-class allowlist
mandat checkbox scorer"]:::transformer OCR --> CLS --> EXT --> POST end subgraph RULES["📋 Rule Engine · guichetoi.recommendation"] direction TB FNHINT["🏷️ Filename hints
PlanSituation · PlanMasse
ARRETE PC · ADRESSAGE"]:::rule OOS["🚫 Out-of-scope filter
PV-Loc-PAR · Autre_*
Plan-et-ou-photo"]:::rule RECOL{"♻️ Récolement?"}:::decision AGILIS["📐 AGILIS rules
R1 – R5
champs obligatoires fiche"]:::rule REFMATCH["🔗 Reference cross-check
fiche ↔ autorisation
Levenshtein-tolerant"]:::rule FNHINT --> OOS --> RECOL RECOL -- "non" --> AGILIS --> REFMATCH end subgraph OUT["📤 Outputs"] direction TB VERDICT["✅ Verdict
complète / incomplète
hors-périmètre"]:::output ARMAIL["📨 Brouillon AR
ready to paste
into MSURVEY"]:::output CMS["📊 CMS IMMO 9
BANBOU pre-filled
xlsx template"]:::output end UI(["🌐 FastAPI · /docs
azizmiladi-fibergate.hf.space"]):::io ZIP --> PIPE PIPE --> RULES RECOL -- "oui" --> VERDICT REFMATCH --> VERDICT VERDICT --> ARMAIL VERDICT --> CMS OUT --> UI ``` **Two-tier design**: the transformer handles perception (what kind of document it is, where the data is), rules handle business logic (what makes a demande complete, how to fill the CMS). Each layer is independently testable and fixable — extraction errors don't propagate into wrong verdicts thanks to per-field validators and OCR-tolerant cross-checks. --- ## Headline numbers | Metric | Value | |---|---| | Document classes | 6 (fiche, Autorisation, Mandat, Certificat, PlanMasse, PlanSituation) | | Fields extracted | 13 (Reference_Urbanisme, DLPI, nb_log_totale, Disposition_Mandat, …) | | Training set (de-duped, leakage-free) | 754 annotated pages → 528 train / 114 val / 112 test | | Classifier accuracy (val) | ~ 95 % | | Extractor macro span-F1 (val, honest) | **0.62** — Reference_Urbanisme 0.77, Email 1.00, nb_log_totale 0.82 | | Audited demandes (real Orange data) | 11 ZIPs → 7 auto-complète, 3 justifiably-incomplète, 1 hors-périmètre | | Test suite | **171 passing** unit + integration tests (`pytest -q`, ~25 s) | --- ## Repository layout ``` GuichetOI_ML/ ├── src/guichetoi/ The library (importable as `guichetoi.…`) │ ├── inference.py GuichetOIPipeline + post-processing │ ├── recommendation.py AGILIS rule engine + AR-mail rendering │ ├── cms.py Fills the CMS IMMO 9 BANBOU xlsx │ └── api/main.py FastAPI service (Spring Boot / Angular ready) ├── apps/ │ └── streamlit_demo.py One-page demo UI (Orange-branded) ├── scripts/ Training pipeline + batch CLIs │ ├── 01_convert_labelstudio.py │ ├── 02_train_classifier.py │ ├── 03_train_extractor.py │ ├── 05_evaluate.py │ ├── ocr_rasterise.py │ ├── batch_process_dataref.py │ ├── resplit.py · label.py ├── tools/ Dev / debug one-offs ├── tests/ 181 pytest unit/integration tests ├── docs/ │ ├── DEMO_SCRIPT.md Voiceover script for the recorded demo │ └── LOGEMENT_IMPROVEMENTS.md ├── assets/ │ ├── orange_logo.png Brand mark used by the demo │ ├── cms_template.xlsx Official CMS template │ └── label_mappings.json 6 doc classes + 13 field labels (training output) ├── models/ Gitignored — LayoutLMv3 weights │ ├── classifier/ Fine-tuned doc-class model │ ├── extractor_v3/ Field extractor (current production) │ └── extractor_v3_backup_v2/ Previous training run (kept for rollback) ├── .github/workflows/ci.yml Ruff + mypy + pytest on every PR ├── outputs/ Generated verdicts + CMS files (gitignored) ├── Dockerfile · .dockerignore Production container image ├── pyproject.toml Installable package metadata ├── requirements.txt Pinned dependencies (Dockerfile + CI) ├── Makefile Common dev shortcuts (test, demo, api, docker, …) ├── pytest.ini · mypy.ini Test + type-check config └── CONTRIBUTING.md Branch strategy, setup, sensitive-data rules ``` --- ## Setup ### Prerequisites - **Python 3.14** (tested) — likely works on 3.11+ - **Tesseract OCR** with the French language pack - Windows: download from [https://github.com/UB-Mannheim/tesseract/wiki](https://github.com/UB-Mannheim/tesseract/wiki) - During install, tick "Additional language data" → French - **8 GB+ RAM** (model loading), CPU works but GPU strongly recommended for retraining ### Install ```powershell python -m venv .venv .venv\Scripts\activate pip install -e .[dev,ui] # installs the guichetoi package pip install -r requirements.txt ``` ### Verify ```powershell python -m pytest -q # should print: 181 passed in ~25 s ``` ### Common dev commands ([Makefile](Makefile)) ```bash make help # list all targets make test # full pytest suite (181 tests) make test-fast # cms tests only (no model load, < 2 s) make demo # streamlit run apps/streamlit_demo.py make api # uvicorn guichetoi.api.main:app on :8000 make docker # docker build -t guichetoi-ml . make lint # ruff + mypy make clean # remove caches and temp outputs ``` On Windows without `make`, run the command on the right of each `:` line in `Makefile` directly. --- ## Live demo (Hugging Face Space) The FastAPI service is deployed at **[azizmiladi-fibergate.hf.space](https://azizmiladi-fibergate.hf.space)**. - The **App** tab opens the interactive Swagger UI (`/docs`) — try `POST /analyze` directly in the browser. - Models are downloaded from HF Hub on first boot (not baked into the image); cold-start takes ~2 min on HF's free CPU tier. --- ## Run the demo locally ```powershell streamlit run apps/streamlit_demo.py ``` A browser tab opens at `http://localhost:8501`. **For a quick demo**: click any **🎬 Échantillon de démonstration** button — results are pre-computed and appear instantly (~1 s). **For a live analysis**: drop a ZIP of a real demande into the file uploader. CPU inference takes ~5-15 s per document. See [docs/DEMO_SCRIPT.md](docs/DEMO_SCRIPT.md) for a 3-5 minute presentation script with timing and key talking points. --- ## CLI usage ### Analyse one document ```powershell python -m guichetoi.inference --image path/to/doc.pdf # → prints classification + extracted fields, saves JSON to outputs/ ``` ### Analyse a complete demande (folder) ```powershell python -m guichetoi.recommendation --folder path/to/demande/ # → produces outputs//verdict.json + ar_mail.txt ``` ### Use as a Python library ```python from guichetoi.recommendation import RecommendationEngine engine = RecommendationEngine() # loads model once verdict = engine.evaluate_folder("path/to/demande/") print(verdict.status) # "complète" / "incomplète" / "hors-périmètre" ``` ### Run as an HTTP service (for Spring Boot / Angular) ```powershell uvicorn guichetoi.api.main:app --host 0.0.0.0 --port 8000 # or: docker build -t guichetoi-ml . && docker run -p 8000:8000 guichetoi-ml ``` Endpoints: `POST /analyze`, `POST /cms`, `POST /cms/preview`, `GET /metadata`, `GET /health`. `GET /` redirects to `/docs` (Swagger UI). OpenAPI spec at `/openapi.json` (consume with `openapi-generator` for a typed Spring `WebClient`). --- ## Deployment ### Hugging Face Space (public demo) The canonical live environment is the HF Space **azizmiladi/fibergate** (Docker SDK, port 8000). - Models are **not baked into the image** — the container downloads them from HF Hub on first boot using `HF_TOKEN` (set as a Space secret). - `GET /` → 302 → `/docs` so the HF "App" tab shows the Swagger UI immediately. - Free CPU tier: ~2 min cold-start, ~10–30 s per analyze call. ### Render (production / Spring Boot integration) Topology: **Angular (Static Site) → Spring Boot (Web Service) → guichetoi-ml (Private Service)**. The Python service has no public URL; only your Spring Boot can reach it. **Why a Pro plan is required**: LayoutLMv3 holds ~1.2 GB resident. Free/Starter (512 MB) crashes at model load. Pro = 2 GB = minimum viable. #### One-time setup 1. **Create a GitHub PAT** with `write:packages` and `read:packages` scopes (Settings → Developer settings → Personal access tokens → Fine-grained → repo-scoped to this one). 2. **Log in to GHCR locally**: ```powershell echo $env:GHCR_PAT | docker login ghcr.io -u medaziz012 --password-stdin ``` 3. **In Render dashboard** → Settings → Registry Credentials → add one named `ghcr` (username = GitHub username, password = same PAT). #### Each deploy ```powershell make release # builds locally and pushes to GHCR ``` Render auto-pulls `:latest` and redeploys (`autoDeploy: true` in [render.yaml](render.yaml)). First boot takes ~30 s; Render's health check polls `/health` until `pipeline_loaded: true`. #### Spring Boot configuration ```yaml guichetoi: ml: base-url: http://guichetoi-ml:10000 ``` No CORS, no public exposure, no separate auth — Spring Boot is the only client. --- ## Retraining ```powershell # 1. Annotate new documents in Label Studio, export JSON # 2. Convert to training format python scripts/01_convert_labelstudio.py path/to/export.json # 3. Train (writes to models/extractor_v3/) python scripts/03_train_extractor.py # 4. Evaluate on the held-out test split python scripts/05_evaluate.py ``` Training the extractor takes ~6 hours on CPU, ~30 min on a single GPU. **Move old checkpoints first**: HuggingFace Trainer's `save_total_limit=3` rotates by step number, not date — leaving old checkpoints in place silently keeps the *old* model. ```powershell mv models/extractor_v3/checkpoint-* models/extractor_v3_backup_v2/ ``` --- ## Architecture highlights ### Hybrid Transformer + rules Pure LayoutLMv3 (a multimodal document transformer) extraction was unreliable on this small dataset (528 training examples, noisy OCR on form-cell digits). Wrapping the transformer with **regex post-processing + per-class field allowlists + OCR-tolerant cross-checks** turned a "mostly works" prototype into a system whose verdicts can be trusted at the demande level — even when individual field confidences are low. ### Six engine adjustments derived from real-data audit A 11-demande audit on production-shaped ZIPs surfaced systemic failure modes that the test scores didn't reveal. Each was addressed with a targeted fix (all locked in by regression tests): - **Stricter `_RE_REFURB`** — rejects "rue Abbé" / "Parcelle" false positives from the `RU`/`PA` prefixes. - **Tri-state `_autorisation_matches`** — distinguishes "different ref" (incohérent) from "no ref readable" (manual review). - **Out-of-scope filename detection** — `PV-Loc-PAR`, `Plan-et-ou-photo`, `Autre_*` files no longer satisfy class requirements. - **Recolement short-circuit** — dossiers de récolement get `hors-périmètre` status + dedicated AR mail. - **Filename hints broadened** — `ARRETE PC.jpg`, `CERTIFICAT ADRESSAGE.jpg`, `Mandat_PAR-1-1.pdf` all match now. - **Strict mandat checkbox scorer** — `!` and `si` no longer count as marked boxes; ambiguous cases fall through to manual review instead of false OUI. ### Test suite (171 tests, ~25 s) | File | Tests | Coverage | |---|---|---| | `tests/test_cms_generator.py` | 67 | All derivations + 4 end-to-end fill_cms scenarios | | `tests/test_recommendation_engine.py` | 50 | Rule helpers + verdict logic on synthetic Documents | | `tests/test_inference_postprocess.py` | 54 | Regex constants + mandat detector + cleaner | Every bug debugged during development has a regression test. Running them takes the place of "I checked it manually" — a senior-eng quality signal. --- ## Limits & known gaps - **Handwritten / small-font form-cell digits** drop Tesseract confidence below MIN_CONF=30 → `Nb_log_pro` and `Nb_log_res` macro-F1 ≈ 0.25. Mitigated by regex backstops where possible, falls through to "manual completion" otherwise. - **No live re-extraction after filename override** — when the model picks PlanMasse with 65% confidence and we override to Autorisation, we don't re-run extraction on the override target. The CMS gets the right class but no fields; consultant fills them in. - **XY coordinates (Géoréso) and Mondofi ref** are always manual — explicitly listed in the CMS download's "À compléter manuellement" panel. - **Single-page PDFs assumed** for several extraction shortcuts — multi-page docs work but only the first page drives classification. --- ## Author Mohamed Aziz Miladi — AMARIS internship project (Guichet Accueil Infrastructures).