Spaces:
Configuration error
Configuration error
| # FiberGate — Document Analysis Pipeline for Orange's PAR Localisation Workflow | |
| Automated processing of *demandes de localisation du Point d'Accès au Réseau (PAR)* | |
| for the Orange "Guichet Accueil Infrastructures" team. Given a folder (or ZIP) of | |
| documents submitted by a bureau d'études, the system: | |
| 1. **classifies** each document (fiche / autorisation / mandat / plan de masse / plan de situation / certificat), | |
| 2. **extracts** 13 business fields with a fine-tuned LayoutLMv3 model, | |
| 3. **applies the AGILIS rule set** to verdict the demande's completeness (complète / incomplète / hors-périmètre), | |
| 4. **pre-fills the CMS IMMO 9 BANBOU** Excel template with the derived values, | |
| 5. **drafts the AR mail** ready to paste into MSURVEY. | |
| A polished Streamlit demo wraps the whole pipeline with one-click sample loaders for presentation. | |
| --- | |
| ## Architecture | |
| ```mermaid | |
| flowchart LR | |
| classDef transformer fill:#1e3a8a,stroke:#93c5fd,stroke-width:2px,color:#e0f2fe | |
| classDef rule fill:#1c1917,stroke:#fb923c,stroke-width:2px,color:#ffedd5 | |
| classDef output fill:#14532d,stroke:#4ade80,stroke-width:2px,color:#dcfce7 | |
| classDef io fill:#312e81,stroke:#a5b4fc,stroke-width:2px,color:#e0e7ff | |
| classDef decision fill:#7c2d12,stroke:#fb923c,stroke-width:2px,color:#ffedd5 | |
| ZIP(["📁 ZIP / loose files"]):::io | |
| subgraph PIPE["🔬 Transformer Pipeline · guichetoi.inference"] | |
| direction TB | |
| OCR["🔍 OCR<br/>Tesseract · fra<br/>conf ≥ 30"]:::transformer | |
| CLS["🧠 Classifier<br/>LayoutLMv3<br/>6 document classes"]:::transformer | |
| EXT["🧠 Extractor<br/>LayoutLMv3 BIO<br/>13 business fields"]:::transformer | |
| POST["⚙️ Post-processing<br/>regex cleaners<br/>per-class allowlist<br/>mandat checkbox scorer"]:::transformer | |
| OCR --> CLS --> EXT --> POST | |
| end | |
| subgraph RULES["📋 Rule Engine · guichetoi.recommendation"] | |
| direction TB | |
| FNHINT["🏷️ Filename hints<br/>PlanSituation · PlanMasse<br/>ARRETE PC · ADRESSAGE"]:::rule | |
| OOS["🚫 Out-of-scope filter<br/>PV-Loc-PAR · Autre_*<br/>Plan-et-ou-photo"]:::rule | |
| RECOL{"♻️ Récolement?"}:::decision | |
| AGILIS["📐 AGILIS rules<br/>R1 – R5<br/>champs obligatoires fiche"]:::rule | |
| REFMATCH["🔗 Reference cross-check<br/>fiche ↔ autorisation<br/>Levenshtein-tolerant"]:::rule | |
| FNHINT --> OOS --> RECOL | |
| RECOL -- "non" --> AGILIS --> REFMATCH | |
| end | |
| subgraph OUT["📤 Outputs"] | |
| direction TB | |
| VERDICT["✅ Verdict<br/>complète / incomplète<br/>hors-périmètre"]:::output | |
| ARMAIL["📨 Brouillon AR<br/>ready to paste<br/>into MSURVEY"]:::output | |
| CMS["📊 CMS IMMO 9<br/>BANBOU pre-filled<br/>xlsx template"]:::output | |
| end | |
| UI(["🌐 FastAPI · /docs<br/>azizmiladi-fibergate.hf.space"]):::io | |
| ZIP --> PIPE | |
| PIPE --> RULES | |
| RECOL -- "oui" --> VERDICT | |
| REFMATCH --> VERDICT | |
| VERDICT --> ARMAIL | |
| VERDICT --> CMS | |
| OUT --> UI | |
| ``` | |
| **Two-tier design**: the transformer handles perception (what kind of document it is, where the data is), rules handle business logic (what makes a demande complete, how to fill the CMS). Each layer is independently testable and fixable — extraction errors don't propagate into wrong verdicts thanks to per-field validators and OCR-tolerant cross-checks. | |
| --- | |
| ## Headline numbers | |
| | Metric | Value | | |
| |---|---| | |
| | Document classes | 6 (fiche, Autorisation, Mandat, Certificat, PlanMasse, PlanSituation) | | |
| | Fields extracted | 13 (Reference_Urbanisme, DLPI, nb_log_totale, Disposition_Mandat, …) | | |
| | Training set (de-duped, leakage-free) | 754 annotated pages → 528 train / 114 val / 112 test | | |
| | Classifier accuracy (val) | ~ 95 % | | |
| | Extractor macro span-F1 (val, honest) | **0.62** — Reference_Urbanisme 0.77, Email 1.00, nb_log_totale 0.82 | | |
| | Audited demandes (real Orange data) | 11 ZIPs → 7 auto-complète, 3 justifiably-incomplète, 1 hors-périmètre | | |
| | Test suite | **171 passing** unit + integration tests (`pytest -q`, ~25 s) | | |
| --- | |
| ## Repository layout | |
| ``` | |
| GuichetOI_ML/ | |
| ├── src/guichetoi/ The library (importable as `guichetoi.…`) | |
| │ ├── inference.py GuichetOIPipeline + post-processing | |
| │ ├── recommendation.py AGILIS rule engine + AR-mail rendering | |
| │ ├── cms.py Fills the CMS IMMO 9 BANBOU xlsx | |
| │ └── api/main.py FastAPI service (Spring Boot / Angular ready) | |
| ├── apps/ | |
| │ └── streamlit_demo.py One-page demo UI (Orange-branded) | |
| ├── scripts/ Training pipeline + batch CLIs | |
| │ ├── 01_convert_labelstudio.py | |
| │ ├── 02_train_classifier.py | |
| │ ├── 03_train_extractor.py | |
| │ ├── 05_evaluate.py | |
| │ ├── ocr_rasterise.py | |
| │ ├── batch_process_dataref.py | |
| │ ├── resplit.py · label.py | |
| ├── tools/ Dev / debug one-offs | |
| ├── tests/ 181 pytest unit/integration tests | |
| ├── docs/ | |
| │ ├── DEMO_SCRIPT.md Voiceover script for the recorded demo | |
| │ └── LOGEMENT_IMPROVEMENTS.md | |
| ├── assets/ | |
| │ ├── orange_logo.png Brand mark used by the demo | |
| │ ├── cms_template.xlsx Official CMS template | |
| │ └── label_mappings.json 6 doc classes + 13 field labels (training output) | |
| ├── models/ Gitignored — LayoutLMv3 weights | |
| │ ├── classifier/ Fine-tuned doc-class model | |
| │ ├── extractor_v3/ Field extractor (current production) | |
| │ └── extractor_v3_backup_v2/ Previous training run (kept for rollback) | |
| ├── .github/workflows/ci.yml Ruff + mypy + pytest on every PR | |
| ├── outputs/ Generated verdicts + CMS files (gitignored) | |
| ├── Dockerfile · .dockerignore Production container image | |
| ├── pyproject.toml Installable package metadata | |
| ├── requirements.txt Pinned dependencies (Dockerfile + CI) | |
| ├── Makefile Common dev shortcuts (test, demo, api, docker, …) | |
| ├── pytest.ini · mypy.ini Test + type-check config | |
| └── CONTRIBUTING.md Branch strategy, setup, sensitive-data rules | |
| ``` | |
| --- | |
| ## Setup | |
| ### Prerequisites | |
| - **Python 3.14** (tested) — likely works on 3.11+ | |
| - **Tesseract OCR** with the French language pack | |
| - Windows: download from [https://github.com/UB-Mannheim/tesseract/wiki](https://github.com/UB-Mannheim/tesseract/wiki) | |
| - During install, tick "Additional language data" → French | |
| - **8 GB+ RAM** (model loading), CPU works but GPU strongly recommended for retraining | |
| ### Install | |
| ```powershell | |
| python -m venv .venv | |
| .venv\Scripts\activate | |
| pip install -e .[dev,ui] # installs the guichetoi package | |
| pip install -r requirements.txt | |
| ``` | |
| ### Verify | |
| ```powershell | |
| python -m pytest -q # should print: 181 passed in ~25 s | |
| ``` | |
| ### Common dev commands ([Makefile](Makefile)) | |
| ```bash | |
| make help # list all targets | |
| make test # full pytest suite (181 tests) | |
| make test-fast # cms tests only (no model load, < 2 s) | |
| make demo # streamlit run apps/streamlit_demo.py | |
| make api # uvicorn guichetoi.api.main:app on :8000 | |
| make docker # docker build -t guichetoi-ml . | |
| make lint # ruff + mypy | |
| make clean # remove caches and temp outputs | |
| ``` | |
| On Windows without `make`, run the command on the right of each `:` line in `Makefile` directly. | |
| --- | |
| ## Live demo (Hugging Face Space) | |
| The FastAPI service is deployed at **[azizmiladi-fibergate.hf.space](https://azizmiladi-fibergate.hf.space)**. | |
| - The **App** tab opens the interactive Swagger UI (`/docs`) — try `POST /analyze` directly in the browser. | |
| - Models are downloaded from HF Hub on first boot (not baked into the image); cold-start takes ~2 min on HF's free CPU tier. | |
| --- | |
| ## Run the demo locally | |
| ```powershell | |
| streamlit run apps/streamlit_demo.py | |
| ``` | |
| A browser tab opens at `http://localhost:8501`. | |
| **For a quick demo**: click any **🎬 Échantillon de démonstration** button — results are pre-computed and appear instantly (~1 s). | |
| **For a live analysis**: drop a ZIP of a real demande into the file uploader. CPU inference takes ~5-15 s per document. | |
| See [docs/DEMO_SCRIPT.md](docs/DEMO_SCRIPT.md) for a 3-5 minute presentation script with timing and key talking points. | |
| --- | |
| ## CLI usage | |
| ### Analyse one document | |
| ```powershell | |
| python -m guichetoi.inference --image path/to/doc.pdf | |
| # → prints classification + extracted fields, saves JSON to outputs/ | |
| ``` | |
| ### Analyse a complete demande (folder) | |
| ```powershell | |
| python -m guichetoi.recommendation --folder path/to/demande/ | |
| # → produces outputs/<demande>/verdict.json + ar_mail.txt | |
| ``` | |
| ### Use as a Python library | |
| ```python | |
| from guichetoi.recommendation import RecommendationEngine | |
| engine = RecommendationEngine() # loads model once | |
| verdict = engine.evaluate_folder("path/to/demande/") | |
| print(verdict.status) # "complète" / "incomplète" / "hors-périmètre" | |
| ``` | |
| ### Run as an HTTP service (for Spring Boot / Angular) | |
| ```powershell | |
| uvicorn guichetoi.api.main:app --host 0.0.0.0 --port 8000 | |
| # or: docker build -t guichetoi-ml . && docker run -p 8000:8000 guichetoi-ml | |
| ``` | |
| Endpoints: `POST /analyze`, `POST /cms`, `POST /cms/preview`, `GET /metadata`, `GET /health`. | |
| `GET /` redirects to `/docs` (Swagger UI). | |
| OpenAPI spec at `/openapi.json` (consume with `openapi-generator` for a typed Spring `WebClient`). | |
| --- | |
| ## Deployment | |
| ### Hugging Face Space (public demo) | |
| The canonical live environment is the HF Space **azizmiladi/fibergate** (Docker SDK, port 8000). | |
| - Models are **not baked into the image** — the container downloads them from HF Hub on first boot using `HF_TOKEN` (set as a Space secret). | |
| - `GET /` → 302 → `/docs` so the HF "App" tab shows the Swagger UI immediately. | |
| - Free CPU tier: ~2 min cold-start, ~10–30 s per analyze call. | |
| ### Render (production / Spring Boot integration) | |
| Topology: **Angular (Static Site) → Spring Boot (Web Service) → guichetoi-ml (Private Service)**. | |
| The Python service has no public URL; only your Spring Boot can reach it. | |
| **Why a Pro plan is required**: LayoutLMv3 holds ~1.2 GB resident. Free/Starter (512 MB) crashes at model load. Pro = 2 GB = minimum viable. | |
| #### One-time setup | |
| 1. **Create a GitHub PAT** with `write:packages` and `read:packages` scopes | |
| (Settings → Developer settings → Personal access tokens → Fine-grained → repo-scoped to this one). | |
| 2. **Log in to GHCR locally**: | |
| ```powershell | |
| echo $env:GHCR_PAT | docker login ghcr.io -u medaziz012 --password-stdin | |
| ``` | |
| 3. **In Render dashboard** → Settings → Registry Credentials → add one named `ghcr` (username = GitHub username, password = same PAT). | |
| #### Each deploy | |
| ```powershell | |
| make release # builds locally and pushes to GHCR | |
| ``` | |
| Render auto-pulls `:latest` and redeploys (`autoDeploy: true` in [render.yaml](render.yaml)). First boot takes ~30 s; Render's health check polls `/health` until `pipeline_loaded: true`. | |
| #### Spring Boot configuration | |
| ```yaml | |
| guichetoi: | |
| ml: | |
| base-url: http://guichetoi-ml:10000 | |
| ``` | |
| No CORS, no public exposure, no separate auth — Spring Boot is the only client. | |
| --- | |
| ## Retraining | |
| ```powershell | |
| # 1. Annotate new documents in Label Studio, export JSON | |
| # 2. Convert to training format | |
| python scripts/01_convert_labelstudio.py path/to/export.json | |
| # 3. Train (writes to models/extractor_v3/) | |
| python scripts/03_train_extractor.py | |
| # 4. Evaluate on the held-out test split | |
| python scripts/05_evaluate.py | |
| ``` | |
| Training the extractor takes ~6 hours on CPU, ~30 min on a single GPU. | |
| **Move old checkpoints first**: HuggingFace Trainer's `save_total_limit=3` rotates by step number, not date — leaving old checkpoints in place silently keeps the *old* model. | |
| ```powershell | |
| mv models/extractor_v3/checkpoint-* models/extractor_v3_backup_v2/ | |
| ``` | |
| --- | |
| ## Architecture highlights | |
| ### Hybrid Transformer + rules | |
| Pure LayoutLMv3 (a multimodal document transformer) extraction was unreliable on this small dataset (528 training examples, noisy OCR on form-cell digits). Wrapping the transformer with **regex post-processing + per-class field allowlists + OCR-tolerant cross-checks** turned a "mostly works" prototype into a system whose verdicts can be trusted at the demande level — even when individual field confidences are low. | |
| ### Six engine adjustments derived from real-data audit | |
| A 11-demande audit on production-shaped ZIPs surfaced systemic failure modes that the test scores didn't reveal. Each was addressed with a targeted fix (all locked in by regression tests): | |
| - **Stricter `_RE_REFURB`** — rejects "rue Abbé" / "Parcelle" false positives from the `RU`/`PA` prefixes. | |
| - **Tri-state `_autorisation_matches`** — distinguishes "different ref" (incohérent) from "no ref readable" (manual review). | |
| - **Out-of-scope filename detection** — `PV-Loc-PAR`, `Plan-et-ou-photo`, `Autre_*` files no longer satisfy class requirements. | |
| - **Recolement short-circuit** — dossiers de récolement get `hors-périmètre` status + dedicated AR mail. | |
| - **Filename hints broadened** — `ARRETE PC.jpg`, `CERTIFICAT ADRESSAGE.jpg`, `Mandat_PAR-1-1.pdf` all match now. | |
| - **Strict mandat checkbox scorer** — `!` and `si` no longer count as marked boxes; ambiguous cases fall through to manual review instead of false OUI. | |
| ### Test suite (171 tests, ~25 s) | |
| | File | Tests | Coverage | | |
| |---|---|---| | |
| | `tests/test_cms_generator.py` | 67 | All derivations + 4 end-to-end fill_cms scenarios | | |
| | `tests/test_recommendation_engine.py` | 50 | Rule helpers + verdict logic on synthetic Documents | | |
| | `tests/test_inference_postprocess.py` | 54 | Regex constants + mandat detector + cleaner | | |
| Every bug debugged during development has a regression test. Running them takes the place of "I checked it manually" — a senior-eng quality signal. | |
| --- | |
| ## Limits & known gaps | |
| - **Handwritten / small-font form-cell digits** drop Tesseract confidence below MIN_CONF=30 → `Nb_log_pro` and `Nb_log_res` macro-F1 ≈ 0.25. Mitigated by regex backstops where possible, falls through to "manual completion" otherwise. | |
| - **No live re-extraction after filename override** — when the model picks PlanMasse with 65% confidence and we override to Autorisation, we don't re-run extraction on the override target. The CMS gets the right class but no fields; consultant fills them in. | |
| - **XY coordinates (Géoréso) and Mondofi ref** are always manual — explicitly listed in the CMS download's "À compléter manuellement" panel. | |
| - **Single-page PDFs assumed** for several extraction shortcuts — multi-page docs work but only the first page drives classification. | |
| --- | |
| ## Author | |
| Mohamed Aziz Miladi — AMARIS internship project (Guichet Accueil Infrastructures). | |