# FiberGate — Document Analysis Pipeline for Orange's PAR Localisation Workflow
Automated processing of *demandes de localisation du Point d'Accès au Réseau (PAR)*
for the Orange "Guichet Accueil Infrastructures" team. Given a folder (or ZIP) of
documents submitted by a bureau d'études, the system:
1. **classifies** each document (fiche / autorisation / mandat / plan de masse / plan de situation / certificat),
2. **extracts** 13 business fields with a fine-tuned LayoutLMv3 model,
3. **applies the AGILIS rule set** to verdict the demande's completeness (complète / incomplète / hors-périmètre),
4. **pre-fills the CMS IMMO 9 BANBOU** Excel template with the derived values,
5. **drafts the AR mail** ready to paste into MSURVEY.
A polished Streamlit demo wraps the whole pipeline with one-click sample loaders for presentation.
---
## Architecture
```mermaid
flowchart LR
classDef transformer fill:#1e3a8a,stroke:#93c5fd,stroke-width:2px,color:#e0f2fe
classDef rule fill:#1c1917,stroke:#fb923c,stroke-width:2px,color:#ffedd5
classDef output fill:#14532d,stroke:#4ade80,stroke-width:2px,color:#dcfce7
classDef io fill:#312e81,stroke:#a5b4fc,stroke-width:2px,color:#e0e7ff
classDef decision fill:#7c2d12,stroke:#fb923c,stroke-width:2px,color:#ffedd5
ZIP(["📁 ZIP / loose files"]):::io
subgraph PIPE["🔬 Transformer Pipeline · guichetoi.inference"]
direction TB
OCR["🔍 OCR
Tesseract · fra
conf ≥ 30"]:::transformer
CLS["🧠 Classifier
LayoutLMv3
6 document classes"]:::transformer
EXT["🧠 Extractor
LayoutLMv3 BIO
13 business fields"]:::transformer
POST["⚙️ Post-processing
regex cleaners
per-class allowlist
mandat checkbox scorer"]:::transformer
OCR --> CLS --> EXT --> POST
end
subgraph RULES["📋 Rule Engine · guichetoi.recommendation"]
direction TB
FNHINT["🏷️ Filename hints
PlanSituation · PlanMasse
ARRETE PC · ADRESSAGE"]:::rule
OOS["🚫 Out-of-scope filter
PV-Loc-PAR · Autre_*
Plan-et-ou-photo"]:::rule
RECOL{"♻️ Récolement?"}:::decision
AGILIS["📐 AGILIS rules
R1 – R5
champs obligatoires fiche"]:::rule
REFMATCH["🔗 Reference cross-check
fiche ↔ autorisation
Levenshtein-tolerant"]:::rule
FNHINT --> OOS --> RECOL
RECOL -- "non" --> AGILIS --> REFMATCH
end
subgraph OUT["📤 Outputs"]
direction TB
VERDICT["✅ Verdict
complète / incomplète
hors-périmètre"]:::output
ARMAIL["📨 Brouillon AR
ready to paste
into MSURVEY"]:::output
CMS["📊 CMS IMMO 9
BANBOU pre-filled
xlsx template"]:::output
end
UI(["🌐 FastAPI · /docs
azizmiladi-fibergate.hf.space"]):::io
ZIP --> PIPE
PIPE --> RULES
RECOL -- "oui" --> VERDICT
REFMATCH --> VERDICT
VERDICT --> ARMAIL
VERDICT --> CMS
OUT --> UI
```
**Two-tier design**: the transformer handles perception (what kind of document it is, where the data is), rules handle business logic (what makes a demande complete, how to fill the CMS). Each layer is independently testable and fixable — extraction errors don't propagate into wrong verdicts thanks to per-field validators and OCR-tolerant cross-checks.
---
## Headline numbers
| Metric | Value |
|---|---|
| Document classes | 6 (fiche, Autorisation, Mandat, Certificat, PlanMasse, PlanSituation) |
| Fields extracted | 13 (Reference_Urbanisme, DLPI, nb_log_totale, Disposition_Mandat, …) |
| Training set (de-duped, leakage-free) | 754 annotated pages → 528 train / 114 val / 112 test |
| Classifier accuracy (val) | ~ 95 % |
| Extractor macro span-F1 (val, honest) | **0.62** — Reference_Urbanisme 0.77, Email 1.00, nb_log_totale 0.82 |
| Audited demandes (real Orange data) | 11 ZIPs → 7 auto-complète, 3 justifiably-incomplète, 1 hors-périmètre |
| Test suite | **171 passing** unit + integration tests (`pytest -q`, ~25 s) |
---
## Repository layout
```
GuichetOI_ML/
├── src/guichetoi/ The library (importable as `guichetoi.…`)
│ ├── inference.py GuichetOIPipeline + post-processing
│ ├── recommendation.py AGILIS rule engine + AR-mail rendering
│ ├── cms.py Fills the CMS IMMO 9 BANBOU xlsx
│ └── api/main.py FastAPI service (Spring Boot / Angular ready)
├── apps/
│ └── streamlit_demo.py One-page demo UI (Orange-branded)
├── scripts/ Training pipeline + batch CLIs
│ ├── 01_convert_labelstudio.py
│ ├── 02_train_classifier.py
│ ├── 03_train_extractor.py
│ ├── 05_evaluate.py
│ ├── ocr_rasterise.py
│ ├── batch_process_dataref.py
│ ├── resplit.py · label.py
├── tools/ Dev / debug one-offs
├── tests/ 181 pytest unit/integration tests
├── docs/
│ ├── DEMO_SCRIPT.md Voiceover script for the recorded demo
│ └── LOGEMENT_IMPROVEMENTS.md
├── assets/
│ ├── orange_logo.png Brand mark used by the demo
│ ├── cms_template.xlsx Official CMS template
│ └── label_mappings.json 6 doc classes + 13 field labels (training output)
├── models/ Gitignored — LayoutLMv3 weights
│ ├── classifier/ Fine-tuned doc-class model
│ ├── extractor_v3/ Field extractor (current production)
│ └── extractor_v3_backup_v2/ Previous training run (kept for rollback)
├── .github/workflows/ci.yml Ruff + mypy + pytest on every PR
├── outputs/ Generated verdicts + CMS files (gitignored)
├── Dockerfile · .dockerignore Production container image
├── pyproject.toml Installable package metadata
├── requirements.txt Pinned dependencies (Dockerfile + CI)
├── Makefile Common dev shortcuts (test, demo, api, docker, …)
├── pytest.ini · mypy.ini Test + type-check config
└── CONTRIBUTING.md Branch strategy, setup, sensitive-data rules
```
---
## Setup
### Prerequisites
- **Python 3.14** (tested) — likely works on 3.11+
- **Tesseract OCR** with the French language pack
- Windows: download from [https://github.com/UB-Mannheim/tesseract/wiki](https://github.com/UB-Mannheim/tesseract/wiki)
- During install, tick "Additional language data" → French
- **8 GB+ RAM** (model loading), CPU works but GPU strongly recommended for retraining
### Install
```powershell
python -m venv .venv
.venv\Scripts\activate
pip install -e .[dev,ui] # installs the guichetoi package
pip install -r requirements.txt
```
### Verify
```powershell
python -m pytest -q # should print: 181 passed in ~25 s
```
### Common dev commands ([Makefile](Makefile))
```bash
make help # list all targets
make test # full pytest suite (181 tests)
make test-fast # cms tests only (no model load, < 2 s)
make demo # streamlit run apps/streamlit_demo.py
make api # uvicorn guichetoi.api.main:app on :8000
make docker # docker build -t guichetoi-ml .
make lint # ruff + mypy
make clean # remove caches and temp outputs
```
On Windows without `make`, run the command on the right of each `:` line in `Makefile` directly.
---
## Live demo (Hugging Face Space)
The FastAPI service is deployed at **[azizmiladi-fibergate.hf.space](https://azizmiladi-fibergate.hf.space)**.
- The **App** tab opens the interactive Swagger UI (`/docs`) — try `POST /analyze` directly in the browser.
- Models are downloaded from HF Hub on first boot (not baked into the image); cold-start takes ~2 min on HF's free CPU tier.
---
## Run the demo locally
```powershell
streamlit run apps/streamlit_demo.py
```
A browser tab opens at `http://localhost:8501`.
**For a quick demo**: click any **🎬 Échantillon de démonstration** button — results are pre-computed and appear instantly (~1 s).
**For a live analysis**: drop a ZIP of a real demande into the file uploader. CPU inference takes ~5-15 s per document.
See [docs/DEMO_SCRIPT.md](docs/DEMO_SCRIPT.md) for a 3-5 minute presentation script with timing and key talking points.
---
## CLI usage
### Analyse one document
```powershell
python -m guichetoi.inference --image path/to/doc.pdf
# → prints classification + extracted fields, saves JSON to outputs/
```
### Analyse a complete demande (folder)
```powershell
python -m guichetoi.recommendation --folder path/to/demande/
# → produces outputs//verdict.json + ar_mail.txt
```
### Use as a Python library
```python
from guichetoi.recommendation import RecommendationEngine
engine = RecommendationEngine() # loads model once
verdict = engine.evaluate_folder("path/to/demande/")
print(verdict.status) # "complète" / "incomplète" / "hors-périmètre"
```
### Run as an HTTP service (for Spring Boot / Angular)
```powershell
uvicorn guichetoi.api.main:app --host 0.0.0.0 --port 8000
# or: docker build -t guichetoi-ml . && docker run -p 8000:8000 guichetoi-ml
```
Endpoints: `POST /analyze`, `POST /cms`, `POST /cms/preview`, `GET /metadata`, `GET /health`.
`GET /` redirects to `/docs` (Swagger UI).
OpenAPI spec at `/openapi.json` (consume with `openapi-generator` for a typed Spring `WebClient`).
---
## Deployment
### Hugging Face Space (public demo)
The canonical live environment is the HF Space **azizmiladi/fibergate** (Docker SDK, port 8000).
- Models are **not baked into the image** — the container downloads them from HF Hub on first boot using `HF_TOKEN` (set as a Space secret).
- `GET /` → 302 → `/docs` so the HF "App" tab shows the Swagger UI immediately.
- Free CPU tier: ~2 min cold-start, ~10–30 s per analyze call.
### Render (production / Spring Boot integration)
Topology: **Angular (Static Site) → Spring Boot (Web Service) → guichetoi-ml (Private Service)**.
The Python service has no public URL; only your Spring Boot can reach it.
**Why a Pro plan is required**: LayoutLMv3 holds ~1.2 GB resident. Free/Starter (512 MB) crashes at model load. Pro = 2 GB = minimum viable.
#### One-time setup
1. **Create a GitHub PAT** with `write:packages` and `read:packages` scopes
(Settings → Developer settings → Personal access tokens → Fine-grained → repo-scoped to this one).
2. **Log in to GHCR locally**:
```powershell
echo $env:GHCR_PAT | docker login ghcr.io -u medaziz012 --password-stdin
```
3. **In Render dashboard** → Settings → Registry Credentials → add one named `ghcr` (username = GitHub username, password = same PAT).
#### Each deploy
```powershell
make release # builds locally and pushes to GHCR
```
Render auto-pulls `:latest` and redeploys (`autoDeploy: true` in [render.yaml](render.yaml)). First boot takes ~30 s; Render's health check polls `/health` until `pipeline_loaded: true`.
#### Spring Boot configuration
```yaml
guichetoi:
ml:
base-url: http://guichetoi-ml:10000
```
No CORS, no public exposure, no separate auth — Spring Boot is the only client.
---
## Retraining
```powershell
# 1. Annotate new documents in Label Studio, export JSON
# 2. Convert to training format
python scripts/01_convert_labelstudio.py path/to/export.json
# 3. Train (writes to models/extractor_v3/)
python scripts/03_train_extractor.py
# 4. Evaluate on the held-out test split
python scripts/05_evaluate.py
```
Training the extractor takes ~6 hours on CPU, ~30 min on a single GPU.
**Move old checkpoints first**: HuggingFace Trainer's `save_total_limit=3` rotates by step number, not date — leaving old checkpoints in place silently keeps the *old* model.
```powershell
mv models/extractor_v3/checkpoint-* models/extractor_v3_backup_v2/
```
---
## Architecture highlights
### Hybrid Transformer + rules
Pure LayoutLMv3 (a multimodal document transformer) extraction was unreliable on this small dataset (528 training examples, noisy OCR on form-cell digits). Wrapping the transformer with **regex post-processing + per-class field allowlists + OCR-tolerant cross-checks** turned a "mostly works" prototype into a system whose verdicts can be trusted at the demande level — even when individual field confidences are low.
### Six engine adjustments derived from real-data audit
A 11-demande audit on production-shaped ZIPs surfaced systemic failure modes that the test scores didn't reveal. Each was addressed with a targeted fix (all locked in by regression tests):
- **Stricter `_RE_REFURB`** — rejects "rue Abbé" / "Parcelle" false positives from the `RU`/`PA` prefixes.
- **Tri-state `_autorisation_matches`** — distinguishes "different ref" (incohérent) from "no ref readable" (manual review).
- **Out-of-scope filename detection** — `PV-Loc-PAR`, `Plan-et-ou-photo`, `Autre_*` files no longer satisfy class requirements.
- **Recolement short-circuit** — dossiers de récolement get `hors-périmètre` status + dedicated AR mail.
- **Filename hints broadened** — `ARRETE PC.jpg`, `CERTIFICAT ADRESSAGE.jpg`, `Mandat_PAR-1-1.pdf` all match now.
- **Strict mandat checkbox scorer** — `!` and `si` no longer count as marked boxes; ambiguous cases fall through to manual review instead of false OUI.
### Test suite (171 tests, ~25 s)
| File | Tests | Coverage |
|---|---|---|
| `tests/test_cms_generator.py` | 67 | All derivations + 4 end-to-end fill_cms scenarios |
| `tests/test_recommendation_engine.py` | 50 | Rule helpers + verdict logic on synthetic Documents |
| `tests/test_inference_postprocess.py` | 54 | Regex constants + mandat detector + cleaner |
Every bug debugged during development has a regression test. Running them takes the place of "I checked it manually" — a senior-eng quality signal.
---
## Limits & known gaps
- **Handwritten / small-font form-cell digits** drop Tesseract confidence below MIN_CONF=30 → `Nb_log_pro` and `Nb_log_res` macro-F1 ≈ 0.25. Mitigated by regex backstops where possible, falls through to "manual completion" otherwise.
- **No live re-extraction after filename override** — when the model picks PlanMasse with 65% confidence and we override to Autorisation, we don't re-run extraction on the override target. The CMS gets the right class but no fields; consultant fills them in.
- **XY coordinates (Géoréso) and Mondofi ref** are always manual — explicitly listed in the CMS download's "À compléter manuellement" panel.
- **Single-page PDFs assumed** for several extraction shortcuts — multi-page docs work but only the first page drives classification.
---
## Author
Mohamed Aziz Miladi — AMARIS internship project (Guichet Accueil Infrastructures).