Spaces:

AzizMiladi
/

FiberGate

Configuration error

App Files Files

FiberGate / README.md

AzizMiladi

docs(readme): remove AI from title

b108a0a 1 day ago

preview code

Raw

History Blame

14.9 kB

	# FiberGate — Document Analysis Pipeline for Orange's PAR Localisation Workflow

	Automated processing of demandes de localisation du Point d'Accès au Réseau (PAR)
	for the Orange "Guichet Accueil Infrastructures" team. Given a folder (or ZIP) of
	documents submitted by a bureau d'études, the system:

	1. classifies each document (fiche / autorisation / mandat / plan de masse / plan de situation / certificat),
	2. extracts 13 business fields with a fine-tuned LayoutLMv3 model,
	3. applies the AGILIS rule set to verdict the demande's completeness (complète / incomplète / hors-périmètre),
	4. pre-fills the CMS IMMO 9 BANBOU Excel template with the derived values,
	5. drafts the AR mail ready to paste into MSURVEY.

	A polished Streamlit demo wraps the whole pipeline with one-click sample loaders for presentation.

	---

	## Architecture

	```mermaid
	flowchart LR
	classDef transformer fill:#1e3a8a,stroke:#93c5fd,stroke-width:2px,color:#e0f2fe
	classDef rule fill:#1c1917,stroke:#fb923c,stroke-width:2px,color:#ffedd5
	classDef output fill:#14532d,stroke:#4ade80,stroke-width:2px,color:#dcfce7
	classDef io fill:#312e81,stroke:#a5b4fc,stroke-width:2px,color:#e0e7ff
	classDef decision fill:#7c2d12,stroke:#fb923c,stroke-width:2px,color:#ffedd5

	ZIP(["📁 ZIP / loose files"]):::io

	subgraph PIPE["🔬 Transformer Pipeline · guichetoi.inference"]
	direction TB
	OCR["🔍 OCR<br/>Tesseract · fra<br/>conf ≥ 30"]:::transformer
	CLS["🧠 Classifier<br/>LayoutLMv3<br/>6 document classes"]:::transformer
	EXT["🧠 Extractor<br/>LayoutLMv3 BIO<br/>13 business fields"]:::transformer
	POST["⚙️ Post-processing<br/>regex cleaners<br/>per-class allowlist<br/>mandat checkbox scorer"]:::transformer
	OCR --> CLS --> EXT --> POST
	end

	subgraph RULES["📋 Rule Engine · guichetoi.recommendation"]
	direction TB
	FNHINT["🏷️ Filename hints<br/>PlanSituation · PlanMasse<br/>ARRETE PC · ADRESSAGE"]:::rule
	OOS["🚫 Out-of-scope filter<br/>PV-Loc-PAR · Autre_*<br/>Plan-et-ou-photo"]:::rule
	RECOL{"♻️ Récolement?"}:::decision
	AGILIS["📐 AGILIS rules<br/>R1 – R5<br/>champs obligatoires fiche"]:::rule
	REFMATCH["🔗 Reference cross-check<br/>fiche ↔ autorisation<br/>Levenshtein-tolerant"]:::rule
	FNHINT --> OOS --> RECOL
	RECOL -- "non" --> AGILIS --> REFMATCH
	end

	subgraph OUT["📤 Outputs"]
	direction TB
	VERDICT["✅ Verdict<br/>complète / incomplète<br/>hors-périmètre"]:::output
	ARMAIL["📨 Brouillon AR<br/>ready to paste<br/>into MSURVEY"]:::output
	CMS["📊 CMS IMMO 9<br/>BANBOU pre-filled<br/>xlsx template"]:::output
	end

	UI(["🌐 FastAPI · /docs<br/>azizmiladi-fibergate.hf.space"]):::io

	ZIP --> PIPE
	PIPE --> RULES
	RECOL -- "oui" --> VERDICT
	REFMATCH --> VERDICT
	VERDICT --> ARMAIL
	VERDICT --> CMS
	OUT --> UI
	```

	Two-tier design: the transformer handles perception (what kind of document it is, where the data is), rules handle business logic (what makes a demande complete, how to fill the CMS). Each layer is independently testable and fixable — extraction errors don't propagate into wrong verdicts thanks to per-field validators and OCR-tolerant cross-checks.

	---

	## Headline numbers

	\| Metric \| Value \|
	\|---\|---\|
	\| Document classes \| 6 (fiche, Autorisation, Mandat, Certificat, PlanMasse, PlanSituation) \|
	\| Fields extracted \| 13 (Reference_Urbanisme, DLPI, nb_log_totale, Disposition_Mandat, …) \|
	\| Training set (de-duped, leakage-free) \| 754 annotated pages → 528 train / 114 val / 112 test \|
	\| Classifier accuracy (val) \| ~ 95 % \|
	\| Extractor macro span-F1 (val, honest) \| 0.62 — Reference_Urbanisme 0.77, Email 1.00, nb_log_totale 0.82 \|
	\| Audited demandes (real Orange data) \| 11 ZIPs → 7 auto-complète, 3 justifiably-incomplète, 1 hors-périmètre \|
	\| Test suite \| 171 passing unit + integration tests (`pytest -q`, ~25 s) \|

	---

	## Repository layout

	```
	GuichetOI_ML/
	├── src/guichetoi/ The library (importable as `guichetoi.…`)
	│ ├── inference.py GuichetOIPipeline + post-processing
	│ ├── recommendation.py AGILIS rule engine + AR-mail rendering
	│ ├── cms.py Fills the CMS IMMO 9 BANBOU xlsx
	│ └── api/main.py FastAPI service (Spring Boot / Angular ready)
	├── apps/
	│ └── streamlit_demo.py One-page demo UI (Orange-branded)
	├── scripts/ Training pipeline + batch CLIs
	│ ├── 01_convert_labelstudio.py
	│ ├── 02_train_classifier.py
	│ ├── 03_train_extractor.py
	│ ├── 05_evaluate.py
	│ ├── ocr_rasterise.py
	│ ├── batch_process_dataref.py
	│ ├── resplit.py · label.py
	├── tools/ Dev / debug one-offs
	├── tests/ 181 pytest unit/integration tests
	├── docs/
	│ ├── DEMO_SCRIPT.md Voiceover script for the recorded demo
	│ └── LOGEMENT_IMPROVEMENTS.md
	├── assets/
	│ ├── orange_logo.png Brand mark used by the demo
	│ ├── cms_template.xlsx Official CMS template
	│ └── label_mappings.json 6 doc classes + 13 field labels (training output)
	├── models/ Gitignored — LayoutLMv3 weights
	│ ├── classifier/ Fine-tuned doc-class model
	│ ├── extractor_v3/ Field extractor (current production)
	│ └── extractor_v3_backup_v2/ Previous training run (kept for rollback)
	├── .github/workflows/ci.yml Ruff + mypy + pytest on every PR
	├── outputs/ Generated verdicts + CMS files (gitignored)
	├── Dockerfile · .dockerignore Production container image
	├── pyproject.toml Installable package metadata
	├── requirements.txt Pinned dependencies (Dockerfile + CI)
	├── Makefile Common dev shortcuts (test, demo, api, docker, …)
	├── pytest.ini · mypy.ini Test + type-check config
	└── CONTRIBUTING.md Branch strategy, setup, sensitive-data rules
	```

	---

	## Setup

	### Prerequisites

	- Python 3.14 (tested) — likely works on 3.11+
	- Tesseract OCR with the French language pack
	- Windows: download from [https://github.com/UB-Mannheim/tesseract/wiki](https://github.com/UB-Mannheim/tesseract/wiki)
	- During install, tick "Additional language data" → French
	- 8 GB+ RAM (model loading), CPU works but GPU strongly recommended for retraining

	### Install

	```powershell
	python -m venv .venv
	.venv\Scripts\activate
	pip install -e .[dev,ui] # installs the guichetoi package
	pip install -r requirements.txt
	```

	### Verify

	```powershell
	python -m pytest -q # should print: 181 passed in ~25 s
	```

	### Common dev commands ([Makefile](Makefile))

	```bash
	make help # list all targets
	make test # full pytest suite (181 tests)
	make test-fast # cms tests only (no model load, < 2 s)
	make demo # streamlit run apps/streamlit_demo.py
	make api # uvicorn guichetoi.api.main:app on :8000
	make docker # docker build -t guichetoi-ml .
	make lint # ruff + mypy
	make clean # remove caches and temp outputs
	```

	On Windows without `make`, run the command on the right of each `:` line in `Makefile` directly.

	---

	## Live demo (Hugging Face Space)

	The FastAPI service is deployed at [azizmiladi-fibergate.hf.space](https://azizmiladi-fibergate.hf.space).

	- The App tab opens the interactive Swagger UI (`/docs`) — try `POST /analyze` directly in the browser.
	- Models are downloaded from HF Hub on first boot (not baked into the image); cold-start takes ~2 min on HF's free CPU tier.

	---

	## Run the demo locally

	```powershell
	streamlit run apps/streamlit_demo.py
	```

	A browser tab opens at `http://localhost:8501`.

	For a quick demo: click any 🎬 Échantillon de démonstration button — results are pre-computed and appear instantly (~1 s).

	For a live analysis: drop a ZIP of a real demande into the file uploader. CPU inference takes ~5-15 s per document.

	See [docs/DEMO_SCRIPT.md](docs/DEMO_SCRIPT.md) for a 3-5 minute presentation script with timing and key talking points.

	---

	## CLI usage

	### Analyse one document
	```powershell
	python -m guichetoi.inference --image path/to/doc.pdf
	# → prints classification + extracted fields, saves JSON to outputs/
	```

	### Analyse a complete demande (folder)
	```powershell
	python -m guichetoi.recommendation --folder path/to/demande/
	# → produces outputs/<demande>/verdict.json + ar_mail.txt
	```

	### Use as a Python library
	```python
	from guichetoi.recommendation import RecommendationEngine

	engine = RecommendationEngine() # loads model once
	verdict = engine.evaluate_folder("path/to/demande/")
	print(verdict.status) # "complète" / "incomplète" / "hors-périmètre"
	```

	### Run as an HTTP service (for Spring Boot / Angular)
	```powershell
	uvicorn guichetoi.api.main:app --host 0.0.0.0 --port 8000
	# or: docker build -t guichetoi-ml . && docker run -p 8000:8000 guichetoi-ml
	```
	Endpoints: `POST /analyze`, `POST /cms`, `POST /cms/preview`, `GET /metadata`, `GET /health`.
	`GET /` redirects to `/docs` (Swagger UI).
	OpenAPI spec at `/openapi.json` (consume with `openapi-generator` for a typed Spring `WebClient`).

	---

	## Deployment

	### Hugging Face Space (public demo)

	The canonical live environment is the HF Space azizmiladi/fibergate (Docker SDK, port 8000).

	- Models are not baked into the image — the container downloads them from HF Hub on first boot using `HF_TOKEN` (set as a Space secret).
	- `GET /` → 302 → `/docs` so the HF "App" tab shows the Swagger UI immediately.
	- Free CPU tier: ~2 min cold-start, ~10–30 s per analyze call.

	### Render (production / Spring Boot integration)

	Topology: Angular (Static Site) → Spring Boot (Web Service) → guichetoi-ml (Private Service).
	The Python service has no public URL; only your Spring Boot can reach it.

	Why a Pro plan is required: LayoutLMv3 holds ~1.2 GB resident. Free/Starter (512 MB) crashes at model load. Pro = 2 GB = minimum viable.

	#### One-time setup

	1. Create a GitHub PAT with `write:packages` and `read:packages` scopes
	(Settings → Developer settings → Personal access tokens → Fine-grained → repo-scoped to this one).
	2. Log in to GHCR locally:
	```powershell
	echo $env:GHCR_PAT \| docker login ghcr.io -u medaziz012 --password-stdin
	```
	3. In Render dashboard → Settings → Registry Credentials → add one named `ghcr` (username = GitHub username, password = same PAT).

	#### Each deploy

	```powershell
	make release # builds locally and pushes to GHCR
	```

	Render auto-pulls `:latest` and redeploys (`autoDeploy: true` in [render.yaml](render.yaml)). First boot takes ~30 s; Render's health check polls `/health` until `pipeline_loaded: true`.

	#### Spring Boot configuration

	```yaml
	guichetoi:
	ml:
	base-url: http://guichetoi-ml:10000
	```

	No CORS, no public exposure, no separate auth — Spring Boot is the only client.

	---

	## Retraining

	```powershell
	# 1. Annotate new documents in Label Studio, export JSON
	# 2. Convert to training format
	python scripts/01_convert_labelstudio.py path/to/export.json

	# 3. Train (writes to models/extractor_v3/)
	python scripts/03_train_extractor.py

	# 4. Evaluate on the held-out test split
	python scripts/05_evaluate.py
	```

	Training the extractor takes ~6 hours on CPU, ~30 min on a single GPU.
	Move old checkpoints first: HuggingFace Trainer's `save_total_limit=3` rotates by step number, not date — leaving old checkpoints in place silently keeps the old model.

	```powershell
	mv models/extractor_v3/checkpoint-* models/extractor_v3_backup_v2/
	```

	---

	## Architecture highlights

	### Hybrid Transformer + rules

	Pure LayoutLMv3 (a multimodal document transformer) extraction was unreliable on this small dataset (528 training examples, noisy OCR on form-cell digits). Wrapping the transformer with regex post-processing + per-class field allowlists + OCR-tolerant cross-checks turned a "mostly works" prototype into a system whose verdicts can be trusted at the demande level — even when individual field confidences are low.

	### Six engine adjustments derived from real-data audit

	A 11-demande audit on production-shaped ZIPs surfaced systemic failure modes that the test scores didn't reveal. Each was addressed with a targeted fix (all locked in by regression tests):

	- Stricter `_RE_REFURB` — rejects "rue Abbé" / "Parcelle" false positives from the `RU`/`PA` prefixes.
	- Tri-state `_autorisation_matches` — distinguishes "different ref" (incohérent) from "no ref readable" (manual review).
	- Out-of-scope filename detection — `PV-Loc-PAR`, `Plan-et-ou-photo`, `Autre_*` files no longer satisfy class requirements.
	- Recolement short-circuit — dossiers de récolement get `hors-périmètre` status + dedicated AR mail.
	- Filename hints broadened — `ARRETE PC.jpg`, `CERTIFICAT ADRESSAGE.jpg`, `Mandat_PAR-1-1.pdf` all match now.
	- Strict mandat checkbox scorer — `!` and `si` no longer count as marked boxes; ambiguous cases fall through to manual review instead of false OUI.

	### Test suite (171 tests, ~25 s)

	\| File \| Tests \| Coverage \|
	\|---\|---\|---\|
	\| `tests/test_cms_generator.py` \| 67 \| All derivations + 4 end-to-end fill_cms scenarios \|
	\| `tests/test_recommendation_engine.py` \| 50 \| Rule helpers + verdict logic on synthetic Documents \|
	\| `tests/test_inference_postprocess.py` \| 54 \| Regex constants + mandat detector + cleaner \|

	Every bug debugged during development has a regression test. Running them takes the place of "I checked it manually" — a senior-eng quality signal.

	---

	## Limits & known gaps

	- Handwritten / small-font form-cell digits drop Tesseract confidence below MIN_CONF=30 → `Nb_log_pro` and `Nb_log_res` macro-F1 ≈ 0.25. Mitigated by regex backstops where possible, falls through to "manual completion" otherwise.
	- No live re-extraction after filename override — when the model picks PlanMasse with 65% confidence and we override to Autorisation, we don't re-run extraction on the override target. The CMS gets the right class but no fields; consultant fills them in.
	- XY coordinates (Géoréso) and Mondofi ref are always manual — explicitly listed in the CMS download's "À compléter manuellement" panel.
	- Single-page PDFs assumed for several extraction shortcuts — multi-page docs work but only the first page drives classification.

	---

	## Author

	Mohamed Aziz Miladi — AMARIS internship project (Guichet Accueil Infrastructures).