| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - lightgbm |
| - tabular-regression |
| - energy |
| - epc |
| - uk |
| - property |
| datasets: [] |
| metrics: |
| - mae |
| model-index: |
| - name: uk-epc-model |
| results: |
| - task: |
| type: tabular-regression |
| metrics: |
| - type: mae |
| value: 3.09 |
| name: MAE (SAP score, test set 2024–2026) |
| --- |
| |
| # UK EPC Rating Predictor |
|
|
| A LightGBM gradient-boosted tree model that predicts residential Energy Performance Certificate (EPC) ratings for properties in England and Wales. |
|
|
| Given property characteristics a homeowner already knows (wall type, heating system, floor area, age band, etc.), the model predicts: |
| - A numeric SAP 2012 efficiency score (1–100) |
| - A letter grade (A–G) |
|
|
| ## Model details |
|
|
| | Detail | Value | |
| |--------|-------| |
| | Algorithm | LightGBM (gradient-boosted trees) | |
| | Objective | MAE regression (`regression_l1`) | |
| | Trees | 5,000 | |
| | Leaves per tree | up to 857 | |
| | Features | 40 | |
| | Training rows | 19,279,916 | |
| | Test rows | 4,045,192 | |
| | MAE (test set) | 3.09 SAP points | |
| | Exact grade accuracy | 77.4% (calibrated) | |
| | Within-1-band accuracy | 98.7% | |
|
|
| ## Training data |
|
|
| Trained on the public EPC register maintained by MHCLG, covering all domestic EPC assessments lodged in England and Wales from 2012 to 2023. |
|
|
| - **Source:** https://get-energy-performance-data.communities.gov.uk/ |
| - **Train split:** 2012–2023 (19.3M records) |
| - **Test split:** 2024–2026 (4.0M records) — time-based split to simulate real deployment |
|
|
| ## Features |
|
|
| The model uses 40 features across four categories: |
| - **Component efficiency ratings** (9): walls, roof, floor, windows, main heating, heating controls, hot water, lighting, secondary heating |
| - **Binary flags** (5): mains gas, solar water heating, solar PV, low energy lighting, flat top storey |
| - **Numeric** (10): floor area, room counts, floor level, storey count, glazing proportion, age band, etc. |
| - **Categorical** (16): property type, built form, fuel type, heating system description, wall/roof/floor descriptions, etc. |
|
|
| Assessor-only fields not available to homeowners (transaction type, floor height, lighting outlet counts) are excluded from the feature set. |
|
|
| ## Usage |
|
|
| ```python |
| import lightgbm as lgb |
| import json |
| import numpy as np |
| |
| # Load model |
| booster = lgb.Booster(model_file="lgbm_epc.txt") |
| meta = json.loads(open("feature_meta.json").read()) |
| |
| # See the full inference pipeline at: |
| # https://github.com/kulbinderdio/uk-epc-model/blob/main/src/model/predict.py |
| ``` |
|
|
| The full inference pipeline (including categorical encoding and grade threshold calibration) is in [`src/model/predict.py`](https://github.com/kulbinderdio/uk-epc-model/blob/main/src/model/predict.py) in the GitHub repository. |
|
|
| ## Grade calibration |
|
|
| After training, grade boundaries are optimised using Nelder-Mead minimisation on the first 100K test rows. Calibrated boundaries (vs SAP 2012 standard): |
|
|
| | Boundary | SAP standard | Calibrated | |
| |----------|-------------|------------| |
| | G/F | 21.0 | 22.3 | |
| | F/E | 39.0 | 38.3 | |
| | E/D | 55.0 | 53.7 | |
| | D/C | 69.0 | 68.0 | |
| | C/B | 81.0 | 80.1 | |
| | B/A | 92.0 | 91.1 | |
|
|
| Calibrated thresholds are stored in `feature_meta.json`. |
|
|
| ## Accuracy by property type |
|
|
| | Type | MAE | Exact grade accuracy | |
| |------|-----|----------------------| |
| | House | 2.99 | 75.3% | |
| | Flat | 3.04 | 74.6% | |
| | Maisonette | 3.13 | 75.3% | |
| | Park home | 3.84 | 76.8% | |
| | Bungalow | 3.92 | 70.0% | |
|
|
| ## Top features (by gain) |
|
|
| 1. `walls_description` — wall construction and insulation type |
| 2. `construction_age_band` — decade the property was built |
| 3. `floor_description` — floor construction and insulation |
| 4. `total_floor_area` — property size in m² |
| 5. `roof_description` — roof type and insulation level |
|
|
| ## Limitations |
|
|
| - Predictions are estimates only — not a substitute for an official EPC from an accredited assessor |
| - Higher uncertainty near grade boundaries (±3 SAP points) |
| - Bungalows have lower accuracy (70%) due to higher variance in insulation setups |
| - Model trained on assessor-submitted data; self-reported inputs add a further layer of uncertainty |
|
|
| ## Repository |
|
|
| Full source code, training pipeline, API, and web frontend: |
| https://github.com/kulbinderdio/uk-epc-model |
|
|
| ## License |
|
|
| MIT |
|
|