| # 📐 LLM-BEM-Engineer Benchmark for Automated Building Energy Model Generation |
|
|
| This benchmark dataset is designed to evaluate the capability of **Large Language Models (LLMs)** in generating **Building Energy Models (BEMs)** from natural language descriptions. |
|
|
| The benchmark focuses on two essential aspects of real-world applicability: |
|
|
| - **Scalability**: The ability of LLMs to handle a wide range of building configurations and system complexities. |
| - **Robustness**: The ability of LLMs to correctly infer user intent under noisy, ambiguous, or incomplete inputs. |
|
|
|
|
| ## 📁 Dataset Overview |
|
|
| The benchmark consists of **two complementary test sets**, each designed to evaluate a different capability of LLMs in automated building energy model generation. |
|
|
| | Dataset | Purpose | Description | |
| |-------|--------|-------------| |
| | `detailed_prompt_test` | Scalability benchmark | Well-specified and detailed building modeling prompts | |
| | `robust_prompt_test` | Robustness benchmark | Noisy and high-level user input prompts | |
|
|
| Each test set follows a consistent internal structure: |
| - `*_prompts.py` |
| Contains the natural language prompts used as inputs to LLMs. |
|
|
| - `corresponding_auto-generated_models/` |
| Stores the building energy models automatically generated by LLMs for the corresponding prompts. |
|
|
|
|
| ## (1) Scalability Benchmark |
|
|
| The `detailed_prompt_test` dataset contains **various building energy modeling scenarios**, designed to test whether LLMs can scale across diverse modeling requirements. |
|
|
| ### Covered Modeling Dimensions |
|
|
| Each prompt may include combinations of the following specifications: |
|
|
| - HVAC (heating, ventilation, and air-conditioning) systems, with related configs (e.g., COP, capacity, efficiency, flow rate) and settings (e.g., auto-sizing) |
| - Building geometries |
| - Number of stories |
| - Number of space types and related details |
| - Envelope constructions and materials |
| - Occupancy and operational schedules |
| - Thermostat setpoints |
| - Internal loads (e.g., lighting, equipment, etc.) |
| - Building orientations |
| - Window-to-wall ratios (WWRs) |
| - Zoning strategies |
|
|
| Each file name ends with **two numeric suffixes**, encoding the **HVAC system type** and the **building geometry type**: |
|
|
| ### First Suffix — HVAC System Type |
|
|
| | Suffix | HVAC System | |
| |----|------------| |
| | 1 | Direct expansion (DX) air-conditioning system with electric heater | |
| | 2 | DX air-conditioning system with fuel burner | |
| | 3 | DX air-source heat pump | |
| | 4 | Variable refrigerant flow (VRF) System | |
| | 5 | Dedicated outdoor air system (DOAS) + VRF system, with multiple air handling units (AHUs) | |
| | 6 | DOAS + fan coil unit (FCU) system, with multiple AHUs | |
| | 7 | FCU system | |
| | 8 | Variable air volume (VAV) system, with multiple AHUs | |
| | 9 | Hybrid VAV + FCU System | |
|
|
| ### Second Suffix — Building Geometry Type |
|
|
| | Suffix | Geometry Description | |
| |----|---------------------| |
| | 1 | U-shaped building with gable roof | |
| | 2 | U-shaped building with flat roof | |
| | 3 | T-shaped building with gable roof | |
| | 4 | T-shaped building with flat roof | |
| | 5 | Square (rectangular) building with hip roof | |
| | 6 | Square (rectangular) building with gable roof | |
| | 7 | Square (rectangular) building with flat roof | |
| | 8 | Square (rectangular) building with core–perimeter zoning and hip roof | |
| | 9 | Square (rectangular) building with core–perimeter zoning and gable roof | |
| | 10 | Square (rectangular) building with core–perimeter zoning and flat roof | |
| | 11 | L-shaped building with gable roof | |
| | 12 | L-shaped building with flat roof | |
| | 13 | Hollow square (courtyard) building with gable roof | |
| | 14 | Hollow square (courtyard) building with flat roof | |
|
|
|
|
| ## (2) Robustness Benchmark |
|
|
| The `robust_prompt_test` dataset evaluates the robustness of LLMs to **noisy and imperfect user inputs**, simulating real-world interactions. |
|
|
| ### Noise Characteristics |
|
|
| Prompts include various types of input noise, such as: |
|
|
| - Spelling errors |
| - Ambiguous or general descriptions |
| - Incomplete or missing specifications |
| - Diverse sentence structures |
| - Informal or unstructured language |
|
|
| All prompts are **synthetically generated by GPT-5**, simulating noisy and high-level user intent commonly observed in real-world applications. |
|
|
| Each file name ends with a numeric suffix indicating a **distinct robustness test case**, where each case corresponds to a unique noisy user input scenario. |
|
|
| ### Ambiguity, User Intent, and Modeling Precision |
|
|
| When users aim to obtain more specific or accurate building energy models, they are expected to explicitly provide key modeling information, such as: building geometry, number of thermal zones per story, space type definitions, and system details. Providing such information improves modeling precision and helps reduce model hallucination. This mirrors human communication: even in expert-to-expert interactions, clear and explicit specifications are required to produce accurate technical outputs. |
|
|
| In cases where user intent is ambiguous or underspecified, the generated building model inevitably reflects the LLM’s own interpretation of the input. As a result, the output represents the closest plausible model inferred from the provided intent, rather than a uniquely determined solution. |
|
|
| <!-- ### Multi-Round Robust Inference Mechanism |
|
|
| Because ambiguous user intent must be interpreted by LLMs themselves, this benchmark incorporates a multi-round try to improve robustness as LLMs progressively better understand user intent, ensuring that the generated models converge toward the intended user requirements. |
| --> |
|
|
| ## 🎯 Benchmark Objectives |
|
|
| This dataset is suitable for: |
|
|
| - Benchmarking LLMs in building energy modeling tasks |
| - Research on AI-assisted building simulation workflows |
| - Robustness testing of natural language interfaces |
| - Comparative evaluation of different SOTA LLMs |
|
|
|
|
| ## 🧪 Suggested Evaluation Criteria |
|
|
| Users may evaluate LLM outputs using one or more of the following criteria: |
|
|
| - Geometry correctness |
| - HVAC system selection accuracy |
| - Completeness of generated model components |
| - Constraint violations |
| - Simulation success rate (e.g., EnergyPlus error-free execution) |
| - Robust intent inference under noisy prompts |
|
|
|
|
| ## 📄 License & Citation |
|
|
| Please cite this repository if used in academic or technical work. |
|
|
| Related papers coming soon! |
|
|