📐 LLM-BEM-Engineer Benchmark for Automated Building Energy Model Generation
This benchmark dataset is designed to evaluate the capability of Large Language Models (LLMs) in generating Building Energy Models (BEMs) from natural language descriptions.
The benchmark focuses on two essential aspects of real-world applicability:
- Scalability: The ability of LLMs to handle a wide range of building configurations and system complexities.
- Robustness: The ability of LLMs to correctly infer user intent under noisy, ambiguous, or incomplete inputs.
📁 Dataset Overview
The benchmark consists of two complementary test sets, each designed to evaluate a different capability of LLMs in automated building energy model generation.
| Dataset | Purpose | Description |
|---|---|---|
detailed_prompt_test |
Scalability benchmark | Well-specified and detailed building modeling prompts |
robust_prompt_test |
Robustness benchmark | Noisy and high-level user input prompts |
Each test set follows a consistent internal structure:
*_prompts.py
Contains the natural language prompts used as inputs to LLMs.corresponding_auto-generated_models/
Stores the building energy models automatically generated by LLMs for the corresponding prompts.
(1) Scalability Benchmark
The detailed_prompt_test dataset contains various building energy modeling scenarios, designed to test whether LLMs can scale across diverse modeling requirements.
Covered Modeling Dimensions
Each prompt may include combinations of the following specifications:
- HVAC (heating, ventilation, and air-conditioning) systems, with related configs (e.g., COP, capacity, efficiency, flow rate) and settings (e.g., auto-sizing)
- Building geometries
- Number of stories
- Number of space types and related details
- Envelope constructions and materials
- Occupancy and operational schedules
- Thermostat setpoints
- Internal loads (e.g., lighting, equipment, etc.)
- Building orientations
- Window-to-wall ratios (WWRs)
- Zoning strategies
Each file name ends with two numeric suffixes, encoding the HVAC system type and the building geometry type:
First Suffix — HVAC System Type
| Suffix | HVAC System |
|---|---|
| 1 | Direct expansion (DX) air-conditioning system with electric heater |
| 2 | DX air-conditioning system with fuel burner |
| 3 | DX air-source heat pump |
| 4 | Variable refrigerant flow (VRF) System |
| 5 | Dedicated outdoor air system (DOAS) + VRF system, with multiple air handling units (AHUs) |
| 6 | DOAS + fan coil unit (FCU) system, with multiple AHUs |
| 7 | FCU system |
| 8 | Variable air volume (VAV) system, with multiple AHUs |
| 9 | Hybrid VAV + FCU System |
Second Suffix — Building Geometry Type
| Suffix | Geometry Description |
|---|---|
| 1 | U-shaped building with gable roof |
| 2 | U-shaped building with flat roof |
| 3 | T-shaped building with gable roof |
| 4 | T-shaped building with flat roof |
| 5 | Square (rectangular) building with hip roof |
| 6 | Square (rectangular) building with gable roof |
| 7 | Square (rectangular) building with flat roof |
| 8 | Square (rectangular) building with core–perimeter zoning and hip roof |
| 9 | Square (rectangular) building with core–perimeter zoning and gable roof |
| 10 | Square (rectangular) building with core–perimeter zoning and flat roof |
| 11 | L-shaped building with gable roof |
| 12 | L-shaped building with flat roof |
| 13 | Hollow square (courtyard) building with gable roof |
| 14 | Hollow square (courtyard) building with flat roof |
(2) Robustness Benchmark
The robust_prompt_test dataset evaluates the robustness of LLMs to noisy and imperfect user inputs, simulating real-world interactions.
Noise Characteristics
Prompts include various types of input noise, such as:
- Spelling errors
- Ambiguous or general descriptions
- Incomplete or missing specifications
- Diverse sentence structures
- Informal or unstructured language
All prompts are synthetically generated by GPT-5, simulating noisy and high-level user intent commonly observed in real-world applications.
Each file name ends with a numeric suffix indicating a distinct robustness test case, where each case corresponds to a unique noisy user input scenario.
Ambiguity, User Intent, and Modeling Precision
When users aim to obtain more specific or accurate building energy models, they are expected to explicitly provide key modeling information, such as: building geometry, number of thermal zones per story, space type definitions, and system details. Providing such information improves modeling precision and helps reduce model hallucination. This mirrors human communication: even in expert-to-expert interactions, clear and explicit specifications are required to produce accurate technical outputs.
In cases where user intent is ambiguous or underspecified, the generated building model inevitably reflects the LLM’s own interpretation of the input. As a result, the output represents the closest plausible model inferred from the provided intent, rather than a uniquely determined solution.
🎯 Benchmark Objectives
This dataset is suitable for:
- Benchmarking LLMs in building energy modeling tasks
- Research on AI-assisted building simulation workflows
- Robustness testing of natural language interfaces
- Comparative evaluation of different SOTA LLMs
🧪 Suggested Evaluation Criteria
Users may evaluate LLM outputs using one or more of the following criteria:
- Geometry correctness
- HVAC system selection accuracy
- Completeness of generated model components
- Constraint violations
- Simulation success rate (e.g., EnergyPlus error-free execution)
- Robust intent inference under noisy prompts
📄 License & Citation
Please cite this repository if used in academic or technical work.
Related papers coming soon!