📐 LLM-BEM-Engineer Benchmark for Automated Building Energy Model Generation

This benchmark dataset is designed to evaluate the capability of Large Language Models (LLMs) in generating Building Energy Models (BEMs) from natural language descriptions.

The benchmark focuses on two essential aspects of real-world applicability:

Scalability: The ability of LLMs to handle a wide range of building configurations and system complexities.
Robustness: The ability of LLMs to correctly infer user intent under noisy, ambiguous, or incomplete inputs.

📁 Dataset Overview

The benchmark consists of two complementary test sets, each designed to evaluate a different capability of LLMs in automated building energy model generation.

Dataset	Purpose	Description
`detailed_prompt_test`	Scalability benchmark	Well-specified and detailed building modeling prompts
`robust_prompt_test`	Robustness benchmark	Noisy and high-level user input prompts

Each test set follows a consistent internal structure:

*_prompts.py
Contains the natural language prompts used as inputs to LLMs.
corresponding_auto-generated_models/
Stores the building energy models automatically generated by LLMs for the corresponding prompts.

(1) Scalability Benchmark

The detailed_prompt_test dataset contains various building energy modeling scenarios, designed to test whether LLMs can scale across diverse modeling requirements.

Covered Modeling Dimensions

Each prompt may include combinations of the following specifications:

HVAC (heating, ventilation, and air-conditioning) systems, with related configs (e.g., COP, capacity, efficiency, flow rate) and settings (e.g., auto-sizing)
Building geometries
Number of stories
Number of space types and related details
Envelope constructions and materials
Occupancy and operational schedules
Thermostat setpoints
Internal loads (e.g., lighting, equipment, etc.)
Building orientations
Window-to-wall ratios (WWRs)
Zoning strategies

Each file name ends with two numeric suffixes, encoding the HVAC system type and the building geometry type:

First Suffix — HVAC System Type

Suffix	HVAC System
1	Direct expansion (DX) air-conditioning system with electric heater
2	DX air-conditioning system with fuel burner
3	DX air-source heat pump
4	Variable refrigerant flow (VRF) System
5	Dedicated outdoor air system (DOAS) + VRF system, with multiple air handling units (AHUs)
6	DOAS + fan coil unit (FCU) system, with multiple AHUs
7	FCU system
8	Variable air volume (VAV) system, with multiple AHUs
9	Hybrid VAV + FCU System

Second Suffix — Building Geometry Type

Suffix	Geometry Description
1	U-shaped building with gable roof
2	U-shaped building with flat roof
3	T-shaped building with gable roof
4	T-shaped building with flat roof
5	Square (rectangular) building with hip roof
6	Square (rectangular) building with gable roof
7	Square (rectangular) building with flat roof
8	Square (rectangular) building with core–perimeter zoning and hip roof
9	Square (rectangular) building with core–perimeter zoning and gable roof
10	Square (rectangular) building with core–perimeter zoning and flat roof
11	L-shaped building with gable roof
12	L-shaped building with flat roof
13	Hollow square (courtyard) building with gable roof
14	Hollow square (courtyard) building with flat roof

(2) Robustness Benchmark

The robust_prompt_test dataset evaluates the robustness of LLMs to noisy and imperfect user inputs, simulating real-world interactions.

Noise Characteristics

Prompts include various types of input noise, such as:

Spelling errors
Ambiguous or general descriptions
Incomplete or missing specifications
Diverse sentence structures
Informal or unstructured language

All prompts are synthetically generated by GPT-5, simulating noisy and high-level user intent commonly observed in real-world applications.

Each file name ends with a numeric suffix indicating a distinct robustness test case, where each case corresponds to a unique noisy user input scenario.

Ambiguity, User Intent, and Modeling Precision

When users aim to obtain more specific or accurate building energy models, they are expected to explicitly provide key modeling information, such as: building geometry, number of thermal zones per story, space type definitions, and system details. Providing such information improves modeling precision and helps reduce model hallucination. This mirrors human communication: even in expert-to-expert interactions, clear and explicit specifications are required to produce accurate technical outputs.

In cases where user intent is ambiguous or underspecified, the generated building model inevitably reflects the LLM’s own interpretation of the input. As a result, the output represents the closest plausible model inferred from the provided intent, rather than a uniquely determined solution.

🎯 Benchmark Objectives

This dataset is suitable for:

Benchmarking LLMs in building energy modeling tasks
Research on AI-assisted building simulation workflows
Robustness testing of natural language interfaces
Comparative evaluation of different SOTA LLMs

🧪 Suggested Evaluation Criteria

Users may evaluate LLM outputs using one or more of the following criteria:

Geometry correctness
HVAC system selection accuracy
Completeness of generated model components
Constraint violations
Simulation success rate (e.g., EnergyPlus error-free execution)
Robust intent inference under noisy prompts

📄 License & Citation

Please cite this repository if used in academic or technical work.