GangJiang's picture
Update LLM-BEM-Engineer_Benchmark/README.md
608f50f verified

📐 LLM-BEM-Engineer Benchmark for Automated Building Energy Model Generation

This benchmark dataset is designed to evaluate the capability of Large Language Models (LLMs) in generating Building Energy Models (BEMs) from natural language descriptions.

The benchmark focuses on two essential aspects of real-world applicability:

  • Scalability: The ability of LLMs to handle a wide range of building configurations and system complexities.
  • Robustness: The ability of LLMs to correctly infer user intent under noisy, ambiguous, or incomplete inputs.

📁 Dataset Overview

The benchmark consists of two complementary test sets, each designed to evaluate a different capability of LLMs in automated building energy model generation.

Dataset Purpose Description
detailed_prompt_test Scalability benchmark Well-specified and detailed building modeling prompts
robust_prompt_test Robustness benchmark Noisy and high-level user input prompts

Each test set follows a consistent internal structure:

  • *_prompts.py
    Contains the natural language prompts used as inputs to LLMs.

  • corresponding_auto-generated_models/
    Stores the building energy models automatically generated by LLMs for the corresponding prompts.

(1) Scalability Benchmark

The detailed_prompt_test dataset contains various building energy modeling scenarios, designed to test whether LLMs can scale across diverse modeling requirements.

Covered Modeling Dimensions

Each prompt may include combinations of the following specifications:

  • HVAC (heating, ventilation, and air-conditioning) systems, with related configs (e.g., COP, capacity, efficiency, flow rate) and settings (e.g., auto-sizing)
  • Building geometries
  • Number of stories
  • Number of space types and related details
  • Envelope constructions and materials
  • Occupancy and operational schedules
  • Thermostat setpoints
  • Internal loads (e.g., lighting, equipment, etc.)
  • Building orientations
  • Window-to-wall ratios (WWRs)
  • Zoning strategies

Each file name ends with two numeric suffixes, encoding the HVAC system type and the building geometry type:

First Suffix — HVAC System Type

Suffix HVAC System
1 Direct expansion (DX) air-conditioning system with electric heater
2 DX air-conditioning system with fuel burner
3 DX air-source heat pump
4 Variable refrigerant flow (VRF) System
5 Dedicated outdoor air system (DOAS) + VRF system, with multiple air handling units (AHUs)
6 DOAS + fan coil unit (FCU) system, with multiple AHUs
7 FCU system
8 Variable air volume (VAV) system, with multiple AHUs
9 Hybrid VAV + FCU System

Second Suffix — Building Geometry Type

Suffix Geometry Description
1 U-shaped building with gable roof
2 U-shaped building with flat roof
3 T-shaped building with gable roof
4 T-shaped building with flat roof
5 Square (rectangular) building with hip roof
6 Square (rectangular) building with gable roof
7 Square (rectangular) building with flat roof
8 Square (rectangular) building with core–perimeter zoning and hip roof
9 Square (rectangular) building with core–perimeter zoning and gable roof
10 Square (rectangular) building with core–perimeter zoning and flat roof
11 L-shaped building with gable roof
12 L-shaped building with flat roof
13 Hollow square (courtyard) building with gable roof
14 Hollow square (courtyard) building with flat roof

(2) Robustness Benchmark

The robust_prompt_test dataset evaluates the robustness of LLMs to noisy and imperfect user inputs, simulating real-world interactions.

Noise Characteristics

Prompts include various types of input noise, such as:

  • Spelling errors
  • Ambiguous or general descriptions
  • Incomplete or missing specifications
  • Diverse sentence structures
  • Informal or unstructured language

All prompts are synthetically generated by GPT-5, simulating noisy and high-level user intent commonly observed in real-world applications.

Each file name ends with a numeric suffix indicating a distinct robustness test case, where each case corresponds to a unique noisy user input scenario.

Ambiguity, User Intent, and Modeling Precision

When users aim to obtain more specific or accurate building energy models, they are expected to explicitly provide key modeling information, such as: building geometry, number of thermal zones per story, space type definitions, and system details. Providing such information improves modeling precision and helps reduce model hallucination. This mirrors human communication: even in expert-to-expert interactions, clear and explicit specifications are required to produce accurate technical outputs.

In cases where user intent is ambiguous or underspecified, the generated building model inevitably reflects the LLM’s own interpretation of the input. As a result, the output represents the closest plausible model inferred from the provided intent, rather than a uniquely determined solution.

🎯 Benchmark Objectives

This dataset is suitable for:

  • Benchmarking LLMs in building energy modeling tasks
  • Research on AI-assisted building simulation workflows
  • Robustness testing of natural language interfaces
  • Comparative evaluation of different SOTA LLMs

🧪 Suggested Evaluation Criteria

Users may evaluate LLM outputs using one or more of the following criteria:

  • Geometry correctness
  • HVAC system selection accuracy
  • Completeness of generated model components
  • Constraint violations
  • Simulation success rate (e.g., EnergyPlus error-free execution)
  • Robust intent inference under noisy prompts

📄 License & Citation

Please cite this repository if used in academic or technical work.

Related papers coming soon!