# šŸ“ LLM-BEM-Engineer Benchmark for Automated Building Energy Model Generation This benchmark dataset is designed to evaluate the capability of **Large Language Models (LLMs)** in generating **Building Energy Models (BEMs)** from natural language descriptions. The benchmark focuses on two essential aspects of real-world applicability: - **Scalability**: The ability of LLMs to handle a wide range of building configurations and system complexities. - **Robustness**: The ability of LLMs to correctly infer user intent under noisy, ambiguous, or incomplete inputs. ## šŸ“ Dataset Overview The benchmark consists of **two complementary test sets**, each designed to evaluate a different capability of LLMs in automated building energy model generation. | Dataset | Purpose | Description | |-------|--------|-------------| | `detailed_prompt_test` | Scalability benchmark | Well-specified and detailed building modeling prompts | | `robust_prompt_test` | Robustness benchmark | Noisy and high-level user input prompts | Each test set follows a consistent internal structure: - `*_prompts.py` Contains the natural language prompts used as inputs to LLMs. - `corresponding_auto-generated_models/` Stores the building energy models automatically generated by LLMs for the corresponding prompts. ## (1) Scalability Benchmark The `detailed_prompt_test` dataset contains **various building energy modeling scenarios**, designed to test whether LLMs can scale across diverse modeling requirements. ### Covered Modeling Dimensions Each prompt may include combinations of the following specifications: - HVAC (heating, ventilation, and air-conditioning) systems, with related configs (e.g., COP, capacity, efficiency, flow rate) and settings (e.g., auto-sizing) - Building geometries - Number of stories - Number of space types and related details - Envelope constructions and materials - Occupancy and operational schedules - Thermostat setpoints - Internal loads (e.g., lighting, equipment, etc.) - Building orientations - Window-to-wall ratios (WWRs) - Zoning strategies Each file name ends with **two numeric suffixes**, encoding the **HVAC system type** and the **building geometry type**: ### First Suffix — HVAC System Type | Suffix | HVAC System | |----|------------| | 1 | Direct expansion (DX) air-conditioning system with electric heater | | 2 | DX air-conditioning system with fuel burner | | 3 | DX air-source heat pump | | 4 | Variable refrigerant flow (VRF) System | | 5 | Dedicated outdoor air system (DOAS) + VRF system, with multiple air handling units (AHUs) | | 6 | DOAS + fan coil unit (FCU) system, with multiple AHUs | | 7 | FCU system | | 8 | Variable air volume (VAV) system, with multiple AHUs | | 9 | Hybrid VAV + FCU System | ### Second Suffix — Building Geometry Type | Suffix | Geometry Description | |----|---------------------| | 1 | U-shaped building with gable roof | | 2 | U-shaped building with flat roof | | 3 | T-shaped building with gable roof | | 4 | T-shaped building with flat roof | | 5 | Square (rectangular) building with hip roof | | 6 | Square (rectangular) building with gable roof | | 7 | Square (rectangular) building with flat roof | | 8 | Square (rectangular) building with core–perimeter zoning and hip roof | | 9 | Square (rectangular) building with core–perimeter zoning and gable roof | | 10 | Square (rectangular) building with core–perimeter zoning and flat roof | | 11 | L-shaped building with gable roof | | 12 | L-shaped building with flat roof | | 13 | Hollow square (courtyard) building with gable roof | | 14 | Hollow square (courtyard) building with flat roof | ## (2) Robustness Benchmark The `robust_prompt_test` dataset evaluates the robustness of LLMs to **noisy and imperfect user inputs**, simulating real-world interactions. ### Noise Characteristics Prompts include various types of input noise, such as: - Spelling errors - Ambiguous or general descriptions - Incomplete or missing specifications - Diverse sentence structures - Informal or unstructured language All prompts are **synthetically generated by GPT-5**, simulating noisy and high-level user intent commonly observed in real-world applications. Each file name ends with a numeric suffix indicating a **distinct robustness test case**, where each case corresponds to a unique noisy user input scenario. ### Ambiguity, User Intent, and Modeling Precision When users aim to obtain more specific or accurate building energy models, they are expected to explicitly provide key modeling information, such as: building geometry, number of thermal zones per story, space type definitions, and system details. Providing such information improves modeling precision and helps reduce model hallucination. This mirrors human communication: even in expert-to-expert interactions, clear and explicit specifications are required to produce accurate technical outputs. In cases where user intent is ambiguous or underspecified, the generated building model inevitably reflects the LLM’s own interpretation of the input. As a result, the output represents the closest plausible model inferred from the provided intent, rather than a uniquely determined solution. ## šŸŽÆ Benchmark Objectives This dataset is suitable for: - Benchmarking LLMs in building energy modeling tasks - Research on AI-assisted building simulation workflows - Robustness testing of natural language interfaces - Comparative evaluation of different SOTA LLMs ## 🧪 Suggested Evaluation Criteria Users may evaluate LLM outputs using one or more of the following criteria: - Geometry correctness - HVAC system selection accuracy - Completeness of generated model components - Constraint violations - Simulation success rate (e.g., EnergyPlus error-free execution) - Robust intent inference under noisy prompts ## šŸ“„ License & Citation Please cite this repository if used in academic or technical work. Related papers coming soon!