GangJiang
/

LLM-BEM-Engineer

Safetensors

English

Model card Files Files and versions

xet

Community

GangJiang commited on Jan 30

Commit

c3b5f03

verified ·

1 Parent(s): 1fe4dfd

Update LLM-BEM-Engineer_Benchmark/README.md

Browse files

Files changed (1) hide show

LLM-BEM-Engineer_Benchmark/README.md +158 -1

LLM-BEM-Engineer_Benchmark/README.md CHANGED Viewed

	@@ -1 +1,158 @@
1	- User instruction

+# 🏗️ LLM Benchmark for Automated Building Energy Model Generation
+This repository provides a benchmark dataset designed to evaluate the capability of **Large Language Models (LLMs)** in generating **Building Energy Models (BEMs)** from natural language descriptions.
+The benchmark focuses on two essential aspects of real-world applicability:
+- **Scalability**: The ability of LLMs to handle a wide range of building configurations and system complexities.
+- **Robustness**: The ability of LLMs to correctly infer user intent under noisy, ambiguous, or incomplete inputs.
+---
+## 📦 Dataset Overview
+The benchmark consists of **two complementary test sets**:
+| Dataset | Purpose | Description |
+|-------|--------|-------------|
+| `detailed_prompt_test` | Scalability benchmark | Well-specified, detailed building modeling prompts |
+| `robust_prompt_test` | Robustness benchmark | Noisy and ambiguous user input prompts |
+---
+## 1️⃣ detailed_prompt_test — Scalability Benchmark
+The `detailed_prompt_test` dataset contains **126 building energy modeling scenarios**, designed to test whether LLMs can scale across diverse modeling requirements.
+### Covered Modeling Dimensions
+Each prompt may include combinations of the following specifications:
+- Building geometry
+- HVAC systems (heating, ventilation, and air-conditioning)
+- Number of stories
+- Envelope constructions and materials
+- Occupancy and operational schedules
+- Thermostat setpoints
+- Space types
+- Building orientation
+- Window-to-wall ratios (WWRs)
+- Zoning strategies
+This dataset reflects realistic complexity encountered in professional building energy modeling workflows.
+---
+### 📄 File Naming Convention
+Each file name ends with **two digits**, encoding the HVAC system type and the building geometry type:
+---
+### 🔢 First Digit — HVAC System Type
+| Code | HVAC System |
+|----|------------|
+| 1 | DX system with electric heater |
+| 2 | DX system with fuel burner |
+| 3 | Heat pump |
+| 4 | VRF system |
+| 5 | DOAS + VRF, with multiple AHU units |
+| 6 | DOAS + FCU, with multiple AHU units |
+| 7 | FCU system |
+| 8 | VAV system, with multiple AHU units |
+| 9 | Hybrid VAV + FCU system |
+---
+### 🔢 Second Digit — Building Geometry Type
+| Code | Geometry Description |
+|----|---------------------|
+| 1 | U-shaped building with gable roof |
+| 2 | U-shaped building with flat roof |
+| 3 | T-shaped building with gable roof |
+| 4 | T-shaped building with flat roof |
+| 5 | Rectangular building with hip roof |
+| 6 | Rectangular building with gable roof |
+| 7 | Rectangular building with flat roof |
+| 8 | Rectangular building with core–perimeter zoning and hip roof |
+| 9 | Rectangular building with core–perimeter zoning and gable roof |
+| 10 | Rectangular building with core–perimeter zoning and flat roof |
+| 11 | L-shaped building with gable roof |
+| 12 | Flat-shaped building |
+| 13 | Hollow square (courtyard) building with gable roof |
+| 14 | Hollow square (courtyard) building with flat roof |
+---
+## 2️⃣ robust_prompt_test — Robustness Benchmark
+The `robust_prompt_test` dataset evaluates the robustness of LLMs to **noisy and imperfect user inputs**, simulating real-world interactions.
+### Noise Characteristics
+Prompts include various types of input noise, such as:
+- Spelling errors
+- Ambiguous or vague descriptions
+- Incomplete or missing specifications
+- Diverse sentence structures
+- Informal or unstructured language
+All prompts are **synthetically generated by GPT-5**, simulating noisy user intent.
+---
+### 📄 File Naming Convention
+Each file name ends with a numeric suffix indicating a **distinct robustness test case**:
+Each case corresponds to a unique noisy user input scenario.
+---
+## 🎯 Benchmark Objectives
+This benchmark is designed to support the evaluation of:
+- LLM generalization across building types and HVAC systems
+- Accuracy of system and geometry inference
+- Completeness and validity of generated building models
+- Robustness to noisy, ambiguous, or incomplete user intent
+- Failure modes under increasing modeling complexity
+---
+## 🧪 Suggested Evaluation Criteria (Optional)
+Users may evaluate LLM outputs using one or more of the following criteria:
+- Geometry correctness
+- HVAC system selection accuracy
+- Completeness of generated model components
+- Constraint violations
+- Simulation success rate (e.g., EnergyPlus error-free execution)
+- Robust intent inference under noisy prompts
+---
+## 📌 Intended Use
+This dataset is suitable for:
+- Benchmarking LLMs in building energy modeling tasks
+- Research on AI-assisted building simulation workflows
+- Robustness testing of natural language interfaces
+- Comparative evaluation of different LLM architectures
+---
+## 📄 License & Citation
+Please cite this repository if used in academic or technical work.
+(You may add license information, BibTeX, or DOI here.)
+---