🚌 EV-CarbonBench

A Comprehensive Benchmark and Domain Dataset for Assessing LLMs in Electric Bus Energy and Carbon Analysis

Submitted to NeurIPS 2026 Evaluations & Datasets Track

1. Dataset Summary

EV-CarbonBench is a domain-specific benchmark that pairs real-world electric-bus telemetry with structured evaluation questions designed to probe whether LLMs and VLMs can perform the multi-step reasoning, numerical estimation, multimodal fusion, and structured report generation required for trustworthy electric-bus carbon accounting.


Vehicles	10 BYD E6 electric buses (anonymized as `BUS_001` – `BUS_010`)
Period	2024-09-11 to 2024-09-20 (10 days)
Region	Shenzhen, China
Trips	1,044
Telemetry rows	~1.8 M (≈1 Hz, 92 columns)
Geographic images	4,176 (4 modalities × 1,044 trips)
Evaluation questions	≈505 across 7 modules
License	CC-BY-4.0 (data) / MIT (code)

⚠️ VIN anonymization. The 10 buses are released under anonymous identifiers BUS_001 through BUS_010. The mapping to original Vehicle Identification Numbers (VINs) is retained only by the authors to protect the bus operator's commercial information.

2. Dataset Structure

EV-CarbonBench/
├── README.md                   ← this file
├── croissant.json              ← machine-readable metadata (NeurIPS 2026 required)
├── LICENSE / LICENSE-CODE
├── trip_rows/                  ← raw telemetry (organised by anonymized bus ID)
│   └── BUS_001/  ...  BUS_010/
│       ├── BUS_00X_trip_0001.csv     (one CSV per trip, ~1Hz × 92 cols)
│       ├── BUS_00X_trip_0002.csv
│       ├── BUS_00X_road_info.csv     (road context, one row per trip)
│       └── BUS_00X_weather_info.csv  (weather context, one row per trip)
├── images/
│   ├── satellite/BUS_00X/BUS_00X_trip_*.png
│   ├── landuse/BUS_00X/...
│   ├── roadnet/BUS_00X/...
│   └── terrain/BUS_00X/...
├── questions/                  ← evaluation questions, one folder per module
│   ├── G1_visual/
│   ├── G2_cot/
│   ├── G3_format/
│   ├── S1a_energy/
│   ├── S1b_physical/
│   ├── S2_reports/
│   └── S3_multimodal/
├── splits/
│   ├── train_trip_ids.txt      (70%)
│   ├── dev_trip_ids.txt        (10%)
│   └── test_trip_ids.txt       (20%)
├── samples/                    ← 5 trips with full data + images for quick review
└── code/
    ├── config.py
    ├── utils.py
    ├── s1_scoring.py
    ├── aggregate_scores.py
    └── run_benchmark_v31.py

3. Telemetry Schema (Selected Columns)

Group	Columns (Chinese name in raw CSV)
Vehicle ID	VIN (anonymized to `BUS_00X`), 累计里程, 数据采集时间
Powertrain	总电压, 总电流, SOC, 绝缘电阻, DC-DC 状态
Battery cells	单体电压最高/最低值, 最高/最低温度值
Motor	驱动电机状态, 转速, 转矩, 温度
Geography	经度, 纬度, 经过道路, 经过行政区
Weather	temperature_c, humidity_pct, wind_speed_kmh, precipitation_mm, ...
Road	dominant_road_type, avg_speed_limit_kmh, 快速路/主干道/次干道/支路占比

4. Evaluation Modules (v3.1)

General capabilities

G1 — Visual Recognition (~60 Q): chart reading + geographic image interpretation
G3 — Instruction Following (~40 Q): structured-format compliance

Domain capabilities

G2 — Domain CoT Reasoning (~60 Q): multi-step causal reasoning under physical constraints
S1-a — Energy Numerical Estimation (~150 Q): segment-level energy / CO₂ with bin-tolerance scoring
S1-b — Physical Consistency Judgment (~35 Q): identifying scenarios that violate physical bounds (new in v3.1)
S2 — Carbon-Accounting Report Generation (10 reports): 6-section daily reports, LLM-as-Judge
S3 — Multimodal Fusion Gain (~150 Q): same numerical task as S1-a but with multimodal context

Final score is a weighted geometric mean. See code/aggregate_scores.py.

5. Quick Start

from datasets import load_dataset

# Telemetry
telemetry = load_dataset("ANONYMIZE/EV-CarbonBench", data_files="trip_rows/*/*.csv")

# Or clone everything
# git clone https://huggingface.co/datasets/ANONYMIZE/EV-CarbonBench

6. Datasheet (Gebru et al. 2021)

Motivation. Carbon accounting for electric public transport is a high-stakes engineering task that combines time-series telemetry, multimodal geographic context, physics-based reasoning, and structured reporting. No existing benchmark stresses LLMs along these dimensions simultaneously. EV-CarbonBench fills that gap.

Composition. Each instance is either (a) a 1 Hz telemetry row, (b) a trip-aligned PNG image, or (c) an evaluation Q-A item.

Collection. Telemetry collected via the buses' on-board T-Box at ~1 Hz under contractual data-sharing agreements; images programmatically generated from public OpenStreetMap and SRTM tiles; weather from Open-Meteo. Questions programmatically generated and expert-reviewed (κ > 0.85 on pilot, ~87% retention).

Pre-processing. Trip segmentation by ignition cycle; deduplication; coordinate transformation to WGS-84; physically-impossible-value removal; 500 m fixed-distance segmentation. No imputation on raw telemetry. VINs replaced with BUS_001–BUS_010.

Uses. See croissant.json → rai:dataUseCases. Out-of-scope: training driver-surveillance systems, attempts to re-identify drivers, deployment without recalibration to other regions / vehicles.

Distribution. Hugging Face dataset repo, CC-BY-4.0.

Maintenance. Authors maintain ≥ 5 years post-publication. Issues / corrections via the HF discussion tab. Versioned releases.

7. Limitations & Known Biases

Geographic — single city (Shenzhen). May not transfer to highway-dominant or mountainous regions.
Temporal — 10 days in September (warm-humid). No winter / cold-start data.
Vehicle homogeneity — all 10 vehicles are BYD E6.
Carbon factor — 0.4268 kgCO₂/kWh corresponds to Shenzhen 2024 grid mix; not transferable as-is.
LLM-as-Judge bias — G2 and S2 use LLM-as-Judge scoring.

8. Ethics & Responsible Use

No personal data. Bus IDs are anonymized; original VINs are retained only by the authors.
GPS resolution is at trip-segment level.
Recommended uses: transportation electrification research, LLM/VLM evaluation, energy modeling, education.
Discouraged uses: re-identifying vehicles or drivers; replacing safety-critical or regulatory carbon disclosures without independent verification.

9. Citation

@inproceedings{evcarbonbench2026,
  title  = {{EV-CarbonBench}: A Comprehensive Benchmark and Domain Dataset
            for Assessing LLMs in Electric Bus Energy and Carbon Analysis},
  author = {Anonymous},
  booktitle = {Advances in Neural Information Processing Systems
               (NeurIPS) Datasets and Benchmarks Track},
  year   = {2026}
}

10. License

Data: CC-BY-4.0. Code under code/: MIT.

11. Contact

During double-blind review, please use the OpenReview discussion. After acceptance, the authors and an issue tracker will be listed here.

Last updated: 2026-05-05 — version 3.1.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support