๐ EV-CarbonBench
A Comprehensive Benchmark and Domain Dataset for Assessing LLMs in Electric Bus Energy and Carbon Analysis
Submitted to NeurIPS 2026 Evaluations & Datasets Track
1. Dataset Summary
EV-CarbonBench is a domain-specific benchmark that pairs real-world electric-bus telemetry with structured evaluation questions designed to probe whether LLMs and VLMs can perform the multi-step reasoning, numerical estimation, multimodal fusion, and structured report generation required for trustworthy electric-bus carbon accounting.
| Vehicles | 10 BYD E6 electric buses (anonymized as BUS_001 โ BUS_010) |
| Period | 2024-09-11 to 2024-09-20 (10 days) |
| Region | Shenzhen, China |
| Trips | 1,044 |
| Telemetry rows | ~1.8 M (โ1 Hz, 92 columns) |
| Geographic images | 4,176 (4 modalities ร 1,044 trips) |
| Evaluation questions | โ505 across 7 modules |
| License | CC-BY-4.0 (data) / MIT (code) |
โ ๏ธ VIN anonymization. The 10 buses are released under anonymous identifiers
BUS_001throughBUS_010. The mapping to original Vehicle Identification Numbers (VINs) is retained only by the authors to protect the bus operator's commercial information.
2. Dataset Structure
EV-CarbonBench/
โโโ README.md โ this file
โโโ croissant.json โ machine-readable metadata (NeurIPS 2026 required)
โโโ LICENSE / LICENSE-CODE
โโโ trip_rows/ โ raw telemetry (organised by anonymized bus ID)
โ โโโ BUS_001/ ... BUS_010/
โ โโโ BUS_00X_trip_0001.csv (one CSV per trip, ~1Hz ร 92 cols)
โ โโโ BUS_00X_trip_0002.csv
โ โโโ BUS_00X_road_info.csv (road context, one row per trip)
โ โโโ BUS_00X_weather_info.csv (weather context, one row per trip)
โโโ images/
โ โโโ satellite/BUS_00X/BUS_00X_trip_*.png
โ โโโ landuse/BUS_00X/...
โ โโโ roadnet/BUS_00X/...
โ โโโ terrain/BUS_00X/...
โโโ questions/ โ evaluation questions, one folder per module
โ โโโ G1_visual/
โ โโโ G2_cot/
โ โโโ G3_format/
โ โโโ S1a_energy/
โ โโโ S1b_physical/
โ โโโ S2_reports/
โ โโโ S3_multimodal/
โโโ splits/
โ โโโ train_trip_ids.txt (70%)
โ โโโ dev_trip_ids.txt (10%)
โ โโโ test_trip_ids.txt (20%)
โโโ samples/ โ 5 trips with full data + images for quick review
โโโ code/
โโโ config.py
โโโ utils.py
โโโ s1_scoring.py
โโโ aggregate_scores.py
โโโ run_benchmark_v31.py
3. Telemetry Schema (Selected Columns)
| Group | Columns (Chinese name in raw CSV) |
|---|---|
| Vehicle ID | VIN (anonymized to BUS_00X), ็ดฏ่ฎก้็จ, ๆฐๆฎ้้ๆถ้ด |
| Powertrain | ๆป็ตๅ, ๆป็ตๆต, SOC, ็ป็ผ็ต้ป, DC-DC ็ถๆ |
| Battery cells | ๅไฝ็ตๅๆ้ซ/ๆไฝๅผ, ๆ้ซ/ๆไฝๆธฉๅบฆๅผ |
| Motor | ้ฉฑๅจ็ตๆบ็ถๆ, ่ฝฌ้, ่ฝฌ็ฉ, ๆธฉๅบฆ |
| Geography | ็ปๅบฆ, ็บฌๅบฆ, ็ป่ฟ้่ทฏ, ็ป่ฟ่กๆฟๅบ |
| Weather | temperature_c, humidity_pct, wind_speed_kmh, precipitation_mm, ... |
| Road | dominant_road_type, avg_speed_limit_kmh, ๅฟซ้่ทฏ/ไธปๅนฒ้/ๆฌกๅนฒ้/ๆฏ่ทฏ ๅ ๆฏ |
4. Evaluation Modules (v3.1)
General capabilities
- G1 โ Visual Recognition (~60 Q): chart reading + geographic image interpretation
- G3 โ Instruction Following (~40 Q): structured-format compliance
Domain capabilities
- G2 โ Domain CoT Reasoning (~60 Q): multi-step causal reasoning under physical constraints
- S1-a โ Energy Numerical Estimation (~150 Q): segment-level energy / COโ with bin-tolerance scoring
- S1-b โ Physical Consistency Judgment (~35 Q): identifying scenarios that violate physical bounds (new in v3.1)
- S2 โ Carbon-Accounting Report Generation (10 reports): 6-section daily reports, LLM-as-Judge
- S3 โ Multimodal Fusion Gain (~150 Q): same numerical task as S1-a but with multimodal context
Final score is a weighted geometric mean. See code/aggregate_scores.py.
5. Quick Start
from datasets import load_dataset
# Telemetry
telemetry = load_dataset("ANONYMIZE/EV-CarbonBench", data_files="trip_rows/*/*.csv")
# Or clone everything
# git clone https://huggingface.co/datasets/ANONYMIZE/EV-CarbonBench
6. Datasheet (Gebru et al. 2021)
Motivation. Carbon accounting for electric public transport is a high-stakes engineering task that combines time-series telemetry, multimodal geographic context, physics-based reasoning, and structured reporting. No existing benchmark stresses LLMs along these dimensions simultaneously. EV-CarbonBench fills that gap.
Composition. Each instance is either (a) a 1 Hz telemetry row, (b) a trip-aligned PNG image, or (c) an evaluation Q-A item.
Collection. Telemetry collected via the buses' on-board T-Box at ~1 Hz under contractual data-sharing agreements; images programmatically generated from public OpenStreetMap and SRTM tiles; weather from Open-Meteo. Questions programmatically generated and expert-reviewed (ฮบ > 0.85 on pilot, ~87% retention).
Pre-processing. Trip segmentation by ignition cycle; deduplication; coordinate transformation to WGS-84; physically-impossible-value removal; 500 m fixed-distance segmentation. No imputation on raw telemetry. VINs replaced with BUS_001โBUS_010.
Uses. See croissant.json โ rai:dataUseCases. Out-of-scope: training driver-surveillance systems, attempts to re-identify drivers, deployment without recalibration to other regions / vehicles.
Distribution. Hugging Face dataset repo, CC-BY-4.0.
Maintenance. Authors maintain โฅ 5 years post-publication. Issues / corrections via the HF discussion tab. Versioned releases.
7. Limitations & Known Biases
- Geographic โ single city (Shenzhen). May not transfer to highway-dominant or mountainous regions.
- Temporal โ 10 days in September (warm-humid). No winter / cold-start data.
- Vehicle homogeneity โ all 10 vehicles are BYD E6.
- Carbon factor โ 0.4268 kgCOโ/kWh corresponds to Shenzhen 2024 grid mix; not transferable as-is.
- LLM-as-Judge bias โ G2 and S2 use LLM-as-Judge scoring.
8. Ethics & Responsible Use
- No personal data. Bus IDs are anonymized; original VINs are retained only by the authors.
- GPS resolution is at trip-segment level.
- Recommended uses: transportation electrification research, LLM/VLM evaluation, energy modeling, education.
- Discouraged uses: re-identifying vehicles or drivers; replacing safety-critical or regulatory carbon disclosures without independent verification.
9. Citation
@inproceedings{evcarbonbench2026,
title = {{EV-CarbonBench}: A Comprehensive Benchmark and Domain Dataset
for Assessing LLMs in Electric Bus Energy and Carbon Analysis},
author = {Anonymous},
booktitle = {Advances in Neural Information Processing Systems
(NeurIPS) Datasets and Benchmarks Track},
year = {2026}
}
10. License
Data: CC-BY-4.0. Code under code/: MIT.
11. Contact
During double-blind review, please use the OpenReview discussion. After acceptance, the authors and an issue tracker will be listed here.
Last updated: 2026-05-05 โ version 3.1.0