๐ŸšŒ EV-CarbonBench

A Comprehensive Benchmark and Domain Dataset for Assessing LLMs in Electric Bus Energy and Carbon Analysis

Submitted to NeurIPS 2026 Evaluations & Datasets Track

1. Dataset Summary

EV-CarbonBench is a domain-specific benchmark that pairs real-world electric-bus telemetry with structured evaluation questions designed to probe whether LLMs and VLMs can perform the multi-step reasoning, numerical estimation, multimodal fusion, and structured report generation required for trustworthy electric-bus carbon accounting.

Vehicles 10 BYD E6 electric buses (anonymized as BUS_001 โ€“ BUS_010)
Period 2024-09-11 to 2024-09-20 (10 days)
Region Shenzhen, China
Trips 1,044
Telemetry rows ~1.8 M (โ‰ˆ1 Hz, 92 columns)
Geographic images 4,176 (4 modalities ร— 1,044 trips)
Evaluation questions โ‰ˆ505 across 7 modules
License CC-BY-4.0 (data) / MIT (code)

โš ๏ธ VIN anonymization. The 10 buses are released under anonymous identifiers BUS_001 through BUS_010. The mapping to original Vehicle Identification Numbers (VINs) is retained only by the authors to protect the bus operator's commercial information.

2. Dataset Structure

EV-CarbonBench/
โ”œโ”€โ”€ README.md                   โ† this file
โ”œโ”€โ”€ croissant.json              โ† machine-readable metadata (NeurIPS 2026 required)
โ”œโ”€โ”€ LICENSE / LICENSE-CODE
โ”œโ”€โ”€ trip_rows/                  โ† raw telemetry (organised by anonymized bus ID)
โ”‚   โ””โ”€โ”€ BUS_001/  ...  BUS_010/
โ”‚       โ”œโ”€โ”€ BUS_00X_trip_0001.csv     (one CSV per trip, ~1Hz ร— 92 cols)
โ”‚       โ”œโ”€โ”€ BUS_00X_trip_0002.csv
โ”‚       โ”œโ”€โ”€ BUS_00X_road_info.csv     (road context, one row per trip)
โ”‚       โ””โ”€โ”€ BUS_00X_weather_info.csv  (weather context, one row per trip)
โ”œโ”€โ”€ images/
โ”‚   โ”œโ”€โ”€ satellite/BUS_00X/BUS_00X_trip_*.png
โ”‚   โ”œโ”€โ”€ landuse/BUS_00X/...
โ”‚   โ”œโ”€โ”€ roadnet/BUS_00X/...
โ”‚   โ””โ”€โ”€ terrain/BUS_00X/...
โ”œโ”€โ”€ questions/                  โ† evaluation questions, one folder per module
โ”‚   โ”œโ”€โ”€ G1_visual/
โ”‚   โ”œโ”€โ”€ G2_cot/
โ”‚   โ”œโ”€โ”€ G3_format/
โ”‚   โ”œโ”€โ”€ S1a_energy/
โ”‚   โ”œโ”€โ”€ S1b_physical/
โ”‚   โ”œโ”€โ”€ S2_reports/
โ”‚   โ””โ”€โ”€ S3_multimodal/
โ”œโ”€โ”€ splits/
โ”‚   โ”œโ”€โ”€ train_trip_ids.txt      (70%)
โ”‚   โ”œโ”€โ”€ dev_trip_ids.txt        (10%)
โ”‚   โ””โ”€โ”€ test_trip_ids.txt       (20%)
โ”œโ”€โ”€ samples/                    โ† 5 trips with full data + images for quick review
โ””โ”€โ”€ code/
    โ”œโ”€โ”€ config.py
    โ”œโ”€โ”€ utils.py
    โ”œโ”€โ”€ s1_scoring.py
    โ”œโ”€โ”€ aggregate_scores.py
    โ””โ”€โ”€ run_benchmark_v31.py

3. Telemetry Schema (Selected Columns)

Group Columns (Chinese name in raw CSV)
Vehicle ID VIN (anonymized to BUS_00X), ็ดฏ่ฎก้‡Œ็จ‹, ๆ•ฐๆฎ้‡‡้›†ๆ—ถ้—ด
Powertrain ๆ€ป็”ตๅŽ‹, ๆ€ป็”ตๆต, SOC, ็ป็ผ˜็”ต้˜ป, DC-DC ็Šถๆ€
Battery cells ๅ•ไฝ“็”ตๅŽ‹ๆœ€้ซ˜/ๆœ€ไฝŽๅ€ผ, ๆœ€้ซ˜/ๆœ€ไฝŽๆธฉๅบฆๅ€ผ
Motor ้ฉฑๅŠจ็”ตๆœบ็Šถๆ€, ่ฝฌ้€Ÿ, ่ฝฌ็Ÿฉ, ๆธฉๅบฆ
Geography ็ปๅบฆ, ็บฌๅบฆ, ็ป่ฟ‡้“่ทฏ, ็ป่ฟ‡่กŒๆ”ฟๅŒบ
Weather temperature_c, humidity_pct, wind_speed_kmh, precipitation_mm, ...
Road dominant_road_type, avg_speed_limit_kmh, ๅฟซ้€Ÿ่ทฏ/ไธปๅนฒ้“/ๆฌกๅนฒ้“/ๆ”ฏ่ทฏ ๅ ๆฏ”

4. Evaluation Modules (v3.1)

General capabilities

  • G1 โ€” Visual Recognition (~60 Q): chart reading + geographic image interpretation
  • G3 โ€” Instruction Following (~40 Q): structured-format compliance

Domain capabilities

  • G2 โ€” Domain CoT Reasoning (~60 Q): multi-step causal reasoning under physical constraints
  • S1-a โ€” Energy Numerical Estimation (~150 Q): segment-level energy / COโ‚‚ with bin-tolerance scoring
  • S1-b โ€” Physical Consistency Judgment (~35 Q): identifying scenarios that violate physical bounds (new in v3.1)
  • S2 โ€” Carbon-Accounting Report Generation (10 reports): 6-section daily reports, LLM-as-Judge
  • S3 โ€” Multimodal Fusion Gain (~150 Q): same numerical task as S1-a but with multimodal context

Final score is a weighted geometric mean. See code/aggregate_scores.py.

5. Quick Start

from datasets import load_dataset

# Telemetry
telemetry = load_dataset("ANONYMIZE/EV-CarbonBench", data_files="trip_rows/*/*.csv")

# Or clone everything
# git clone https://huggingface.co/datasets/ANONYMIZE/EV-CarbonBench

6. Datasheet (Gebru et al. 2021)

Motivation. Carbon accounting for electric public transport is a high-stakes engineering task that combines time-series telemetry, multimodal geographic context, physics-based reasoning, and structured reporting. No existing benchmark stresses LLMs along these dimensions simultaneously. EV-CarbonBench fills that gap.

Composition. Each instance is either (a) a 1 Hz telemetry row, (b) a trip-aligned PNG image, or (c) an evaluation Q-A item.

Collection. Telemetry collected via the buses' on-board T-Box at ~1 Hz under contractual data-sharing agreements; images programmatically generated from public OpenStreetMap and SRTM tiles; weather from Open-Meteo. Questions programmatically generated and expert-reviewed (ฮบ > 0.85 on pilot, ~87% retention).

Pre-processing. Trip segmentation by ignition cycle; deduplication; coordinate transformation to WGS-84; physically-impossible-value removal; 500 m fixed-distance segmentation. No imputation on raw telemetry. VINs replaced with BUS_001โ€“BUS_010.

Uses. See croissant.json โ†’ rai:dataUseCases. Out-of-scope: training driver-surveillance systems, attempts to re-identify drivers, deployment without recalibration to other regions / vehicles.

Distribution. Hugging Face dataset repo, CC-BY-4.0.

Maintenance. Authors maintain โ‰ฅ 5 years post-publication. Issues / corrections via the HF discussion tab. Versioned releases.

7. Limitations & Known Biases

  • Geographic โ€” single city (Shenzhen). May not transfer to highway-dominant or mountainous regions.
  • Temporal โ€” 10 days in September (warm-humid). No winter / cold-start data.
  • Vehicle homogeneity โ€” all 10 vehicles are BYD E6.
  • Carbon factor โ€” 0.4268 kgCOโ‚‚/kWh corresponds to Shenzhen 2024 grid mix; not transferable as-is.
  • LLM-as-Judge bias โ€” G2 and S2 use LLM-as-Judge scoring.

8. Ethics & Responsible Use

  • No personal data. Bus IDs are anonymized; original VINs are retained only by the authors.
  • GPS resolution is at trip-segment level.
  • Recommended uses: transportation electrification research, LLM/VLM evaluation, energy modeling, education.
  • Discouraged uses: re-identifying vehicles or drivers; replacing safety-critical or regulatory carbon disclosures without independent verification.

9. Citation

@inproceedings{evcarbonbench2026,
  title  = {{EV-CarbonBench}: A Comprehensive Benchmark and Domain Dataset
            for Assessing LLMs in Electric Bus Energy and Carbon Analysis},
  author = {Anonymous},
  booktitle = {Advances in Neural Information Processing Systems
               (NeurIPS) Datasets and Benchmarks Track},
  year   = {2026}
}

10. License

Data: CC-BY-4.0. Code under code/: MIT.

11. Contact

During double-blind review, please use the OpenReview discussion. After acceptance, the authors and an issue tracker will be listed here.


Last updated: 2026-05-05 โ€” version 3.1.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support