Corrupt image in training set

#2
by ghanning - opened

Iterating over the training dataset fails with the following error:

PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f4b76b54ea0>

It happens at iteration 13077/19677. Not sure which scene but it appears to contain a corrupt image file.

To reproduce:

from datasets import load_dataset

dataset = load_dataset("usm3d/hoho22k_2026_trainval", trust_remote_code=True)["train"]

for data in dataset:
    pass
Urban Scene Modeling Competition CVPR 2026 (Image Track) org

Hi,

Thank you for the report! I am not sure if that is easy for us to update the training set in place due to HF size limits (if we just push new version), and due to the gated access (if we delete the dataset and upload new one under same name). Probably the best course of action is to get image inside try/except.
I apologize for the issue and will try to fix it in the mean time.

--
Best, Dmytro

Sign up or log in to comment