Datasets from the paper: A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness (arxiv: https://arxiv.org/abs/2603.06594)