What this shows: Each violin is the Bayesian posterior distribution over a policy's true success rate, given the observed rollouts. Wide = high uncertainty; narrow = high certainty. CLD letters above summarise which policies are statistically separable (Bonferroni-corrected). Shared letter → not significantly different.
Violins represent posterior uncertainty, not confidence intervals. Two overlapping violins can still be statistically distinct. Barnard's exact test with Bonferroni correction for 55 pairwise comparisons; CLD groups computed at α=0.10 per-pair threshold (p<0.0018).