md896's picture
Harden GRPO generation stability on CUDA: bf16 + eager attention + invalid-logit guards.
948530a