I think SFT would help a lot as you suspected.
The way I see it is that it's actually succeeding at what CPT is good at (pattern matching). Meaning, somewhere in the data set there is data that actually favors White over Yolk and somewhere in your data Yolk is being preferred over White. It doesn't even have the be that obviously defined, but could be indirect.
So what I think happens is this:
Short question (100 words) ===> Matches pattern from Q&A sites and FAQ sections (just as example) ===> This data mentions yolk wins
Long question (500 words) ===> Matches pattern from blog posts and academic articles (also just examples) ===> This data mentions whites wins
So besides cleaning up the data, which is really kind of out of scope because you'd be babysitting your data for every possible length/answer. I think SFT will help.
With SFT it doesn't just learn the patterns but what humans prefer, which is consistency across length. It's basically statistical correlation with CPT vs behavioral alignment with SFT.
There's also a thing called attention drift that you may want to look into, it can be helpful.