Some clarity is emerging:

The distribution of response lengths has shifted considerably in 3.6 and 2 of my tasks are no longer fitting into 16k, the ignorance zone blows up.
Re-running at 32k then we'll see if that extra thinking pays off or nah.
An interesting outlier here is the word-sort task where 3.6 thinks ~half as much and this costs it about 10pp of performance.