Cool work, I've been quite excited about AllenAI's take/improved hybrid arch. Question for you though:
The one genuinely matched-data comparison in the paper is the 1B ladder (transformer / hybrid / pure-RNN, identical mix), which you use for the 6 filtered-loss eval - but only as aggregate loss, not the POS/bracket/copy decomposition. Since that's forward-passes-only on released checkpoints, have you run (or can you) the same tag-stratified analysis on those models? It'd help show whether the content-word / open-close / copy structure survives when data is actually held constant (vs ~7b case).
Curious if you've looked at this internally as well