Abstract
CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.
Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.
Community
A few threads I'd love to explore with the community:
Routing collapse as a design principle — the router rejected RBF entirely and the purpose-built DCT+Attn variant beat all 3-operator versions. Has anyone seen analogous collapse behaviour in MoE or other multi-operator architectures that turned out to be informative rather than a failure?
Scaling hypothesis — at 17M params the spectral-then-dynamic pipeline emerges naturally from routing. My conjecture for a 12-layer BERT-sized model: first 3–4 layers route to DCT, middle layers mix, later layers go full attention. Would love to hear intuitions on whether this holds.
The ListOps failure is intentional — DCT blurs the sharp integer boundaries that symbolic rule-following needs. If you're working on hybrid architectures that preserve spectral routing for naturalistic tokens while handling discrete pattern-matching, I'd love to compare notes.
Tau calibration — thresholds are currently post-training and corpus-specific. Online adaptation during training feels like the right direction. Any pointers to related work on dynamic threshold learning?
All code, training configs, and ablation checkpoints available on request.
Get this paper in your agent:
hf papers read 2606.08327 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper