arxiv:2606.08327

Chiaroscuro Attention: Spending Compute in the Dark

Published on Jun 6

· Submitted by

Prateek Sikdar on Jun 9

Accenture

Upvote

Authors:

Abstract

CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

View arXiv page View PDF Add to collection

Community

prateeksikdar

Paper submitter about 10 hours ago

A few threads I'd love to explore with the community:

Routing collapse as a design principle — the router rejected RBF entirely and the purpose-built DCT+Attn variant beat all 3-operator versions. Has anyone seen analogous collapse behaviour in MoE or other multi-operator architectures that turned out to be informative rather than a failure?
Scaling hypothesis — at 17M params the spectral-then-dynamic pipeline emerges naturally from routing. My conjecture for a 12-layer BERT-sized model: first 3–4 layers route to DCT, middle layers mix, later layers go full attention. Would love to hear intuitions on whether this holds.
The ListOps failure is intentional — DCT blurs the sharp integer boundaries that symbolic rule-following needs. If you're working on hybrid architectures that preserve spectral routing for naturalistic tokens while handling discrete pattern-matching, I'd love to compare notes.
Tau calibration — thresholds are currently post-training and corpus-specific. Online adaptation during training feels like the right direction. Any pointers to related work on dynamic threshold learning?

All code, training configs, and ablation checkpoints available on request.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.08327

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.08327 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.08327 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.08327 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.