125 kB

	Title: SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

	URL Source: https://arxiv.org/html/2012.09852

	Markdown Content:
	Hanrui Wang, Zhekai Zhang, Song Han [https://github.com/mit-han-lab/spatten](https://github.com/mit-han-lab/spatten)

	###### Abstract

	The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, attention becomes the computation bottleneck because of its quadratic computational complexity to input length, complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient _algorithm-architecture co-design_ that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel _cascade token pruning_ to prune away unimportant tokens in the sentence. We also propose _cascade head pruning_ to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel _top-k engine_ to rank token and head importance scores with high throughput. Furthermore, we propose _progressive quantization_ that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.

	Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0×\times× with no accuracy loss, and achieves 1.6×\times×, 3.0×\times×, 162×\times×, 347×\times× speedup, and 1.4×\times×, 3.2×\times×, 1193×\times×, 4059×\times× energy savings over A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.

	###### Index Terms:

	Natural Language Processing; Attention; Domain-Specific Accelerator; Algorithm-Architecture Co-design; Pruning; Quantization

	I Introduction
	--------------

	Natural Language Processing (NLP) has witnessed rapid progress in recent years driven by the attention mechanism[[1](https://arxiv.org/html/2012.09852v3#bib.bib1)]. Attention models such as Transformer[[1](https://arxiv.org/html/2012.09852v3#bib.bib1)], BERT[[2](https://arxiv.org/html/2012.09852v3#bib.bib2)], and GPT-2[[3](https://arxiv.org/html/2012.09852v3#bib.bib3)] provide significant performance improvements over models based on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). BERT[[2](https://arxiv.org/html/2012.09852v3#bib.bib2)] even outstrips human performance on the challenging sentence classification[[4](https://arxiv.org/html/2012.09852v3#bib.bib4)] tasks.

	Unfortunately, the high accuracy is at the expense of efficiency. Attention runs extremely slow on general-purpose platforms, due to its quadratic computational complexity to input length, complex data movement and low arithmetic intensity. For instance, to generate a sentence with only 30 tokens, a GPT-2 model takes a total of 370ms on a TITAN Xp GPU to perform attention inference. That is two orders of magnitude slower than MobileNet-V2, which takes only 6ms to classify an image. On resources-limited Raspberry Pi ARM CPU, attentions cost 43s, making interactive dialog applications impossible. The inefficiency is even more severe for modern Large Language Models (LLMs) with long context length and large KV cache. The efficiency barrier prevents attention models from being deployed on mobile devices. Many accelerators have been proposed to accelerate CNN and RNN, but they cannot be easily applied to attention due to the distinct operations.

	![Image 1: Refer to caption](https://arxiv.org/html/2012.09852v3/x1.png)

	Figure 1: Cascade token and head pruning removes redundant tokens and heads globally across layers. Evaluated with BERT-Base on SST-2 dataset.

	In this paper, we propose SpAtten 1 1 1 SpAtten is homophonic with _spartan_, meaning simple and frugal. It is analogous to _token and head pruning_, making sentences shorter and simpler., an algorithm-architecture co-design to enable efficient attention inference. We propose three algorithmic optimizations: _cascade token pruning_, _cascade head pruning_ and _progressive quantization_ to reduce computation and memory access. Different from conventional techniques, pruning is applied to the tokens and heads, not weights. Cascade means: once a token/head is pruned, it is removed in all following layers, so one layer only needs to process remaining tokens/heads from previous layers. The deeper the layer, the more tokens/heads are pruned. The three techniques are input-dependent since the pruned computation and bit-width are _adaptive_ to input instances. Cascade pruning requires sorting token/head importance scores on the fly. Thus we design hardware architecture with high parallelism top-k engines for token/head selections, specialized memory hierarchy, and fully-pipelined datapath to translate theoretical savings to real speedup and energy reduction.

	The inputs of attention contain Query (Q), Key (K), and Value (V), each split into multiple heads. Then attention probabilities are computed as the softmax of Q ×\times× K. Multiplying the attention probabilities with V gives the result of one head; concatenating all heads together gives the attention output. The arithmetic intensity of attention in the generation stage is low: only two operations per data (0.5ops/Byte) for the vector-matrix multiplication (Q ×\times× K). Generation takes the largest part of overall latency in GPT-2 models (97% when generating 32 tokens); thus, the overall performance is memory-bounded. For BERT, the overall performance is computation-bounded.

	Therefore, we propose _cascade token pruning_ as shown in Figure[1](https://arxiv.org/html/2012.09852v3#S1.F1 "Figure 1 ‣ I Introduction ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") to reduce both DRAM access and computation. Inspired by human languages being highly redundant due to many structural and meaningless tokens such as prepositions, articles, and adverbs, we can safely remove unimportant tokens with little impact on the results. Moreover, attention uses many heads to capture various dependencies, but some of them are also redundant[[5](https://arxiv.org/html/2012.09852v3#bib.bib5)]. Therefore, we also propose _cascade head pruning_ to determine the importance of heads based on their influence on the outputs, then prune unessential heads. Cascade token and head pruning are fundamentally different from the classic weight pruning and classic head pruning because: (i) There is no trainable weight in attention. (ii) Conventionally, pruned weights and heads are determined at compile-time and consistent for all inputs. In contrast, tokens and heads to be pruned in SpAtten are selected on the fly and vary between different inputs. Token pruning is also different from classic activation pruning because it depends on attention probabilities, not activation magnitude. Specifically, we prune the tokens according to cumulative token importance scores obtained by accumulating attention probabilities (indicators for token influence) across layers. Since long sentences are naturally more redundant, we also adjust the pruning ratios based on sentence length: the longer, the more tokens are pruned away. Moreover, the heads are pruned according to cumulative head importance scores, which are computed by accumulating each head’s magnitude across layers. To support cascade token and head pruning, we design and implement a specialized high parallelism top-k engine with O⁢(n)𝑂 𝑛 O(n)italic_O ( italic_n ) time complexity to get the k 𝑘 k italic_k most essential tokens or heads. On average, token and head pruning can reduce DRAM access and computation by 3.8×\times× and 1.1×\times× on eight GPT-2 models.

	To further reduce the DRAM access, we also propose _progressive quantization_ for attention inputs. We find an interesting phenomenon that quantization errors are related to attention probability distributions: if a few tokens dominate the distribution, the quantization error is small – only MSB is needed; for a flat distribution, the error is large – both LSB and MSB are needed. We also provide a theoretical proof for this phenomenon in Section[III-D](https://arxiv.org/html/2012.09852v3#S3.SS4 "III-D Progressive Quantization ‣ III Algorithmic Optimizations ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). Based on this observation, we quantize more aggressively for attention with dominated attention probabilities and more conservatively for others. Concretely, we first fetch MSBs of attention inputs to compute the attention probabilities. If the max probability is smaller than a threshold, indicating the distribution is flat, we will fetch LSBs on-chip and recompute attention probabilities. In such a way, we trade computation to less memory access, which is beneficial to memory-bounded models. With progressive quantization, we can save another 5.1×\times× memory access.

	Previous state-of-the-art attention accelerators A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT[[6](https://arxiv.org/html/2012.09852v3#bib.bib6)] and MNNFast[[7](https://arxiv.org/html/2012.09852v3#bib.bib7)] also leverage sparsity. However, they have three main limitations. (i) A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and MNNFast need to fetch everything from DRAM before calculating what can be pruned. Thus the overhead is already paid, and no DRAM access is reduced. They only optimize computation-bounded discriminate models, and cannot accelerate memory-bounded generative models. SpAtten not only improves computation-bounded discriminative ones (BERT), but also solves the challenge of memory-bounded generative ones (GPT-2). It significantly reduces QKV DRAM access with token pruning (3.8×\times×), head pruning (1.1×\times×), and progressive quantization (5.1×\times×). (ii) Head sparsity is an opportunity to further reduce DRAM access and computation, but A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and MNNFast do not support head pruning. (iii) A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT prunes QKV vectors of a token only locally in one head, and MNNFast only prunes V vector locally. Therefore they only reduce the attention layer’s computation but not Feed-Forward Network (FFN) layers. SpAtten prunes the token _globally_: once a token is pruned, the involved computations in _all_ following layers are skipped. Therefore, the computations in FFN layers are also reduced in SpAtten.

	SpAtten has a wide support range thanks to the generalization ability of attention-based NLP models. For instance, BERT can be used for arbitrary discriminative tasks, such as sentence sentiment classification and sentence similarity regression, for which the backbone of BERT is the same and only the last layer needs to be changed. Likewise, GPT-2 can handle all generative tasks, including language modeling, document summarization, etc.

	![Image 2: Refer to caption](https://arxiv.org/html/2012.09852v3/x2.png)

	Figure 2: End-to-End GPT-2 latency breakdown on various platforms, and attention latency breakdown on TITAN Xp GPU. Attention accounts for over 50% of total latency. Data movements account for 73% of attention latency.

	In summary, SpAtten performs algorithm-architecture co-design for sparse and quantized attention computing while preserving the accuracy. It makes four contributions:

	* •
	Cascade Token Pruning removes unimportant tokens according to the cumulative token importance scores, reducing DRAM access and computation by up to 3.8×\times×.

	* •
	Cascade Head Pruning removes unimportant heads and save DRAM access and computation by another 1.1×\times×.

	* •
	Progressive Quantization trades a little more computation for less memory access. We change the bitwidths of different attention heads and layers based on attention probability distribution, reducing DRAM access by 5.1×\times×.

	* •
	Specialized High Parallelism top-k Engine with O⁢(n)𝑂 𝑛 O(n)italic_O ( italic_n ) time complexity to efficiently support on-the-fly token and head selections.

	We extensively evaluate SpAtten on 30 benchmarks including GLUE set[[4](https://arxiv.org/html/2012.09852v3#bib.bib4)], SQuAD[[8](https://arxiv.org/html/2012.09852v3#bib.bib8)], Wikitext-2[[9](https://arxiv.org/html/2012.09852v3#bib.bib9)], Wikitext-103[[9](https://arxiv.org/html/2012.09852v3#bib.bib9)], Pen Tree Bank[[10](https://arxiv.org/html/2012.09852v3#bib.bib10)] and Google One-Billion Word[[11](https://arxiv.org/html/2012.09852v3#bib.bib11)] with BERT and GPT-2 models. SpAtten reduces DRAM access by 10.0×\times× with _no accuracy loss_, and achieves 1.6×\times×, 3.0×\times×, 162×\times×, 347×\times×, 1095×\times×, 5071×\times× speedup, and 1.4×\times×, 3.2×\times×, 1193×\times×, 4059×\times×, 406×\times×, 1910×\times× energy savings over A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, Nano GPU, Raspberry Pi ARM CPU, respectively.

	II Background and Motivation
	----------------------------

	![Image 3: Refer to caption](https://arxiv.org/html/2012.09852v3/x3.png)

	Figure 3: NLP model architecture with attention. BERT only contains the summarization stage. GPT-2 contains summarization and generation stages.

	Input:Q i⁢n∈ℝ L 0×D i⁢n subscript 𝑄 𝑖 𝑛 superscript ℝ subscript 𝐿 0 subscript 𝐷 𝑖 𝑛 Q_{in}\in\mathbb{R}^{L_{0}\times D_{in}}italic_Q start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, K i⁢n∈ℝ L 1×D i⁢n subscript 𝐾 𝑖 𝑛 superscript ℝ subscript 𝐿 1 subscript 𝐷 𝑖 𝑛 K_{in}\in\mathbb{R}^{L_{1}\times D_{in}}italic_K start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, V i⁢n∈ℝ L 1×D i⁢n subscript 𝑉 𝑖 𝑛 superscript ℝ subscript 𝐿 1 subscript 𝐷 𝑖 𝑛 V_{in}\in\mathbb{R}^{L_{1}\times D_{in}}italic_V start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT;

	Number of Heads:

	h ℎ h italic_h
	;

	Split

	Q i⁢n,K i⁢n,V i⁢n subscript 𝑄 𝑖 𝑛 subscript 𝐾 𝑖 𝑛 subscript 𝑉 𝑖 𝑛 Q_{in},K_{in},V_{in}italic_Q start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT
	to

	h ℎ h italic_h
	chunks:

	Q∈ℝ h×L 0×D,K∈ℝ h×L 1×D,V∈ℝ h×L 1×D,D=D i⁢n h formulae-sequence 𝑄 superscript ℝ ℎ subscript 𝐿 0 𝐷 formulae-sequence 𝐾 superscript ℝ ℎ subscript 𝐿 1 𝐷 formulae-sequence 𝑉 superscript ℝ ℎ subscript 𝐿 1 𝐷 𝐷 subscript 𝐷 𝑖 𝑛 ℎ Q\in\mathbb{R}^{h\times L_{0}\times D},K\in\mathbb{R}^{h\times L_{1}\times D},% V\in\mathbb{R}^{h\times L_{1}\times D},D=\frac{D_{in}}{h}italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT , italic_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT , italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT , italic_D = divide start_ARG italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_h end_ARG
	;

	for _h⁢e⁢a⁢d i⁢d=0⁢𝐭𝐨⁢h ℎ 𝑒 𝑎 subscript 𝑑 𝑖 𝑑 0 𝐭𝐨 ℎ head\_{id}=0\mathbf{\ to\ }h italic\_h italic\_e italic\_a italic\_d start\_POSTSUBSCRIPT italic\_i italic\_d end\_POSTSUBSCRIPT = 0 bold\_to italic\_h_ do

	a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢s⁢c⁢o⁢r⁢e∈ℝ L 0×L 1 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑠 𝑐 𝑜 𝑟 𝑒 superscript ℝ subscript 𝐿 0 subscript 𝐿 1 attention\_score\in\mathbb{R}^{L_{0}\times L_{1}}italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_s italic_c italic_o italic_r italic_e ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ;

	a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢s⁢c⁢o⁢r⁢e=Q⁢[h⁢e⁢a⁢d i⁢d]⋅K⁢[h⁢e⁢a⁢d i⁢d]T 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑠 𝑐 𝑜 𝑟 𝑒⋅𝑄 delimited-[]ℎ 𝑒 𝑎 subscript 𝑑 𝑖 𝑑 𝐾 superscript delimited-[]ℎ 𝑒 𝑎 subscript 𝑑 𝑖 𝑑 𝑇 attention\_score=Q[head_{id}]\cdot K[head_{id}]^{T}italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_s italic_c italic_o italic_r italic_e = italic_Q [ italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT ] ⋅ italic_K [ italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
	;

	a t t e n t i o n _ s c o r e=a t t e n t i o n _ s c o r e/attention\_score=attention\_score/italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_s italic_c italic_o italic_r italic_e = italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_s italic_c italic_o italic_r italic_e /
	sqrt(

	D 𝐷 D italic_D
	) ;

	a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢p⁢r⁢o⁢b∈ℝ L 0×L 1 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑝 𝑟 𝑜 𝑏 superscript ℝ subscript 𝐿 0 subscript 𝐿 1 attention\_prob\in\mathbb{R}^{L_{0}\times L_{1}}italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ;

	for _r⁢o⁢w i⁢d=0⁢𝐭𝐨⁢L 0 𝑟 𝑜 subscript 𝑤 𝑖 𝑑 0 𝐭𝐨 subscript 𝐿 0 row\_{id}=0\mathbf{\ to\ }L\_{0}italic\_r italic\_o italic\_w start\_POSTSUBSCRIPT italic\_i italic\_d end\_POSTSUBSCRIPT = 0 bold\_to italic\_L start\_POSTSUBSCRIPT 0 end\_POSTSUBSCRIPT_ do

	a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢p⁢r⁢o⁢b⁢[r⁢o⁢w i⁢d]=𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑝 𝑟 𝑜 𝑏 delimited-[]𝑟 𝑜 subscript 𝑤 𝑖 𝑑 absent attention\_prob[row_{id}]=italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b [ italic_r italic_o italic_w start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT ] =
	Softmax(

	a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢s⁢c⁢o⁢r⁢e⁢[r⁢o⁢w i⁢d]𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑠 𝑐 𝑜 𝑟 𝑒 delimited-[]𝑟 𝑜 subscript 𝑤 𝑖 𝑑 attention\_score[row_{id}]italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_s italic_c italic_o italic_r italic_e [ italic_r italic_o italic_w start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT ]
	)

	end for

	E⁢[h⁢e⁢a⁢d i⁢d]=a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢p⁢r⁢o⁢b⋅V⁢[h⁢e⁢a⁢d i⁢d]𝐸 delimited-[]ℎ 𝑒 𝑎 subscript 𝑑 𝑖 𝑑⋅𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑝 𝑟 𝑜 𝑏 𝑉 delimited-[]ℎ 𝑒 𝑎 subscript 𝑑 𝑖 𝑑 E[head_{id}]=attention\_prob\cdot V[head_{id}]italic_E [ italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT ] = italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b ⋅ italic_V [ italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT ] ;

	end for

	Concatenate heads of

	E∈ℝ h×L 0×D 𝐸 superscript ℝ ℎ subscript 𝐿 0 𝐷 E\in\mathbb{R}^{h\times L_{0}\times D}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT
	as output;

	Output:

	a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢o⁢u⁢t∈ℝ L 0×D i⁢n 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑜 𝑢 𝑡 superscript ℝ subscript 𝐿 0 subscript 𝐷 𝑖 𝑛 attention\_out\in\mathbb{R}^{L_{0}\times D_{in}}italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
	;

	Algorithm 1 Attention

	### II-A Background

	Attention-Based NLP Models. NLP tasks can be categorized into two types: discriminative and generative. For discriminative ones, the models need to summarize the input information and make predictions. Discriminative tasks include token-level classification[[12](https://arxiv.org/html/2012.09852v3#bib.bib12)], sentence-level classification and regression[[13](https://arxiv.org/html/2012.09852v3#bib.bib13)] etc. Meanwhile, models for generative tasks need first to summarize the input information and then generate new tokens. Exemplary generation tasks include Language Modeling (LM)[[3](https://arxiv.org/html/2012.09852v3#bib.bib3)] and machine translation[[1](https://arxiv.org/html/2012.09852v3#bib.bib1)].

	BERT for discriminative and GPT-2 for generative tasks are the most widely-used models as illustrated in Figure[3](https://arxiv.org/html/2012.09852v3#S2.F3 "Figure 3 ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). BERT only contains the summarization stage, while GPT-2 contains summarization and generation stages. In summarization (Figure[3](https://arxiv.org/html/2012.09852v3#S2.F3 "Figure 3 ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") left), the input tokens are first embedded into vectors and processed by blocks. Inside each block, b⁢l⁢o⁢c⁢k⁢_⁢i⁢n 𝑏 𝑙 𝑜 𝑐 𝑘 _ 𝑖 𝑛 block\_in italic_b italic_l italic_o italic_c italic_k _ italic_i italic_n are first multiplied with three matrices to get Query (Q), Key (K) and, Value (V). Then QKV are processed by attention to get the intermediate features a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢o⁢u⁢t 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑜 𝑢 𝑡 attention\_out italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t. A residual layer adds the a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢o⁢u⁢t 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑜 𝑢 𝑡 attention\_out italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t with b⁢l⁢o⁢c⁢k⁢_⁢i⁢n 𝑏 𝑙 𝑜 𝑐 𝑘 _ 𝑖 𝑛 block\_in italic_b italic_l italic_o italic_c italic_k _ italic_i italic_n and conducts layer normalization. There will be an additional FC on a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢o⁢u⁢t 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑜 𝑢 𝑡 attention\_out italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t if there is more than one head. Furthermore, a Feed-Forward Network (FFN) layer containing two Fully-Connected (FC) layers is applied. Finally, another residual operation is conducted and outputs b⁢l⁢o⁢c⁢k⁢_⁢o⁢u⁢t 𝑏 𝑙 𝑜 𝑐 𝑘 _ 𝑜 𝑢 𝑡 block\_out italic_b italic_l italic_o italic_c italic_k _ italic_o italic_u italic_t. The same block is repeated multiple times such as 12 times for BERT-Base. The last block is followed by one classification layer in BERT to get the _final result_. In contrast, GPT-2 applies an LM head to generate one new token and enters the generation stage.

	Generation stage (Figure[3](https://arxiv.org/html/2012.09852v3#S2.F3 "Figure 3 ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") right) has two main differences from summarization: 1) Each iteration only processes _one single token_ instead of the whole sentence. 2) Ks and Vs from the summarization stage are concatenated with current K and V, and sent to attention _in batch_, while the query is still _one single vector_. After the last block, another new token will be generated. Generation stage ends when end_of_sentence token is generated, or the sentence length reaches a pre-defined limit. One generation iteration’s runtime is similar to the whole summarization stage on GPU, because the summarization is processed in batch.

	![Image 4: Refer to caption](https://arxiv.org/html/2012.09852v3/x4.png)

	Figure 4: Cascade token pruning removes redundant tokens and corresponding entire Q K V vectors according to the cumulative token importance scores computed from a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢p⁢r⁢o⁢b 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑝 𝑟 𝑜 𝑏 attention\_prob italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b. Cascade head pruning removes unimportant heads and corresponding chunks in all Q K V vectors according to the cumulative head important scores computed from a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢o⁢u⁢t 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑜 𝑢 𝑡 attention\_out italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t. Once a token/head is pruned, it will never appear in any following layers, thus named cascade pruning. More tokens and heads are pruned away as the layer goes deeper.

	Input: Number of heads: h ℎ h italic_h;

	Token pruning ratio:

	p t subscript 𝑝 𝑡 p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
	; Head pruning ratio:

	p h subscript 𝑝 ℎ p_{h}italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT
	;

	a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢p⁢r⁢o⁢b∈ℝ h×L 0×L 1 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑝 𝑟 𝑜 𝑏 superscript ℝ ℎ subscript 𝐿 0 subscript 𝐿 1 attention\_prob\in\mathbb{R}^{h\times L_{0}\times L_{1}}italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
	,

	a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢o⁢u⁢t∈ℝ L 0×D i⁢n 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑜 𝑢 𝑡 superscript ℝ subscript 𝐿 0 subscript 𝐷 𝑖 𝑛 attention\_out\in\mathbb{R}^{L_{0}\times D_{in}}italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
	;

	Previous cumulative token importance score:

	s t∈L 1;subscript 𝑠 𝑡 subscript 𝐿 1 s_{t}\in L_{1};italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ;

	Previous cumulative head importance score:

	s h∈h;subscript 𝑠 ℎ ℎ s_{h}\in h;italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ italic_h ;

	Reshape

	a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢o⁢u⁢t 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑜 𝑢 𝑡 attention\_out italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t
	by head, get

	E∈ℝ h×L 0×D 𝐸 superscript ℝ ℎ subscript 𝐿 0 𝐷 E\in\mathbb{R}^{h\times L_{0}\times D}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT
	; /* accumulate the token importance */

	for _t⁢o⁢k⁢e⁢n i⁢d=0⁢𝐭𝐨⁢L 1 𝑡 𝑜 𝑘 𝑒 subscript 𝑛 𝑖 𝑑 0 𝐭𝐨 subscript 𝐿 1 token\_{id}=0\mathbf{\ to\ }L\_{1}italic\_t italic\_o italic\_k italic\_e italic\_n start\_POSTSUBSCRIPT italic\_i italic\_d end\_POSTSUBSCRIPT = 0 bold\_to italic\_L start\_POSTSUBSCRIPT 1 end\_POSTSUBSCRIPT_ do

	for _l 0=0⁢𝐭𝐨⁢L 0 subscript 𝑙 0 0 𝐭𝐨 subscript 𝐿 0 l\_{0}=0\mathbf{\ to\ }L\_{0}italic\_l start\_POSTSUBSCRIPT 0 end\_POSTSUBSCRIPT = 0 bold\_to italic\_L start\_POSTSUBSCRIPT 0 end\_POSTSUBSCRIPT_ do

	for _h⁢e⁢a⁢d i⁢d=0⁢𝐭𝐨⁢h ℎ 𝑒 𝑎 subscript 𝑑 𝑖 𝑑 0 𝐭𝐨 ℎ head\_{id}=0\mathbf{\ to\ }h italic\_h italic\_e italic\_a italic\_d start\_POSTSUBSCRIPT italic\_i italic\_d end\_POSTSUBSCRIPT = 0 bold\_to italic\_h_ do

	end for

	end for

	end for

	/* accumulate the head importance */

	for _h⁢e⁢a⁢d i⁢d=0⁢𝐭𝐨⁢h ℎ 𝑒 𝑎 subscript 𝑑 𝑖 𝑑 0 𝐭𝐨 ℎ head\_{id}=0\mathbf{\ to\ }h italic\_h italic\_e italic\_a italic\_d start\_POSTSUBSCRIPT italic\_i italic\_d end\_POSTSUBSCRIPT = 0 bold\_to italic\_h_ do

	for _l 0=0⁢𝐭𝐨⁢L 0 subscript 𝑙 0 0 𝐭𝐨 subscript 𝐿 0 l\_{0}=0\mathbf{\ to\ }L\_{0}italic\_l start\_POSTSUBSCRIPT 0 end\_POSTSUBSCRIPT = 0 bold\_to italic\_L start\_POSTSUBSCRIPT 0 end\_POSTSUBSCRIPT_ do

	for _d=0⁢𝐭𝐨⁢D 𝑑 0 𝐭𝐨 𝐷 d=0\mathbf{\ to\ }D italic\_d = 0 bold\_to italic\_D_ do

	end for

	end for

	end for

	r⁢e⁢m⁢a⁢i⁢n⁢e⁢d⁢_⁢t⁢o⁢k⁢e⁢n⁢_⁢i⁢d=𝑟 𝑒 𝑚 𝑎 𝑖 𝑛 𝑒 𝑑 _ 𝑡 𝑜 𝑘 𝑒 𝑛 _ 𝑖 𝑑 absent remained\_token\_id=italic_r italic_e italic_m italic_a italic_i italic_n italic_e italic_d _ italic_t italic_o italic_k italic_e italic_n _ italic_i italic_d =
	top-k(

	s t,L 1×(1−p t)subscript 𝑠 𝑡 subscript 𝐿 1 1 subscript 𝑝 𝑡 s_{t},L_{1}\times(1-p_{t})italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ( 1 - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
	)

	r⁢e⁢m⁢a⁢i⁢n⁢e⁢d⁢_⁢h⁢e⁢a⁢d⁢_⁢i⁢d=𝑟 𝑒 𝑚 𝑎 𝑖 𝑛 𝑒 𝑑 _ ℎ 𝑒 𝑎 𝑑 _ 𝑖 𝑑 absent remained\_head\_id=italic_r italic_e italic_m italic_a italic_i italic_n italic_e italic_d _ italic_h italic_e italic_a italic_d _ italic_i italic_d =
	top-k(

	s h,h×(1−p h)subscript 𝑠 ℎ ℎ 1 subscript 𝑝 ℎ s_{h},h\times(1-p_{h})italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_h × ( 1 - italic_p start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT )
	)

	Output:

	r⁢e⁢m⁢a⁢i⁢n⁢e⁢d⁢_⁢t⁢o⁢k⁢e⁢n⁢_⁢i⁢d,r⁢e⁢m⁢a⁢i⁢n⁢e⁢d⁢_⁢h⁢e⁢a⁢d⁢_⁢i⁢d,s t,s h 𝑟 𝑒 𝑚 𝑎 𝑖 𝑛 𝑒 𝑑 _ 𝑡 𝑜 𝑘 𝑒 𝑛 _ 𝑖 𝑑 𝑟 𝑒 𝑚 𝑎 𝑖 𝑛 𝑒 𝑑 _ ℎ 𝑒 𝑎 𝑑 _ 𝑖 𝑑 subscript 𝑠 𝑡 subscript 𝑠 ℎ remained\_token\_id,remained\_head\_id,s_{t},s_{h}italic_r italic_e italic_m italic_a italic_i italic_n italic_e italic_d _ italic_t italic_o italic_k italic_e italic_n _ italic_i italic_d , italic_r italic_e italic_m italic_a italic_i italic_n italic_e italic_d _ italic_h italic_e italic_a italic_d _ italic_i italic_d , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT

	Algorithm 2 Token and Head Pruning (one layer)

	Attention Mechanism. The attention mechanism[[1](https://arxiv.org/html/2012.09852v3#bib.bib1)] is shown in Algorithm[1](https://arxiv.org/html/2012.09852v3#alg1 "In II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). In the summarization stage, Q, K, and V are matrices with the same dimension, while in the generation stage, Q is one single vector, and K, V are matrices.

	Attention has multiple heads, each processing a chunk of Q, K, and V. Different heads capture various dependencies between tokens, some for long-term, some for short-term. Inside each head, Q×K T/sqrt⁢(D)Q superscript K 𝑇 sqrt 𝐷\mathrm{Q}\times\mathrm{K}^{T}/\mathrm{sqrt}(D)roman_Q × roman_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / roman_sqrt ( italic_D ) gives attention scores, which indicate whether two tokens are related. For instance in Figure[5](https://arxiv.org/html/2012.09852v3#S2.F5 "Figure 5 ‣ II-B Motivation ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"), the score for ‘more’ attending to ‘than’ is large. After that, a row-wise softmax computes the attention probabilities. The exponential of softmax further enlarges the score for highly-related token pairs. The head’s feature is then computed with a t t e n t i o n _ p r o b×attention\_prob\times italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b ×V. This step lets each token fetch information from their cared tokens. Finally, multiple heads are concatenated together as the attention output. In BERT-Base and GPT-2-Small, there are 12 blocks, and each attention layer has 12 heads. There are 24 blocks and 16 heads for each attention layer in BERT-Large and GPT-2-Medium.

	### II-B Motivation

	We profile the end-to-end latency of a GPT-2 model on multiple hardware platforms in Figure[2](https://arxiv.org/html/2012.09852v3#S1.F2 "Figure 2 ‣ I Introduction ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). Attention typically accounts for over 50% latency, even though it only has around 10% of overall FLOPs. In Figure[2](https://arxiv.org/html/2012.09852v3#S1.F2 "Figure 2 ‣ I Introduction ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") right, around 73% of the time is spent on data movements such as splitting heads, K and V concatenations, reshape, and transpose. GPUs and CPUs are well-optimized for matrix multiplications but are poor on the complex memory operations, thus slowing down attention. Therefore, it is necessary to build an accelerator to solve the attention bottleneck as a co-processor. The FC layers of NLP models are processed by GPUs, CPUs, or tensor algebra accelerators as they are highly-optimized for FC, and our SpAtten co-processor handles all attention layers.

	![Image 5: Refer to caption](https://arxiv.org/html/2012.09852v3/x5.png)

	Figure 5: Attention probabilities for BERT are summed over each column to get importance scores. Tokens with small importance scores are pruned.

	III Algorithmic Optimizations
	-----------------------------

	### III-A Cascade Token Pruning

	Plenty of unessential tokens exist in human languages, which can be pruned away to boost efficiency. Therefore, we propose cascade token pruning to assess token importance based on attention probabilities and remove trivial ones. Cascade means that once a token is pruned, it is removed in all the following layers, so one layer only needs to process _remaining_ tokens from previous layers.

	In cascade token pruning, tokens to be pruned are determined by an array of _cumulative token importance scores_, one for each token (Figure[4](https://arxiv.org/html/2012.09852v3#S2.F4 "Figure 4 ‣ II-A Background ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") and Algorithm[2](https://arxiv.org/html/2012.09852v3#alg2 "In II-A Background ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")). The scores are obtained by accumulating attention probabilities across multiple rounds of attention. The probability indicates whether a token is important to the sentence, because if the probability is large, then the outputs are more influenced by the corresponding token. Specifically, a t t e n t i o n _ o u t=a t t e n t i o n _ p r o b×attention\_out=attention\_prob\times italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t = italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b ×V while V is computed from input features b⁢l⁢o⁢c⁢k⁢_⁢i⁢n 𝑏 𝑙 𝑜 𝑐 𝑘 _ 𝑖 𝑛 block\_in italic_b italic_l italic_o italic_c italic_k _ italic_i italic_n (see Figure[3](https://arxiv.org/html/2012.09852v3#S2.F3 "Figure 3 ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")). The b⁢l⁢o⁢c⁢k⁢_⁢i⁢n 𝑏 𝑙 𝑜 𝑐 𝑘 _ 𝑖 𝑛 block\_in italic_b italic_l italic_o italic_c italic_k _ italic_i italic_n of the first block are directly from input tokens. For latter blocks, V vectors are still largely determined by input tokens because of residual connections. Therefore, the probability is an indicator of the token importance, and the accumulations across several heads and layers make the importance more reliable. For example, in Figure[5](https://arxiv.org/html/2012.09852v3#S2.F5 "Figure 5 ‣ II-B Motivation ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"), many tokens attend to the word ‘fun’, implying its high usefulness. In each head, the scores are accumulated by L 0 subscript 𝐿 0 L_{0}italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (number of query vectors) times in the summarization stage and one time in the generation stage. In BERT, we accumulate importance scores in former heads and layers and apply token pruning to latter ones. In GPT-2, we further accumulate importance scores _across generation iterations_ because intuitively, the unimportant tokens for one token generation should also be unimportant to others. With the cumulative importance scores, we can remove a pre-defined pruning ratio of tokens. Once a token is pruned, the QKV of it will _never_ be used in all the following attention heads and layers; in every layer/head, several new tokens can be selected and pruned away, thus being _global and cascade_. Token pruning can reduce the computation and memory access of both attention, and also FC layers outside attention. On eight GPT-2 benchmarks, token pruning can achieve 3.8×\times× reduction of DRAM access.

	### III-B Cascade Head Pruning

	Each QKV vector has multiple chunks corresponding to multiple heads, which are used to capture various token dependency relationships. However, some of the heads are redundant[[5](https://arxiv.org/html/2012.09852v3#bib.bib5)] and have little influence on outputs. Token pruning reduces the sentence length. Head pruning reduces the feature length. Hence, redundancies in both dimensions are removed. The head importance is calculated by accumulating the absolute value of a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢o⁢u⁢t 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑜 𝑢 𝑡 attention\_out italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t elements of each head across layers to get _cumulative head importance scores_ (Figure[4](https://arxiv.org/html/2012.09852v3#S2.F4 "Figure 4 ‣ II-A Background ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") and Algorithm[2](https://arxiv.org/html/2012.09852v3#alg2 "In II-A Background ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")). The magnitude of the head’s outputs indicates its importance because there exists one FC layer processing the concatenation of all heads. If the magnitude of one head is large, the outputs of the FC layer and the whole block b⁢l⁢o⁢c⁢k⁢_⁢o⁢u⁢t 𝑏 𝑙 𝑜 𝑐 𝑘 _ 𝑜 𝑢 𝑡 block\_out italic_b italic_l italic_o italic_c italic_k _ italic_o italic_u italic_t will be more heavily influenced by the head. Similar to token pruning, head pruning is also cascaded: once removed, a head will not appear in the following layers.

	![Image 6: Refer to caption](https://arxiv.org/html/2012.09852v3/x6.png)

	Figure 6: Progressive Quantization. MSBs are fetched first. Only if the resulting probability distribution is flat, LSBs are fetched, and attention probabilities are recomputed – this trades computation for memory saving.

	### III-C Local Value Pruning

	SpAtten also supports local Value (V) pruning, which is performed after Softmax. the V vectors to be pruned are decided _solely with the current head’s attention probabilities._ A pre-defined ratio of V vectors with the smallest attention probabilities are pruned and will not be fetched for the a t t e n t i o n _ p r o b×attention\_prob\times italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b ×V computation. Compared to cascade token pruning which removes Q, K, and V of pruned tokens for current and all following attention heads and layers, local V pruning only removes V vectors of the current head.

	### III-D Progressive Quantization

	Softmax layers are abundant in attention, which allows us to apply more aggressive quantization than CNN/RNN models because Softmax can reduce quantization error. Softmax for attention probabilities is: p i=e s i/∑j=0 L 1−1 e s j subscript 𝑝 𝑖 superscript 𝑒 subscript 𝑠 𝑖 superscript subscript 𝑗 0 subscript 𝐿 1 1 superscript 𝑒 subscript 𝑠 𝑗 p_{i}=e^{s_{i}}/\sum_{j=0}^{L_{1}-1}e^{s_{j}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT / ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where L 1 subscript 𝐿 1 L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the number of K vectors, p 𝑝 p italic_p is attention probability, and s 𝑠 s italic_s is attention score. Quantization on Q and K can be considered as adding a small error Δ⁢s Δ 𝑠\Delta s roman_Δ italic_s to the attention score s 𝑠 s italic_s. We examine the influence of Δ⁢s Δ 𝑠\Delta s roman_Δ italic_s on output attention probabilities by computing the softmax derivative:

	If⁢i=j:∂p i∂s j:If 𝑖 𝑗 subscript 𝑝 𝑖 subscript 𝑠 𝑗\displaystyle\mathrm{If\ }i=j:\frac{\partial{p_{i}}}{\partial{s_{j}}}roman_If italic_i = italic_j : divide start_ARG ∂ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG=∂e s i∑i=0 L 1−1 e s i∂s i=p i⋅(1−p i)absent superscript 𝑒 subscript 𝑠 𝑖 superscript subscript 𝑖 0 subscript 𝐿 1 1 superscript 𝑒 subscript 𝑠 𝑖 subscript 𝑠 𝑖⋅subscript 𝑝 𝑖 1 subscript 𝑝 𝑖\displaystyle=\frac{\partial{\frac{e^{s_{i}}}{\sum_{i=0}^{L_{1}-1}e^{s_{i}}}}}% {\partial{s_{i}}}=p_{i}\cdot(1-p_{i})= divide start_ARG ∂ divide start_ARG italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )(1)
	If⁢i≠j:∂p i∂s j:If 𝑖 𝑗 subscript 𝑝 𝑖 subscript 𝑠 𝑗\displaystyle\mathrm{If\ }i\neq j:\frac{\partial{p_{i}}}{\partial{s_{j}}}roman_If italic_i ≠ italic_j : divide start_ARG ∂ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG=∂e s i∑i=0 L 1−1 e s i∂s j=−p i⋅p j absent superscript 𝑒 subscript 𝑠 𝑖 superscript subscript 𝑖 0 subscript 𝐿 1 1 superscript 𝑒 subscript 𝑠 𝑖 subscript 𝑠 𝑗⋅subscript 𝑝 𝑖 subscript 𝑝 𝑗\displaystyle=\frac{\partial{\frac{e^{s_{i}}}{\sum_{i=0}^{L_{1}-1}e^{s_{i}}}}}% {\partial{s_{j}}}=-p_{i}\cdot p_{j}= divide start_ARG ∂ divide start_ARG italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

	Without loss of generality, we assume s 0 subscript 𝑠 0 s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT changes by Δ⁢s 0>0 Δ subscript 𝑠 0 0\Delta s_{0}>0 roman_Δ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0, and sum the absolute errors of all output with Equation[1](https://arxiv.org/html/2012.09852v3#S3.E1 "In III-D Progressive Quantization ‣ III Algorithmic Optimizations ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"):

	e⁢r⁢r⁢o⁢r 𝑒 𝑟 𝑟 𝑜 𝑟\displaystyle error italic_e italic_r italic_r italic_o italic_r=\|Δ⁢s 0⋅p 0⋅(1−p 0)\|+∑i=1 L 1−1\|−Δ⁢s 0⋅p 0⋅p i\|absent⋅Δ subscript 𝑠 0 subscript 𝑝 0 1 subscript 𝑝 0 superscript subscript 𝑖 1 subscript 𝐿 1 1⋅Δ subscript 𝑠 0 subscript 𝑝 0 subscript 𝑝 𝑖\displaystyle=\lvert\Delta s_{0}\cdot p_{0}\cdot(1-p_{0})\rvert+\sum_{i=1}^{L_% {1}-1}\lvert-\Delta s_{0}\cdot p_{0}\cdot p_{i}\rvert= \| roman_Δ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ( 1 - italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \| + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT \| - roman_Δ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \|(2)
	=Δ⁢s 0⋅(2⋅p 0⋅(1−p 0))<Δ⁢s 0 absent⋅Δ subscript 𝑠 0⋅2 subscript 𝑝 0 1 subscript 𝑝 0 Δ subscript 𝑠 0\displaystyle=\Delta s_{0}\cdot(2\cdot p_{0}\cdot(1-p_{0}))<\Delta s_{0}= roman_Δ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ( 2 ⋅ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ( 1 - italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) < roman_Δ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

	Since 0≤p 0≤1 0 subscript 𝑝 0 1 0\leq p_{0}\leq 1 0 ≤ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ 1, 2⁢p 0⁢(1−p 0)2 subscript 𝑝 0 1 subscript 𝑝 0 2p_{0}(1-p_{0})2 italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is always smaller than 0.5, so the total quantization error is reduced after Softmax.

	![Image 7: Refer to caption](https://arxiv.org/html/2012.09852v3/x7.png)

	Figure 7: When the attention probability is evenly distributed (left), the quantization error is large; both MSB and LSB are required. When dominant probability exists (right), the quantization error is small; only MSB is required. Grey tokens are pruned by token pruning, thus no need to compute their attention probabilities.

	![Image 8: Refer to caption](https://arxiv.org/html/2012.09852v3/x8.png)

	Figure 8: SpAtten Architecture Overview. Modules on the critical path (6,7,8,10,11) are fully pipelined to maximize the throughput.

	On top of static quantization, we propose _progressive quantization_ (Figure[6](https://arxiv.org/html/2012.09852v3#S3.F6 "Figure 6 ‣ III-B Cascade Head Pruning ‣ III Algorithmic Optimizations ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")) to progressively increase the input bitwidth if aggressive quantization hurts accuracy. An interesting phenomenon in Figure[7](https://arxiv.org/html/2012.09852v3#S3.F7 "Figure 7 ‣ III-D Progressive Quantization ‣ III Algorithmic Optimizations ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") shows that if the attention probability distribution is dominated by a few tokens, then the 4-bit quantization error is smaller; if flat, the error is larger. Intuitively, dominant tokens are semantically important; thus cannot be easily influenced by quantization. Theoretically, errors are proportional to p⁢(1−p)𝑝 1 𝑝 p(1-p)italic_p ( 1 - italic_p ) (Equation[2](https://arxiv.org/html/2012.09852v3#S3.E2 "In III-D Progressive Quantization ‣ III Algorithmic Optimizations ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")). When probability dominators exist, p 𝑝 p italic_p is closer to zero or one; thus errors are smaller. Otherwise, errors are larger. Therefore, for robust layers with probability dominators, the bitwidth can be small; while other sensitive layers should have more bits. In Figure[6](https://arxiv.org/html/2012.09852v3#S3.F6 "Figure 6 ‣ III-B Cascade Head Pruning ‣ III Algorithmic Optimizations ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"), we firstly apply an aggressive bitwidth (only fetch MSBs) for inputs and compute the attention probabilities. If the max of computed probability is smaller than a threshold, indicating the distribution is flat, we will fetch LSBs for inputs and recompute the attention probabilities for once. Otherwise, LSBs are not needed. Intuitively, progressive quantization finds which input sample is _more difficult_ and applies a higher bitwidth instead of using a high bitwidth universally, thus reducing DRAM access under the same accuracy.

	For fast interactions between SpAtten and hardware for FC parts, we conduct _linear symmetric quantization_, which is much faster than K-Means quantization. We have five different MSB+LSB settings: 4+4, 6+4, 8+4, 10+4, and 12+4. The settings can be different across tasks but are the same within one task. Different inputs of one task determine _whether to fetch LSB on the fly_. We store MSBs continuously and LSBs continuously in DRAM, so that they can be fetched separately. Progressive quantization trades more computation for less memory access and can effectively accelerate memory-bounded generative models such as GPT-2. It also improves energy efficiency since DRAM access takes around 70% of SpAtten power, much expensive than computation. On average, only 5.9% input samples require LSB. For BERT, we only apply static quantization because BERT models are computation-bounded, and fetching LSB for recomputation will degrade BERT’s performance.

	IV Hardware Architecture
	------------------------

	### IV-A Overview

	An overview of SpAtten is shown in Figure[8](https://arxiv.org/html/2012.09852v3#S3.F8 "Figure 8 ‣ III-D Progressive Quantization ‣ III Algorithmic Optimizations ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). To support token/head pruning, a novel top-k engine (Figure[9](https://arxiv.org/html/2012.09852v3#S4.F9 "Figure 9 ‣ IV-A Overview ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")) is designed to rank the token/head importance. Pruning reduces computation and memory traffic but incurs random access. Thus, a crossbar is applied to process the addresses, keep each memory channel busy, and increase the bandwidth utilization rate. To support progressive quantization, we implement an on-chip bitwidth converter to handle the splits of fetched bits and concatenations of MSBs and LSBs.

	SpAtten processes attention head by head and query by query, thus well balancing the pruning granularity and parallelism. One query of a head is fed to the pipeline at a time, enabling token pruning in both head-wise and layer-wise granularity. Inner-head parallelism can keep all on-chip computation resources busy, so no need for inter-head parallelism. In the summarization stage, K and V that survive cascade token pruning are fetched to the on-chip SRAM and will be _reused_ across multiple queries. In the generation stage, the Q is a single vector, so there is no reuse of K and V, and no need to store them in the on-chip SRAM.

	For each fetched Q, the top-k engine (Figure[9](https://arxiv.org/html/2012.09852v3#S4.F9 "Figure 9 ‣ IV-A Overview ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")) first ranks the token importance scores and get k 𝑘 k italic_k most important Ks. A data fetcher then computes the Ks’ addresses and feeds them to a 32×\times×16 crossbar for 16 HBM channels. It then gets data back through a reverse 16×\times×32 crossbar to preserve the correct order. Q and K are processed by a matrix-vector multiplication module (Figure[11](https://arxiv.org/html/2012.09852v3#S4.F11 "Figure 11 ‣ IV-C Zero Eliminator ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")) to get the attention scores. A Softmax module (Figure[12](https://arxiv.org/html/2012.09852v3#S4.F12 "Figure 12 ‣ IV-G Attention Prob-Value Multiplication ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") left) then processes the attention scores to get attention probabilities, and sends them to the progressive quantization module (Figure[12](https://arxiv.org/html/2012.09852v3#S4.F12 "Figure 12 ‣ IV-G Attention Prob-Value Multiplication ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") right) to decide whether LSBs are required. The probabilities are also sent to the token importance score accumulator to perform accumulations. After that, the local Value pruning top-k engine gets the probabilities, computes k 𝑘 k italic_k most locally important Vs, and sends their indices to the data fetcher. Finally, the remaining probabilities are multiplied with fetched V, getting attention outputs. After computing one head, the head importance score will be accumulated. After finishing all heads in a layer, a top-k module prunes unimportant heads, which will not be computed in any following layers.

	Our dataflow guarantees to avoid fetching pruned tokens and value vectors, thus bringing DRAM access reductions. The critical path (module 6,7,8,10,11) is fully pipelined. The accumulation of token/head importance scores and token/head top-k are performed _in parallel_ with the critical path. For module 3 branches, when multiple sources of Q/K/V requests come simultaneously, the fetcher processes requests one by one and sends addresses to FIFOs. The branches after module 8/11 send a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢p⁢r⁢o⁢b/a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢o⁢u⁢t 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑝 𝑟 𝑜 𝑏 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑜 𝑢 𝑡 attention\_prob/attention\_out italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b / italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t to accumulators and the next corresponding computation module simultaneously. For the module 9 branch, if LSB is required, it discards a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢p⁢r⁢o⁢b 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑝 𝑟 𝑜 𝑏 attention\_prob italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b and initiates recomputation; then modules 10 and 11 will be idle, waiting for recomputed attention probabilities. For module 10 branch, fetching un-pruned V from DRAM is part of the coarse-grained pipeline.

	![Image 9: Refer to caption](https://arxiv.org/html/2012.09852v3/x9.png)

	Figure 9: High Parallelism top-k Engine.

	The on-chip memory system has two main SRAMs for Key and Value, 196KB each in module 7 and module 11. They store K and V from QKV Fetcher. Since we process queries one by one, the Q vector is stored in registers. We also have 32 64-depth×\times×8B address FIFOs after QKV fetcher and 32 64-depth×\times×16B data FIFOs before bitwidth converter.

	![Image 10: Refer to caption](https://arxiv.org/html/2012.09852v3/x10.png)

	Figure 10: Zero Eliminator. A zero eliminator of n 𝑛 n italic_n elements has log⁡n 𝑛\log n roman_log italic_n stages.

	Input: Top-k k 𝑘 k italic_k, Data vector i⁢n⁢p⁢u⁢t⁢s 𝑖 𝑛 𝑝 𝑢 𝑡 𝑠 inputs italic_i italic_n italic_p italic_u italic_t italic_s;

	FIFO_L, FIFO_R depth: 64; FIFO

	===
	[FIFO_L, FIFO_R];

	Initialize FIFO_L with

	i⁢n⁢p⁢u⁢t⁢s 𝑖 𝑛 𝑝 𝑢 𝑡 𝑠 inputs italic_i italic_n italic_p italic_u italic_t italic_s
	, FIFO_R

	=ϕ absent italic-ϕ=\phi= italic_ϕ
	;

	t⁢a⁢r⁢g⁢e⁢t=k 𝑡 𝑎 𝑟 𝑔 𝑒 𝑡 𝑘 target=k italic_t italic_a italic_r italic_g italic_e italic_t = italic_k
	,

	n⁢u⁢m⁢_⁢e⁢q⁢_⁢p⁢i⁢v⁢o⁢t=0 𝑛 𝑢 𝑚 _ 𝑒 𝑞 _ 𝑝 𝑖 𝑣 𝑜 𝑡 0 num\_eq\_pivot=0 italic_n italic_u italic_m _ italic_e italic_q _ italic_p italic_i italic_v italic_o italic_t = 0
	;

	START:

	if _size⁢(FIFO⁢\_⁢R)+n⁢u⁢m⁢\_⁢e⁢q⁢\_⁢p⁢i⁢v⁢o⁢t≤t⁢a⁢r⁢g⁢e⁢t size FIFO \_ R 𝑛 𝑢 𝑚 \_ 𝑒 𝑞 \_ 𝑝 𝑖 𝑣 𝑜 𝑡 𝑡 𝑎 𝑟 𝑔 𝑒 𝑡\mathrm{size(FIFO\\_R)}+num\\_eq\\_pivot\leq target roman\_size ( roman\_FIFO \_ roman\_R ) + italic\_n italic\_u italic\_m \_ italic\_e italic\_q \_ italic\_p italic\_i italic\_v italic\_o italic\_t ≤ italic\_t italic\_a italic\_r italic\_g italic\_e italic\_t_ then

	/* the selected pivot is too large */

	t⁢a⁢r⁢g⁢e⁢t=t⁢a⁢r⁢g⁢e⁢t−size⁢(FIFO⁢_⁢R)−n⁢u⁢m⁢_⁢e⁢q⁢_⁢p⁢i⁢v⁢o⁢t 𝑡 𝑎 𝑟 𝑔 𝑒 𝑡 𝑡 𝑎 𝑟 𝑔 𝑒 𝑡 size FIFO _ R 𝑛 𝑢 𝑚 _ 𝑒 𝑞 _ 𝑝 𝑖 𝑣 𝑜 𝑡 target=target-\mathrm{size(FIFO\_R)}-num\_eq\_pivot italic_t italic_a italic_r italic_g italic_e italic_t = italic_t italic_a italic_r italic_g italic_e italic_t - roman_size ( roman_FIFO _ roman_R ) - italic_n italic_u italic_m _ italic_e italic_q _ italic_p italic_i italic_v italic_o italic_t
	;

	FIFO_R

	=ϕ absent italic-ϕ=\phi= italic_ϕ
	;

	s⁢e⁢l⁢e⁢c⁢t=0 𝑠 𝑒 𝑙 𝑒 𝑐 𝑡 0 select=0 italic_s italic_e italic_l italic_e italic_c italic_t = 0
	;

	p⁢i⁢v⁢o⁢t=𝑝 𝑖 𝑣 𝑜 𝑡 absent pivot=italic_p italic_i italic_v italic_o italic_t =
	FIFO_L[rand];

	Goto STATE_RUN;

	else if _size⁢(FIFO⁢\_⁢R)>t⁢a⁢r⁢g⁢e⁢t size FIFO \_ R 𝑡 𝑎 𝑟 𝑔 𝑒 𝑡\mathrm{size(FIFO\\_R)}>target roman\_size ( roman\_FIFO \_ roman\_R ) > italic\_t italic\_a italic\_r italic\_g italic\_e italic\_t_ then

	/* the selected pivot is too small */

	FIFO_L

	=ϕ absent italic-ϕ=\phi= italic_ϕ
	;

	s⁢e⁢l⁢e⁢c⁢t=1 𝑠 𝑒 𝑙 𝑒 𝑐 𝑡 1 select=1 italic_s italic_e italic_l italic_e italic_c italic_t = 1
	;

	p⁢i⁢v⁢o⁢t=𝑝 𝑖 𝑣 𝑜 𝑡 absent pivot=italic_p italic_i italic_v italic_o italic_t =
	FIFO_R[rand];

	Goto STATE_RUN;

	else

	/*

	size⁢(FIFO⁢_⁢R)≤t⁢a⁢r⁢g⁢e⁢t size FIFO _ R 𝑡 𝑎 𝑟 𝑔 𝑒 𝑡\mathrm{size(FIFO\_R)}\leq target roman_size ( roman_FIFO _ roman_R ) ≤ italic_t italic_a italic_r italic_g italic_e italic_t
	size⁢(FIFO⁢_⁢R)+n⁢u⁢m⁢_⁢e⁢q⁢_⁢p⁢i⁢v⁢o⁢t>t⁢a⁢r⁢g⁢e⁢t size FIFO _ R 𝑛 𝑢 𝑚 _ 𝑒 𝑞 _ 𝑝 𝑖 𝑣 𝑜 𝑡 𝑡 𝑎 𝑟 𝑔 𝑒 𝑡\mathrm{size(FIFO\_R)}+num\_eq\_pivot>target roman_size ( roman_FIFO _ roman_R ) + italic_n italic_u italic_m _ italic_e italic_q _ italic_p italic_i italic_v italic_o italic_t > italic_t italic_a italic_r italic_g italic_e italic_t */

	k⁢_⁢t⁢h⁢_⁢l⁢a⁢r⁢g⁢e⁢s⁢t=p⁢i⁢v⁢o⁢t 𝑘 _ 𝑡 ℎ _ 𝑙 𝑎 𝑟 𝑔 𝑒 𝑠 𝑡 𝑝 𝑖 𝑣 𝑜 𝑡 k\_th\_largest=pivot italic_k _ italic_t italic_h _ italic_l italic_a italic_r italic_g italic_e italic_s italic_t = italic_p italic_i italic_v italic_o italic_t
	;

	n⁢u⁢m⁢_⁢e⁢q⁢_⁢k⁢_⁢t⁢h⁢_⁢l⁢a⁢r⁢g⁢e⁢s⁢t=t⁢a⁢r⁢g⁢e⁢t 𝑛 𝑢 𝑚 _ 𝑒 𝑞 _ 𝑘 _ 𝑡 ℎ _ 𝑙 𝑎 𝑟 𝑔 𝑒 𝑠 𝑡 𝑡 𝑎 𝑟 𝑔 𝑒 𝑡 num\_eq\_k\_th\_largest=target italic_n italic_u italic_m _ italic_e italic_q _ italic_k _ italic_t italic_h _ italic_l italic_a italic_r italic_g italic_e italic_s italic_t = italic_t italic_a italic_r italic_g italic_e italic_t
	-size(FIFO_R);

	Output: [

	k⁢_⁢t⁢h⁢_⁢l⁢a⁢r⁢g⁢e⁢s⁢t,n⁢u⁢m⁢_⁢e⁢q⁢_⁢k⁢_⁢t⁢h⁢_⁢l⁢a⁢r⁢g⁢e⁢s⁢t 𝑘 _ 𝑡 ℎ _ 𝑙 𝑎 𝑟 𝑔 𝑒 𝑠 𝑡 𝑛 𝑢 𝑚 _ 𝑒 𝑞 _ 𝑘 _ 𝑡 ℎ _ 𝑙 𝑎 𝑟 𝑔 𝑒 𝑠 𝑡 k\_th\_largest,num\_eq\_k\_th\_largest italic_k _ italic_t italic_h _ italic_l italic_a italic_r italic_g italic_e italic_s italic_t , italic_n italic_u italic_m _ italic_e italic_q _ italic_k _ italic_t italic_h _ italic_l italic_a italic_r italic_g italic_e italic_s italic_t
	];

	end if

	STATE_RUN: (Figure [9](https://arxiv.org/html/2012.09852v3#S4.F9 "Figure 9 ‣ IV-A Overview ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") Right)

	/* items smaller than pivot →→\rightarrow→ FIFO_L */

	/* items larger than pivot

	→→\rightarrow→
	FIFO_R */

	n⁢u⁢m⁢_⁢e⁢q⁢_⁢p⁢i⁢v⁢o⁢t=0 𝑛 𝑢 𝑚 _ 𝑒 𝑞 _ 𝑝 𝑖 𝑣 𝑜 𝑡 0 num\_eq\_pivot=0 italic_n italic_u italic_m _ italic_e italic_q _ italic_p italic_i italic_v italic_o italic_t = 0
	;

	for _i=0⁢𝐭𝐨⁢size⁢(FIFO⁢[s⁢e⁢l⁢e⁢c⁢t])−1 𝑖 0 𝐭𝐨 size FIFO delimited-[]𝑠 𝑒 𝑙 𝑒 𝑐 𝑡 1 i=0\mathbf{\ to\ }\mathrm{size(FIFO}[select])-1 italic\_i = 0 bold\_to roman\_size ( roman\_FIFO [ italic\_s italic\_e italic\_l italic\_e italic\_c italic\_t ] ) - 1_ do

	Pop FIFO[

	s⁢e⁢l⁢e⁢c⁢t 𝑠 𝑒 𝑙 𝑒 𝑐 𝑡 select italic_s italic_e italic_l italic_e italic_c italic_t
	] as

	i⁢t⁢e⁢m 𝑖 𝑡 𝑒 𝑚 item italic_i italic_t italic_e italic_m
	;

	if _i⁢t⁢e⁢m<p⁢i⁢v⁢o⁢t 𝑖 𝑡 𝑒 𝑚 𝑝 𝑖 𝑣 𝑜 𝑡 item<pivot italic\_i italic\_t italic\_e italic\_m < italic\_p italic\_i italic\_v italic\_o italic\_t_ then

	Push

	i⁢t⁢e⁢m 𝑖 𝑡 𝑒 𝑚 item italic_i italic_t italic_e italic_m
	to FIFO_L;

	else if _i⁢t⁢e⁢m>p⁢i⁢v⁢o⁢t 𝑖 𝑡 𝑒 𝑚 𝑝 𝑖 𝑣 𝑜 𝑡 item>pivot italic\_i italic\_t italic\_e italic\_m > italic\_p italic\_i italic\_v italic\_o italic\_t_ then

	Push

	i⁢t⁢e⁢m 𝑖 𝑡 𝑒 𝑚 item italic_i italic_t italic_e italic_m
	to FIFO_R;

	else

	/*

	i t e m==p i v o t item==pivot italic_i italic_t italic_e italic_m = = italic_p italic_i italic_v italic_o italic_t
	*/

	n⁢u⁢m⁢_⁢e⁢q⁢_⁢p⁢i⁢v⁢o⁢t=n⁢u⁢m⁢_⁢e⁢q⁢_⁢p⁢i⁢v⁢o⁢t+1 𝑛 𝑢 𝑚 _ 𝑒 𝑞 _ 𝑝 𝑖 𝑣 𝑜 𝑡 𝑛 𝑢 𝑚 _ 𝑒 𝑞 _ 𝑝 𝑖 𝑣 𝑜 𝑡 1 num\_eq\_pivot=num\_eq\_pivot+1 italic_n italic_u italic_m _ italic_e italic_q _ italic_p italic_i italic_v italic_o italic_t = italic_n italic_u italic_m _ italic_e italic_q _ italic_p italic_i italic_v italic_o italic_t + 1
	;

	end if

	end for

	Goto START;

	Algorithm 3 top-k Engine

	### IV-B Top-k Engine

	To support cascade token/head pruning and local value pruning, we need to find the top k 𝑘 k italic_k elements of an array. A naïve solution is to use a sorter to sort the original array and directly output the first k 𝑘 k italic_k elements. However, it needs O⁢(n⋅log⁡n)𝑂⋅𝑛 𝑛 O(n\cdot\log n)italic_O ( italic_n ⋅ roman_log italic_n ) time complexity and O⁢(n⋅log 2⁡n)𝑂⋅𝑛 superscript 2 𝑛 O(n\cdot\log^{2}n)italic_O ( italic_n ⋅ roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n ) space sorting network, and also completely randomizes data fetching. Instead, we design a novel top-k engine (Figure[9](https://arxiv.org/html/2012.09852v3#S4.F9 "Figure 9 ‣ IV-A Overview ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")) with a quick-select module to find the k th superscript 𝑘 th k^{\mathrm{th}}italic_k start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT largest element as a threshold to filter the input array. It has much lower time complexity (O⁢(n)𝑂 𝑛 O(n)italic_O ( italic_n ) on average) and keeps the original order of inputs. The engine leverages a randomly chosen pivot to partition the input array into two parts: elements smaller or larger than the pivot. It has two FIFOs, FIFO_L and FIFO_R, to store the partitioned arrays. An array to be partitioned will be fed into two comparator arrays and compared with the pivot. The left/right comparator array only preserves elements smaller/larger than the pivot. Others will be set to zeros and eliminated by a zero eliminator. Quick-select runs iteratively until the k th superscript 𝑘 th k^{\mathrm{th}}italic_k start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT largest element is found. The control logic is in Algorithm[3](https://arxiv.org/html/2012.09852v3#alg3 "In IV-A Overview ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). Afterwards, the k th superscript 𝑘 th k^{\mathrm{th}}italic_k start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT largest element is used to filter the input array, which is buffered in another FIFO (Figure[9](https://arxiv.org/html/2012.09852v3#S4.F9 "Figure 9 ‣ IV-A Overview ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") left) before the quick-select process. The filtered outputs will be processed by another zero eliminator to finally get the top-k elements. It can be easily scaled up with more comparators. In SpAtten, we apply 16 comparators in each array to make it not the whole pipeline’s bottleneck. Since the frequency of head pruning is much lower than token pruning and local Value pruning, we reuse the token pruning top-k engine for head pruning. Therefore, we have two top-k engines in SpAtten architecture.

	### IV-C Zero Eliminator

	We design an innovative logic for the zero eliminator, which first uses a prefix sum module to calculate the number of zeros before each element z⁢e⁢r⁢o⁢_⁢c⁢n⁢t 𝑧 𝑒 𝑟 𝑜 _ 𝑐 𝑛 𝑡 zero\_cnt italic_z italic_e italic_r italic_o _ italic_c italic_n italic_t. These z⁢e⁢r⁢o⁢_⁢c⁢n⁢t 𝑧 𝑒 𝑟 𝑜 _ 𝑐 𝑛 𝑡 zero\_cnt italic_z italic_e italic_r italic_o _ italic_c italic_n italic_t will guide a logN layer shifter to shift the input array by 1, 2, 4, … positions. In the n th superscript 𝑛 th n^{\mathrm{th}}italic_n start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT layer, whether to shift an element is determined by the n th superscript 𝑛 th n^{\mathrm{th}}italic_n start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT bit of its corresponding z⁢e⁢r⁢o⁢_⁢c⁢n⁢t 𝑧 𝑒 𝑟 𝑜 _ 𝑐 𝑛 𝑡 zero\_cnt italic_z italic_e italic_r italic_o _ italic_c italic_n italic_t. An example is illustrated in Figure[10](https://arxiv.org/html/2012.09852v3#S4.F10 "Figure 10 ‣ IV-A Overview ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning").

	The top-k engine equipped with zero-eliminators has much higher parallelism than a direct implementation of quick select. Without high parallelism, the performance will be bottlenecked by finding top-k. In Figure[20](https://arxiv.org/html/2012.09852v3#S5.F20 "Figure 20 ‣ V-C Performance Analysis ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"), we show that SpAtten with a parallelized top-k engine can achieve 3×\times× speedup over a baseline top-k engine with parallelism=1. We also compare a regular full sorting unit (a Batcher’s Odd-Even Sorter[[14](https://arxiv.org/html/2012.09852v3#bib.bib14)] to perform merge-sort) to the worst case of the top-k engine (selecting the median) with an input length of 1024. Experimental results show that we can achieve 1.4×\times× higher throughput with 3.5×\times× smaller power consumption over the full sorting unit.

	![Image 11: Refer to caption](https://arxiv.org/html/2012.09852v3/x11.png)

	Figure 11: Query-Key multiplication module with a reconfigurable adder tree. The adder tree can output a various number of partial sums according to the query vector’s dimension.

	### IV-D Data Fetcher and Bitwidth Converter

	The Q-K-V data fetcher is designed to send multiple random read requests per cycle to all 16 HBM channels, where the QKV are interleaved in different channels. We use a 32-to-16 crossbar to route these read requests to the correct channels. The master side is larger than the slave side. There is no memory access conflict because the crossbar generates at most one memory request for each channel at a time.

	In order to support progressive quantization but avoid complex logic overheads, we enforce on-chip SRAMs and multipliers to have a fixed bitwidth. We use a bitwidth converter to convert the data loaded from DRAM (4,8,12 bits) uniformly into on-chip bitwidth (12 bits). The converter consists of MUXes to select correct bits from the input and a shifter to allow reading data from an unaligned address.

	### IV-E Query-Key Multiplication Module

	The query-key multiplication module (Figure[11](https://arxiv.org/html/2012.09852v3#S4.F11 "Figure 11 ‣ IV-C Zero Eliminator ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")) is designed to calculate the matrix-vector multiplication between K and Q. In each cycle, a row of the key matrix K i is loaded from the K SRAM, multiplied by Q with a multiplier array, and fed to an adder tree. Adder tree then computes attention scores by reducing all multiplied results s i=∑j K i⁢j×Q j subscript 𝑠 𝑖 subscript 𝑗 subscript K 𝑖 𝑗 subscript Q 𝑗 s_{i}=\sum_{j}\mathrm{K}_{ij}\times\mathrm{Q}_{j}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT × roman_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We use 512 multipliers in this module to fully utilize the DRAM bandwidth. To support queries and keys with dimension D 𝐷 D italic_D lower than 512, we get 512/D 𝐷 D italic_D attention scores in each cycle. The corresponding multiple K i s are packed into the same line in the Key SRAM, and the query is broadcast for 512/D 𝐷 D italic_D times so that all K i s can have access to the same Q. The adder tree is also configurable to output the results of the last several layers, making it function like 512/D 𝐷 D italic_D separate D-way adder trees capable of producing multiple results s i subscript 𝑠 𝑖 s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

	### IV-F Softmax and Progressive Quantization

	The fixed-point attention scores s 𝑠 s italic_s from query-key multiplication are first dequantized using a scaling factor. The attention score normalization factor sqrt(D 𝐷 D italic_D) is also included in the scaling factor, so that we can perform attention score dequantization and normalization simultaneously. After that, a pipeline of floating-point exponential, accumulation, and division operations are applied to calculate the Softmax results e s i/∑j e s j superscript 𝑒 subscript 𝑠 𝑖 subscript 𝑗 superscript 𝑒 subscript 𝑠 𝑗 e^{s_{i}}/\sum_{j}e^{s_{j}}italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT / ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The results are finally quantized again so that the operations after Softmax can be performed in fixed-point.

	The Softmax results are then fed to the progressive quantization determination module to examine whether LSBs are required. Specifically, we compare the max attention probability with a pre-defined threshold. If smaller than the threshold, the Q-K-V data fetcher will be informed to fetch the LSBs.

	### IV-G Attention Prob-Value Multiplication

	The attention prob-value multiplication unit takes the attention probabilities as inputs, multiplies them with the Vs and then accumulates to get attention outputs A j=∑i a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢p⁢r⁢o⁢b i×V i⁢j subscript 𝐴 𝑗 subscript 𝑖 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑝 𝑟 𝑜 subscript 𝑏 𝑖 subscript V 𝑖 𝑗 A_{j}=\sum_{i}attention\_prob_{i}\times\mathrm{V}_{ij}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_p italic_r italic_o italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × roman_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. It contains another broadcast-multiply-reduce pipeline, similar to the one in the query-key multiplication module, to support the processing of multiple attention probabilities at the same time. The module has 512 multipliers. The attention probabilities are broadcast for D 𝐷 D italic_D times, and the adder tree is configured to function like D 𝐷 D italic_D 512/D 𝐷 D italic_D-way adder trees. The a⁢t⁢t⁢e⁢n⁢t⁢i⁢o⁢n⁢_⁢o⁢u⁢t 𝑎 𝑡 𝑡 𝑒 𝑛 𝑡 𝑖 𝑜 𝑛 _ 𝑜 𝑢 𝑡 attention\_out italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n _ italic_o italic_u italic_t are in 12 bits.

	![Image 12: Refer to caption](https://arxiv.org/html/2012.09852v3/x12.png)

	Figure 12: Softmax and progressive quantization determination modules.

	TABLE I: Architectural Setups of SpAtten.

	TABLE II: Power Breakdown of SpAtten.

	V Evaluation
	------------

	### V-A Evaluation Methodology

	We implement SpAtten with SpinalHDL and compiled to RTL, and simulate each application using Verilator to get the cycle numbers. For HBM modeling, we use Ramulator[[15](https://arxiv.org/html/2012.09852v3#bib.bib15)] with HBM2 settings. We synthesize two versions of SpAtten: SpAtten and SpAtten 1/8. SpAtten 1/8 is only used for fair comparisons with MNNFast and A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The parameters for SpAtten are listed in Table[I](https://arxiv.org/html/2012.09852v3#S4.T1 "TABLE I ‣ IV-G Attention Prob-Value Multiplication ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). The scale of SpAtten 1/8 is 1/8 of SpAtten and contains 128 multipliers. We synthesize SpAtten using Cadence Genus under TSMC 40nm library to estimate the area and power consumption of the logic, including all fixed-point adders and multipliers. We get the number of floating-point operations in Softmax from the simulator. The exponential function is approximated with Taylor expansion to the 5 th order[[16](https://arxiv.org/html/2012.09852v3#bib.bib16)] and performed with floating multiplication accumulation units (FMA). The power and area of FMA are obtained from[[17](https://arxiv.org/html/2012.09852v3#bib.bib17)]. We perform the division and estimate power and area with the floating-point unit (FPU) from[[17](https://arxiv.org/html/2012.09852v3#bib.bib17)]. The FMAs and FPUs are in 45nm technology, and we use them as an upper bound estimation of 40nm units. We also obtain the width, size, and the number of read/write of each SRAM and FIFO from the simulator and use CACTI[[18](https://arxiv.org/html/2012.09852v3#bib.bib18)] to estimate the energy and area of SRAMs and FIFOs. For HBM, we simulate the number of row activation, read/write with Ramulator, and use the energy numbers from [[19](https://arxiv.org/html/2012.09852v3#bib.bib19)] to calculate overall energy.

	We extensively select various hardware platforms as the evaluation baselines, including server GPU (NVIDIA TITAN Xp GPU), mobile GPU (NVIDIA Jetson Nano), server CPU (Intel Xeon E5-2640 v4 @ 2.40GHz), mobile CPU (4-core ARM A53 CPU on a Raspberry Pi-4), and state-of-the-art accelerators A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT[[6](https://arxiv.org/html/2012.09852v3#bib.bib6)] and MNNFast[[7](https://arxiv.org/html/2012.09852v3#bib.bib7)]. For GPUs and CPUs, we run attention with PyTorch and use cuDNN on GPU and MKL on CPU, which are well-optimized libraries. torch.cuda.Event on GPU, and time.time on CPU are used to measure the latencies. We measure the power with nvidia-smi and pcm-power for TITAN Xp GPU and Xeon CPU, respectively. For Nano GPU and Raspberry Pi ARM CPU, we use a power meter to get power. For latency measurements, we repeat 1000 times, remove the largest 15% and smallest 15%, and average the remaining. For power measurements, we first measure the system’s idle power, and then repeatedly run workloads and get the total power. The dynamic power is total power minus idle power.

	We evaluate SpAtten on attention layers of two discriminative models: BERT-Base, BERT-Large, and two generative models: GPT-2-Small and GPT-2-Medium. Tasks for BERT are nine from GLUE[[4](https://arxiv.org/html/2012.09852v3#bib.bib4)], V1.1 and V2.0 of SQuAD[[8](https://arxiv.org/html/2012.09852v3#bib.bib8)]; For GPT-2, we use language modeling task on four datasets: Wikitext-2[[9](https://arxiv.org/html/2012.09852v3#bib.bib9)], Wikitext-103[[9](https://arxiv.org/html/2012.09852v3#bib.bib9)], Pen Tree Bank[[10](https://arxiv.org/html/2012.09852v3#bib.bib10)] and One-Billion Word[[11](https://arxiv.org/html/2012.09852v3#bib.bib11)]. In total, we have 30 benchmarks. For all tasks, finetuning is performed for 2 hours on average on GPU after token pruning to recover accuracy. For each task, we try multiple sets of token/head pruning ratios and quantization bitwidths to not lose accuracy, except 2% for BERT-large on SQuAD tasks. Given the same overall pruning ratio, ratios among layers/heads do not have a significant influence. We typically keep the 15% front layers un-pruned, then compute the average ratio of the rest layers r a⁢v⁢g subscript 𝑟 𝑎 𝑣 𝑔 r_{avg}italic_r start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT. We set a start ratio r s⁢t⁢a⁢r⁢t subscript 𝑟 𝑠 𝑡 𝑎 𝑟 𝑡 r_{start}italic_r start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT and an end ratio r e⁢n⁢d subscript 𝑟 𝑒 𝑛 𝑑 r_{end}italic_r start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT, r s⁢t⁢a⁢r⁢t+r e⁢n⁢d=2×r a⁢v⁢g subscript 𝑟 𝑠 𝑡 𝑎 𝑟 𝑡 subscript 𝑟 𝑒 𝑛 𝑑 2 subscript 𝑟 𝑎 𝑣 𝑔 r_{start}+r_{end}=2\times r_{avg}italic_r start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT = 2 × italic_r start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT and interpolate the ratios of the rest layers. For head pruning, we keep 30% front layers un-pruned and apply a similar method as token pruning. For progressive quantization, the typical max attention probability threshold is 0.1, and the common MSB+LSB combinations are 6+4 and 8+4.

	To measure BERT latency, we set input sentence length as the average length of the each task’s dev set. For GPT-2 models, we set the initial length of the input sentence as 992 and measure the latency of generating 32 tokens. The energy efficiency of the models is assessed by energy consumption, which is power×\times×latency.

	### V-B Experimental Results

	![Image 13: Refer to caption](https://arxiv.org/html/2012.09852v3/x13.png)

	Figure 13: On-chip (a) Area and (b) Power Breakdowns of SpAtten.

	![Image 14: Refer to caption](https://arxiv.org/html/2012.09852v3/x14.png)

	Figure 14: Speedup and Energy Efficiency of SpAtten over baselines on attention layers. SpAtten can accelerate both discriminative and generative models.

	Throughput, Power, and Area. SpAtten prunes tokens and value vectors with cascade token pruning and local value pruning by 1.9×\times× (all models average), and 3.8×\times× (GPT-2 models average). Cascade head pruning has 1.1×\times× reduction on average. Note that the pruning ratio can be larger when the input sentence of a task is longer because they contain more redundancy. GPT-2 models have longer inputs than BERT, so their pruning ratios can be larger. SpAtten reduces the computation by 2.1×\times× and DRAM access by 10.0×\times× on average. It achieves 1.61TFLOPS on 22 computation-bounded BERT models and 0.43TFLOPS on 8 memory-bounded GPT-2 models. SpAtten consumes 8.30W power as in Table[II](https://arxiv.org/html/2012.09852v3#S4.T2 "TABLE II ‣ IV-G Attention Prob-Value Multiplication ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") breakdown and is 18.71mm 2 in area. Figure[13](https://arxiv.org/html/2012.09852v3#S5.F13 "Figure 13 ‣ V-B Experimental Results ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") shows the area and on-chip power breakdown. The Q×\times×K and Attention_Prob ×\times×V modules consume largest portions of energy and area since they are two most computational intensive modules. The latter consumes less energy thanks to local V pruning. top-k engines are relatively efficient, only taking 1.0% of overall power and 2.7% of area so will not cause severe congestion issues.

	Comparisons with CPUs and GPUs. Figure[14](https://arxiv.org/html/2012.09852v3#S5.F14 "Figure 14 ‣ V-B Experimental Results ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") shows the speedup and energy efficiency comparisons of SpAtten with baselines on attention layers of the benchmarks. On average, SpAtten achieves 162×\times×, 347×\times×, 1095×\times×, and 5071×\times× speedup, and 1193×\times×, 4059×\times×, 406×\times×, and 1910×\times× energy saving over TITAN Xp GPU, Xeon CPU, Nano GPU, and Raspberry Pi ARM CPU. SpAtten obtains high speedup because it has a highly parallelized and pipelined datapath. Meanwhile, cascade pruning and progressive quantization further reduce computation and DRAM access. Energy savings mainly come from DRAM fetch reduction. The specialized datapath also reduces intermediate SRAM fetch. Since FFN layer computations are reduced by token pruning, CPUs and GPUs can also be accelerated. We implement token pruning on CPUs/GPUs. We use topk and gather operations to select un-pruned tokens and QKV matrices to reduce matrix sizes, thus reducing computation, latency, and memory footprint. 3×\times× pruning ratio brings up to 2.3×\times× speedup for BERT in batch mode (assume performing token pruning twice). GPT-2 results in Figure[14](https://arxiv.org/html/2012.09852v3#S5.F14 "Figure 14 ‣ V-B Experimental Results ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") do not have Beam Search. However, our techniques can also accelerate the Beam Search case because when a token (and its K, V) is pruned, it _will not be used by any beams_.

	TABLE III: Compare SpAtten 1/8 with prior art A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and MNNFast.

	Comparisons with A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and MNNFast.A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT[[6](https://arxiv.org/html/2012.09852v3#bib.bib6)] and MNNFast[[7](https://arxiv.org/html/2012.09852v3#bib.bib7)] are also attention accelerators exploring sparsity. A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT first sorts each dimension of the key vectors among all keys. Then it uses a pre-specified number of largest/smallest elements in the keys to conduct multiplications with a query and get _partial_ attention scores. If a score is smaller than a threshold, then the corresponding key will be pruned. MNNFast removes V vectors whose attention probabilities are smaller than a threshold. We compare the differences and performance of SpAtten 1/8, A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and MNNFast in Table[III](https://arxiv.org/html/2012.09852v3#S5.T3 "TABLE III ‣ V-B Experimental Results ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). Specifically: (i) In A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and MNNFast, all QKV vectors need to be fetched from DRAM to on-chip buffers before determining what can be pruned, so it cannot reduce DRAM access. Thus, they can only accelerate computation-bounded models (discriminative BERT), but cannot accelerate memory-bounded models (generative GPT-2). (ii) A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT has pre-processing overhead – sorting the keys. (iii) Token pruning in SpAtten is _global and cascade_, while that in A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is local in one head. Therefore, only SpAtten can reduce the computation in both attention and FFN layers. (iv) Cascade token pruning is interpretable and can be intuitively visualized step by step (Figure [22](https://arxiv.org/html/2012.09852v3#S5.F22 "Figure 22 ‣ V-C Performance Analysis ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")). (v) SpAtten also supports head pruning and progressive quantization.

	![Image 15: Refer to caption](https://arxiv.org/html/2012.09852v3/x15.png)

	Figure 15: End-to-End speedup of SpAtten-e2e over baselines. FC layer weights are quantized to 8 bits or 12 bits.

	TABLE IV: FC & Attention FLOPs & Latency Breakdown on GPT-2-Medium.

	The parallelism d 𝑑 d italic_d in A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is 64, corresponding to 128 multipliers. We compare A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT with SpAtten 1/8 which has the same number of multipliers (128), technology (40nm) and bandwidth (64GB/s) in Table[III](https://arxiv.org/html/2012.09852v3#S5.T3 "TABLE III ‣ V-B Experimental Results ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). Under 1GHz, A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT throughput is 2×d absent 𝑑\times d× italic_d=128GFLOPS. Since it has 1.73×\times× geomean speedup, the effective throughput is 128×\times×1.72=221GFLOPS. We include the same DRAM power for A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and SpAtten 1/8. SpAtten 1/8 achieves 1.6×\times× better throughput, 1.4×\times× better energy efficiency, and 2.2×\times× better area efficiency over A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. MNNFast essentially only supports local Value pruning. We get its throughput number with our reproduced simulator under the same bandwidth and number of multipliers. MNNFast was originally a Zynq-7020 FPGA design (10W power). As an optimistic estimation, we assume ASIC implementation consumes 10×\times× less power, i.e., 1W. Compared to MNNFast, SpAtten 1/8 has 3.0×\times× higher throughput, and 3.2×\times× better energy efficiency.

	![Image 16: Refer to caption](https://arxiv.org/html/2012.09852v3/x16.png)

	Figure 16: Co-designed Transformers for SpAtten-e2e can achieve 1.9×\times× speedup and 2.8×\times× model size reduction over vanilla Transformers[[1](https://arxiv.org/html/2012.09852v3#bib.bib1)].

	End-to-End Performance with FFN Support. To compare the end-to-end performance of SpAtten with baselines. We extend our SpAtten to support the FC in the Feed-Forward Network (FFN) layers by reusing the multiplier arrays. The extended architecture is named SpAtten-e2e. FC weights are linear symmetrically quantized to 12 bits and 8 bits and stored on DRAM. Since the FCs in GPT-2 generation stage are matrix-vector multiplications, the performance of SpAtten-e2e is memory-bounded. As shown in Figure[15](https://arxiv.org/html/2012.09852v3#S5.F15 "Figure 15 ‣ V-B Experimental Results ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"), on eight GPT-2-Medium benchmarks, 8-bit FC SpAtten-e2e achieves on-average 35×\times× and 122×\times× speedup over TITAN Xp GPU and Xeon CPU; 12-bit FC SpAtten-e2e achieves 24×\times× and 83×\times× respectively. The breakdowns of computation and latency of FC and attention parts of four GPT-2-Medium benchmarks averaged are shown in Table[IV](https://arxiv.org/html/2012.09852v3#S5.T4 "TABLE IV ‣ V-B Experimental Results ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). Head pruning is not employed in this comparison. SpAtten-e2e applies token pruning to reduce the attention FLOPs. The FC FLOPs are the same because token pruning can only reduce FC computation in the summarization stage (BERT), not the generation stage (GPT-2). On GPU, the attention only accounts for 14.4% computation but consumes 48.6% latency, echoing with our analysis in Section[II-B](https://arxiv.org/html/2012.09852v3#S2.SS2 "II-B Motivation ‣ II Background and Motivation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). By contrast, attention on SpAtten-e2e can be efficiently supported, thus only taking 7.6% latency.

	Co-design Model Architecture with SpAtten. Besides the experiments above that leverage existing model architecture, we also explore the potentials of co-designing SpAtten with model architecture by searching a Hardware-Aware Transformer (HAT)[[20](https://arxiv.org/html/2012.09852v3#bib.bib20)] for SpAtten-e2e. The search space contains [512, 640, 768] for embedding dim, [512, 1024, 2048, 3072] for FFN layer hidden dim, [1, 2, 3, 4, 5, 6] for decoder layer number, and last three layers for arbitrary encoder-decoder attention. Because the FC layers form the bottleneck of the SpAtten performance, we intentionally configure the lower bound of FFN hidden dimension as low as 512 in expectation of reducing the FC ratio. We set different latency constraints and obtain a series of co-designed Transformers as shown in Figure[16](https://arxiv.org/html/2012.09852v3#S5.F16 "Figure 16 ‣ V-B Experimental Results ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). They are compared with layer number scaling and embedding dimension scaling of vanilla Transformer models[[1](https://arxiv.org/html/2012.09852v3#bib.bib1)]. The co-designed Transformer-7 can achieve 1.9×\times× faster speed and 2.8×\times× smaller size over the vanilla Transformer-Big model. We also show the computation breakdowns of the vanilla Transformer-Base and the co-designed Transformer-3 in Figure[17](https://arxiv.org/html/2012.09852v3#S5.F17 "Figure 17 ‣ V-C Performance Analysis ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning"). The two models have similar accuracy. Since SpAtten-e2e can support attention with better efficiency, the co-designed model has a larger attention FLOPs. By virtue of the increased attention capacity, the FC computation can be largely shrunk without compromising the accuracy.

	### V-C Performance Analysis

	Roofline Analysis. To better understand the distance of SpAtten to the theoretical optimal performance, we analyze its roofline model in Figure[18](https://arxiv.org/html/2012.09852v3#S5.F18 "Figure 18 ‣ V-C Performance Analysis ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") and compare it with TITAN Xp GPU. We use theoretical operational intensity: only memory access for input QKV and attention outputs are counted. HBM has 512GB/s bandwidth; thus, the slope of the bandwidth roof is 512G. SpAtten has 1024 multipliers; hence the theoretical computation roof (multiplication and addition) is 2TFLOPS. For BERT, the operation intensity is high, so the performance is computation-bounded. SpAtten achieves 1.61TFLOPS on BERT tasks. That is close to the computation roof and higher than GPU’s 0.02TFLOPS. GPT-2 models, on the contrary, have a low arithmetic intensity and appear in the memory-bounded region. SpAtten achieves 0.43TFLOPS, close to the bandwidth roof and higher than GPU’s 0.01TFLOPS. GPU performance is far from the roofs in both models because of the low utilization of computation units. Progressive quantization improves the computation intensity; thus, the points of SpAtten are to the right of GPU.

	![Image 17: Refer to caption](https://arxiv.org/html/2012.09852v3/x17.png)

	Figure 17: Co-designed Transformer has slightly more attention but much less FC computations because of SpAtten-e2e’s high efficiency on attention.

	Breakdown of Speedup. Figure[20](https://arxiv.org/html/2012.09852v3#S5.F20 "Figure 20 ‣ V-C Performance Analysis ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") shows the speedup breakdown of SpAtten over TITAN Xp GPU on eight GPT-2 benchmarks. With a dedicated datapath, SpAtten is 22.1×\times× faster than GPU baseline, which needs to execute numerous memory instructions for attention. Cascade pruning is then applied to remove unimportant tokens and heads in the second step, reducing computation by 3.8×\times× and 1.1×\times×, respectively. However, the performance only improves by 1.1×\times× for both. The reason is that cascade pruning needs to frequently execute top-k to find unimportant tokens/heads, which becomes a bottleneck without a high throughput top-k engine. Therefore, after adding the high-parallelism top-k engine, the bottleneck is resolved, and the performance jumps by 3×\times×. Finally, the progressive quantization reduces the average bitwidth of inputs, achieving another 2.8×\times× speedup with less DRAM access.

	Efficiency-Accuracy Trade-offs. Without accuracy loss, token pruning can prune 1.9×\times× for all benchmarks on average, while head pruning can prune 1.1×\times×. Figure[21](https://arxiv.org/html/2012.09852v3#S5.F21 "Figure 21 ‣ V-C Performance Analysis ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") shows two trade-off curves between the token/head pruning ratio and accuracy of GPT-2-Small on PTB (left) and BERT-Base on CoLA (right). For the token pruning curve, head pruning is not applied, and vice versa. We apply 12-bit quantization and disable progressive quantization for both. We can prune around 4×\times× tokens for PTB and 1.2×\times× heads for CoLA without accuracy loss. Small pruning ratios even improve the accuracy. Note that the pruning ratio is related to the input sentence length. Since the sentence length of GPT-2 benchmarks (around 1000) is much longer than BERT ones (less than 100), GPT-2’s pruning ratios can be larger while preserving the accuracy.

	Design Choice Explorations. We also explore the best architectural settings for SpAtten in Figure[19](https://arxiv.org/html/2012.09852v3#S5.F19 "Figure 19 ‣ V-C Performance Analysis ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") on one GPT-2 application. The left side shows the performance with different parallelism (comparator number) of the top-k engine. Comparator number influences the time to perform a STATE_RUN stage (see Algorithm[3](https://arxiv.org/html/2012.09852v3#alg3 "In IV-A Overview ‣ IV Hardware Architecture ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning")). After parallelism 16, the performance does not increase much because 16 matches the top-k engine input data rate from the Q×\times×K module. Thus parallelism larger than 16 makes top-k no longer the bottleneck and cannot much influence the overall performance. We select 16 in our design. On the right side, we change the size of SRAM storing K and V. Since SpAtten supports up to 1024-length context, the minimum SRAM size is set to 2×\times×1024×\times×64×\times×12bits=196KB. ‘2’ is for double buffering. Increasing SRAM size hardly increases the performance because the whole architecture is fully pipelined. More intermediate buffers will not significantly impact the throughput. Therefore, in consideration of reducing SRAM static power, we select the smallest 196KB.

	![Image 18: Refer to caption](https://arxiv.org/html/2012.09852v3/x18.png)

	Figure 18: Roofline Model. The performance of SpAtten is close to bandwidth and computation roofs.

	![Image 19: Refer to caption](https://arxiv.org/html/2012.09852v3/x19.png)

	Figure 19: Design Space Exploration. A top-k engine with a parallelism of 16 and 196KB key/value buffer is sufficient for our settings.

	![Image 20: Refer to caption](https://arxiv.org/html/2012.09852v3/x20.png)

	Figure 20: Speedup Breakdown of SpAtten on GPT-2 models. Cascade pruning and progressive quantization bring 3.4×\times× and 2.8×\times× speedup, respectively.

	![Image 21: Refer to caption](https://arxiv.org/html/2012.09852v3/x21.png)

	Figure 21: Trade-off curves between token/head pruning ratio and accuracy loss.

	Interpretation and Visualization. Figure[22](https://arxiv.org/html/2012.09852v3#S5.F22 "Figure 22 ‣ V-C Performance Analysis ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") visualizes the cascade token pruning process on various tasks. The pruned tokens are redundant ones such as ‘it, are, to, is’, showing the effectiveness of SpAtten’s importance score mechanism. In the first example, the tokens that survive pruning are ‘remember’, ‘admire’, ‘resolve confusion’; we can easily interpret why the sentence is classified as a positive sentiment. The second example is the similarity score regression. The regressed scores range from 1 to 5, and larger scores indicate higher similarity between two sentences. SpAtten can effectively prune away the meaningless tokens such as ‘your’ and ‘is’, and keep the token pairs in two sentences such as ‘upset’ and ‘bothering’. The last example is a generative language modeling with GPT-2. The generated token is ‘English’. SpAtten aggressively prunes away most tokens as they are irrelevant to the generated token, and only keeps ‘Du’, ‘translate’ and ‘into’ tokens. The model may find the name ‘Du’ not typical in English, so the translation language should be ‘English’.

	Figure[23](https://arxiv.org/html/2012.09852v3#S5.F23 "Figure 23 ‣ V-C Performance Analysis ‣ V Evaluation ‣ SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning") shows the cumulative importance scores of every single layer in a GPT-2 LM model. The important tokens are consistent across layers, such as ‘published’. Generated ‘papers’ token heavily attends to several nearby tokens such as ‘published’ and ‘many’. It also attends to some important tokens such as ‘researcher’ and ‘architecture’ even though they are far from it. In summary, token pruning reduces the model complexity and shows which tokens are attended most by the model, bringing better interpretability than A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and MNNFast.

	![Image 22: Refer to caption](https://arxiv.org/html/2012.09852v3/x22.png)

	Figure 22: Examples for cascade token pruning in different models and tasks. SpAtten supports both discriminative models (BERT) and generative models (GPT-2). Unlike A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and MNNFast, cascade token pruning in SpAtten is structured and interpretable.

	![Image 23: Refer to caption](https://arxiv.org/html/2012.09852v3/x23.png)

	Figure 23: Cumulative importance scores in GPT-2. Unimportant tokens are pruned on the fly. Important tokens are heavily attended.

	VI Related Work
	---------------

	### VI-A Neural Networks Pruning and Quantization

	Neural networks tend to be over-parameterized, and many algorithms have been proposed to prune away the redundant parameters. Fine-grained pruning[[21](https://arxiv.org/html/2012.09852v3#bib.bib21), [22](https://arxiv.org/html/2012.09852v3#bib.bib22)] cut offs the connections within the weight matrix and achieves high pruning ratios. However, it is not friendly to CPUs and GPUs and requires dedicated hardware[[23](https://arxiv.org/html/2012.09852v3#bib.bib23), [24](https://arxiv.org/html/2012.09852v3#bib.bib24)] to support sparse matrix multiplication, which may consume extra efforts for design and automation[[25](https://arxiv.org/html/2012.09852v3#bib.bib25), [26](https://arxiv.org/html/2012.09852v3#bib.bib26), [27](https://arxiv.org/html/2012.09852v3#bib.bib27)]. To this end, structured pruning such as channel-pruning[[28](https://arxiv.org/html/2012.09852v3#bib.bib28), [29](https://arxiv.org/html/2012.09852v3#bib.bib29)] was further proposed to remove the entire channel to enable acceleration on general-purpose hardware.[[30](https://arxiv.org/html/2012.09852v3#bib.bib30)] further proposed to enable fine-grained pruning speedup on general-purpose platforms. However, it requires a complicated guided sparsity learning process. Quantization reduces data precision to shrink the model size and bandwidth requirements. SpAtten is fundamentally different from the existing weight pruning and quantization because there is no weight in attention, and we prune the input tokens/heads based on their importance to the sentence. Quantization in SpAtten is applied to input QKV instead of weights.

	### VI-B Accelerators for Sparse and Quantized Neural Networks

	There have been various domain-specific FPGA and ASIC accelerators proposed for neural networks[[31](https://arxiv.org/html/2012.09852v3#bib.bib31), [32](https://arxiv.org/html/2012.09852v3#bib.bib32), [33](https://arxiv.org/html/2012.09852v3#bib.bib33), [34](https://arxiv.org/html/2012.09852v3#bib.bib34), [35](https://arxiv.org/html/2012.09852v3#bib.bib35), [36](https://arxiv.org/html/2012.09852v3#bib.bib36), [37](https://arxiv.org/html/2012.09852v3#bib.bib37), [38](https://arxiv.org/html/2012.09852v3#bib.bib38), [39](https://arxiv.org/html/2012.09852v3#bib.bib39), [40](https://arxiv.org/html/2012.09852v3#bib.bib40), [41](https://arxiv.org/html/2012.09852v3#bib.bib41), [42](https://arxiv.org/html/2012.09852v3#bib.bib42), [43](https://arxiv.org/html/2012.09852v3#bib.bib43), [44](https://arxiv.org/html/2012.09852v3#bib.bib44), [45](https://arxiv.org/html/2012.09852v3#bib.bib45), [46](https://arxiv.org/html/2012.09852v3#bib.bib46), [47](https://arxiv.org/html/2012.09852v3#bib.bib47), [48](https://arxiv.org/html/2012.09852v3#bib.bib48), [49](https://arxiv.org/html/2012.09852v3#bib.bib49), [50](https://arxiv.org/html/2012.09852v3#bib.bib50), [51](https://arxiv.org/html/2012.09852v3#bib.bib51), [52](https://arxiv.org/html/2012.09852v3#bib.bib52), [53](https://arxiv.org/html/2012.09852v3#bib.bib53), [54](https://arxiv.org/html/2012.09852v3#bib.bib54), [55](https://arxiv.org/html/2012.09852v3#bib.bib55), [56](https://arxiv.org/html/2012.09852v3#bib.bib56)]. Many of them leverage the sparsity to improve performance[[57](https://arxiv.org/html/2012.09852v3#bib.bib57), [58](https://arxiv.org/html/2012.09852v3#bib.bib58), [59](https://arxiv.org/html/2012.09852v3#bib.bib59), [60](https://arxiv.org/html/2012.09852v3#bib.bib60), [61](https://arxiv.org/html/2012.09852v3#bib.bib61), [62](https://arxiv.org/html/2012.09852v3#bib.bib62), [63](https://arxiv.org/html/2012.09852v3#bib.bib63), [64](https://arxiv.org/html/2012.09852v3#bib.bib64), [65](https://arxiv.org/html/2012.09852v3#bib.bib65), [66](https://arxiv.org/html/2012.09852v3#bib.bib66), [67](https://arxiv.org/html/2012.09852v3#bib.bib67), [68](https://arxiv.org/html/2012.09852v3#bib.bib68), [69](https://arxiv.org/html/2012.09852v3#bib.bib69), [70](https://arxiv.org/html/2012.09852v3#bib.bib70), [71](https://arxiv.org/html/2012.09852v3#bib.bib71), [72](https://arxiv.org/html/2012.09852v3#bib.bib72), [73](https://arxiv.org/html/2012.09852v3#bib.bib73)]. There also exist general sparse tensor algebra accelerators[[74](https://arxiv.org/html/2012.09852v3#bib.bib74), [75](https://arxiv.org/html/2012.09852v3#bib.bib75), [76](https://arxiv.org/html/2012.09852v3#bib.bib76), [77](https://arxiv.org/html/2012.09852v3#bib.bib77), [78](https://arxiv.org/html/2012.09852v3#bib.bib78), [79](https://arxiv.org/html/2012.09852v3#bib.bib79), [80](https://arxiv.org/html/2012.09852v3#bib.bib80), [81](https://arxiv.org/html/2012.09852v3#bib.bib81), [82](https://arxiv.org/html/2012.09852v3#bib.bib82), [83](https://arxiv.org/html/2012.09852v3#bib.bib83), [84](https://arxiv.org/html/2012.09852v3#bib.bib84), [85](https://arxiv.org/html/2012.09852v3#bib.bib85)] proposed in recent years, which can be used to process sparse FC layers. Most of the prior work focuses on leveraging weight sparsity. By contrast, SpAtten leverages activation (token/head) sparsity and employs specialized top-k engines to support on-the-fly cascade token/head pruning. Quantized neural networks are also supported by many accelerators[[86](https://arxiv.org/html/2012.09852v3#bib.bib86), [87](https://arxiv.org/html/2012.09852v3#bib.bib87), [88](https://arxiv.org/html/2012.09852v3#bib.bib88), [89](https://arxiv.org/html/2012.09852v3#bib.bib89), [90](https://arxiv.org/html/2012.09852v3#bib.bib90), [91](https://arxiv.org/html/2012.09852v3#bib.bib91), [92](https://arxiv.org/html/2012.09852v3#bib.bib92), [93](https://arxiv.org/html/2012.09852v3#bib.bib93), [94](https://arxiv.org/html/2012.09852v3#bib.bib94), [95](https://arxiv.org/html/2012.09852v3#bib.bib95)]. In those accelerators, the bitwidth is fixed at the compile time, while in SpAtten, we can adjust the bitwidth according to the attention probability distributions.

	### VI-C Efficient Natural Language Processing

	The large computation of attention-based NLP models raised much interest in improving their efficiency[[20](https://arxiv.org/html/2012.09852v3#bib.bib20), [96](https://arxiv.org/html/2012.09852v3#bib.bib96), [97](https://arxiv.org/html/2012.09852v3#bib.bib97), [98](https://arxiv.org/html/2012.09852v3#bib.bib98)]. GOBO[[95](https://arxiv.org/html/2012.09852v3#bib.bib95)] proposed to compress BERT model down to 3 bits, thus significantly reducing DRAM access. PoWER-BERT[[99](https://arxiv.org/html/2012.09852v3#bib.bib99)] prunes tokens based on the _instant_ attention probabilities of only one layer, which is different from SpAtten’s _cumulative_ attention probabilities of multiple layers. It cannot support _per-head granularity_ token pruning or local V vector pruning either. Head pruning is proposed in [[100](https://arxiv.org/html/2012.09852v3#bib.bib100), [5](https://arxiv.org/html/2012.09852v3#bib.bib5)] but they only prune head weights instead of activations as in SpAtten. The head pruning in[[5](https://arxiv.org/html/2012.09852v3#bib.bib5)] is not cascaded since pruned heads in one layer appear in latter layers. Our token pruning idea can also be generalized to Memory-Augmented Networks[[101](https://arxiv.org/html/2012.09852v3#bib.bib101)] to remove unimportant memory vectors and improve efficiency.

	VII Conclusion
	--------------

	We propose SpAtten, a software-architecture co-design to enable efficient sparse and quantized attention inference. We first propose cascade token and head pruning to remove the computation and memory access of inessential tokens and heads. A novel top-k engine is designed to support on-the-fly token and head importance ranking with O⁢(n)𝑂 𝑛 O(n)italic_O ( italic_n ) time complexity. Moreover, we propose progressive quantization to allow different bitwidths across layers. SpAtten achieves orders of magnitude speedup and energy savings over traditional platforms, and is 1.6×\times× and 3.0×\times× faster than A 3 superscript 𝐴 3 A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and MNNFast. We also provide detailed performance analysis, breakdown of each technique, and design space explorations, offering insights to future NLP accelerator designs.

	Acknowledgement
	---------------

	Part of this work was supported under NSF CAREER Award #1943349 and DARPA SDH program. We thank MIT Data Science and AI Lab (DSAIL) for supporting this research. We thank Joel Emer, Stephen Keckler, Mike O’Connor, Donghyuk Lee for inspiring discussions.

	References
	----------

	* [1] A.Vaswani _et al._, “Attention is all you need,” in _NeurIPS_, 2017.
	* [2] J.Devlin _et al._, “Bert: Pre-training of deep bidirectional transformers for language understanding,” _arXiv_, 2018.
	* [3] A.Radford _et al._, “Language models are unsupervised multitask learners,” _OpenAI Blog_, 2019.
	* [4] A.Wang _et al._, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in _EMNLP Workshop BlackboxNLP_, 2018.
	* [5] E.Voita _et al._, “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,” _ACL_, 2019.
	* [6] T.J. Ham _et al._, “A3: Accelerating attention mechanisms in neural networks with approximation,” _HPCA_, 2020.
	* [7] H.Jang _et al._, “Mnnfast: A fast and scalable system architecture for memory-augmented neural networks,” in _ISCA_, 2019.
	* [8] P.Rajpurkar _et al._, “Squad: 100,000+ questions for machine comprehension of text,” _EMNLP_, 2016.
	* [9] S.Merity _et al._, “Pointer sentinel mixture models,” _arXiv_, 2016.
	* [10] M.P. Marcus _et al._, “Building a large annotated corpus of English: The Penn Treebank,” _Computational Linguistics_, 1993.
	* [11] C.Chelba _et al._, “One billion word benchmark for measuring progress in statistical language modeling,” _arXiv_, 2013.
	* [12] E.F. Tjong Kim Sang _et al._, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in _HLT-NAACL_, 2003.
	* [13] R.Socher _et al._, “Recursive deep models for semantic compositionality over a sentiment treebank,” in _EMNLP_, 2013.
	* [14] D.E. Knuth, _The Art of Computer Programming, Volume 3: (2nd Ed.) Sorting and Searching_.Addison Wesley Longman Publishing Co., Inc., 1998.
	* [15] Y.Kim _et al._, “Ramulator: A fast and extensible dram simulator,” _IEEE Computer architecture letters_, 2015.
	* [16] P.Nilsson _et al._, “Hardware implementation of the exponential function using taylor series,” in _NORCHIP_, 2014.
	* [17] S.Salehi _et al._, “Energy and area analysis of a floating-point unit in 15nm cmos process technology,” in _SoutheastCon_, 2015.
	* [18] N.Muralimanohar _et al._, “Cacti 6.0: A tool to model large caches,” _HPL_, 2015.
	* [19] M.O’Connor _et al._, “Fine-grained dram: energy-efficient dram for extreme bandwidth systems,” in _MICRO_, 2017.
	* [20] H.Wang _et al._, “Hat: Hardware-aware transformers for efficient natural language processing,” in _ACL_, 2020.
	* [21] S.Han _et al._, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in _ICLR_, 2016.
	* [22] T.Wang _et al._, “Apq: Joint search for network architecture, pruning and quantization policy,” in _CVPR_, 2020.
	* [23] S.Pal _et al._, “Outerspace: An outer product based sparse matrix multiplication accelerator,” in _HPCA_, 2018.
	* [24] Z.Zhang, H.Wang _et al._, “Sparch: Efficient architecture for sparse matrix multiplication,” in _HPCA_, 2020.
	* [25] H.Wang _et al._, “Gcn-rl circuit designer: Transferable transistor sizing with graph neural networks and reinforcement learning,” _DAC_, 2020.
	* [26] H.Mao _et al._, “Park: An open platform for learning-augmented computer systems,” _NeurIPS_, 2019.
	* [27] H.Wang, J.Yang, H.-S. Lee, and S.Han, “Learning to design circuits,” _NeurIPS MLSys Workshop_, 2018.
	* [28] Y.He, X.Zhang, and J.Sun, “Channel pruning for accelerating very deep neural networks,” _ICCV_, 2017.
	* [29] Y.He _et al._, “Amc: Automl for model compression and acceleration on mobile devices,” in _ECCV_, 2018.
	* [30] J.Park _et al._, “Faster cnns with direct sparse convolutions and guided pruning,” _arXiv_, 2016.
	* [31] M.H. Mubarik _et al._, “Printed machine learning classifiers,” in _MICRO_, 2020.
	* [32] A.K. Ramanathan _et al._, “Look-up table based energy efficient processing in cache support for neural network acceleration,” in _MICRO_, 2020.
	* [33] Z.Song _et al._, “Vr-dann: Real-time video recognition via decoder-assisted neural network acceleration,” in _MICRO_, 2020.
	* [34] H.Kim _et al._, “Duplo: Lifting redundant memory accesses of deep neural networks for gpu tensor cores,” in _MICRO_, 2020.
	* [35] L.Liu _et al._, “Duet: Boosting deep neural network efficiency on dual-module architecture,” in _MICRO_, 2020.
	* [36] H.Mo _et al._, “Tfe: Energy-efficient transferred filter-based engine to compress and accelerate convolutional neural networks,” in _MICRO_, 2020.
	* [37] J.Weng, S.Liu, V.Dadu, Z.Wang, P.Shah, and T.Nowatzki, “Dsagen: Synthesizing programmable spatial accelerators,” in _ISCA_, 2020.
	* [38] M.Vilim _et al._, “Gorgon: accelerating machine learning from relational data,” in _ISCA_, 2020.
	* [39] L.Ke _et al._, “Recnmp: Accelerating personalized recommendation with near-memory processing,” in _ISCA_, 2020.
	* [40] R.D. Evans _et al._, “Jpeg-act: accelerating deep learning via transform-based lossy compression,” in _ISCA_, 2020.
	* [41] U.Gupta _et al._, “Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference,” _ISCA_, 2020.
	* [42] Y.Zhao _et al._, “Smartexchange: Trading higher-cost memory storage/access for lower-cost computation,” _ISCA_, 2020.
	* [43] E.Baek _et al._, “A multi-neural network acceleration architecture,” in _ISCA_, 2020.
	* [44] Y.Ji _et al._, “Fpsa: A full system stack solution for reconfigurable reram-based nn accelerator architecture,” in _ASPLOS_, 2019.
	* [45] K.Ishida _et al._, “Supernpu: An extremely fast neural processing unit using superconducting logic devices,” in _MICRO_, 2020.
	* [46] Y.Gan _et al._, “Ptolemy: Architecture support for robust deep learning,” in _MICRO_, 2020.
	* [47] M.He _et al._, “Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning,” in _MICRO_, 2020.
	* [48] S.o. Ghodrati, “Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks,” in _MICRO_, 2020.
	* [49] Y.Feng _et al._, “Mesorasi: Architecture support for point cloud analytics via delayed-aggregation,” in _MICRO_, 2020.
	* [50] S.Singh _et al._, “Nebula: a neuromorphic spin-based ultra-low power architecture for snns and anns,” in _ISCA_, 2020.
	* [51] M.Imani _et al._, “Deep learning acceleration with neuron-to-memory transformation,” in _HPCA_, 2020.
	* [52] K.Shiflett _et al._, “Pixel: Photonic neural network accelerator,” in _HPCA_, 2020.
	* [53] S.Gudaparthi _et al._, “Wire-aware architecture and dataflow for cnn accelerators,” in _MICRO_, 2019.
	* [54] Y.S. Shao _et al._, “Simba: Scaling deep-learning inference with multi-chip-module-based architecture,” in _MICRO_, 2019.
	* [55] B.Akin _et al._, “Zcomp: Reducing dnn cross-layer memory footprint using vector extensions,” in _MICRO_, 2019.
	* [56] C.-T. Huang _et al._, “ecnn: A block-based and highly-parallel cnn accelerator for edge inference,” in _MICRO_, 2019.
	* [57] S.Han _et al._, “Eie: Efficient inference engine on compressed deep neural network,” in _ISCA_, 2016.
	* [58] S.Zhang _et al._, “Cambricon-x: An accelerator for sparse neural networks,” in _MICRO_, 2016.
	* [59] A.Parashar, M.Rhu, A.Mukkara, A.Puglielli, R.Venkatesan, B.Khailany, J.Emer, S.W. Keckler, and W.J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” _ACM SIGARCH Computer Architecture News_, vol.45, no.2, pp. 27–40, 2017.
	* [60] X.Zhou _et al._, “Cambricon-s: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach,” in _MICRO_, 2018.
	* [61] T.-H. Yang _et al._, “Sparse reram engine: joint exploration of activation and weight sparsity in compressed neural networks,” in _ISCA_, 2019.
	* [62] A.Gondimalla _et al._, “Sparten: A sparse tensor accelerator for convolutional neural networks,” in _MICRO_, 2019.
	* [63] S.Sharify _et al._, “Laconic deep learning inference acceleration,” in _ISCA_, 2019.
	* [64] J.Cong _et al._, “Understanding performance differences of fpgas and gpus,” in _FCCM_, 2018.
	* [65] D.Yang _et al._, “Procrustes: a dataflow and accelerator for sparse deep neural network training,” in _MICRO_, 2020.
	* [66] Z.Gong _et al._, “Save: Sparsity-aware vector engine for accelerating dnn training and inference on cpus,” in _MICRO_, 2020.
	* [67] R.Hwang _et al._, “Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations,” _ISCA_, 2020.
	* [68] E.Qin _et al._, “Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,” in _HPCA_, 2020.
	* [69] A.Delmas Lascorz _et al._, “Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks,” in _ASPLOS_, 2019.
	* [70] M.Yan _et al._, “Hygcn: A gcn accelerator with hybrid architecture,” in _HPCA_, 2020.
	* [71] L.Song _et al._, “Accpar: Tensor partitioning for heterogeneous deep learning accelerators,” in _HPCA_, 2020.
	* [72] X.Zhang _et al._, “Enabling highly efficient capsule networks processing through a pim-based architecture design,” in _HPCA_, 2020.
	* [73] W.Hua _et al._, “Boosting the performance of cnn accelerators with dynamic fine-grained channel gating,” in _MICRO_, 2019.
	* [74] Y.Chen _et al._, “tpspmv: A two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures,” _Information Sciences_, 2020.
	* [75] K.Hegde _et al._, “Extensor: An accelerator for sparse tensor algebra,” in _MICRO_, 2019.
	* [76] K.Kanellopoulos _et al._, “Smash: Co-designing software compression and hardware-accelerated indexing for efficient sparse matrix operations,” in _MICRO_, 2019.
	* [77] M.Zhu _et al._, “Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus,” in _MICRO_, 2019.
	* [78] Y.Kwon _et al._, “Tensordimm: A practical near-memory processing architecture for embeddings and tensor operations in deep learning,” in _MICRO_, 2019.
	* [79] M.Mahmoud _et al._, “Tensordash: Exploiting sparsity to accelerate deep neural network training,” in _MICRO_, 2020.
	* [80] N.Srivastava _et al._, “Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product,” in _MICRO_, 2020.
	* [81] B.Asgari _et al._, “Alrescha: A lightweight reconfigurable sparse-computation accelerator,” in _HPCA_, 2020.
	* [82] N.Srivastava _et al._, “Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations,” in _HPCA_, 2020.
	* [83] J.Weng _et al._, “A hybrid systolic-dataflow architecture for inductive matrix algorithms,” in _HPCA_, 2020.
	* [84] A.D. Lascorz _et al._, “Shapeshifter: Enabling fine-grain data width adaptation in deep learning,” in _MICRO_, 2019.
	* [85] F.Sadi _et al._, “Efficient spmv operation for large and highly sparse matrices using scalable multi-way merge parallelization,” in _MICRO_, 2019.
	* [86] J.Lee _et al._, “Unpu: A 50.6 tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision,” in _ISSCC_, 2018.
	* [87] P.Judd _et al._, “Stripes: Bit-serial deep neural network computing,” in _MICRO_, 2016.
	* [88] S.Sharify _et al._, “Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks,” in _DAC_, 2018.
	* [89] Y.Umuroglu _et al._, “Bismo: A scalable bit-serial matrix multiplication overlay for reconfigurable computing,” in _FPL_, 2018.
	* [90] T.Rzayev _et al._, “Deeprecon: Dynamically reconfigurable architecture for accelerating deep neural networks,” in _IJCNN_, 2017.
	* [91] H.Sharma _et al._, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network,” in _ISCA_, 2018.
	* [92] Nvidia, “Nvidia tensor cores,” in _Nvidia_, 2018.
	* [93] N.P. Jouppi _et al._, “In-datacenter performance analysis of a tensor processing unit,” in _ISCA_, 2017.
	* [94] Z.Song _et al._, “Drq: dynamic region-based quantization for deep neural network acceleration,” in _ISCA_, 2020.
	* [95] A.H.o. Zadeh, “Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference,” in _MICRO_, 2020.
	* [96] Z.Yan _et al._, “Micronet for efficient language modeling,” _JMLR_, 2020.
	* [97] J.Gu _et al._, “Non-autoregressive neural machine translation,” in _ICLR_, 2018.
	* [98] H.Wang, “Efficient algorithms and hardware for natural language processing,” _Massachusetts Institute of Technology_, 2020.
	* [99] S.o. Goyal, “Power-bert: Accelerating bert inference for classification tasks,” _ICML_, 2020.
	* [100] P.Michel _et al._, “Are sixteen heads really better than one?” in _NeurIPS_, 2019.
	* [101] S.Sukhbaatar _et al._, “End-to-end memory networks,” in _NeurIPS_, 2015.

Xet Storage Details

Size:: 125 kB
Xet hash:: 114baf1e701a27d063f8e7dbf33cab4846016d71ead6d02eb3834af1cbdd6c41

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.