Buckets:

|
download
raw
83 kB

Title: Evaluating Transformer’s Ability to Learn Mildly Context-Sensitive Languages

URL Source: https://arxiv.org/html/2309.00857

Markdown Content: \usetikzlibrary fadings \usetikzlibrary patterns \usetikzlibrary shadows.blur \usetikzlibrary shapes \usetikzlibrary shapes.geometric,calc,tikzmark \ExplSyntaxOn\NewDocumentCommand\liaison O0.5mm O20 m m {tikzpicture}[remember picture, overlay] \draw([yshift=-#1]#3.south) to[bend right=#2] ([yshift=-#1]#4.south); \ExplSyntaxOff

Shunjie Wang

Independent Researcher

shunjiewang@hotmail.com

&Shane Steinert-Threlkeld

University of Washington

shanest@uw.edu

Abstract

Despite the fact that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their implications in modeling natural language, which is hypothesized to be mildly context-sensitive. We test the Transformer’s ability to learn mildly context-sensitive languages of varying complexities, and find that they generalize well to unseen in-distribution data, but their ability to extrapolate to longer strings is worse than that of LSTMs. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior, which may have helped the models solve the languages.

1 Introduction

Transformers (Vaswani et al., 2017) have demonstrated well-versed language processing capabilities and enabled a wide range of exciting NLP applications ever since its inception. However, Hahn (2020) shows that hard self-attention Transformers, as well as soft attention under some assumptions, fail at modeling regular languages with periodicity as well as hieararchical context-free languages eventually when presented with long enough sequences.

These theoretical limitations have since sparked the interest of the formal language community. A variety of formal languages, as well as formal models of computation such as circuits, counter automata, and predicate logic, have been studied to characterize the expressiveness of the architecture.

When it comes to probing an architecture’s linguistic adequacy, a particular class of formal languages and formalisms naturally comes into sight: the mildly context-sensitive class (Joshi, 1985; Kallmeyer, 2010), the formal complexity class hypothesized to have the necessary expressive power for natural language.

Image 1: Refer to caption

Figure 1: How certain MCSGs fit on the Chomsky hierarchy of languages in terms of their weak generative capacities (Stabler, 2011): CFL⊂L⁢(TAG)=L⁢(CCG)⊂L⁢(MG)=L⁢(LCFRS)=L⁢(MCFG)⊂CSL CFL 𝐿 TAG 𝐿 CCG 𝐿 MG 𝐿 LCFRS 𝐿 MCFG CSL\text{CFL}\subset L(\text{TAG})=L(\text{CCG})\subset L(\text{MG})=L(\text{% LCFRS})=L(\text{MCFG})\subset\text{CSL}CFL ⊂ italic_L ( TAG ) = italic_L ( CCG ) ⊂ italic_L ( MG ) = italic_L ( LCFRS ) = italic_L ( MCFG ) ⊂ CSL. No formalism generates the largest set of MCSLs.

Image 2: Refer to caption

Figure 2: Swiss German subordinate clauses allow n 𝑛 n italic_n accusative NPs before m 𝑚 m italic_m dative NPs, followed by n 𝑛 n italic_n corresponding accusative object taking verbs before m 𝑚 m italic_m corresponding dative object taking verbs. Shieber (1985) defined a homomorphism for Swiss German such that when intersecting with regular 𝚠𝚊⁢𝚋⁢𝚡𝚌⁢𝚍⁢𝚢𝚣 superscript 𝚠𝚊 superscript 𝚋 superscript 𝚡𝚌 superscript 𝚍 𝚢𝚣\mathtt{w}\mathtt{a}^{}\mathtt{b}^{}\mathtt{x}\mathtt{c}^{}\mathtt{d}^{}% \mathtt{y}\mathtt{z}typewriter_wa start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT typewriter_xc start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT typewriter_yz yields non-context-free 𝚠𝚊 n⁢𝚋 m⁢𝚡𝚌 n⁢𝚍 m⁢𝚢𝚣 superscript 𝚠𝚊 𝑛 superscript 𝚋 𝑚 superscript 𝚡𝚌 𝑛 superscript 𝚍 𝑚 𝚢𝚣\mathtt{w}\mathtt{a}^{n}\mathtt{b}^{m}\mathtt{x}\mathtt{c}^{n}\mathtt{d}^{m}% \mathtt{y}\mathtt{z}typewriter_wa start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_xc start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_yz, which contradicts CFL’s closure property under intersection with regular languages.

This motivates us to study the Transformer’s ability to learn a variety of linguistically significant, mildly context-sensitive string languages of varying degrees of complexities. Specifically, we ask two research questions:

    1. How well do Transformers learn MCSLs from finite examples, both in terms of generalizing to in-distribution data, as well as extrapolating to strings longer than the ones seen during training?
    1. What kind of meaningful representations or patterns do the models learn?

Our contributions include that we extend current empirical studies on formal language learning with Transformers to the mildly context-sensitive class, and find that they generalize well to unseen strings within the same length range as training data, but their ability to extrapolate is worse than that of LSTMs. We also present analyses that suggest self-attention learned symbol dependency relations, and the representations encoded count information for some complex languages, which may have been useful for solving languages in this class.

2 Mildly Context-Sensitive Hypothesis

In search of computational models that have adequate power to generate natural language sentences while also assigning meaningful structural descriptions like trees to them, Chomsky (1956, 1959) defined context-sensitive grammars (CSG) and context-free grammars (CFG) as intermediate systems that lie between two extremities: the Turing machine which overgenerates and the finite-state automaton which undergenerates. A question that immediately follows the definitions is whether the CFG could serve as a computational model for natural language, which had been an open question for a few decades until it was settled by evidence such as Swiss German cross-serial dependency (Figure 2; Shieber, 1985) and Bambara vocabulary (Culy, 1985), which demonstrated the existence of natural languages that are supra-context-free.

However, although more restricted than the Turing machine, the CSG is also undesired as it still has much more generative capacity than natural languages should ever need, and as a result of that, is hard to parse efficiently. This motivated Tree-Adjoining Grammars (TAG; Joshi, 1985) and Combinatory Categorial Grammars (CCG; Steedman, 1987) among a few other weakly equivalent formalisms such that extend the CFG with just enough additional descriptive power so that phenomena like Swiss German cross-serial dependency can be treated, while not using the full CSG thus parsing can still be efficient. The properties of these formalisms with such additional power roughly characterize a class of languages and grammars that Joshi (1985) calls mildly context-sensitive languages/grammars (MCSL/MCSG).

Another related line of weakly equivalent formalisms, such as Linear Context-Free Rewriting Systems (LCFRS; Vijay-Shanker et al., 1987), Multiple Context-Free Grammars (MCFG; Seki et al., 1991), and Minimalist Grammars (MG; Stabler, 1997), further extend their expressive power beyond that of the TAG, as motivated by more complex phenomena like German scrambling (Becker et al., 1991). While no single grammar formalism generates the largest possible set of MCSLs that satisfy the formal characterization in Kallmeyer (2010), these are the closest approximations we have. Such differences in expressiveness formed a subhierarchy within this class (Figure 1), and the languages recognizable by TAGs and MGs and their respective equivalents, denoted as L⁢(TAG)𝐿 TAG L(\text{TAG})italic_L ( TAG ) and L⁢(MG)𝐿 MG L(\text{MG})italic_L ( MG ), are therefore the language complexity classes that we examine in this work.

Therefore, one hypothesis for the complexity of natural language is that it is mildly context-sensitive (henthforth MCS). There have been some challenges citing linguistic data requiring more power beyond MCS (Radzinski, 1991; Michaelis and Kracht, 1997; Kobele, 2006), but the validity of these claims remains controversial (Bhatt and Joshi, 2004; Clark and Yoshinaka, 2012), or no consensus has been reached on the need for more power (Graf, 2021). Thus, while acknowledging that whether the MCS hypothesis is true remains an open question, it is a reasonably good hypothesis that allows us to analyze natural languages meaningfully.

3 Related Work

Regarding the expressiveness of the Transformer, Pérez et al. (2019, 2021) established the Turing-completeness of the hard-attention Transformer. Bhattamishra et al. (2020b) proves the Turing-completeness of soft attention by showing that they can simulate RNNs. However, these results assumed arbitrary precision for weights and activations, had certain departures from the original architecture, and made the proofs through their unique task definitions.

In a practical case of formal language learning from finite examples, the Transformer’s ability is known to be limited, even in the regular language class. Empirically, it was shown that Transformers of different self-attention variants have limited abilities to learn certain star-free languages, as well as non-star-free, periodic regular languages (Hahn, 2020; Bhattamishra et al., 2020a), although the latter may still be recognized theoretically with a simple modification to the architecture (Chiang and Cholak, 2022).

As for context-free languages with hieararchical structural analyses such as Dyck-n 𝑛 n italic_n, Ebrahimi et al. (2020) empirically demonstrated one Transformer encoder setup in which such languages may be learned and observed stack-like behavior in self-attention patterns. Yao et al. (2021) proved and empirically showed that self-attention can learn Dyck-n 𝑛 n italic_n with a bounded depth, although the boundedness reduces the CFL to regular. Additional empirical work on Dyck-n 𝑛 n italic_n include Bernardy et al. (2021); Wen et al. (2023), among others.

Besides language recognition guided by the Chomsky hierarchy, another line of research investigates other alternative formal languages, such as counter-recognizable languages (Bhattamishra et al., 2020a) and first-order logic (Merrill and Sabharwal, 2022; Chiang et al., 2023), to characterize the expressiveness of Transformers.

This work introduces MCSGs into the empirical explorations through assessing the ability of Transformers in certain learning settings to learn a variety of string languages recognizable by MCSGs of different complexities, which have not yet been studied as a whole like languages in other classes. Occasionally, studies on the Transformer’s learning ability worked with data that conveniently fall into this class, including a few counter languages that are also TAG-recognizable (Bhattamishra et al., 2020a), discontinuities in Dutch (Kogkalidis and Wijnholds, 2022), reduplication (Deletang et al., 2023), as well as a crossed parentheses language inspired by crossing dependency 1 1 1 However, their construction is not based on the language on which Shieber (1985) based his argument, and a CFG analysis might exist for their string language.(Papadimitriou and Jurafsky, 2023). This work complements these and other aforementioned related work by presenting a systematic evaluation guided by basic MCSL constructions and the subhierarchy within the class, as well as comparing each of the basic constructions against a less and a more complex counterparts.

4 Methodology

4.1 Task Setups

Our experiments use the original soft-attention Transformer with sinusoidal positional encoding (henthforth PE) as defined in Vaswani et al. (2017), and the architecture we use is based on a Transformer with just self-attention and feedforward sublayers, and depending on the task, we may use either unidirectional or bidirectional attention. We avoid using dropout as it could negatively impact performance since we are working with simple abstract formal languages. For each experiment, we also train an LSTM (Hochreiter and Schmidhuber, 1997) baseline for comparison. The implementations 2 2 2 The code for the experiments is available at:

https://github.com/shunjiewang/mcsl-transformer for both the Transformer and the LSTM are from PyTorch (Paszke et al., 2019).

Table 1: Languages this work studies, organized by complexities and basic MCSL constructions each represents or resembles.

We use one of the following two established tasks for each of the languages depending on which better enables learning for the data. We further elaborate on the reasoning for the choice of task for each language in Appendix A.

Binary Classification (Bidirectional Attention)

Following Weiss et al. (2018), a model g 𝑔 g italic_g is said to recognize a formal language L∈Σ𝐿 superscript Σ L\in\Sigma^{}italic_L ∈ roman_Σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT if f⁢(g⁢(w))=1 𝑓 𝑔 𝑤 1 f(g(w))=1 italic_f ( italic_g ( italic_w ) ) = 1 for all and only strings w∈L 𝑤 𝐿 w\in L italic_w ∈ italic_L. In our case, g 𝑔 g italic_g is a Transformer encoder, and g⁢(w)𝑔 𝑤 g(w)italic_g ( italic_w ) is the representation of a positive or negative example w 𝑤 w italic_w, which is averaged from each symbol’s encoder output for all symbols in w 𝑤 w italic_w. f 𝑓 f italic_f is a fully connected linear layer that maps the pooled representation to a real number, which is then passed through the sigmoid function to obtain the class label 0 or 1 using a threshold of 0.5. The loss is the BCE loss between the prediction and the target label.

Next Character Prediction (Unidirectional Attention)

For languages in which training with positive and negative examples is ineffective because too few examples are available or the set of possible negative examples is too large, we use this task, which only requires positive examples. Given a valid prefix of a string in a language at each timestep, the model is tasked to predict the next set of acceptable symbols, or predicts [EOS] if the prefix is already in the language. To do this, the Transformer outputs for each symbol in the string are passed in parallel through a linear layer and then the sigmoid function using a threshold of 0.5 to obtain k 𝑘 k italic_k-hot vectors of dimension |Σ∪{[EOS]}|Σ[EOS]|\Sigma\cup{\texttt{[EOS]}}|| roman_Σ ∪ { [EOS] } |, where each dimension represents whether the symbol is in the set of next characters at the timestep. We consider the prediction for a string to be correct if all predicted k 𝑘 k italic_k-hot vectors are correct. The loss is the individual symbol’s BCE loss between the predicted and the target k 𝑘 k italic_k-hot vectors summed and then averaged. A look-ahead mask is applied to prevent self-attention from attending to later positions, which indirectly offers positional information (Irie et al., 2019; Bhattamishra et al., 2020a; Haviv et al., 2022), thus making PE optional for the languages we study using this task, and makes the architecture in this task essentially a Transformer decoder.

4.2 Data

Following the categorization in Ilie (1997), we are interested in three basic constructions that should be contained in MCSLs:

    1. copying: w⁢w 𝑤 𝑤 ww italic_w italic_w
    1. crossing dependency: 𝚊 n⁢𝚋 m⁢𝚌 n⁢𝚍 m superscript 𝚊 𝑛 superscript 𝚋 𝑚 superscript 𝚌 𝑛 superscript 𝚍 𝑚\mathtt{a}^{n}\mathtt{b}^{m}\mathtt{c}^{n}\mathtt{d}^{m}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT
    1. multiple agreements: 𝚊 n⁢𝚋 n⁢𝚌 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

All these languages are TAG-recognizable. We try to compare each of the three languages with a similar but less complex context-free language, as well as a similar but more complex MG-recognizable language. We also investigate two relevant scramble languages that are also felicitously MCFG-recognizable.

4.2.1 Copying

Copy language {w⁢w∣w∈{𝚊,𝚋}}conditional-set 𝑤 𝑤 𝑤 superscript 𝚊 𝚋{ww\mid w\in{\mathtt{a},\mathtt{b}}^{}}{ italic_w italic_w ∣ italic_w ∈ { typewriter_a , typewriter_b } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT } is in L⁢(TAG)𝐿 TAG L(\text{TAG})italic_L ( TAG )(Joshi, 1985). Its context-free counterpart is the palindrome {w⁢w ℛ∣w∈{𝚊,𝚋}}conditional-set 𝑤 superscript 𝑤 ℛ 𝑤 superscript 𝚊 𝚋{ww^{\mathcal{R}}\mid w\in{\mathtt{a},\mathtt{b}}^{}}{ italic_w italic_w start_POSTSUPERSCRIPT caligraphic_R end_POSTSUPERSCRIPT ∣ italic_w ∈ { typewriter_a , typewriter_b } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT }, where w ℛ superscript 𝑤 ℛ w^{\mathcal{R}}italic_w start_POSTSUPERSCRIPT caligraphic_R end_POSTSUPERSCRIPT is the reverse of the string w 𝑤 w italic_w. Joshi (1985) indicates double copy language w⁢w⁢w 𝑤 𝑤 𝑤 www italic_w italic_w italic_w is not in L⁢(TAG)𝐿 TAG L(\text{TAG})italic_L ( TAG ). However, any multiple copying w k superscript 𝑤 𝑘 w^{k}italic_w start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is in L⁢(MG)𝐿 MG L(\text{MG})italic_L ( MG )(Jäger and Rogers, 2012). Thus, we study w⁢w⁢w 𝑤 𝑤 𝑤 www italic_w italic_w italic_w as the simplest strictly MG-recognizable language for copying.

We use the binary classification setup for this family of languages. To generate the strings, we enumerate each possible w 𝑤 w italic_w in our chosen |w|𝑤|w|| italic_w | range and then duplicate w 𝑤 w italic_w to produce w⁢w ℛ 𝑤 superscript 𝑤 ℛ ww^{\mathcal{R}}italic_w italic_w start_POSTSUPERSCRIPT caligraphic_R end_POSTSUPERSCRIPT, w⁢w 𝑤 𝑤 ww italic_w italic_w, and w⁢w⁢w 𝑤 𝑤 𝑤 www italic_w italic_w italic_w. Negative examples are random strings sampled from {𝚊,𝚋}superscript 𝚊 𝚋{\mathtt{a},\mathtt{b}}^{}{ typewriter_a , typewriter_b } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT in the same length range as the positive examples, but are not in the set of positive examples.

4.2.2 Crossing Dependency

Cross-serial dependency language 𝚊 n⁢𝚋 m⁢𝚌 n⁢𝚍 m superscript 𝚊 𝑛 superscript 𝚋 𝑚 superscript 𝚌 𝑛 superscript 𝚍 𝑚\mathtt{a}^{n}\mathtt{b}^{m}\mathtt{c}^{n}\mathtt{d}^{m}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is in L⁢(TAG)𝐿 TAG L(\text{TAG})italic_L ( TAG )(Joshi, 1985). Its context-free counterpart is the nesting dependency language 𝚊 n⁢𝚋 m⁢𝚌 m⁢𝚍 n superscript 𝚊 𝑛 superscript 𝚋 𝑚 superscript 𝚌 𝑚 superscript 𝚍 𝑛\mathtt{a}^{n}\mathtt{b}^{m}\mathtt{c}^{m}\mathtt{d}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. We use the next character prediction task for these two languages because the potential set of negative examples is too large.

Following Gers and Schmidhuber (2001), to recognize nested 𝚊 n⁢𝚋 m⁢𝚌 m⁢𝚍 n superscript 𝚊 𝑛 superscript 𝚋 𝑚 superscript 𝚌 𝑚 superscript 𝚍 𝑛\mathtt{a}^{n}\mathtt{b}^{m}\mathtt{c}^{m}\mathtt{d}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, when the input is 𝚊 𝚊\mathtt{a}typewriter_a, the next valid character is 𝚊 𝚊\mathtt{a}typewriter_a or 𝚋 𝚋\mathtt{b}typewriter_b. As soon as the input becomes 𝚋 𝚋\mathtt{b}typewriter_b, the value of n 𝑛 n italic_n is determined, then the next valid symbol is now 𝚋 𝚋\mathtt{b}typewriter_b or 𝚌 𝚌\mathtt{c}typewriter_c. Once the input becomes 𝚌 𝚌\mathtt{c}typewriter_c, the value of m 𝑚 m italic_m is also determined, and the next characters are deterministic from this point on. Lastly, we output [EOS] as soon as the final symbol in the input is consumed. Following the notation in Suzgun et al. (2019), we denote the above described input-target scheme as the following, where ⊣does-not-prove\dashv⊣ denotes [EOS]:

𝚊 n⁢𝚋 m⁢𝚌 m⁢𝚍 n→(𝚊/𝚋)n⁢(𝚋/𝚌)m⁢𝚌 m−1⁢𝚍 n⊣does-not-prove→superscript 𝚊 𝑛 superscript 𝚋 𝑚 superscript 𝚌 𝑚 superscript 𝚍 𝑛 superscript 𝚊 𝚋 𝑛 superscript 𝚋 𝚌 𝑚 superscript 𝚌 𝑚 1 superscript 𝚍 𝑛 absent\mathtt{a}^{n}\mathtt{b}^{m}\mathtt{c}^{m}\mathtt{d}^{n}\rightarrow(\nicefrac{% {\mathtt{a}}}{{\mathtt{b}}})^{n}(\nicefrac{{\mathtt{b}}}{{\mathtt{c}}})^{m}% \mathtt{c}^{m-1}\mathtt{d}^{n}\dashv typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → ( / start_ARG typewriter_a end_ARG start_ARG typewriter_b end_ARG ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( / start_ARG typewriter_b end_ARG start_ARG typewriter_c end_ARG ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⊣

Trivially, this scheme can be generalized to the crossing dependency language:

𝚊 n⁢𝚋 m⁢𝚌 n⁢𝚍 m→(𝚊/𝚋)n⁢(𝚋/𝚌)m⁢𝚌 n−1⁢𝚍 m⊣does-not-prove→superscript 𝚊 𝑛 superscript 𝚋 𝑚 superscript 𝚌 𝑛 superscript 𝚍 𝑚 superscript 𝚊 𝚋 𝑛 superscript 𝚋 𝚌 𝑚 superscript 𝚌 𝑛 1 superscript 𝚍 𝑚 absent\mathtt{a}^{n}\mathtt{b}^{m}\mathtt{c}^{n}\mathtt{d}^{m}\rightarrow(\nicefrac{% {\mathtt{a}}}{{\mathtt{b}}})^{n}(\nicefrac{{\mathtt{b}}}{{\mathtt{c}}})^{m}% \mathtt{c}^{n-1}\mathtt{d}^{m}\dashv typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → ( / start_ARG typewriter_a end_ARG start_ARG typewriter_b end_ARG ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( / start_ARG typewriter_b end_ARG start_ARG typewriter_c end_ARG ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⊣

Table 2: Palindrome/Copy/2-Copy: Transformers surpass LSTMs for in-distribution tests but fall to random guesses for OOD (null accuracy for OOD is 50%).

4.2.3 Multiple Agreements

Being simple counter languages, the multiple agreements family had previously been extensively studied. We complement the related work by adding an MG-recognizable language, as well as giving additional analyses on the learned pattern.

Both 𝚊 n⁢𝚋 n⁢𝚌 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, 𝚊 n⁢𝚋 n⁢𝚌 n⁢𝚍 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛 superscript 𝚍 𝑛\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}\mathtt{d}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are TAG-recognizable. Their context-free counterpart is 𝚊 n⁢𝚋 n superscript 𝚊 𝑛 superscript 𝚋 𝑛\mathtt{a}^{n}\mathtt{b}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Joshi (1985) indicates 𝚊 n⁢𝚋 n⁢𝚌 n⁢𝚍 n⁢𝚎 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛 superscript 𝚍 𝑛 superscript 𝚎 𝑛\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}\mathtt{d}^{n}\mathtt{e}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_e start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is not in L⁢(TAG)𝐿 TAG L(\text{TAG})italic_L ( TAG ), but Stabler (1997) shows that it is in L⁢(MG)𝐿 MG L(\text{MG})italic_L ( MG ). Moreover, σ 1 n,…,σ k n superscript subscript 𝜎 1 𝑛…superscript subscript 𝜎 𝑘 𝑛\sigma_{1}^{n},...,\sigma_{k}^{n}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for any arbitrary k 𝑘 k italic_k is MG-recognizable (Jäger and Rogers, 2012). Thus, we study 𝚊 n⁢𝚋 n⁢𝚌 n⁢𝚍 n⁢𝚎 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛 superscript 𝚍 𝑛 superscript 𝚎 𝑛\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}\mathtt{d}^{n}\mathtt{e}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_e start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as the simplest strictly MG-recognizable language in this family.

Since the dataset is very small as only one positive example is available for each n 𝑛 n italic_n, we use the next character prediction task, which requires fewer examples and is also an established setup for learning these languages. Gers and Schmidhuber (2001); Suzgun et al. (2019) have proposed the input-target schemes for 𝚊 n⁢𝚋 n superscript 𝚊 𝑛 superscript 𝚋 𝑛\mathtt{a}^{n}\mathtt{b}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, 𝚊 n⁢𝚋 n⁢𝚌 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and 𝚊 n⁢𝚋 n⁢𝚌 n⁢𝚍 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛 superscript 𝚍 𝑛\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}\mathtt{d}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT:

𝚊 n⁢𝚋 n superscript 𝚊 𝑛 superscript 𝚋 𝑛\displaystyle\mathtt{a}^{n}\mathtt{b}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT→(𝚊/𝚋)n⁢𝚋 n−1⊣does-not-prove→absent superscript 𝚊 𝚋 𝑛 superscript 𝚋 𝑛 1 absent\displaystyle\rightarrow(\nicefrac{{\mathtt{a}}}{{\mathtt{b}}})^{n}\mathtt{b}^% {n-1}\dashv→ ( / start_ARG typewriter_a end_ARG start_ARG typewriter_b end_ARG ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT ⊣ 𝚊 n⁢𝚋 n⁢𝚌 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛\displaystyle\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT→(𝚊/𝚋)n⁢𝚋 n−1⁢𝚌 n⊣does-not-prove→absent superscript 𝚊 𝚋 𝑛 superscript 𝚋 𝑛 1 superscript 𝚌 𝑛 absent\displaystyle\rightarrow(\nicefrac{{\mathtt{a}}}{{\mathtt{b}}})^{n}\mathtt{b}^% {n-1}\mathtt{c}^{n}\dashv→ ( / start_ARG typewriter_a end_ARG start_ARG typewriter_b end_ARG ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⊣ 𝚊 n⁢𝚋 n⁢𝚌 n⁢𝚍 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛 superscript 𝚍 𝑛\displaystyle\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}\mathtt{d}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT→(𝚊/𝚋)n⁢𝚋 n−1⁢𝚌 n⁢𝚍 n⊣does-not-prove→absent superscript 𝚊 𝚋 𝑛 superscript 𝚋 𝑛 1 superscript 𝚌 𝑛 superscript 𝚍 𝑛 absent\displaystyle\rightarrow(\nicefrac{{\mathtt{a}}}{{\mathtt{b}}})^{n}\mathtt{b}^% {n-1}\mathtt{c}^{n}\mathtt{d}^{n}\dashv→ ( / start_ARG typewriter_a end_ARG start_ARG typewriter_b end_ARG ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⊣

Trivially, this scheme can be generalized to the MG-recognizable 𝚊 n⁢𝚋 n⁢𝚌 n⁢𝚍 n⁢𝚎 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛 superscript 𝚍 𝑛 superscript 𝚎 𝑛\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}\mathtt{d}^{n}\mathtt{e}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_e start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT:

𝚊 n⁢𝚋 n⁢𝚌 n⁢𝚍 n⁢𝚎 n→(𝚊/𝚋)n⁢𝚋 n−1⁢𝚌 n⁢𝚍 n⁢𝚎 n⊣does-not-prove→superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛 superscript 𝚍 𝑛 superscript 𝚎 𝑛 superscript 𝚊 𝚋 𝑛 superscript 𝚋 𝑛 1 superscript 𝚌 𝑛 superscript 𝚍 𝑛 superscript 𝚎 𝑛 absent\displaystyle\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}\mathtt{d}^{n}\mathtt{e% }^{n}\rightarrow(\nicefrac{{\mathtt{a}}}{{\mathtt{b}}})^{n}\mathtt{b}^{n-1}% \mathtt{c}^{n}\mathtt{d}^{n}\mathtt{e}^{n}\dashv typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_e start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → ( / start_ARG typewriter_a end_ARG start_ARG typewriter_b end_ARG ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_e start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⊣

That is, the next valid character is 𝚊 𝚊\mathtt{a}typewriter_a or 𝚋 𝚋\mathtt{b}typewriter_b as long as the input is 𝚊 𝚊\mathtt{a}typewriter_a, but once 𝚋 𝚋\mathtt{b}typewriter_b occurs in the input, n 𝑛 n italic_n will be determined and the next characters will be deterministic from this point on.

4.2.4 Scramble Languages

Joshi (1985) argued that MCSGs should only handle limited cross-serial dependency like the type in Dutch (Bresnan et al., 1982) and Swiss German, but not as in MIX={w∈{𝚊,𝚋,𝚌}∣|w|𝚊=|w|𝚋=|w|𝚌}MIX conditional-set 𝑤 superscript 𝚊 𝚋 𝚌 subscript 𝑤 𝚊 subscript 𝑤 𝚋 subscript 𝑤 𝚌\text{MIX}={w\in{\mathtt{a},\mathtt{b},\mathtt{c}}^{}\mid|w|{\mathtt{a}}=% |w|{\mathtt{b}}=|w|{\mathtt{c}}}MIX = { italic_w ∈ { typewriter_a , typewriter_b , typewriter_c } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∣ | italic_w | start_POSTSUBSCRIPT typewriter_a end_POSTSUBSCRIPT = | italic_w | start_POSTSUBSCRIPT typewriter_b end_POSTSUBSCRIPT = | italic_w | start_POSTSUBSCRIPT typewriter_c end_POSTSUBSCRIPT }, where |w|σ subscript 𝑤 𝜎|w|{\sigma}| italic_w | start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT denotes the number of occurrences of symbol σ 𝜎\sigma italic_σ in string w 𝑤 w italic_w, that is, the language of strings with an equal number of 𝚊 𝚊\mathtt{a}typewriter_a’s, 𝚋 𝚋\mathtt{b}typewriter_b’s, and 𝚌 𝚌\mathtt{c}typewriter_c’s, but the symbols can occur in any order, and can thus be seen as scrambled 𝚊 n⁢𝚋 n⁢𝚌 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT.

Image 3: Refer to caption

Figure 3: Anti-diagonal alignment for w⁢w ℛ 𝑤 superscript 𝑤 ℛ ww^{\mathcal{R}}italic_w italic_w start_POSTSUPERSCRIPT caligraphic_R end_POSTSUPERSCRIPT, and forward and backward alignment for w⁢w 𝑤 𝑤 ww italic_w italic_w.

MIX resembles an extreme case of free word order and is not recognizable by the TAG (Kanazawa and Salvati, 2012). However, it turned out that the language is in L⁢(MCFG)𝐿 MCFG L(\text{MCFG})italic_L ( MCFG )(Salvati, 2015). A related language O 2={w∈{𝚊,𝚋,𝚌,𝚍}∣|w|𝚊=|w|𝚌∧|w|𝚋=|w|𝚍}subscript 𝑂 2 conditional-set 𝑤 superscript 𝚊 𝚋 𝚌 𝚍 subscript 𝑤 𝚊 subscript 𝑤 𝚌 subscript 𝑤 𝚋 subscript 𝑤 𝚍 O_{2}={w\in{\mathtt{a},\mathtt{b},\mathtt{c},\mathtt{d}}^{}\mid|w|{% \mathtt{a}}=|w|{\mathtt{c}}\wedge|w|{\mathtt{b}}=|w|{\mathtt{d}}}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_w ∈ { typewriter_a , typewriter_b , typewriter_c , typewriter_d } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∣ | italic_w | start_POSTSUBSCRIPT typewriter_a end_POSTSUBSCRIPT = | italic_w | start_POSTSUBSCRIPT typewriter_c end_POSTSUBSCRIPT ∧ | italic_w | start_POSTSUBSCRIPT typewriter_b end_POSTSUBSCRIPT = | italic_w | start_POSTSUBSCRIPT typewriter_d end_POSTSUBSCRIPT }, which can be seen as scrambled 𝚊 n⁢𝚋 m⁢𝚌 n⁢𝚍 m superscript 𝚊 𝑛 superscript 𝚋 𝑚 superscript 𝚌 𝑛 superscript 𝚍 𝑚\mathtt{a}^{n}\mathtt{b}^{m}\mathtt{c}^{n}\mathtt{d}^{m}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, is also in L⁢(MCFG)𝐿 MCFG L(\text{MCFG})italic_L ( MCFG )(Salvati, 2015).

Image 4: Refer to caption

Figure 4: w⁢w⁢w 𝑤 𝑤 𝑤 www italic_w italic_w italic_w expects six alignments in total, usually distributed across heads, where half are better aligned, and the other half partially aligned.

We investigate the two scramble languages using the binary classification task as the model may benefit from seeing the whole string at once to directly model the occurrences of each symbol. For MIX, the positive examples exhaustively enumerate all permutations of 𝚊 n⁢𝚋 n⁢𝚌 n superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT in the chosen n 𝑛 n italic_n range, and the negative examples enumerate the remaining strings in {𝚊,𝚋,𝚌}superscript 𝚊 𝚋 𝚌{\mathtt{a},\mathtt{b},\mathtt{c}}^{}{ typewriter_a , typewriter_b , typewriter_c } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT within the same range to help the model better eliminate most wrong hypotheses. As for O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we enumerate permutations of 𝚊 n⁢𝚋 m⁢𝚌 n⁢𝚍 m superscript 𝚊 𝑛 superscript 𝚋 𝑚 superscript 𝚌 𝑛 superscript 𝚍 𝑚\mathtt{a}^{n}\mathtt{b}^{m}\mathtt{c}^{n}\mathtt{d}^{m}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in the chosen n,m 𝑛 𝑚 n,m italic_n , italic_m range, and the negative examples are from the remaining strings in {𝚊,𝚋,𝚌,𝚍}superscript 𝚊 𝚋 𝚌 𝚍{\mathtt{a},\mathtt{b},\mathtt{c},\mathtt{d}}^{}{ typewriter_a , typewriter_b , typewriter_c , typewriter_d } start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT that are in the same range. Negative examples for both languages also include strings where |w|σ=0 subscript 𝑤 𝜎 0|w|_{\sigma}=0| italic_w | start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = 0. The train-test split is performed over each sequence length separately rather than over the entire dataset, so strings of different lengths appear in all splits.

Table 1 summarizes all languages studied in this work, and we also include detailed dataset statistics in Appendix B.

5 Experiments

Each model is evaluated on three sets: an in-distribution held-out test set, an out-of-distribution (henthforth OOD) set with strings longer than the ones seen during training, and a second OOD set with even longer strings. We report the mean and standard deviation of test accuracies in three runs with different random seeds.

We then visualize the heads from our best-performing runs that most clearly demonstrate highly interpretable patterns upon visual inspection, but note that all visualized patterns do recur across different configurations.

5.1 Copying

Our Transformer models learned w⁢w 𝑤 𝑤 ww italic_w italic_w and the related w⁢w ℛ,w⁢w⁢w 𝑤 superscript 𝑤 ℛ 𝑤 𝑤 𝑤 ww^{\mathcal{R}},www italic_w italic_w start_POSTSUPERSCRIPT caligraphic_R end_POSTSUPERSCRIPT , italic_w italic_w italic_w with high accuracy and outperformed LSTMs in in-distribution tests. However, in the two OOD tests, only LSTMs were able to extrapolate, while the accuracies of the Transformers are close to random guesses (Table 2).

We identified certain heads that align the substrings in different diagonalities to measure the similarity of a string to itself (Figure 3). For w⁢w 𝑤 𝑤 ww italic_w italic_w, the gold alignment is to align the first w 𝑤 w italic_w against the second and vice versa. In the visualized run, we find that among all positive examples in the test set, 93.4% of the time the highest query-key attention is on the gold alignment. For w⁢w ℛ 𝑤 superscript 𝑤 ℛ ww^{\mathcal{R}}italic_w italic_w start_POSTSUPERSCRIPT caligraphic_R end_POSTSUPERSCRIPT, the gold alignment expects the head of a substring to attend to the tail of the other substring, thus resulting in an anti-diagonal pattern, and 94.8% of the time the highest attention is on the gold alignment diagonals among all positive examples in the test set.

For w⁢w⁢w 𝑤 𝑤 𝑤 www italic_w italic_w italic_w, a gold alignment requires each substring to attend to the other two, resulting in a total of six alignments, which we did get during training as shown in Figure 4. The six alignments are distributed across heads in a multi-head model, where only three of the six are better aligned, while the other partial alignments appear to be auxiliary if were at all useful for inference. In the visualized model, the three clearer alignments in the first head on average match the gold alignment 86.1% of the time over positive examples in the test set, while the accuracy is 39.0% for the other three partial alignments.

Table 3: Nesting/Crossing: not using sinusoidal PE helped with extrapolation. Note that the two OOD sets include both strings where only one of n,m 𝑛 𝑚 n,m italic_n , italic_m is OOD, e.g., 𝚊 1⁢𝚋 100⁢𝚌 1⁢𝚍 100 superscript 𝚊 1 superscript 𝚋 100 superscript 𝚌 1 superscript 𝚍 100\mathtt{a}^{1}\mathtt{b}^{100}\mathtt{c}^{1}\mathtt{d}^{100}typewriter_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT 100 end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT 100 end_POSTSUPERSCRIPT, and strings where both n,m 𝑛 𝑚 n,m italic_n , italic_m are OOD, e.g., 𝚊 51⁢𝚋 100⁢𝚌 51⁢𝚍 100 superscript 𝚊 51 superscript 𝚋 100 superscript 𝚌 51 superscript 𝚍 100\mathtt{a}^{51}\mathtt{b}^{100}\mathtt{c}^{51}\mathtt{d}^{100}typewriter_a start_POSTSUPERSCRIPT 51 end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT 100 end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT 51 end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT 100 end_POSTSUPERSCRIPT.

Image 5: Refer to caption

Figure 5: Crossing shows a checkerboard pattern as the result of correctly identifying pairwise dependents, while nesting has a similar pattern except for different symbol dependencies.

5.2 Crossing Dependency

The in-distribution tests were solved almost perfectly by all three model setups, including a Transformer decoder setup where we remove PE and rely only on indirect positional information from the causal mask. For OOD tests, we evaluate models on strings where both n,m 𝑛 𝑚 n,m italic_n , italic_m are OOD, as well as strings in which only one of n,m 𝑛 𝑚 n,m italic_n , italic_m is OOD. We find that removing PE and relying only on causal mask helped with the Transformer’s extrapolation, which is consistent with findings in Bhattamishra et al. (2020a) on other languages. However, such ability is still worse than that of LSTMs (Table 3).

Figure 5 shows that the Transformer’s attention formed a checkerboard pattern for recognizing crossing, as the result of each symbol in the query attends to every occurrence of itself and its dependent in the key but not to the non-dependents. As for nesting, the pattern is very similar except that the dependency relations are different. Models trained with or without sinusoidal PE end up learning very similar patterns, except that without PE, the attention from one query symbol to every occurrence of a key symbol is uniformly distributed, resulting in a stack of color bands on the attention map of the visualized head.

Image 6: Refer to caption

Figure 6: Every occurrence of a query symbol attends to every occurrence of a key symbol using similar attention values, thus low variance among the values. The query symbols also attend to different key symbols with different weights, thus resulting in a grid pattern.

The attention maps suggest that in the optimal case, each symbol in query identifies to which other symbol in key it is pairwise dependent, and then in the visible portion without look-ahead mask, distributes its attention to every occurrence of itself and the dependent, and gives zero attention to the other pair. We measure as an example how accurately the visualized head in the crossing model without PE has implemented this optimum, and we find that across all in-distribution test set datapoints, keys that expect 100% of the attention weights from each symbol in query have received on average 93.0% of the attention weights.

5.3 Multiple Agreements

We follow the established finding in Bhattamishra et al. (2020a) and only consider the Transformer decoder without sinusoidal PE for these languages, as training with PE was ineffective in pilot experiments. Transformers without PE demonstrated the ability to extrapolate, although on average they are still not as good or consistent as LSTMs (Table 4).

Table 4: Multiple Agreements: Transformers without PE demonstrated the ability to extrapolate, though in general still not as good or consistent as LSTMs. {}^{}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT The in-distribution set is held-out and only has strings with n∈{5,15,25,35,45}𝑛 5 15 25 35 45 n\in{5,15,25,35,45}italic_n ∈ { 5 , 15 , 25 , 35 , 45 }.

We annotate the mean and standard deviation in percentages among all attention values from one query symbol to one key symbol in Figure 6. It can be observed that every input alphabet symbol attends to different symbols in the output alphabet using a different weight, but the attention values to each occurrence of the same symbol are similar, and thus have low variance, culminating in the grid pattern we see on the maps. The differences in weights from a query symbol suggested a particular dependency analysis learned by that run.

Unlike what we have discussed for copying and crossing, the multiple agreements strings do not have a definite dependency relation to learn, and many possible analyses exist for the same string, e.g., any two symbols in the alphabet could be pairwise dependent, or all symbols in the alphabet could be dependent on each other (cf. Joshi (1985)). Thus, from run to run, to which key symbol the query symbol gives the most attention, and how much attention is given to each symbol, could indeed vary. Also, the lack of a gold analysis has determined that the model cannot simply focus on some of the symbols, and it is crucial for every symbol to attend to every symbol as we see here.

Image 7: Refer to caption

Figure 7: MIX resembles multiple agreements in that the attention weights from one query symbol to one key symbol are similar. O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT resembles the checkerboard in crossing although it does not ignore non-dependents.

5.4 Scramble Languages

We use the macro F-1 score as the metric for this set of languages, since our data generation is skewed towards negative examples. Despite the seeming complexity of the data, Transformers are able to solve MIX and O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in-distribution test sets perfectly, while LSTMs also have very high scores. However, the MIX OOD sets are challenging for both models, while LSTMs outperformed Transformers in solving the OOD sets for O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Table 5).

Since 𝚊 n⁢𝚋 n⁢𝚌 n⊂MIX superscript 𝚊 𝑛 superscript 𝚋 𝑛 superscript 𝚌 𝑛 MIX\mathtt{a}^{n}\mathtt{b}^{n}\mathtt{c}^{n}\subset\text{MIX}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⊂ MIX and 𝚊 n⁢𝚋 m⁢𝚌 n⁢𝚍 m⊂O 2 superscript 𝚊 𝑛 superscript 𝚋 𝑚 superscript 𝚌 𝑛 superscript 𝚍 𝑚 subscript 𝑂 2\mathtt{a}^{n}\mathtt{b}^{m}\mathtt{c}^{n}\mathtt{d}^{m}\subset O_{2}typewriter_a start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_b start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT typewriter_c start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT typewriter_d start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⊂ italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we use unscrambled strings in the visualizations for readability in Figure 7. MIX has a pattern that resembles the one in multiple agreements in that the amount of attention from one symbol in query to one symbol in key among all occurrences is similar and has low variance, as we annotated in the visualized example. Similarly, O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has a pattern that resembles the checkerboard in crossing, although it is not the case that the queries ignore non-dependents. However, it is still evident that each query in O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT identified which two symbols should form pairwise dependents and used similar attention weights to the pair. Do note that although we visualized the unscrambled strings for readability, the similar attention and low variance properties hold for other scrambled strings.

As an additional analysis, we probe the MIX representations to see what information is encoded. One possibility is the count for the symbol occurrences at each timestep, which directly follows from MIX’s definition. We decode the MIX embeddings for a full counting target that maintains the ongoing tallies for all 3 symbols, as illustrated in Table 6. We also include a control task (Hewitt and Liang, 2019) target that is the random shuffle of the counting target, and if the probing model trained on this target has a higher error, we would be more confident that the count is actually present in the representations, rather than the counting results following from the power of the probing model.

Table 5: Scramble macro F-1 (%): the models performed perfectly for in-distribution tests, but MIX OOD sets are challenging to both models, whereas LSTMs outperformed Transformers for O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT OOD sets. Note that we made sure that the three tests for O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT do not share datapoints when generating them.

Counting Target Control Task #a#b#c#a#b#c 𝚊 𝚊\mathtt{a}typewriter_a[1 0 0][2 2 2] 𝚋 𝚋\mathtt{b}typewriter_b[1 1 0][1 0 0] 𝚌 𝚌\mathtt{c}typewriter_c[1 1 1][2 1 1] 𝚊 𝚊\mathtt{a}typewriter_a[2 1 1][1 1 0] 𝚋 𝚋\mathtt{b}typewriter_b[2 2 1][1 1 1] 𝚌 𝚌\mathtt{c}typewriter_c[2 2 2][2 2 1]

Table 6: Target examples for w=𝚊𝚋𝚌𝚊𝚋𝚌 𝑤 𝚊𝚋𝚌𝚊𝚋𝚌 w=\mathtt{abcabc}italic_w = typewriter_abcabc. The counting target is the count of each symbol at each timestep; the control task target is the random shuffle of the counting target.

Similar to the methodology in Wallace et al. (2019), we used an MLP regressor prober with 1 hidden layer, ReLU activations, and MSE loss. We trained the prober for up to 300 epochs with early stopping. On the in-distribution test set, the prober using the counting target has an MSE of 0.21 and a Pearson correlation of 0.929 between the target and the predicted count values. This contrasts with the control task target which has an MSE of 1.33. We find this to be suggestive that the learned representations contain count information, which may have been useful for solving scramble languages.

6 Discussion and Conclusion

We empirically studied the Transformer’s ability to learn a variety of linguistically significant MCSLs. The significance of the languages is two-fold: they represent a hypothesized upper bound for the complexity of natural language, and they are the abstractions of the motivating linguistic phenomena. Overall, the Transformers performed well in in-distribution tests and are comparable to LSTMs, but their ability to extrapolate is limited. In our next character prediction experiments with Transformer decoders, removing the sinusoidal PE alleviated the problem, which is an established empirical finding for some formal languages and natural language, but this technique is not always generalizable to other data, nor does it work in the encoder since the decoder can rely on the indirect positional information from the causal mask in absence of PE.

Transformers leveraged the attention mechanism to score the similarity between the substrings. In our analyses, the learned self-attention’s alignments often reflect the symbol dependency relations within the string, which had been useful for MCSLs because of the rich and complex dependencies in the languages. In a more complex language like MIX, Transformers had implicitly learned some form of counting behavior that may have helped solve the language.

Within the same family of languages spanning across complexity classes, the learned patterns are similar and no significant differences in behaviors are observed in the reduced or added complexity languages. This may suggest that we cannot draw parallels between the MCSG formalisms and the Transformer’s expressiveness directly, like some other formal models such as circuits (Hao et al., 2022; Merrill et al., 2022) do. However, this work serves as an example of how we may draw inspiration from the rich MCSL scholarships to motivate work in NLP, as they help us examine the linguistic capacity of current and future NLP models.

Limitations

An empirical study on formal language learning is always inconveniently insufficient, as there is always some string length upper bound that any experiment can get to or reasonably work with, so any conclusions drawn are based on an unintentionally bounded dataset, which could weaken the argument about learnability in general as the dataset might form a language with reduced complexity.

In addition, the roles of the other heads, the feed-forward sublayer, etc. are not investigated. Therefore, we cannot definitively say how self-attention directly contributed to inference, despite learning meaningful and interpretable patterns (cf. Wen et al. (2023)).

Ideally, we would complement the empirical findings with theoretical constructions on whether and how the MCSLs can be learned, which is lacking in the current work. However, the empirical results serve as the foundation towards that goal. Especially, the highly interpretable self-attention patterns could inspire us and hint at what the theoretical constructions would look like.

References

Appendix A Additional Experiment Details

Table 7: Dataset statistics and label distribution. The next character prediction task uses only positive examples. OOD sets for scramble languages are downsampled. Negative examples for scramble languages also include strings where |w|σ=0 subscript 𝑤 𝜎 0|w|_{\sigma}=0| italic_w | start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = 0.

Choice of Task

In pilot experiments, training crossing and multiple agreements with the binary classification task was unsuccessful, and our analyses suggest that they learned spurious statistical cues. The task was especially difficult for multiple agreements, where only one example is available for each n 𝑛 n italic_n. For crossing, we tried to use an equal number of positive and negative examples, but that is not enough for the model to rule out alternative wrong hypotheses. On the other hand, if we follow what we did for O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and enumerate all possible negative examples in a length range, since crossing has a much wider length range than O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, this will lead to an explosion of negative examples and is impractical to work with. Teaching these two sets of languages with the binary classification setup may still be possible, but the negative examples likely need to be carefully curated, so that we avoid an explosion of negative examples over positive examples, but still have enough negative datapoints to help the model eliminate most wrong hypotheses such as the spurious cue ones.

The [EOS] Decision

In the binary classification task, since we are not scoring or generating a string, the decision on whether to add [EOS] to strings is arbitrary. Newman et al. (2020) suggest that without [EOS], models may extrapolate better to longer strings. We tried the setup without [EOS] in pilot experiments but found no significantly better performance for our Transformer models in the studied languages. We chose to include [EOS] and use the [EOS] embeddings as the sentence representations for LSTMs in this task.

Appendix B Training Details

Table 8: Search space and choices for certain hyperparameters per language.

The datasets for all languages are generated exactly once, and are used across hyperparameter tuning and the final experiments. We record the random seeds used for generation for reproducibility. We try to enumerate all examples in our chosen length range except for the OOD sets of the scramble languages because of an explosion of datapoints as strings get longer, and we downsample in this case. For MIX, positive examples are capped at 756756; negative examples for a given string length are capped at 25200. For O 2 subscript 𝑂 2 O_{2}italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, examples for a given string length are capped at 6300.

We then do a 70%-15%-15% Train-Dev-Test random split of the data. This is performed over the entire dataset, except for scramble languages, in which we split by each string length separately since the dataset is skewed towards longer strings as a result of enumerating all distinct permutations for a given string length. For multiple agreements languages, since the dataset is too small to be randomly split, we manually picked strings with n∈{5,15,25,35,45}𝑛 5 15 25 35 45 n\in{5,15,25,35,45}italic_n ∈ { 5 , 15 , 25 , 35 , 45 } as the test set and strings with n∈{6,16,26,36,46}𝑛 6 16 26 36 46 n\in{6,16,26,36,46}italic_n ∈ { 6 , 16 , 26 , 36 , 46 } as the development set. We give detailed statistics of the dataset size and the label distribution in Table 7.

We first tune the hyperparameters for each of the languages on both the Transformer and the LSTM. All models tune d model subscript 𝑑 model d_{\text{model}}italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT in {16,32,64}16 32 64{16,32,64}{ 16 , 32 , 64 }, and for Transformers, we also tune the number of layers in {1,2}1 2{1,2}{ 1 , 2 } and the number of heads in {1,2,4}1 2 4{1,2,4}{ 1 , 2 , 4 }. The search space for the learning rate per language is in Table 8. We tuned the hyperparameters using grid search and picked the configuration with the lowest development loss for each language. The final set of hyperparameters is in Table 9. For both tuning and training, we also use early stopping with manually selected minimum delta, patience, and the maximum number of epochs as shown in Table 8. All experiment runs used a batch size of 32, and the AdamW optimizer (Loshchilov and Hutter, 2019). The embedding layer of the Transformer is initialized using a uniform distribution in the range [−0.1,0.1]0.1 0.1[-0.1,0.1][ - 0.1 , 0.1 ].

Table 9: Final hyperparameters used in experiments per language.

Xet Storage Details

Size:
83 kB
·
Xet hash:
131b2eb35eb85bf1d0fe9e4a920b5947b3ff35d9735461075d778444e773fc96

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.