102 kB

Title: Lewis’s Signaling Game as beta-VAE For Natural Word Lengths and Segments

URL Source: https://arxiv.org/html/2311.04453

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Background 3 Redefine Objective as ELBO from the Generative Perspective 4Experiment 5Discussion 6Related Work 7Conclusion References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: warning failed: proofread

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license arXiv:2311.04453v2 [cs.CL] 02 Apr 2024 Lewis’s Signaling Game as beta-VAE For Natural Word Lengths and Segments Ryo Ueda The University of Tokyo ryoryoueda@is.s.u-tokyo.ac.jp &Tadahiro Taniguchi Kyoto University Ritsumeikan University taniguchi@ci.ritsumei.ac.jp Abstract

As a sub-discipline of evolutionary and computational linguistics, emergent communication (EC) studies communication protocols, called emergent languages, arising in simulations where agents communicate. A key goal of EC is to give rise to languages that share statistical properties with natural languages. In this paper, we reinterpret Lewis’s signaling game, a frequently used setting in EC, as beta-VAE and reformulate its objective function as ELBO. Consequently, we clarify the existence of prior distributions of emergent languages and show that the choice of the priors can influence their statistical properties. Specifically, we address the properties of word lengths and segmentation, known as Zipf’s law of abbreviation (ZLA) and Harris’s articulation scheme (HAS), respectively. It has been reported that the emergent languages do not follow them when using the conventional objective. We experimentally demonstrate that by selecting an appropriate prior distribution, more natural segments emerge, while suggesting that the conventional one prevents the languages from following ZLA and HAS.

1Introduction

Understanding how language and communication emerge is one of the ultimate goals in evolutionary linguistics and computational linguistics. Emergent communication (EC, Lazaridou & Baroni, 2020; Galke et al., 2022; Brandizzi, 2023) attempts to simulate the emergence of language using techniques such as deep learning and reinforcement learning. A key challenge in EC is to elucidate how emergent communication protocols, or emergent languages, reproduce statistical properties similar to natural languages, such as compositionality (Kottur et al., 2017; Chaabouni et al., 2020; Ren et al., 2020), word length distribution (Chaabouni et al., 2019; Rita et al., 2020), and word segmentation (Resnick et al., 2020; Ueda et al., 2023). In this paper, we reformulate Lewis’s signaling game (Lewis, 1969; Skyrms, 2010), a popular setting in EC, as beta-VAE (Kingma & Welling, 2014; Higgins et al., 2017) for the emergence of more natural word lengths and segments.

Lewis’s signaling game and its objective: Lewis’s signaling game is frequently adopted in EC (e.g., Chaabouni et al., 2019; Rita et al., 2022b). It is a simple communication game involving a sender 𝑆 𝜙 and receiver 𝑅 𝜽 , with only unidirectional communication allowed from 𝑆 𝜙 to 𝑅 𝜽 . In each play, 𝑆 𝜙 obtains an object 𝑥 and converts it into a message 𝑚 . Then, 𝑅 𝜽 interprets 𝑚 to guess the original object 𝑥 . The game is successful upon a correct guess. 𝑆 𝜙 and 𝑅 𝜽 are typically represented as neural networks such as RNNs (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) and optimized for a certain objective 𝒥 conv , which is in most cases defined as the maximization of the log-likelihood of the receiver’s guess (or equivalently minimization of cross-entropy loss).

ELBO as objective: In contrast, we propose to regard the signaling game as (beta-)VAE (Kingma & Welling, 2014; Higgins et al., 2017) and utilize ELBO (with a KL-weighing parameter 𝛽 ) as the game objective. We reinterpret the game as representation learning in a generative model, regarding 𝑥 as an observed variable, 𝑚 as latent variable, 𝑆 𝜙 as encoder, 𝑅 𝜽 as decoder, and an additionally introduced “language model” 𝑃 𝜽 prior as prior latent distribution. Several reasons support our reformulation. First, the conventional objective 𝒥 conv actually resembles ELBO in some respects. We illustrate their similarity in Figure 1 and Table 1. 𝒥 conv has an implicit prior message distribution, while ELBO contains explicit one. Also, 𝒥 conv is often adjusted with an entropy regularizer while ELBO contains an entropy maximizer. Although they are not mathematically equal, their motivations coincide in that they keep the sender’s entropy high to encourage exploration (Levine, 2018). Second, the choice of a prior message distribution can affect the statistical properties of emergent languages. Specifically, this paper addresses Zipf’s law of abbreviation (ZLA, Zipf, 1935; 1949) and Harris’s articulation scheme (HAS, Harris, 1955; Tanaka-Ishii, 2021). It has been reported that emergent languages do not give rise to these properties as long as the conventional objective 𝒥 conv is used (Chaabouni et al., 2019; Ueda et al., 2023). We suggest that such an artifact is caused by unnatural prior distributions and that it has been obscure due to their implicitness. Moreover, our ELBO-based formulation can be justified from the computational psycholinguistic point of view, namely surprisal theory (Hale, 2001; Levy, 2008; Smith & Levy, 2013). Interestingly, the prior distribution can be seen as a language model, and thus the corresponding term can be seen as the surprisal, or cognitive processing cost, of sentences (messages). In that sense, ELBO naturally models the trade-off between the informativeness and processability of sentences.

Related work and contribution: The idea per se of viewing communication as representation learning is also seen in previous work. For instance, the linguistic concept compositionality is often regarded as disentanglement (Andreas, 2019; Chaabouni et al., 2020; Resnick et al., 2020).Also, Tucker et al. (2022) formulated a VQ-VIB game based on Variational Information Bottleneck (VIB, Alemi et al., 2017), which is known as a generalization of beta-VAE. Moreover, Taniguchi et al. (2022) defined a Metropolis-Hastings (MH) naming game in which communication was formulated as representation learning on a generative model and emerged through MCMC. However, to the best of our knowledge, no previous work has addressed in an integrated way (1) the similarities between the game and (beta-)VAE from the generative viewpoint, (2) the potential impact of prior distribution on the properties of emergent languages, and (3) a psycholinguistic reinterpretation of emergent communication. They are the main contribution of this paper. Another important point is that, since Chaabouni et al. (2019), most EC work on signaling games has adopted either the one-character or fixed-length message setting, avoiding the variable-length setting, though the variable-length is more natural. We revisit the variable-length setting towards a better simulation of language emergence.

Figure 1: Illustration of similarity between signaling games and (beta-)VAE. Table 1: A list of objective functions in this paper. It is useful to grasp their similarity. objective definition prior (implicit/explicit) entropy-related term

𝒥 conv ⁢ ( 𝜙 , 𝜽 ) Eq. 2 (Section 2) 𝑃 unif prior ⁢ ( 𝑀 ) (implicit) entropy regularizer

𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) Eq. 4 (Section 2) 𝑃 𝛼 prior ⁢ ( 𝑀 ) (implicit) entropy regularizer

𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) Eq. 12 (Section 3) 𝑃 𝛼 prior ⁢ ( 𝑀 ) (explicit) entropy maximizer

𝒥 ours ⁢ ( 𝜙 , 𝜽 ; 𝛽 ) Eq. 7 (Section 3) 𝑃 𝜽 prior ⁢ ( 𝑀 ) (explicit) entropy maximizer 2Background

Mathematical Notation: Throughout this paper, we use a calligraphic letter for a set (e.g., 𝒳 ), an uppercase letter for a random variable of the set (e.g., 𝑋 ), and a lowercase letter for an element of the set (e.g., 𝑥 ). We denote a finite object space by 𝒳 , a finite alphabet by 𝒜 , a finite message space by ℳ , a probability distribution of objects by 𝑃 obj ⁢ ( 𝑋 ) , a probabilistic sender agent by 𝑆 𝜙 ⁢ ( 𝑀 | 𝑋 ) , and a probabilistic receiver agent by 𝑅 𝜽 ⁢ ( 𝑀 | 𝑋 ) . 𝑆 𝜙 is parametrized by 𝜙 and 𝑅 𝜽 by 𝜽 .

2.1Language Emergence via Lewis’s Signaling Game

To simulate language emergence, defining an environment surrounding agents is necessary. One of the most standard settings in EC is Lewis’s signaling game (Lewis, 1969; Skyrms, 2010; Rita et al., 2022b), though there are variations in the environment definition (Foerster et al., 2016; Lowe et al., 2017; Mordatch & Abbeel, 2018; Jaques et al., 2019, inter alia).1 The game involves two agents, sender 𝑆 𝜙 and receiver 𝑅 𝜽 . In a single play, 𝑆 𝜙 observes an object 𝑥 and generates a message 𝑚 . 𝑅 𝜽 obtains the message 𝑚 and guesses the object 𝑥 . The game is successful if the guess is correct. 𝑆 𝜙 and 𝑅 𝜽 are often represented as RNNs and optimized toward successful communication. During training, the sender probabilistically observes an object 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) , generates a message 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) , and the receiver guesses via log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) . During validation, the sender observes 𝑥 one by one and greedily generates 𝑚 .

Object Space: Objects 𝑥 can be defined in various ways, from abstract to realistic data. In this paper, an object 𝑥 is assumed to be an attribute-value (att-val) object (Kottur et al., 2017; Chaabouni et al., 2020) which is a 𝑛 att -tuple of integers ( 𝑣 1 , … , 𝑣 𝑛 att ) ∈ 𝒳 , where 𝑣 𝑖 ∈ { 1 , … , 𝑛 val } for each 𝑖 . 𝑛 att is called the number of attributes and 𝑛 val is called the number of values.

Message Space: In most cases, the message space ℳ is defined as a set of sequences of a (maximum) length 𝐿 max over a finite alphabet 𝒜 . The message length, denoted by | 𝑚 | , can be either fixed (Chaabouni et al., 2020) or variable (Chaabouni et al., 2019). This paper adopts the latter setting:

ℳ := { 𝑚 1 ⁢ ⋯ ⁢ 𝑚 𝑘 ∣ 𝑚 𝑖 ∈ 𝒜
{ eos } ⁢ ( 1 ≤ 𝑖 ≤ 𝑘 − 1 ) , 𝑚 𝑘

eos , 1 ≤ 𝑘 ≤ 𝐿 max } ,

(1)

where eos ∈ 𝒜 is the end-of-sequences marker. We assume | 𝒜 | ≥ 3 and thus log ⁡ ( | 𝒜 | − 1 )

0 .

Game Objective: The game objective is often defined as (Chaabouni et al., 2019; Rita et al., 2022b):

𝒥 conv ⁢ ( 𝜙 , 𝜽 ) := 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) ] .

(2)

For clarity, we refer to 𝒥 conv as the conventional objective, in contrast to our objective 𝒥 ours defined later. The gradient of 𝒥 conv ⁢ ( 𝜙 , 𝜽 ) is obtained as (Chaabouni et al., 2019):

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ ∇ 𝜙 , 𝜽 log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) + ( log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) − 𝐵 ⁢ ( 𝑥 ) ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 | 𝑥 ) ] ,

where 𝐵 : 𝒳 → ℝ is a baseline. Intuitively, 𝜙 is updated via REINFORCE (Williams, 1992) and 𝜽 via the standard backpropagation. In practice, an entropy regularizer (Williams & Peng, 1991; Mnih et al., 2016) is added to ∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ) :

∇ 𝜙 , 𝜽 entreg := 𝜆 entreg | 𝑚 | ⁢ ∑ 𝑡

1 | 𝑚 | ∇ 𝜙 , 𝜽 ⁢ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 𝑡 ∣ 𝑥 , 𝑚 1 : 𝑡 − 1 ) ) ,

(3)

where 𝜆 entreg is a hyperparameter adjusting the weight. It encourages the sender agent to explore the message space during training by keeping its policy entropy high. A message-length penalty − 𝛼 ⁢ | 𝑚 | is sometimes added to the objective to prevent messages from becoming unnaturally long:

𝒥 conv ( 𝜙 , 𝜽 ; 𝛼 ) :

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log 𝑅 𝜽 ( 𝑥 ∣ 𝑚 ) − 𝛼 | 𝑚 | ] .

(4)

Chaabouni et al. (2019) and Rita et al. (2020) pointed out that the message-length penalty is necessary to give rise to Zipf’s law of abbreviation (ZLA) in emergent languages.

2.2On the Statistical Properties of Languages

It is a key challenge in EC to fill the gap between emergent and natural languages. Though it is not straightforward to evaluate the emergent languages that are uninterpretable for human evaluators, their natural language likeness has been assessed indirectly by focusing on their statistical properties (e.g., Chaabouni et al., 2019; Kharitonov et al., 2020). We briefly introduce previous EC studies on Zipf’s law of abbreviation and Harris’s articulation scheme.

Word Lengths: Zipf’s law of abbreviation (ZLA, Zipf, 1935; 1949) refers to the statistical property of natural language where frequently used words tend to be shorter. Chaabouni et al. (2019) reported that emergent languages do not follow ZLA by the conventional objective 𝒥 conv ⁢ ( 𝜙 , 𝜽 ) . They defined 𝑃 obj ⁢ ( 𝑋 ) as a power-law distribution. According to ZLA, it was expected that | 𝑚 ( 1 ) | < | 𝑚 ( 2 ) | overall for 𝑥 ( 1 ) , 𝑥 ( 2 ) such that 𝑃 obj ⁢ ( 𝑥 ( 1 ) )

𝑃 obj ⁢ ( 𝑥 ( 2 ) ) . Their experiment, however, showed the opposite tendency | 𝑚 ( 1 ) |

| 𝑚 ( 2 ) | . They also reported that the emergent languages follow ZLA when the message-length penalty (Eq. 4) is additionally introduced. Previous work has ascribed such anti-efficiency to the lack of desirable inductive biases, such as the sender’s laziness (Chaabouni et al., 2019) and receiver’s impatience (Rita et al., 2020), and memory constraints (Ueda & Washio, 2021).

Word Segments: Harris’s articulation scheme (HAS, Harris, 1955; Tanaka-Ishii, 2021) is a statistical property of word segmentation in natural languages. According to HAS, the boundaries of word segments in a character sequence can be predicted to some extent from the statistical behavior of character 𝑛 -grams. This scheme is based on the discovery of Harris (1955) that there tend to be word boundaries at points where the number of possible successive characters after given contexts increases in English corpora. HAS reformulates it based on information theory (Cover & Thomas, 2006). Formally, let BE ⁢ ( 𝑠 ) be the branching entropy of character sequences 𝑠

𝑎 1 ⁢ ⋯ ⁢ 𝑎 𝑛 ( 𝑎 𝑖 ∈ 𝒜 ):

BE ⁢ ( 𝑠 )
:= ℋ ( 𝐴 𝑛 + 1 | 𝐴 1 : 𝑛

𝑠 )

− ∑ 𝑎 ∈ 𝒜 𝑃 ⁢ ( 𝐴 𝑛 + 1

𝑎 | 𝐴 1 : 𝑛

𝑠 ) ⁢ log 2 ⁡ 𝑃 ⁢ ( 𝐴 𝑛 + 1

𝑎 | 𝐴 1 : 𝑛

𝑠 ) ,

(5)

where 𝑛 is the length of 𝑠 and 𝑃 ⁢ ( 𝐴 𝑛 + 1

𝑎 | 𝐴 1 : 𝑛

𝑠 ) is defined as a simple 𝑛 -gram model. BE ⁢ ( 𝑠 ) is known to decrease monotonically on average (Bell et al., 1990), while its increase can be of special interest. HAS states that there tend to be word boundaries at BE’s increasing points. HAS can be utilized as an unsupervised word segmentation algorithm (Tanaka-Ishii, 2005; Tanaka-Ishii & Ishii, 2007) whose pseudo-code is shown in Appendix A. However, whether obtained segments from emergent languages are meaningful is not obvious. For instance, no one would believe that segments obtained from a random unigram model represent meaningful words; emergent languages face a similar issue as they are not human languages. We do not even have a direct method to evaluate the meaningfulness since there is no ground-truth segmentation data in emergent languages. To alleviate it, Ueda et al. (2023) proposed the following criteria to indirectly verify their meaningfulness:

C1.

The number of boundaries per message (denoted by 𝑛 bou ) should increase if 𝑛 att increases.

C2.

The number of distinct segments in a language (denoted by 𝑛 seg ) should increase if 𝑛 val increases.

C3.

W-TopSim

C-TopSim should hold, or Δ w , c := W-TopSim − C-TopSim

0 should hold.2

These are rephrased to some extent. C-TopSim is the normal TopSim that regards characters 𝑎 ∈ 𝒜 as symbol units, whereas W-TopSim regards segments (chunked characters) as symbol units. In the att-val setting, meaningful segments are expected to represent attributes and values in a disentangled way, motivating C1 and C2. On the other hand, C3 is based on the concept called double articulation by Martinet (1960), who pointed out that, in natural language, characters within a word are hardly related to the corresponding meaning whereas the word is associated. For instance, characters a, e, l, p are irrelevant to a rounded, red fruit, but word apple is relevant to it. Thus, the word-level compositionality is expected to be higher than the character-level one, motivating C3. With the conventional objective, however, emergent languages did not satisfy any of them, and the potential causes for this have not been addressed well in Ueda et al. (2023).

3 Redefine Objective as ELBO from the Generative Perspective

In this section, we first redefine the objective of signaling games as (beta-)VAE’s (Kingma & Welling, 2014; Higgins et al., 2017), i.e., the evidence lower bound (ELBO) with a KL weighting hyperparameter 𝛽 . Next, we show several reasons supporting our ELBO-based formulation.

Though somewhat abrupt, let us think of a receiver agent 𝑅 𝜽 as a generative model, a joint distribution over objects 𝑋 and messages 𝑀 :

𝑅 𝜽 joint ⁢ ( 𝑋 , 𝑀 ) := 𝑅 𝜽 ⁢ ( 𝑋 ∣ 𝑀 ) ⁢ 𝑃 𝜽 prior ⁢ ( 𝑀 ) .

(6)

Intuitively, the receiver consists not only of the conventional reconstruction term 𝑅 𝜽 ⁢ ( 𝑋 | 𝑀 ) but also of a “language model” 𝑃 𝜽 prior ⁢ ( 𝑀 ) . The optimal sender would be the true posterior 𝑆 𝜽 ∗ ⁢ ( 𝑀 | 𝑋 )

𝑅 𝜽 ⁢ ( 𝑋 | 𝑀 ) ⁢ 𝑃 𝜽 prior ⁢ ( 𝑀 ) / 𝑃 obj ⁢ ( 𝑋 ) by Bayes’ theorem. Let us assume, however, that the sender can only approximate it via a variational posterior 𝑆 𝜙 ⁢ ( 𝑀 | 𝑋 ) . The sender cannot access the true posterior because it is intractable, and the sender cannot directly peer into the receiver’s “brain” 𝜽 . Now one notices the similarity between the signaling game and (beta-)VAE.3 Their similarity is illustrated in Figure 1 and Table 1. beta-VAE consists of an encoder, decoder, and prior latent distribution. Here, we regard 𝑀 as latent, 𝑆 𝜙 ⁢ ( 𝑀 | 𝑋 ) as encoder, 𝑅 𝜽 ⁢ ( 𝑋 | 𝑀 ) as receiver, and 𝑃 𝜽 prior ⁢ ( 𝑀 ) as prior distribution. Our objective is now defined as follows:

𝒥 ours ⁢ ( 𝜙 , 𝜽 ; 𝛽 )

:= 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log 𝑅 𝜽 ( 𝑥 ∣ 𝑚 ) ] − 𝛽 𝐷 KL ( 𝑆 𝜙 ( 𝑀 ∣ 𝑥 ) | | 𝑃 𝜽 prior ( 𝑀 ) ) ] ,

(7)

where 𝐷 KL denotes the KL divergence and 𝛽 ≥ 0 is a hyperparameter that weights the KL term. Although Eq. 7 equals to the precise ELBO only if 𝛽

1 , 𝛽 is often set < 1 initially and (optionally) annealed to 1 to prevent posterior collapse (Bowman et al., 2016; Alemi et al., 2018; Fu et al., 2019; Klushyn et al., 2019). In what follows, we discuss reasons supporting our redefinition.

Recipe for Supporting Our ELBO-based Formulation

The conventional objective actually resembles ELBO in some respects. Section 3.1: The conventional objective has an implicit prior distribution. Section 3.2: The conventional objective is similar to beta-VAE’s.
The choice of prior distributions can affect the properties of emergent languages. Section 3.3: Be aware that unnatural prior distributions cause inefficient messages. Section 3.4: Use a learnable prior distribution instead, as a richer “language model.” Section 4.2: The learnable prior distribution gives rise to more meaningful segments.
ELBO can be justified from the computational psycholinguistic viewpoint. Section 3.5: ELBO models the trade-off between informativeness and surprisal. 3.1Conventional Objective Has An Implicit Prior Distribution

We show that the conventional signaling game has implicit prior distributions 𝑃 unif prior ⁢ ( 𝑀 ) or 𝑃 𝛼 prior ⁢ ( 𝑀 ) . First, let 𝑃 unif prior ⁢ ( 𝑀 ) be a uniform message distribution and 𝑃 𝜽 , unif joint ⁢ ( 𝑋 , 𝑀 ) be as follows:

𝑃 unif prior ⁢ ( 𝑚 ) := 1 | ℳ | , 𝑃 𝜽 , unif joint ⁢ ( 𝑥 , 𝑚 ) := 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) ⁢ 𝑃 unif prior ⁢ ( 𝑚 ) .

(8)

Then, the following holds (see Section B.1 for its proof):

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 )

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑃 𝜽 , unif joint ⁢ ( 𝑥 , 𝑚 ) ] ,

(9)

i.e., maximizing 𝒥 conv ⁢ ( 𝜙 , 𝜽 ) via the gradient method is the same as maximizing the expectation of log ⁡ 𝑃 𝜽 , unif joint ⁢ ( 𝑥 , 𝑚 ) . Next, the length-penalizing objective 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) also has a prior distribution. Let 𝑃 𝛼 prior ⁢ ( 𝑀 ) and 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑋 , 𝑀 ) be as follows:

𝑃 𝛼 prior ⁢ ( 𝑚 ) ∝ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | ) , 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ) := 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) ⁢ 𝑃 𝛼 prior ⁢ ( 𝑚 ) .

(10)

Then, the following holds (see Section B.1 for its proof):

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ) ] ,

(11)

i.e., maximizing 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) is the same as maximizing the expectation of log ⁡ 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ) . Note that Eq. 8 is a special case of Eq. 10; if 𝛼

0 , 𝑃 unif prior ⁢ ( 𝑚 )

𝑃 𝛼 prior ⁢ ( 𝑚 ) and 𝑃 𝜽 , unif joint ⁢ ( 𝑥 , 𝑚 )

𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ) .

3.2Conventional Objective Is Similar to (beta-)VAE

The conventional objective may be similar to variational inference on a generative model, because it turned out to have implicit prior distributions. Indeed, it is similar to (beta-)VAE. As 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) has the implicit prior 𝑃 𝛼 prior ⁢ ( 𝑀 ) , one notices the similarity between the signaling game and VAE, regarding 𝑀 as latent, 𝑆 𝜙 ⁢ ( 𝑀 | 𝑋 ) as encoder, 𝑅 𝜽 ⁢ ( 𝑋 | 𝑀 ) as decoder. Here, the corresponding ELBO 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) can be written as follows:

𝒥 elbo ( 𝜙 , 𝜽 ; 𝛼 ) := 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ ∣ 𝑥 ) [ log 𝑅 𝜽 ( 𝑥 ∣ 𝑚 ) ] − 𝐷 KL ( 𝑆 𝜙 ( 𝑀 | 𝑥 ) | | 𝑃 𝛼 prior ( 𝑀 ) ) ] .

(12)

Then, the following holds (see Section B.2 for its proof):

∇ 𝜙 , 𝜽 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) + ∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ] ⏟ resembles ∇ 𝜙 , 𝜽 entreg ( Eq. 3 ) ,

(13)

i.e., maximizing 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) is the same as maximizing 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) plus an entropy maximizer. Recall that previous work typically adopts an entropy regularizer (Eq. 3). Though they are not mathematically equal, they are similar enough in that they keep the sender’s entropy high to encourage exploration (Levine, 2018). In this sense, the conventional signaling game is similar to VAE. Or rather, it is similar to beta-VAE because 𝜆 entreg adjusts the entropy regularizer in practice, which roughly corresponds to that 𝛽 adjusts the KL term weight.

3.3Implicit Prior Distribution and Inefficiency of Emergent Language

We have thus far seen that the conventional objective has an implicit prior distribution and is similar to beta-VAE. Now we need to elaborate on why one should be aware of them. To this end, in this subsection, we suggest that the inefficiency of emergent languages is due to unnatural prior distributions. Chaabouni et al. (2019) reported that the objective 𝒥 conv ⁢ ( 𝜙 , 𝜽 ) did not result in emergent languages that follow ZLA, while the length-penalizing one 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) did. Recall that 𝒥 conv ⁢ ( 𝜙 , 𝜽 ) has the implicit prior distribution 𝑃 unif prior ⁢ ( 𝑀 ) and 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) has 𝑃 𝛼 prior ⁢ ( 𝑀 ) . Then, we define the corresponding distributions over message lengths 𝐾 :

𝑃 unif prior ⁢ ( 𝐾

𝑘 )
:= ∑ 𝑚 ∈ ℳ 𝟙 | 𝑚 |

𝑘 ⁢ 𝑃 unif prior ⁢ ( 𝑚 )

∝ exp ⁡ ( 𝛾 ⁢ 𝑘 ) ,

(14)

𝑃 𝛼 prior ⁢ ( 𝐾

𝑘 )
:= ∑ 𝑚 ∈ ℳ 𝟙 | 𝑚 |

𝑘 ⁢ 𝑃 𝛼 prior ⁢ ( 𝑚 )

∝ exp ⁡ ( ( 𝛾 − 𝛼 ) ⁢ 𝑘 ) ,

where 𝛾

log ⁡ ( | 𝒜 | − 1 )

0 and 𝟙 ( ⋅ ) is the indicator function. 𝑃 unif prior ⁢ ( 𝑘 ) (resp. 𝑃 𝛼 prior ⁢ ( 𝑘 ) ) defines the probability that the length 𝐾 of a message sampled from 𝑃 unif prior ⁢ ( 𝑀 ) (resp. 𝑃 𝛼 prior ⁢ ( 𝑀 ) ) is 𝑘 . 𝑃 unif prior ⁢ ( 𝑘 ) grows up w.r.t 𝑘 , i.e., longer messages are more common, and even the longest messages are the most likely. It is clearly unnatural as a prior distribution. Regularized by such a prior distribution, emergent languages would obviously become unnaturally long. In contrast, when 𝛼

𝛾 , 𝑃 𝛼 prior ⁢ ( 𝑘 ) decays w.r.t 𝑘 , i.e., shorter messages are more common. Thus, unnatural inefficiency is arguably due to the implicit, inappropriate prior distribution 𝑃 unif prior ⁢ ( 𝑚 ) , while 𝑃 𝛼 prior ⁢ ( 𝑚 ) may mitigate the problem. This conclusion contrasts with the previous work that has ascribed the issue to inductive biases, as mentioned in Section 2.2. Note that it is an additional interpretation rather than a rebuttal. It is likely to be a complicated issue involving both the prior distribution and inductive bias.

3.4Introduce a Learnable Parametrized Prior Distribution

In the previous subsection, we saw that the choice of prior distributions can influence the statistical properties of emergent languages. However, the signaling-game counterpart of the VAE’s prior distribution has not been clear yet. In other words, as a sub-domain of linguistics, it is important to take a clear stance on which part of real-world communication the prior distribution models. This paper considers it as a receiver’s neural language model (LM) parametrized by 𝜽 :

𝑃 𝜽 prior ⁢ ( 𝑚 )

∏ 𝑡

1 | 𝑚 | 𝑃 𝜽 prior ⁢ ( 𝑚 𝑡 ∣ 𝑚 1 : 𝑡 − 1 ) .

(15)

It can be seen as an LM because it approximates the average behavior of a sender, i.e., the KL term in Eq. 7, 𝔼 𝑃 obj ⁢ ( 𝑥 ) [ 𝐷 KL ( 𝑆 𝜙 ( 𝑀 | 𝑥 ) | | 𝑃 𝜽 prior ( 𝑀 ) ) ]

− 𝔼 𝑃 obj ⁢ ( 𝑥 ) [ ℋ ( 𝑆 𝜙 ( 𝑀 | 𝑥 ) ) ] − 𝔼 𝑆 𝜙 ⁢ ( 𝑚 ) [ log 𝑃 𝜽 prior ( 𝑚 ) ] , is minimized w.r.t 𝜽 when 𝑃 𝜽 prior ⁢ ( 𝑚 )

𝔼 𝑃 obj ⁢ ( 𝑥 ) [ 𝑆 𝜙 ⁢ ( 𝑚 | 𝑥 ) ] . In contrast, 𝑃 𝛼 prior ⁢ ( 𝑚 ) seems to be oversimplified as an LM; though it is more natural than 𝑃 unif prior ⁢ ( 𝑚 ) , it is still unnatural as a simulation of language since it restricts the complexity of the sender’s behavior. Our modification leads to more meaningful segments (see Section 4).

3.5Relationship to Surprisal Theory

We formulated the game objective as ELBO and defined its prior distribution as a learnable LM. It can also be seen as a modeling of surprisal theory (Hale, 2001; Levy, 2008; Smith & Levy, 2013) in computational psycholinguistics. The theory has recently been considered important for evaluating neural language models as cognitive models (Wilcox et al., 2020). In the theory, surprisal refers to the negative log-likelihood of sentences/messages − log ⁡ 𝑃 𝜽 prior ⁢ ( 𝑚 )

− ∑ 𝑡 log ⁡ 𝑃 𝜽 prior ⁢ ( 𝑚 𝑡 | 𝑚 1 : 𝑡 − 1 ) and higher surprisals are considered to require greater cognitive processing costs. Intuitively, the cost for a reader/listener/receiver to predict the next token accumulates over time. In fact, our objective 𝒥 ours models it naturally, as one notices from the following rewrite:

𝒥 ours ⁢ ( 𝜙 , 𝜽 ; 𝛽 )

𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) ⏟ reconstruction ⁢ + 𝛽 ⁢ log ⁡ 𝑃 𝜽 prior ⁢ ( 𝑚 ) ⏟ negative surprisal ] ⁢ + 𝛽 ⁢ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ⏟ entropy maximizer ] .

(16)

The first term is the reconstruction, the second is the (negative) surprisal, and the third is the entropy maximizer. Reconstruction and surprisal can be in a trade-off relationship. Conveying more information results in messages with a higher cognitive cost, while messages with a lower cognitive cost may not convey as much information.4

4Experiment

In this section, we validate the effects of the formulation presented in Section 3. Specifically, we demonstrate improvements in response to the problem raised by Ueda et al. (2023) that the conventional objective does not give rise to meaningful word segmentation in emergent languages. The experimental setup is described in Section 4.1 and the results are presented in Section 4.2.5

4.1Setup

Object and Message Spaces: We define an object space 𝒳 as a set of att-val objects. Following Ueda et al. (2023), we set ( 𝑛 att , 𝑛 val ) ∈ { ( 2 , 64 ) , ( 3 , 16 ) , ( 4 , 8 ) , ( 6 , 4 ) , ( 12 , 2 ) } , ensuring all the object space sizes | 𝒳 | are ( 𝑛 val ) 𝑛 att

4096 .6 To define a message space ℳ , we set 𝐿 max

32 , the same as Ueda et al. (2023), and | 𝒜 |

9 , one bigger than Ueda et al. (2023) so that eos ∈ 𝒜 .

Agent Architectures and Optimization: The agents 𝑆 𝜙 , 𝑅 𝜽 are based on GRU (Cho et al., 2014), following Chaabouni et al. (2020); Ueda et al. (2023) and have symbol embedding layers. LayerNorm (Ba et al., 2016) is applied to the hidden states and embeddings for faster convergence. The agents have linear layers (with biases) to predict the next symbol; Using the symbol prediction layer, the sender generates a message while the receiver computes log ⁡ 𝑃 𝜽 prior ⁢ ( 𝑚 ) . The sender embeds an object 𝑥 to the initial hidden state, and the receiver outputs the object logits from the last hidden state. We apply Gaussian dropout (Srivastava et al., 2014) to regularize 𝜽 , because otherwise it can be unstable under competing pressures log ⁡ 𝑅 𝜽 ⁢ ( 𝑚 | 𝑥 ) and log ⁡ 𝑅 𝜽 ⁢ ( 𝑚 ) . We optimize ELBO in which latent variables are discrete sequences, similarly to Mnih & Gregor (2014). By the stochastic computation graph approach (Schulman et al., 2015), we obtain a surrogate objective that is the combination of the standard backpropagation w.r.t the receiver’s parameter 𝜽 and REINFORCE (Williams, 1992) w.r.t the sender’s parameter 𝜙 . Consequently, we obtain a similar optimization strategy to the one adopted in Chaabouni et al. (2019); Ueda et al. (2023). 𝛽 is initially set ≪ 1 to avoid posterior collapse and annealed to 1 via REWO (Klushyn et al., 2019). For more detail, see Appendix C.

Evaluation: We have three criteria C1, C2, and C3 to measure the meaningfulness of segments obtained by the boundary detection algorithm. The threshold parameter is set to 0 .7 We compare our objective with the following baselines: (BL1) the conventional objective 𝒥 conv plus the entropy regularizer (where 𝜆 entreg

1 ), (BL2) the ELBO-based objective that is almost the same as ours, except that its prior is 𝑃 𝛼 prior (where 𝛼

log ⁡ | 𝒜 | ).8 We ran each 16 times with distinct random seeds.

4.2Experimental Result Figure 2: Results for 𝑛 bou (C1), 𝑛 seg (C2), Δ w , c (C3), C-TopSim (C3), and W-TopSim (C3) are shown in order from the left. The x-axis represents ( 𝑛 att , 𝑛 val ) while the y-axis represents the values of each metric. The shaded regions and error bars represent the standard error of mean. The threshold parameter is set to 0 . The blue plots represent the results for our ELBO-based objective 𝒥 ours , the orange ones for (BL1) the conventional objective 𝒥 conv plus the entropy regularizer, and the grey ones for (8) the ELBO-based objective whose prior is 𝑃 𝛼 prior . The apparent inferior performance of Δ w , c for 𝒥 ours compared to the baselines might be misleading. It is because 𝒥 ours greatly improves both C-TomSim and W-TopSim. The larger scale of their improvements could result in a seemingly worse Δ w , c , but this does not necessarily indicate poorer performance.

We show the results in Figure 2. The results for 𝑛 bou (C1), 𝑛 seg (C2), Δ w , c (C3), C-TopSim (C3), and W-TopSim (C3) are shown in order from the left. As will be explained below, it can be observed that the meaningfulness of segments in emergent language improves when using our objective 𝒥 ours .

Criteria C1 and C2 are satisfied with our objective 𝒥 ours : First, see the result for 𝑛 bou in Figure 2. It is evident that the line monotonically increases only for 𝒥 ours . It means that, as 𝑛 att increases, 𝑛 bou also increases only when using 𝒥 ours , confirming that C1 is satisfied. Next, see the result for 𝑛 seg in Figure 2. Again, it is evident that the line monotonically decreases only for 𝒥 ours . That is, as 𝑛 val increases, 𝑛 seg also increases only when using 𝒥 ours , confirming that C2 is satisfied.

Criterion C3 is not met but TopSim improved: Refer to the results of Δ w , c in Figure 2. Δ w , c < 0 regardless of the objective used. Even when using 𝒥 ours , C3 is not satisfied. However, observe the results for C-TopSim and W-TopSim in Figure 2. By using 𝒥 ours , both C-TopSim and W-TopSim have improved. Notably, the improvement in W-TopSim indicates that the meaningfulness of segments has improved. Δ w , c does not become positive since C-TopSim has increased even more. We speculate that one reason for Δ w , c < 0 might be due to the “spelling variation” of segments.

5Discussion

In Section 3, we proposed formulating the signaling game objective as ELBO and discussed supporting scenarios. In particular, we suggested that prior distributions can influence whether emergent languages follow ZLA. In Section 4, we demonstrated the improvement of segments’ meaningfulness regarding HAS with a learnable prior distribution (neural language model).

Why Did Segments Become More Meaningful? Though it is not straightforward to give a clear mathematical basis, we currently hypothesize that the improvement of segments’ meaningfulness is due to the competing pressure of the reconstruction and the surprisal. On the one hand, due to the reconstruction term, a receiver has to be “surprised,” i.e., to receive symbols 𝑚 𝑡 with relatively high − log ⁡ 𝑃 𝜽 prior ⁢ ( 𝑚 𝑡 | 𝑚 1 : 𝑡 − 1 ) . Objects contain 𝑛 att statistically independent components, i.e., attributes. The receiver must be surprised several times ∝ 𝑛 att to reconstruct the object from a sequential message. On the other hand, due to the surprisal term, the receiver does not want to be “surprised,” i.e., it prefers relatively low − log ⁡ 𝑃 𝜽 prior ⁢ ( 𝑚 𝑡 | 𝑚 1 : 𝑡 − 1 ) . In natural language, branching entropy (Eq. 5) decreases on average, but there are some spikes indicating boundaries. In other words, the next character should often be predictable (less surprising), while boundaries appear at points where predictions are hard. The conventional objective 𝒥 conv does not provide any incentives to make characters “often predictable,” whereas the surprisal term in ELBO provides.9

Limitation and Future Work: A populated signaling game (Chaabouni et al., 2022; Rita et al., 2022a; Michel et al., 2023) that involves multiple senders and receivers is important from a sociolinguistic perspective (Raviv et al., 2019). Defining a reasonable ELBO for such a setting is our future work. Perhaps it would be similar to multi-view VAE (Suzuki & Matsuo, 2022). Another remaining task is to extend our generative perspective to discrimination games for which the objective is not reconstructive but contrastive, such as InfoNCE (van den Oord et al., 2018). Furthermore, although we adopted GRU as a prior message distribution, various neural language models have been proposed as cognitive models in computational (psycho)linguistics (Dyer et al., 2016; Shen et al., 2019; Stanojevic et al., 2021; Kuribayashi et al., 2022, inter alia). It should be a key direction to investigate whether such cognitively motivated models give rise to richer structures such as syntax.

6Related Work

EC as Representation Learning: Several studies regard EC as representation learning (Andreas, 2019; Chaabouni et al., 2020; Xu et al., 2022). For instance, compositionality and disentanglement are often seen as synonymous. TopSim is a representation similarity analysis (RSA, Kriegeskorte et al., 2008), as van der Wal et al. (2020) pointed out. Resnick et al. (2020) formulated the objective as ELBO. However, they defined messages as fixed-length binaries and agents as non-autoregressive models, which is not applicable to our variable-length setting. Moreover, they do not discuss the prior distribution choice; they seem to have adopted Bernoulli purely for computational convenience. In contrast, we indicated that their choice influences the structure of emergent languages.10

VIB: Tucker et al. (2022) defined a communication game called VQ-VIB, based on Variational Information Bottleneck (VIB, Alemi et al., 2017) which is known as a generalization of beta-VAE. Also, Chaabouni et al. (2021) formalized a color naming game with a similar motivation. Note that messages are defined as single symbols in their settings. In contrast, messages are of variable length in our setting so that we can discuss the structure of emergent languages such as ZLA and HAS.

MH Naming Game: Taniguchi et al. (2022), Inukai et al. (2023), and Hoang et al. (2024) defined a (recursive) MH naming game in which communication was formulated as representation learning on a generative model and emerged through MCMC instead of variational inference.

Natural Language as Prior: Havrylov & Titov (2017) formalized a referential game with a pre-trained natural language model as a prior distribution. Their qualitative evaluation showed that languages became more compositional. Lazaridou et al. (2020) adopted a similar approach.

Dealing with Discreteness: Though we adopted the REINFORCE-like optimization following the previous EC work (Chaabouni et al., 2019; Ueda et al., 2023), there are various approaches dealing with discrete variables (Rolfe, 2017; Vahdat et al., 2018a; b; Jang et al., 2017; Maddison et al., 2017).

7Conclusion

In this paper, EC was reinterpreted as representation learning on a generative model. We regarded Lewis’s signaling game as beta-VAE and formulated its objective as ELBO. Our main contributions are (1) to show the similarities between the game and (beta-)VAE, (2) to indicate the impact of prior message distribution on the statistical properties of emergent languages, and (3) to present a computational psycholinguistic interpretation of emergent communication. Specifically, we addressed the issues of Zipf’s law of abbreviation and Harris’s articulation scheme, which had not been reproduced well with the conventional objective. Another important point is that we revisited the variable-length message setting, which previous work often avoided. The remaining tasks for future work are formulating a populated or discrimination game (while keeping the generative viewpoint) and introducing cognitively motivated language models as prior distributions.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers JP21H04904, JP23H04835, and JP23KJ0768. We would like to express our gratitude to the anonymous reviewers for their lively and constructive discussions.

Reproducibility Statement

The reproducibility of the experiments in this paper is ensured in the following ways:

•

The experimental setup is described in Section 4.1 and Appendix C.

•

The source code has been released upon acceptance (See Appendix C).

References Alemi et al. (2017) ↑ Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy.Deep variational information bottleneck.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.URL https://openreview.net/forum?id=HyxQzBceg. Alemi et al. (2018) ↑ Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V. Dillon, Rif A. Saurous, and Kevin Murphy.Fixing a broken ELBO.In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 159–168. PMLR, 2018.URL http://proceedings.mlr.press/v80/alemi18a.html. Andreas (2019) ↑ Jacob Andreas.Measuring compositionality in representation learning.In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.URL https://openreview.net/forum?id=HJz05o0qK7. Ba et al. (2016) ↑ Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton.Layer normalization.CoRR, abs/1607.06450, 2016.URL http://arxiv.org/abs/1607.06450. Bell et al. (1990) ↑ Timothy C. Bell, John G. Cleary, and Ian H. Witten.Text Compression.Prentice-Hall, Inc., 1990. Bowman et al. (2016) ↑ Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio.Generating sentences from a continuous space.In Yoav Goldberg and Stefan Riezler (eds.), Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, pp. 10–21. ACL, 2016.doi: 10.18653/v1/k16-1002.URL https://doi.org/10.18653/v1/k16-1002. Brandizzi (2023) ↑ Nicolo’ Brandizzi.Towards more human-like AI communication: A review of emergent communication research.CoRR, abs/2308.02541, 2023.doi: 10.48550/arXiv.2308.02541.URL https://doi.org/10.48550/arXiv.2308.02541. Brighton & Kirby (2006) ↑ Henry Brighton and Simon Kirby.Understanding linguistic evolution by visualizing the emergence of topographic mappings.Artif. Life, 12(2):229–242, 2006.URL https://doi.org/10.1162/artl.2006.12.2.229. Chaabouni et al. (2019) ↑ Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, and Marco Baroni.Anti-efficient encoding in emergent communication.In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 6290–6300, 2019.URL https://proceedings.neurips.cc/paper/2019/hash/31ca0ca71184bbdb3de7b20a51e88e90-Abstract.html. Chaabouni et al. (2020) ↑ Rahma Chaabouni, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux, and Marco Baroni.Compositionality and generalization in emergent languages.In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 4427–4442. Association for Computational Linguistics, 2020.URL https://doi.org/10.18653/v1/2020.acl-main.407. Chaabouni et al. (2021) ↑ Rahma Chaabouni, Eugene Kharitonov, Emmanuel Dupoux, and Marco Baroni.Communicating artificial neural networks develop efficient color-naming systems.Proceedings of the National Academy of Sciences, 118(12):e2016569118, 2021.doi: 10.1073/pnas.2016569118.URL https://www.pnas.org/doi/abs/10.1073/pnas.2016569118. Chaabouni et al. (2022) ↑ Rahma Chaabouni, Florian Strub, Florent Altché, Eugene Tarassov, Corentin Tallec, Elnaz Davoodi, Kory Wallace Mathewson, Olivier Tieleman, Angeliki Lazaridou, and Bilal Piot.Emergent communication at scale.In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.URL https://openreview.net/forum?id=AUGBfDIV9rL. Cho et al. (2014) ↑ Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.Learning phrase representations using RNN encoder-decoder for statistical machine translation.In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1724–1734. ACL, 2014.URL https://doi.org/10.3115/v1/d14-1179. Choi et al. (2018) ↑ Edward Choi, Angeliki Lazaridou, and Nando de Freitas.Compositional obverter communication learning from raw visual input.In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.URL https://openreview.net/forum?id=rknt2Be0-. Conrad & Mitzenmacher (2004) ↑ Brian Conrad and Michael Mitzenmacher.Power laws for monkeys typing randomly: the case of unequal probabilities.IEEE Transactions on Information Theory, 50(7):1403–1414, 2004.doi: 10.1109/TIT.2004.830752. Cover & Thomas (2006) ↑ Thomas M. Cover and Joy A. Thomas.Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing).Wiley-Interscience, 2006. Dessì et al. (2021) ↑ Roberto Dessì, Eugene Kharitonov, and Marco Baroni.Interpretable agent communication from scratch (with a generic visual processor emerging on the side).In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 26937–26949, 2021.URL https://proceedings.neurips.cc/paper/2021/hash/e250c59336b505ed411d455abaa30b4d-Abstract.html. Dyer et al. (2016) ↑ Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith.Recurrent neural network grammars.In Kevin Knight, Ani Nenkova, and Owen Rambow (eds.), NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 199–209. The Association for Computational Linguistics, 2016.doi: 10.18653/v1/n16-1024.URL https://doi.org/10.18653/v1/n16-1024. Foerster et al. (2016) ↑ Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson.Learning to communicate with deep multi-agent reinforcement learning.In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 2137–2145, 2016.URL https://proceedings.neurips.cc/paper/2016/hash/c7635bfd99248a2cdef8249ef7bfbef4-Abstract.html. Fu et al. (2019) ↑ Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Celikyilmaz, and Lawrence Carin.Cyclical annealing schedule: A simple approach to mitigating KL vanishing.In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 240–250. Association for Computational Linguistics, 2019.doi: 10.18653/v1/n19-1021.URL https://doi.org/10.18653/v1/n19-1021. Gal & Ghahramani (2016) ↑ Yarin Gal and Zoubin Ghahramani.A theoretically grounded application of dropout in recurrent neural networks.In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 1019–1027, 2016.URL https://proceedings.neurips.cc/paper/2016/hash/076a0c97d09cf1a0ec3e19c7f2529f2b-Abstract.html. Galke et al. (2022) ↑ Lukas Galke, Yoav Ram, and Limor Raviv.Emergent communication for understanding human language evolution: What’s missing?CoRR, abs/2204.10590, 2022.doi: 10.48550/arXiv.2204.10590.URL https://doi.org/10.48550/arXiv.2204.10590. Hale (2001) ↑ John Hale.A probabilistic earley parser as a psycholinguistic model.In Language Technologies 2001: The Second Meeting of the North American Chapter of the Association for Computational Linguistics, NAACL 2001, Pittsburgh, PA, USA, June 2-7, 2001. The Association for Computational Linguistics, 2001.URL https://aclanthology.org/N01-1021/. Harris (1955) ↑ Zellig S. Harris.From phoneme to morpheme.Language, 31(2):190–222, 1955.URL http://www.jstor.org/stable/411036. Havrylov & Titov (2017) ↑ Serhii Havrylov and Ivan Titov.Emergence of language with multi-agent games: Learning to communicate with sequences of symbols.In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 2149–2159, 2017.URL https://proceedings.neurips.cc/paper/2017/hash/70222949cc0db89ab32c9969754d4758-Abstract.html. Higgins et al. (2017) ↑ Irina Higgins, Loïc Matthey, Arka Pal, Christopher P. Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner.beta-vae: Learning basic visual concepts with a constrained variational framework.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.URL https://openreview.net/forum?id=Sy2fzU9gl. Hoang et al. (2024) ↑ Nguyen Le Hoang, Tadahiro Taniguchi, Yoshinobu Hagiwara, and Akira Taniguchi.Emergent communication of multimodal deep generative models based on metropolis-hastings naming game.Frontiers in Robotics and AI, 10, 2024.ISSN 2296-9144.doi: 10.3389/frobt.2023.1290604.URL https://www.frontiersin.org/articles/10.3389/frobt.2023.1290604. Hochreiter & Schmidhuber (1997) ↑ Sepp Hochreiter and Jürgen Schmidhuber.Long short-term memory.Neural Comput., 9(8):1735–1780, 1997.URL https://doi.org/10.1162/neco.1997.9.8.1735. Inukai et al. (2023) ↑ Jun Inukai, Tadahiro Taniguchi, Akira Taniguchi, and Yoshinobu Hagiwara.Recursive metropolis-hastings naming game: Symbol emergence in a multi-agent system based on probabilistic generative models.CoRR, abs/2305.19761, 2023.doi: 10.48550/arXiv.2305.19761.URL https://doi.org/10.48550/arXiv.2305.19761. Jang et al. (2017) ↑ Eric Jang, Shixiang Gu, and Ben Poole.Categorical reparameterization with gumbel-softmax.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.URL https://openreview.net/forum?id=rkE3y85ee. Jaques et al. (2019) ↑ Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Çaglar Gülçehre, Pedro A. Ortega, DJ Strouse, Joel Z. Leibo, and Nando de Freitas.Social influence as intrinsic motivation for multi-agent deep reinforcement learning.In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 3040–3049. PMLR, 2019.URL http://proceedings.mlr.press/v97/jaques19a.html. Kharitonov et al. (2020) ↑ Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, and Marco Baroni.Entropy minimization in emergent languages.In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 5220–5230. PMLR, 2020.URL http://proceedings.mlr.press/v119/kharitonov20a.html. Kingma & Ba (2015) ↑ Diederik P. Kingma and Jimmy Ba.Adam: A method for stochastic optimization.In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.URL http://arxiv.org/abs/1412.6980. Kingma & Welling (2014) ↑ Diederik P. Kingma and Max Welling.Auto-encoding variational bayes.In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.URL http://arxiv.org/abs/1312.6114. Kirby et al. (2014) ↑ Simon Kirby, Tom Griffiths, and Kenny Smith.Iterated learning and the evolution of language.Current Opinion in Neurobiology, 28:108–114, 2014.ISSN 0959-4388.doi: https://doi.org/10.1016/j.conb.2014.07.014.URL https://www.sciencedirect.com/science/article/pii/S0959438814001421.SI: Communication and language. Kirby et al. (2015) ↑ Simon Kirby, Monica Tamariz, Hannah Cornish, and Kenny Smith.Compression and communication in the cultural evolution of linguistic structure.Cognition, 141:87–102, 2015.ISSN 0010-0277.doi: https://doi.org/10.1016/j.cognition.2015.03.016.URL https://www.sciencedirect.com/science/article/pii/S0010027715000815. Klushyn et al. (2019) ↑ Alexej Klushyn, Nutan Chen, Richard Kurle, Botond Cseke, and Patrick van der Smagt.Learning hierarchical priors in vaes.In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 2866–2875, 2019.URL https://proceedings.neurips.cc/paper/2019/hash/7d12b66d3df6af8d429c1a357d8b9e1a-Abstract.html. Kottur et al. (2017) ↑ Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra.Natural language does not emerge ’naturally’ in multi-agent dialog.In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 2962–2967. Association for Computational Linguistics, 2017.URL https://doi.org/10.18653/v1/d17-1321. Kriegeskorte et al. (2008) ↑ Nikolaus Kriegeskorte, Marieke Mur, and Peter Bandettini.Representational similarity analysis - connecting the branches of systems neuroscience.Frontiers in Systems Neuroscience, 2, 2008.ISSN 1662-5137.doi: 10.3389/neuro.06.004.2008.URL https://www.frontiersin.org/articles/10.3389/neuro.06.004.2008. Kuribayashi et al. (2022) ↑ Tatsuki Kuribayashi, Yohei Oseki, Ana Brassard, and Kentaro Inui.Context limitations make neural language models more human-like.In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pp. 10421–10436. Association for Computational Linguistics, 2022.doi: 10.18653/v1/2022.emnlp-main.712.URL https://doi.org/10.18653/v1/2022.emnlp-main.712. Lazaridou & Baroni (2020) ↑ Angeliki Lazaridou and Marco Baroni.Emergent multi-agent communication in the deep learning era.CoRR, abs/2006.02419, 2020.URL https://arxiv.org/abs/2006.02419. Lazaridou et al. (2017) ↑ Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni.Multi-agent cooperation and the emergence of (natural) language.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.URL https://openreview.net/forum?id=Hk8N3Sclg. Lazaridou et al. (2018) ↑ Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark.Emergence of linguistic communication from referential games with symbolic and pixel input.In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.URL https://openreview.net/forum?id=HJGv1Z-AW. Lazaridou et al. (2020) ↑ Angeliki Lazaridou, Anna Potapenko, and Olivier Tieleman.Multi-agent communication meets natural language: Synergies between functional and structural language learning.In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 7663–7674. Association for Computational Linguistics, 2020.URL https://doi.org/10.18653/v1/2020.acl-main.685. Levine (2018) ↑ Sergey Levine.Reinforcement learning and control as probabilistic inference: Tutorial and review.CoRR, abs/1805.00909, 2018.URL http://arxiv.org/abs/1805.00909. Levy (2008) ↑ Roger Levy.Expectation-based syntactic comprehension.Cognition, 106(3):1126–1177, 2008.ISSN 0010-0277.doi: https://doi.org/10.1016/j.cognition.2007.05.006.URL https://www.sciencedirect.com/science/article/pii/S0010027707001436. Lewis (1969) ↑ David K. Lewis.Convention: A Philosophical Study.Wiley-Blackwell, 1969. Li & Bowling (2019) ↑ Fushan Li and Michael Bowling.Ease-of-teaching and language structure from emergent communication.In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 15825–15835, 2019.URL https://proceedings.neurips.cc/paper/2019/hash/b0cf188d74589db9b23d5d277238a929-Abstract.html. Lowe et al. (2017) ↑ Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch.Multi-agent actor-critic for mixed cooperative-competitive environments.In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 6379–6390, 2017.URL https://proceedings.neurips.cc/paper/2017/hash/68a9750337a418a86fe06c1991a1d64c-Abstract.html. Maddison et al. (2017) ↑ Chris J. Maddison, Andriy Mnih, and Yee Whye Teh.The concrete distribution: A continuous relaxation of discrete random variables.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.URL https://openreview.net/forum?id=S1jE5L5gl. Martinet (1960) ↑ André Martinet.Éléments de linguistique générale.Armand Colin, 1960. Michel et al. (2023) ↑ Paul Michel, Mathieu Rita, Kory Wallace Mathewson, Olivier Tieleman, and Angeliki Lazaridou.Revisiting populations in multi-agent communication.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.URL https://openreview.net/pdf?id=n-UHRIdPju. Miller (1957) ↑ George A. Miller.Some effects of intermittent silence.The American Journal of Psychology, 70(2):311–314, 1957.URL http://www.jstor.org/stable/1419346. Mnih & Gregor (2014) ↑ Andriy Mnih and Karol Gregor.Neural variational inference and learning in belief networks.In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pp. 1791–1799. JMLR.org, 2014.URL http://proceedings.mlr.press/v32/mnih14.html. Mnih et al. (2016) ↑ Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu.Asynchronous methods for deep reinforcement learning.In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 1928–1937. JMLR.org, 2016.URL http://proceedings.mlr.press/v48/mniha16.html. Mordatch & Abbeel (2018) ↑ Igor Mordatch and Pieter Abbeel.Emergence of grounded compositional language in multi-agent populations.In Sheila A. McIlraith and Kilian Q. Weinberger (eds.), Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 1495–1502. AAAI Press, 2018.URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17007. Raviv et al. (2019) ↑ Limor Raviv, Antje Meyer, and Shiri Lev-Ari.Larger communities create more systematic languages.Proceedings of the Royal Society B: Biological Sciences, 286(1907):20191262, 2019.doi: 10.1098/rspb.2019.1262.URL https://royalsocietypublishing.org/doi/abs/10.1098/rspb.2019.1262. Ren et al. (2020) ↑ Yi Ren, Shangmin Guo, Matthieu Labeau, Shay B. Cohen, and Simon Kirby.Compositional languages emerge in a neural iterated learning model.In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.URL https://openreview.net/forum?id=HkePNpVKPB. Resnick et al. (2020) ↑ Cinjon Resnick, Abhinav Gupta, Jakob N. Foerster, Andrew M. Dai, and Kyunghyun Cho.Capacity, bandwidth, and compositionality in emergent language learning.In Amal El Fallah Seghrouchni, Gita Sukthankar, Bo An, and Neil Yorke-Smith (eds.), Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’20, Auckland, New Zealand, May 9-13, 2020, pp. 1125–1133. International Foundation for Autonomous Agents and Multiagent Systems, 2020.doi: 10.5555/3398761.3398892.URL https://dl.acm.org/doi/10.5555/3398761.3398892. Ri et al. (2023) ↑ Ryokan Ri, Ryo Ueda, and Jason Naradowsky.Emergent communication with attention.CoRR, abs/2305.10920, 2023.doi: 10.48550/arXiv.2305.10920.URL https://doi.org/10.48550/arXiv.2305.10920. Rita et al. (2020) ↑ Mathieu Rita, Rahma Chaabouni, and Emmanuel Dupoux.”lazimpa”: Lazy and impatient neural agents learn to communicate efficiently.In Raquel Fernández and Tal Linzen (eds.), Proceedings of the 24th Conference on Computational Natural Language Learning, CoNLL 2020, Online, November 19-20, 2020, pp. 335–343. Association for Computational Linguistics, 2020.URL https://doi.org/10.18653/v1/2020.conll-1.26. Rita et al. (2022a) ↑ Mathieu Rita, Florian Strub, Jean-Bastien Grill, Olivier Pietquin, and Emmanuel Dupoux.On the role of population heterogeneity in emergent communication.In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a.URL https://openreview.net/forum?id=5Qkd7-bZfI. Rita et al. (2022b) ↑ Mathieu Rita, Corentin Tallec, Paul Michel, Jean-Bastien Grill, Olivier Pietquin, Emmanuel Dupoux, and Florian Strub.Emergent communication: Generalization and overfitting in lewis games.In NeurIPS, 2022b.URL http://papers.nips.cc/paper_files/paper/2022/hash/093b08a7ad6e6dd8d34b9cc86bb5f07c-Abstract-Conference.html. Rolfe (2017) ↑ Jason Tyler Rolfe.Discrete variational autoencoders.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.URL https://openreview.net/forum?id=ryMxXPFex. Schulman et al. (2015) ↑ John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel.Gradient estimation using stochastic computation graphs.In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 3528–3536, 2015.URL https://proceedings.neurips.cc/paper/2015/hash/de03beffeed9da5f3639a621bcab5dd4-Abstract.html. Shen et al. (2019) ↑ Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron C. Courville.Ordered neurons: Integrating tree structures into recurrent neural networks.In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.URL https://openreview.net/forum?id=B1l6qiR5F7. Skyrms (2010) ↑ Brian Skyrms.Signals: Evolution, Learning, and Information.Oxford University Press, Oxford, GB, 2010. Smith & Levy (2013) ↑ Nathaniel J. Smith and Roger Levy.The effect of word predictability on reading time is logarithmic.Cognition, 128(3):302–319, 2013.ISSN 0010-0277.doi: https://doi.org/10.1016/j.cognition.2013.02.013.URL https://www.sciencedirect.com/science/article/pii/S0010027713000413. Srivastava et al. (2014) ↑ Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: a simple way to prevent neural networks from overfitting.J. Mach. Learn. Res., 15(1):1929–1958, 2014.doi: 10.5555/2627435.2670313.URL https://dl.acm.org/doi/10.5555/2627435.2670313. Stanojevic et al. (2021) ↑ Milos Stanojevic, Shohini Bhattasali, Donald Dunagan, Luca Campanelli, Mark Steedman, Jonathan Brennan, and John T. Hale.Modeling incremental language comprehension in the brain with combinatory categorial grammar.In Emmanuele Chersoni, Nora Hollenstein, Cassandra Jacobs, Yohei Oseki, Laurent Prévot, and Enrico Santus (eds.), Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, CMCL 2021, Online, June 10, 2021, pp. 23–38. Association for Computational Linguistics, 2021.doi: 10.18653/v1/2021.cmcl-1.3.URL https://doi.org/10.18653/v1/2021.cmcl-1.3. Suzuki & Matsuo (2022) ↑ Masahiro Suzuki and Yutaka Matsuo.A survey of multimodal deep generative models.Adv. Robotics, 36(5-6):261–278, 2022.doi: 10.1080/01691864.2022.2035253.URL https://doi.org/10.1080/01691864.2022.2035253. Tanaka-Ishii (2005) ↑ Kumiko Tanaka-Ishii.Entropy as an indicator of context boundaries: An experiment using a web search engine.In Robert Dale, Kam-Fai Wong, Jian Su, and Oi Yee Kwong (eds.), Natural Language Processing - IJCNLP 2005, Second International Joint Conference, Jeju Island, Korea, October 11-13, 2005, Proceedings, volume 3651 of Lecture Notes in Computer Science, pp. 93–105. Springer, 2005.URL https://doi.org/10.1007/11562214_9. Tanaka-Ishii (2021) ↑ Kumiko Tanaka-Ishii.Articulation of Elements, pp. 115–124.Springer International Publishing, Cham, 2021.URL https://doi.org/10.1007/978-3-030-59377-3_11. Tanaka-Ishii & Ishii (2007) ↑ Kumiko Tanaka-Ishii and Yuichiro Ishii.Multilingual phrase-based concordance generation in real-time.Inf. Retr., 10(3):275–295, 2007.URL https://doi.org/10.1007/s10791-006-9021-5. Taniguchi et al. (2022) ↑ Tadahiro Taniguchi, Yuto Yoshida, Akira Taniguchi, and Yoshinobu Hagiwara.Emergent communication through metropolis-hastings naming game with deep generative models.CoRR, abs/2205.12392, 2022.doi: 10.48550/arXiv.2205.12392.URL https://doi.org/10.48550/arXiv.2205.12392. Tucker et al. (2022) ↑ Mycal Tucker, Roger Levy, Julie A. Shah, and Noga Zaslavsky.Trading off utility, informativeness, and complexity in emergent communication.In NeurIPS, 2022.URL http://papers.nips.cc/paper_files/paper/2022/hash/8bb5f66371c7e4cbf6c223162c62c0f4-Abstract-Conference.html. Ueda & Washio (2021) ↑ Ryo Ueda and Koki Washio.On the relationship between zipf’s law of abbreviation and interfering noise in emergent languages.In Proceedings of the ACL-IJCNLP 2021 Student Research Workshop, ACL 2021, Online, JUli 5-10, 2021, pp. 60–70. Association for Computational Linguistics, 2021.doi: 10.18653/v1/2021.acl-srw.6.URL https://doi.org/10.18653/v1/2021.acl-srw.6. Ueda et al. (2023) ↑ Ryo Ueda, Taiga Ishii, and Yusuke Miyao.On the word boundaries of emergent languages based on harris’s articulation scheme.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.URL https://openreview.net/pdf?id=b4t9_XASt6G. Vahdat et al. (2018a) ↑ Arash Vahdat, Evgeny Andriyash, and William G. Macready.Dvae#: Discrete variational autoencoders with relaxed boltzmann priors.In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 1869–1878, 2018a.URL https://proceedings.neurips.cc/paper/2018/hash/9f53d83ec0691550f7d2507d57f4f5a2-Abstract.html. Vahdat et al. (2018b) ↑ Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, and Evgeny Andriyash.DVAE++: discrete variational autoencoders with overlapping transformations.In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 5042–5051. PMLR, 2018b.URL http://proceedings.mlr.press/v80/vahdat18a.html. van den Oord et al. (2018) ↑ Aäron van den Oord, Yazhe Li, and Oriol Vinyals.Representation learning with contrastive predictive coding.CoRR, abs/1807.03748, 2018.URL http://arxiv.org/abs/1807.03748. van der Wal et al. (2020) ↑ Oskar van der Wal, Silvan de Boer, Elia Bruni, and Dieuwke Hupkes.The grammar of emergent languages.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 3339–3359. Association for Computational Linguistics, 2020.URL https://doi.org/10.18653/v1/2020.emnlp-main.270. Wilcox et al. (2020) ↑ Ethan Wilcox, Jon Gauthier, Jennifer Hu, Peng Qian, and Roger Levy.On the predictive power of neural language models for human real-time comprehension behavior.In Stephanie Denison, Michael Mack, Yang Xu, and Blair C. Armstrong (eds.), Proceedings of the 42th Annual Meeting of the Cognitive Science Society - Developing a Mind: Learning in Humans, Animals, and Machines, CogSci 2020, virtual, July 29 - August 1, 2020. cognitivesciencesociety.org, 2020.URL https://cogsci.mindmodeling.org/2020/papers/0375/index.html. Williams (1992) ↑ Ronald J. Williams.Simple statistical gradient-following algorithms for connectionist reinforcement learning.Mach. Learn., 8:229–256, 1992.URL https://doi.org/10.1007/BF00992696. Williams & Peng (1991) ↑ Ronald J. Williams and Jing Peng.Function optimization using connectionist reinforcement learning algorithms.Connection Science, 3:241–268, 1991. Xu et al. (2022) ↑ Zhenlin Xu, Marc Niethammer, and Colin Raffel.Compositional generalization in unsupervised compositional representation learning: A study on disentanglement and emergent language.In NeurIPS, 2022.URL http://papers.nips.cc/paper_files/paper/2022/hash/9f9ecbf4062842df17ec3f4ea3ad7f54-Abstract-Conference.html. Zaslavsky et al. (2022) ↑ Noga Zaslavsky, Karee Garvin, Charles Kemp, Naftali Tishby, and Terry Regier.The evolution of color naming reflects pressure for efficiency: Evidence from the recent past.Journal of Language Evolution, 7(2):184–199, 04 2022.ISSN 2058-458X.doi: 10.1093/jole/lzac001.URL https://doi.org/10.1093/jole/lzac001. Zipf (1935) ↑ George K. Zipf.The psycho-biology of language.Houghton Mifflin, 1935. Zipf (1949) ↑ George K. Zipf.Human Behaviour and the Principle of Least Effort.Addison-Wesley, 1949. Appendix AHAS-based Boundary Detection Algorithm Algorithm 1 Boundary Detection Algorithm 1: 𝑖 ← 0 ; 𝑤 ← 1 ; ℬ ← { } 2:while 𝑖 < 𝑛 do 3: Compute BE ⁢ ( 𝑠 𝑖 : 𝑖 + 𝑤 − 1 ) 4: if BE ⁢ ( 𝑠 𝑖 : 𝑖 + 𝑤 − 1 ) − BE ⁢ ( 𝑠 𝑖 : 𝑖 + 𝑤 − 2 )

threshold & 𝑤

1 then 5: ℬ ← ℬ ∪ { 𝑖 + 𝑤 } 6: end if 7: if 𝑖 + 𝑤 < 𝑛 − 1 then 8: 𝑤 ← 𝑤 + 1 9: else 10: 𝑖 ← 𝑖 + 1 ; 𝑤 ← 1 11: end if 12:end while

We show the HAS-based boundary detection algorithm (Tanaka-Ishii, 2005; Tanaka-Ishii & Ishii, 2007) in Algorithm 1. Note that this pseudo code is a simplified version by Ueda et al. (2023). ℬ is a set of boundary positions. threshold is a hyperparameter.

Appendix BProofs on the Existence of Implicit Priors B.1 Proof of Eq. 9 and Eq. 11 Remark 1.

Since 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 ) ∝ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | ) by definition,

𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 )

1 𝑍 𝛼 ⁢ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | ) ,

where 𝑍 𝛼 is a normalizer to ensure that 𝑃 𝛼 prior ⁢ ( 𝑀 ∣ 𝛼 ) is a probability distribution. 𝑍 𝛼 can be written as follows:

𝑍 𝛼

∑ 𝑚 ∈ ℳ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | )

exp ⁡ ( − 𝛼 ) ⁢ ∑ 𝑙

1 𝐿 max { ( | 𝒜 | − 1 ) ⁢ exp ⁡ ( − 𝛼 ) } 𝑙 − 1
( There are ( | 𝒜 | − 1 ) 𝑙 − 1 messages of length 𝑙 . )

exp ⁡ ( − 𝛼 ) ⁢ 1 − { ( | 𝒜 | − 1 ) ⁢ exp ⁡ ( − 𝛼 ) } 𝐿 max 1 − ( | 𝒜 | − 1 ) ⁢ exp ⁡ ( − 𝛼 )

1 − { ( | 𝒜 | − 1 ) ⁢ exp ⁡ ( − 𝛼 ) } 𝐿 max exp ⁡ ( 𝛼 ) − | 𝒜 | + 1 .

Lemma 1.

The following equation holds:

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑆 𝜙 ( ⋅ | 𝑥 ) [ 𝑓 𝜽 ⁢ ( 𝑥 , 𝑚 ) ]

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑆 𝜙 ( ⋅ | 𝑥 ) [ ∇ 𝜙 , 𝜽 𝑓 𝜽 ⁢ ( 𝑥 , 𝑚 ) + 𝑓 𝜽 ⁢ ( 𝑥 , 𝑚 ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ] ,

where 𝑓 𝜃 : 𝒳 × ℳ → ℝ is an any function differentiable w.r.t 𝜃 .

Proof of Lemma 1. Let supp ⁢ ( 𝑃 ) be the support of a given probability measure 𝑃 . For the notational convenience, we define ℳ 𝑥 ′ := supp ⁢ ( 𝑆 𝜙 ⁢ ( 𝑀 | 𝑥 ) ) . Then,

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ 𝑓 𝜽 ⁢ ( 𝑥 , 𝑚 ) ]

∇ 𝜙 , 𝜽 ⁢ ∑ 𝑥 ∈ 𝒳 ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 )

∑ 𝑥 ∈ 𝒳 ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑃 obj ⁢ ( 𝑥 ) ⁢ ( 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) + ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) )

∑ 𝑥 ∈ 𝒳 ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑃 obj ⁢ ( 𝑥 ) ⁢ ( 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) + 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) )

∑ 𝑥 ∈ 𝒳 ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑃 obj ⁢ ( 𝑥 ) ⁢ ( 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) + 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ ( ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ) ⁢ 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) )

∑ 𝑥 ∈ 𝒳 ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ ( ∇ 𝜙 , 𝜽 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) + 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ ∇ 𝜙 , 𝜽 𝑓 𝜽 ⁢ ( 𝑥 , 𝑚 ) + 𝑓 𝜽 ⁢ ( 𝑥 , 𝑚 ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ] .

∎

Remark 2.

As Chaabouni et al. (2019) pointed out, Lemma 1 can be seen as a special case of the stochastic computation graph approach by Schulman et al. (2015).

Lemma 2.

The following equation holds:

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑆 𝜙 ( ⋅ | 𝑥 ) [ 𝑔 ⁢ ( 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ]

0 ,

where 𝑔 : 𝒳 → ℝ is an any function. The function 𝑔 is called baseline in the literature of reinforcement learning.

Proof of Lemma 2. Let supp ⁢ ( 𝑃 ) be the support of a given probability measure 𝑃 . For the notational convenience, we define ℳ 𝑥 ′ := supp ⁢ ( 𝑆 𝜙 ⁢ ( 𝑀 | 𝑥 ) ) . Then,

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 𝑆 𝜙 ( ⋅ | 𝑥 ) [ 𝑔 ⁢ ( 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ]

∑ 𝑥 ∈ 𝒳 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑔 ⁢ ( 𝑥 ) ⁢ ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑆 𝜙 ⁢ ( 𝑚 | 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 )

∑ 𝑥 ∈ 𝒳 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑔 ⁢ ( 𝑥 ) ⁢ ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑆 𝜙 ⁢ ( 𝑚 | 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 )

∑ 𝑥 ∈ 𝒳 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑔 ⁢ ( 𝑥 ) ⁢ ∑ 𝑚 ∈ ℳ 𝑥 ′ ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 )

∑ 𝑥 ∈ 𝒳 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑔 ⁢ ( 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 ( ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) )

∑ 𝑥 ∈ 𝒳 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑔 ⁢ ( 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 1

0 .

∎

Proof of Eq. 9 and Eq. 11. As 𝑃 unif prior ⁢ ( 𝑀 ) is a special case of 𝑃 𝛼 prior ⁢ ( 𝑀 ∣ 𝛼 ) , it is sufficient to prove Eq. 11.

Recall that what we have to prove, i.e., Eq. 11, is as follows:

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ) ] ,

where

𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 )

1 𝑍 𝛼 ⁢ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | ) ,

𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 )

𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) ⁢ 𝑃 𝛼 prior ⁢ ( 𝑚 ) .

Here, let us define 𝑇 (for convenience) as:

𝑇 := 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ ∇ 𝜙 , 𝜽 log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) + { log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) − 𝛼 ⁢ | 𝑚 | } ⁢ ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ] .

In what follows, we first prove the left-hand side of Eq. 11 is equal to 𝑇 , i.e., ∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

𝑇 . We next prove that the right-hand side of Eq. 11 is equal to 𝑇 , i.e., ∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ) ]

𝑇 . We finally conclude that ∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

𝑇

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ) ] , and thus Eq. 11 holds.

On the one hand, the left-hand side of Eq. 11 can be transformed as follows:

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) − 𝛼 ⁢ | 𝑚 | ]
(By the definition of 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ ∇ 𝜙 , 𝜽 log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) − 𝛼 ⁢ ∇ 𝜙 , 𝜽 | 𝑚 | + { log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) − 𝛼 ⁢ | 𝑚 | } ⁢ ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 | 𝑥 ) ]
(From Lemma 1, defining 𝑓 𝜙 ⁢ ( 𝑥 , 𝑚 ) := log ⁡ 𝑅 𝜙 ⁢ ( 𝑥 ∣ 𝑚 ) − 𝛼 ⁢ | 𝑚 | )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ ∇ 𝜙 , 𝜽 log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) + { log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) − 𝛼 ⁢ | 𝑚 | } ⁢ ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ]
(because ∇ 𝜙 , 𝜽 | 𝑚 |

0 )

𝑇 .

On the other hand, the right-hand side of Eq. 11 can be transformed as follows:

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ ∣ 𝑥 ) [ log ⁡ 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ∣ 𝛼 ) ]

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 ) ]
(By the definition of 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [

∇ 𝜙 , 𝜽 log 𝑅 𝜽 ( 𝑥 | 𝑚 ) + ∇ 𝜙 , 𝜽 log 𝑃 𝛼 prior ( 𝑚 | 𝛼 ) + { log 𝑅 𝜽 ( 𝑥 | 𝑚 ) + log 𝑃 𝛼 prior ( 𝑚 | 𝛼 ) } ∇ 𝜙 , 𝜽 log 𝑆 𝜙 ( 𝑚 | 𝑥 ) ]
(From Lemma 1, defining 𝑓 𝜽 ⁢ ( 𝑥 , 𝑚 ) := log ⁡ 𝑅 𝜽 ⁢ ( 𝑚 ∣ 𝑥 ) + log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 ) )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [

∇ 𝜙 , 𝜽 log 𝑅 𝜽 ( 𝑥 ∣ 𝑚 ) + { log 𝑅 𝜽 ( 𝑥 ∣ 𝑚 ) − 𝛼 | 𝑚 | − log 𝑍 𝛼 } ∇ 𝜙 , 𝜽 log 𝑆 𝜙 ( 𝑚 ∣ 𝑥 ) ]
(Because ∇ 𝜙 , 𝜽 log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 )

0 and log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 )

− 𝛼 ⁢ | 𝑚 | − log ⁡ 𝑍 𝛼 )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ ∇ 𝜙 , 𝜽 log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + { log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) − 𝛼 ⁢ | 𝑚 | } ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ]
(From Lemma 2, defining 𝑔 ⁢ ( 𝑥 ) := log ⁡ 𝑍 𝛼 )

𝑇 .

Therefore, Eq. 11 holds. For Eq. 9, consider the special case, i.e., 𝛼

0 . ∎

B.2 Proof of Eq. 13

Recall that 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) is defined as:

Also, recall that what we have to prove, i.e., Eq. 13, is as follows:

∇ 𝜙 , 𝜽 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) + ∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ] .

Proof of Eq. 13. First, 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) can be transformed as follows:

𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log 𝑅 𝜽 ( 𝑥 ∣ 𝑚 ) ] − 𝐷 KL ( 𝑆 𝜙 ( 𝑀 ∣ 𝑥 ) | | 𝑃 𝛼 prior ( 𝑀 ∣ 𝛼 ) ) ]
(By the definition of 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) )

𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) ] − 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) − log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 ) ] ]
(By the definition of 𝐷 KL )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ) ] − 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ]
(By the linearity of 𝔼 )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ) ] − 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ]
(By the definition of ℋ )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) − 𝛼 ⁢ | 𝑚 | ] + 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ] − log ⁡ 𝑍 𝛼
(By the definition of 𝑃 𝛼 prior )

𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) + 𝛽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ] − log ⁡ 𝑍 𝛼 .

(By the definition of 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) )

Thus,

∇ 𝜙 , 𝜽 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) + ∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ] − ∇ 𝜙 , 𝜽 log ⁡ 𝑍 𝛼 ⏟

0

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) + ∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ] .

∎

B.3 Curious Case on Length Penalty Remark 3 (Curious Case on Length Penalty).

When 𝛼

𝛾 , 𝑃 𝛼 prior ⁢ ( 𝑀 ) converges to a monkey typing model 𝑃 monkey prior ⁢ ( 𝑀 ) , a unigram model that generates a sequence until emitting eos:

lim 𝐿 max → ∞ 𝑃 𝛼 prior ⁢ ( 𝑚 )

𝑃 monkey prior ⁢ ( 𝑚 ) ,

(17)

𝑃 monkey prior ⁢ ( 𝑚 𝑡 ∣ 𝑚 1 : 𝑡 − 1 )

𝑃 monkey prior ⁢ ( 𝑚 𝑡 )

{ 1 − exp ⁡ ( − 𝛼 + 𝛾 )
( 𝑚 𝑡

eos )

exp ⁡ ( − 𝛼 )

( 𝑚 𝑡 ≠ eos ) ,

where 𝛾

log ⁡ ( | 𝒜 | − 1 )

0 . Intriguingly, it gives another interpretation of the length penalty. Though originally adopted for modeling sender’s laziness, the length penalty corresponds to the implicit prior distribution 𝑃 𝛼 prior ⁢ ( 𝑚 ) , whose special case is monkey typing that totally ignores a context 𝑚 1 : 𝑡 − 1 . The length penalty may just regularize the policy to a length-insensitive policy rather than a length-sensitive, lazy one.

Proof of remark 3. On the one hand, for any message 𝑚

𝑎 1 ⁢ ⋯ ⁢ 𝑎 | 𝑚 | ∈ ℳ , a monkey typing model can be transformed as follows:

𝑃 monkey prior ⁢ ( 𝑚 )

∏ 𝑡

1 | 𝑚 | 𝑃 monkey prior ⁢ ( 𝑎 𝑡 ∣ 𝑎 1 : 𝑡 − 1 ) (By definition)

𝑃 monkey prior ⁢ ( eos ∣ 𝑎 1 : | 𝑚 | − 1 ) ⁢ ∏ 𝑡

1 | 𝑚 | − 1 𝑃 monkey prior ⁢ ( 𝑎 𝑡 ∣ 𝑎 1 : 𝑡 − 1 ) (Only the last symbol is eos)

( 1 − ( | 𝒜 | − 1 ) ⁢ exp ⁡ ( − 𝛼 ) ) ⁢ exp ⁡ ( − 𝛼 ⁢ ( | 𝑚 | − 1 ) ) (By definition)

( exp ⁡ ( 𝛼 ) − | 𝒜 | + 1 ) ⁢ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | ) .

On the other hand, from remark 1, we have:

lim 𝐿 max → ∞ 𝑍 𝛼

lim 𝐿 max → ∞ 1 − { ( | 𝒜 | − 1 ) ⁢ exp ⁡ ( − 𝛼 ) } 𝐿 max exp ⁡ ( 𝛼 ) − | 𝒜 | + 1

1 exp ⁡ ( 𝛼 ) − | 𝒜 | + 1 ,

when 𝛼

log ⁡ ( | 𝒜 | − 1 ) . Therefore, we have:

lim 𝐿 max → ∞ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 )

lim 𝐿 max → ∞ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | ) 𝑍 𝛼

( exp ⁡ ( 𝛼 ) − | 𝒜 | + 1 ) ⁢ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | )

𝑃 monkey prior ⁢ ( 𝑚 )

when 𝛼

log ⁡ ( | 𝒜 | − 1 ) . ∎

Remark 4.

In fact, it has been shown (Miller, 1957; Conrad & Mitzenmacher, 2004) that even the monkey typing model 𝑃 monkey prior ⁢ ( 𝑚 ) follows ZLA.

Appendix CSupplemental Information on Experimental Setup

Sender Architecture: The sender 𝑆 𝜙 is based on GRU (Cho et al., 2014) with hidden states of size 512 and has embedding layers converting symbols to 32 -dimensional vectors. In addition, LayerNorm (Ba et al., 2016) is applied to the hidden states and embeddings for faster convergence. For 𝑡 ≥ 0 ,

𝒆 𝜙 , 𝑡

{ 𝜙 bos
( 𝑡

0 )

Embedding ⁢ ( 𝑚 𝑡 ; 𝜙 s-emb )

( 𝑡

0 ) ,

(18)

𝒆 𝜙 , 𝑡 LN

LayerNorm ⁢ ( 𝒆 𝑡 ) ,

𝒉 𝜙 , 𝑡

{ Encoder ⁢ ( 𝑥 ; 𝜙 o-emb )
( 𝑡

0 )

GRU ⁢ ( 𝒆 𝑡 − 1 LN , 𝒉 𝑡 − 1 LN ; 𝜙 gru )
( 𝑡 > 0 ) ,

𝒉 𝜙 , 𝑡 LN

LayerNorm ⁢ ( 𝒉 𝑡 ) ,

where 𝜙 bos , 𝜙 s-emb , 𝜙 o-emb , 𝜙 gru ∈ 𝜙 . The object encoder is represented as the additive composition of the embedding layers of attributes. The sender 𝑆 𝜙 also has the linear layer (with bias) to predict the next symbol. For 𝑡 ≥ 0 ,

𝑆 𝜙 ⁢ ( 𝐴 𝑡 + 1 ∣ 𝑚 1 : 𝑡 )

Softmax ⁢ ( 𝑾 𝑆 ⁢ 𝒉 𝜙 , 𝑡 LN + 𝒃 𝑆 ) ,

(19)

where 𝑾 𝑆 , 𝒃 𝑆 ∈ 𝜙 .

Receiver Architecture: The receiver has a similar architecture to the sender. The receiver 𝑅 𝜽 is based on GRU (Cho et al., 2014) with hidden states of size 512 and has embedding layers converting symbols to 32 -dimensional vectors. In addition, LayerNorm (Ba et al., 2016) and Gaussian dropout (Srivastava et al., 2014) are applied to the hidden states and embeddings. The dropout scale is set to 0.001 . For 𝑡 ≥ 0 ,

𝒆 𝜽 , 𝑡

{ 𝜽 bos
( 𝑡

0 )

Embedding ⁢ ( 𝑚 𝑡 ; 𝜽 s-emb )

( 𝑡

0 ) ,

(20)

𝒆 𝜽 , 𝑡 LN

LayerNorm ⁢ ( 𝒆 𝑡 ⊙ 𝒅 𝑒 ) ,

𝒉 𝜽 , 𝑡

{ 𝟎
( 𝑡

0 )

GRU ⁢ ( 𝒆 𝑡 − 1 LN , 𝒉 𝑡 − 1 LN ; 𝜽 gru )
( 𝑡 > 0 ) ,

𝒉 𝜽 , 𝑡 LN

LayerNorm ⁢ ( 𝒉 𝑡 ⊙ 𝒅 ℎ ) ,

where 𝜽 bos , 𝜽 s-emb , 𝜽 o-emb , 𝜽 gru ∈ 𝜽 , ⊙ is Hadamard product, and 𝒅 𝑒 , 𝒅 ℎ are dropout masks. Note that the dropout masks are fixed over time 𝑡 , following Gal & Ghahramani (2016). The receiver 𝑅 𝜽 also has the linear layer (with bias) to predict the next symbol. For 𝑡 ≥ 0 ,

𝑅 𝜽 ⁢ ( 𝐴 𝑡 + 1 ∣ 𝑚 1 : 𝑡 )

Softmax ⁢ ( 𝑾 𝑅 ⁢ 𝒉 𝜽 , 𝑡 LN + 𝒃 𝑅 ) ,

(21)

where 𝑾 𝑅 , 𝒃 𝑅 ∈ 𝜽 .

Optimization: We have to optimize ELBO in which latent variables are discrete sequences, similarly to Mnih & Gregor (2014). By the stochastic computation approach (Schulman et al., 2015), we obtain the following surrogate objective:11

𝒥 surrogate ⁢ ( 𝜙 , 𝜽 )
:= log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + 𝛽 ⁢ ∑ 𝑡

1 | 𝑚 | log ⁡ 𝑃 𝜽 prior ⁢ ( 𝑚 𝑡 ∣ 𝑚 1 : 𝑡 − 1 )

(22)

∑ 𝑡
1 | 𝑚 | StopGrad ⁢ ( 𝐶 𝜙 , 𝜽 , 𝑡 − 𝐵 𝜻 ⁢ ( 𝑥 , 𝑚 1 : 𝑡 − 1 ) − 𝑏 mean ) ⁢ log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 𝑡 ∣ 𝑥 , 𝑚 1 : 𝑡 − 1 ) ,

where StopGrad ⁢ ( ⋅ ) is the stop-gradient operator, 𝐵 𝜻 ⁢ ( 𝑥 , 𝑚 1 : 𝑡 − 1 ) is a state-dependent baseline parametrized by 𝜻 , 𝑏 mean is a (state-independent) mean baseline, and

𝐶 𝜙 , 𝜽 , 𝑡

log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + 𝛽 ⁢ ∑ 𝑢

𝑡 | 𝑚 | ( log ⁡ 𝑃 𝜽 prior ⁢ ( 𝑚 𝑢 ∣ 𝑚 1 : 𝑢 − 1 ) − log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 𝑢 ∣ 𝑥 , 𝑚 1 : 𝑢 − 1 ) ) .

The architecture of the state-dependent baseline 𝐵 𝜻 ⁢ ( 𝑥 , 𝑚 1 : 𝑡 − 1 ) is defined as:

𝐵 𝜻 ⁢ ( 𝑥 , 𝑚 1 : 𝑡 − 1 )

𝐵 𝜻 ⁢ ( 𝒉 𝜙 , 𝑡 LN )

𝒘 𝐵 , 2 ⊤ ⁢ LeakyReLU ⁢ ( 𝑾 𝐵 , 1 ⁢ StopGrad ⁢ ( 𝒉 𝜙 , 𝑡 LN ) + 𝒃 𝐵 , 1 ) + 𝑏 𝐵 , 2 ,

where 𝑾 𝐵 , 1 ∈ ℝ 256 × 512 , 𝒃 𝐵 , 1 ∈ ℝ 256 , 𝒘 𝐵 , 2 ∈ ℝ 256 , 𝑏 𝐵 , 2 ∈ ℝ are learnable parameters in 𝜻 . We optimize the baseline’s parameter 𝜻 by simultaneously minimizing the sum of square errors:

𝒥 baseline ⁢ ( 𝜻 ) := ∑ 𝑡

1 | 𝑚 | ( 𝐶 𝜙 , 𝜽 , 𝑡 − 𝐵 𝜻 ⁢ ( 𝑥 , 𝑚 1 : 𝑡 − 1 ) ) 2 .

(23)

Note that we optimize the sender’s parameter 𝜙 according to 𝒥 surrogate but not 𝒥 baseline , so as not to break the original problem setting of ELBO maximization w.r.t 𝜙 , 𝜽 . In addition, to further reduce the variance, we use the (state-independent) mean baseline 𝑏 mean , which is the batch mean of 𝐶 𝜙 , 𝜽 , 𝑡 − 𝐵 𝜻 ⁢ ( 𝑥 , 𝑚 1 : 𝑡 − 1 ) . We use Adam (Kingma & Ba, 2015) as an optimizer. The learning rate is set to 10 − 4 . The batch size is set to 8192 . The parameters are updated 20000 times for each run. To avoid posterior collapse, 𝛽 is initially set ≪ 1 and annealed to 1 via the REWO algorithm (Klushyn et al., 2019). We set the constraint parameter of REWO to 0.3 , which intuitively means that it tries to bring 𝛽 closer to 1 while trying to keep the exponential moving average of the reconstruction error − log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) less than 0.3 .1213

Appendix DSupplemental Information on Experimental Result Figure 3: Results for 𝑛 bou (C1), 𝑛 seg (C2), Δ w , c (C3), C-TopSim (C3), and W-TopSim (C3) are shown in order from the left. The x-axis represents ( 𝑛 att , 𝑛 val ) while the y-axis represents the values of each metric. The shaded regions and error bars represent the standard error of mean. The threshold parameter is set to 0.25 . The blue plots represent the results for our ELBO-based objective 𝒥 ours , the orange ones for (BL1) the conventional objective 𝒥 conv plus the entropy regularizer, and the grey ones for (8) the ELBO-based objective whose prior is 𝑃 𝛼 prior . The apparent inferior performance of Δ w , c for 𝒥 ours compared to the baselines might be misleading. It is because 𝒥 ours greatly improves both C-TomSim and W-TopSim. The larger scale of their improvements could result in a seemingly worse Δ w , c , but this does not necessarily indicate poorer performance. Figure 4: Results for 𝑛 bou (C1), 𝑛 seg (C2), Δ w , c (C3), C-TopSim (C3), and W-TopSim (C3) are shown in order from the left. The x-axis represents ( 𝑛 att , 𝑛 val ) while the y-axis represents the values of each metric. The shaded regions and error bars represent the standard error of mean. The threshold parameter is set to 0.25 . The blue plots represent the results for our ELBO-based objective 𝒥 ours , the orange ones for (BL1) the conventional objective 𝒥 conv plus the entropy regularizer, and the grey ones for (8) the ELBO-based objective whose prior is 𝑃 𝛼 prior . The apparent inferior performance of Δ w , c for 𝒥 ours compared to the baselines might be misleading. It is because 𝒥 ours greatly improves both C-TomSim and W-TopSim. The larger scale of their improvements could result in a seemingly worse Δ w , c , but this does not necessarily indicate poorer performance.

We show the supplemental results in Figure 3 and Figure 4, in which 𝑛 bou (C1), 𝑛 seg (C2), Δ w , c (C3), C-TopSim (C3), and W-TopSim (C3) are shown in order from the left. While Figure 2 in the main content shows the result for threshold

0 , Figure 3 shows the result for threshold

0.25 and Figure 4 for threshold

0.5 . In Figure 4, the blue line plotted for 𝑛 seg does not decrease monotonically, i.e., C2 is a bit violated when using our objective 𝒥 ours as well as the baseline objectives. It implies that the threshold parameter 0.5 is not appropriate in our case.

Appendix E Supplemental Experiment on Zipf’s law of abbreviation E.1Setup

Object and Message Space: We set 𝒳 := { 1 , … , 1000 } , 𝑃 obj ⁢ ( 𝑥 ) ∝ 𝑥 − 1 , 𝐿 max

30 , and | 𝒜 | ∈ { 5 , 10 , 20 , 30 , 40 } following (Chaabouni et al., 2019).

Agent Architectures and Optimization: The architectures of sender 𝑆 𝜙 and 𝑅 𝜽 are the same as described in Section 4.1. The optimization method and parameters are also the same as described in Section 4.1.

E.2Result and Discussion Figure 5: Mean message length sorted by objects’ frequency across 32 random seeds. A moving average with a window size of 10 is shown for readability.

The experimental results are shown in Figure 5. As can be seen from the figure, there is a tendency for shorter messages to be assigned to high-frequency objects.

It is also observed that the larger the alphabet size | 𝒜 | , the longer the messages tend to be. It is probably because, as the alphabet size | 𝒜 | increases, the likelihood of eos becomes smaller in the early stages of training.

As an intuitive analogy for this, let us assume that a sender 𝑆 𝜙 ⁢ ( 𝑀 | 𝑋 ) and prior message distribution 𝑃 𝜽 prior ⁢ ( 𝑀 ) are roughly similar to a uniform monkey typing (UMT) in the early stages of training:

𝑆 𝜙 ⁢ ( 𝑚 𝑡 ∣ 𝑥 , 𝑚 1 : 𝑡 − 1 ) ≈ 1 | 𝒜 | , 𝑃 𝜽 prior ⁢ ( 𝑚 𝑡 ∣ 𝑚 1 : 𝑡 − 1 ) ≈ 1 | 𝒜 | .

(24)

The length distribution of UMT is a geometry distribution with a parameter 𝑝

| 𝒜 | − 1 (Imagine a situation where you roll the dice with “eos” written on one face until you get that face). Then, the expected message length of UMT is 𝑝 − 1

| 𝒜 | . Therefore, as the alphabet size | 𝒜 | increases, the sender 𝑆 𝜙 ⁢ ( 𝑀 | 𝑋 ) and prior message distribution 𝑃 𝜽 prior ⁢ ( 𝑀 ) are inclined to prefer longer messages in the early stages. Consequently, they are more likely to converge to longer messages. Designing neural network architectures that maintain a consistent scale for eos regardless of alphabet size is a challenge for future work.

Appendix F Supplemental Information for Related Work

[htbp] We show the characteristics of this paper and related EC work in several respects to clarify the positioning of this paper. We mean by “Generative?” whether their formulation is based on a generative perspective such as VAE and VIB, by “Variable message length?” whether message length in their settings is variable, by “Compositionality?” whether they study the compositionality (e.g., TopSim) of emergent languages, by “ZLA?” whether they investigate if emergent languages follow ZLA, by “HAS?” whether they investigate if emergent languages follow HAS, by “Prior?” whether they are aware of the existence of prior distributions and carefully choose an appropriate one. The check mark (✓) indicates yes, and the blank indicates no. Paper Generative? Variable Composi- ZLA? HAS? Prior? message length? tionality? Chaabouni et al. (2021) ✓ Tucker et al. (2022) Chaabouni et al. (2019) ✓ ✓ Rita et al. (2020) Ueda & Washio (2021) Andreas (2019) ✓ Li & Bowling (2019) Chaabouni et al. (2020) Ren et al. (2020) Resnick et al. (2020) ✓ ✓ Taniguchi et al. (2022) ✓ Inukai et al. (2023) Ueda et al. (2023) ✓ ✓ ✓ Ours ✓ ✓ ✓ ✓ ✓ ✓

Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 102 kB
Xet hash:: 46a2072a5a7536f78a4dbbb81e4b27f77654fa9551089a9eba6f7f18f90e467a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

ℳ := { 𝑚 1 ⁢ ⋯ ⁢ 𝑚 𝑘 ∣ 𝑚 𝑖 ∈ 𝒜 { eos } ⁢ ( 1 ≤ 𝑖 ≤ 𝑘 − 1 ) , 𝑚 𝑘

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 )

∇ 𝜙 , 𝜽 entreg := 𝜆 entreg | 𝑚 | ⁢ ∑ 𝑡

𝒥 conv ( 𝜙 , 𝜽 ; 𝛼 ) :

BE ⁢ ( 𝑠 ) := ℋ ( 𝐴 𝑛 + 1 | 𝐴 1 : 𝑛

𝑠 )

− ∑ 𝑎 ∈ 𝒜 𝑃 ⁢ ( 𝐴 𝑛 + 1

𝑎 | 𝐴 1 : 𝑛

𝑠 ) ⁢ log 2 ⁡ 𝑃 ⁢ ( 𝐴 𝑛 + 1

𝑎 | 𝐴 1 : 𝑛

where 𝑛 is the length of 𝑠 and 𝑃 ⁢ ( 𝐴 𝑛 + 1

𝑎 | 𝐴 1 : 𝑛

Intuitively, the receiver consists not only of the conventional reconstruction term 𝑅 𝜽 ⁢ ( 𝑋 | 𝑀 ) but also of a “language model” 𝑃 𝜽 prior ⁢ ( 𝑀 ) . The optimal sender would be the true posterior 𝑆 𝜽 ∗ ⁢ ( 𝑀 | 𝑋 )

where 𝐷 KL denotes the KL divergence and 𝛽 ≥ 0 is a hyperparameter that weights the KL term. Although Eq. 7 equals to the precise ELBO only if 𝛽

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 )

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

i.e., maximizing 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) is the same as maximizing the expectation of log ⁡ 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ) . Note that Eq. 8 is a special case of Eq. 10; if 𝛼

0 , 𝑃 unif prior ⁢ ( 𝑚 )

𝑃 𝛼 prior ⁢ ( 𝑚 ) and 𝑃 𝜽 , unif joint ⁢ ( 𝑥 , 𝑚 )

∇ 𝜙 , 𝜽 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

𝑃 unif prior ⁢ ( 𝐾

𝑘 ) := ∑ 𝑚 ∈ ℳ 𝟙 | 𝑚 |

𝑃 𝛼 prior ⁢ ( 𝐾

𝑘 ) := ∑ 𝑚 ∈ ℳ 𝟙 | 𝑚 |

where 𝛾

𝑃 𝜽 prior ⁢ ( 𝑚 )

∏ 𝑡

It can be seen as an LM because it approximates the average behavior of a sender, i.e., the KL term in Eq. 7, 𝔼 𝑃 obj ⁢ ( 𝑥 ) [ 𝐷 KL ( 𝑆 𝜙 ( 𝑀 | 𝑥 ) | | 𝑃 𝜽 prior ( 𝑀 ) ) ]

− 𝔼 𝑃 obj ⁢ ( 𝑥 ) [ ℋ ( 𝑆 𝜙 ( 𝑀 | 𝑥 ) ) ] − 𝔼 𝑆 𝜙 ⁢ ( 𝑚 ) [ log 𝑃 𝜽 prior ( 𝑚 ) ] , is minimized w.r.t 𝜽 when 𝑃 𝜽 prior ⁢ ( 𝑚 )

𝒥 ours ⁢ ( 𝜙 , 𝜽 ; 𝛽 )

Object and Message Spaces: We define an object space 𝒳 as a set of att-val objects. Following Ueda et al. (2023), we set ( 𝑛 att , 𝑛 val ) ∈ { ( 2 , 64 ) , ( 3 , 16 ) , ( 4 , 8 ) , ( 6 , 4 ) , ( 12 , 2 ) } , ensuring all the object space sizes | 𝒳 | are ( 𝑛 val ) 𝑛 att

4096 .6 To define a message space ℳ , we set 𝐿 max

32 , the same as Ueda et al. (2023), and | 𝒜 |

1 ), (BL2) the ELBO-based objective that is almost the same as ours, except that its prior is 𝑃 𝛼 prior (where 𝛼

𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 )

𝑍 𝛼

∑ 𝑚 ∈ ℳ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | )

exp ⁡ ( − 𝛼 ) ⁢ ∑ 𝑙

1 𝐿 max { ( | 𝒜 | − 1 ) ⁢ exp ⁡ ( − 𝛼 ) } 𝑙 − 1 ( There are ( | 𝒜 | − 1 ) 𝑙 − 1 messages of length 𝑙 . )

exp ⁡ ( − 𝛼 ) ⁢ 1 − { ( | 𝒜 | − 1 ) ⁢ exp ⁡ ( − 𝛼 ) } 𝐿 max 1 − ( | 𝒜 | − 1 ) ⁢ exp ⁡ ( − 𝛼 )

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑆 𝜙 ( ⋅ | 𝑥 ) [ 𝑓 𝜽 ⁢ ( 𝑥 , 𝑚 ) ]

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ 𝑓 𝜽 ⁢ ( 𝑥 , 𝑚 ) ]

∇ 𝜙 , 𝜽 ⁢ ∑ 𝑥 ∈ 𝒳 ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 )

∑ 𝑥 ∈ 𝒳 ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑃 obj ⁢ ( 𝑥 ) ⁢ ( 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) + ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) )

∑ 𝑥 ∈ 𝒳 ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑃 obj ⁢ ( 𝑥 ) ⁢ ( 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) + 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ ( ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ) ⁢ 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) )

∑ 𝑥 ∈ 𝒳 ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ⁢ ( ∇ 𝜙 , 𝜽 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) + 𝑓 𝜃 ⁢ ( 𝑥 , 𝑚 ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑆 𝜙 ( ⋅ | 𝑥 ) [ 𝑔 ⁢ ( 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ]

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 𝑆 𝜙 ( ⋅ | 𝑥 ) [ 𝑔 ⁢ ( 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ]

∑ 𝑥 ∈ 𝒳 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑔 ⁢ ( 𝑥 ) ⁢ ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑆 𝜙 ⁢ ( 𝑚 | 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 )

∑ 𝑥 ∈ 𝒳 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑔 ⁢ ( 𝑥 ) ⁢ ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑆 𝜙 ⁢ ( 𝑚 | 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 )

∑ 𝑥 ∈ 𝒳 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑔 ⁢ ( 𝑥 ) ⁢ ∑ 𝑚 ∈ ℳ 𝑥 ′ ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 )

∑ 𝑥 ∈ 𝒳 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑔 ⁢ ( 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 ( ∑ 𝑚 ∈ ℳ 𝑥 ′ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) )

∑ 𝑥 ∈ 𝒳 𝑃 obj ⁢ ( 𝑥 ) ⁢ 𝑔 ⁢ ( 𝑥 ) ⁢ ∇ 𝜙 , 𝜽 1

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 )

1 𝑍 𝛼 ⁢ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | ) , 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 )

In what follows, we first prove the left-hand side of Eq. 11 is equal to 𝑇 , i.e., ∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

𝑇 . We next prove that the right-hand side of Eq. 11 is equal to 𝑇 , i.e., ∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ) ]

𝑇 . We finally conclude that ∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

𝑇

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) − 𝛼 ⁢ | 𝑚 | ] (By the definition of 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ ∇ 𝜙 , 𝜽 log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) + { log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) − 𝛼 ⁢ | 𝑚 | } ⁢ ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ] (because ∇ 𝜙 , 𝜽 | 𝑚 |

0 )

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ ∣ 𝑥 ) [ log ⁡ 𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 ∣ 𝛼 ) ]

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 ) ] (By the definition of 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) )

0 and log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 )

− 𝛼 ⁢ | 𝑚 | − log ⁡ 𝑍 𝛼 )

Therefore, Eq. 11 holds. For Eq. 9, consider the special case, i.e., 𝛼

∇ 𝜙 , 𝜽 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log 𝑅 𝜽 ( 𝑥 ∣ 𝑚 ) ] − 𝐷 KL ( 𝑆 𝜙 ( 𝑀 ∣ 𝑥 ) | | 𝑃 𝛼 prior ( 𝑀 ∣ 𝛼 ) ) ] (By the definition of 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) )

𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) ] − 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) − log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 ) ] ] (By the definition of 𝐷 KL )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ) ] − 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ] (By the linearity of 𝔼 )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ) ] − 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ] (By the definition of ℋ )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) − 𝛼 ⁢ | 𝑚 | ] + 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ] − log ⁡ 𝑍 𝛼 (By the definition of 𝑃 𝛼 prior )

∇ 𝜙 , 𝜽 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 )

∇ 𝜙 , 𝜽 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) + ∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ] − ∇ 𝜙 , 𝜽 log ⁡ 𝑍 𝛼 ⏟

0

lim 𝐿 max → ∞ 𝑃 𝛼 prior ⁢ ( 𝑚 )

ℳ := { 𝑚 1 ⁢ ⋯ ⁢ 𝑚 𝑘 ∣ 𝑚 𝑖 ∈ 𝒜
{ eos } ⁢ ( 1 ≤ 𝑖 ≤ 𝑘 − 1 ) , 𝑚 𝑘

BE ⁢ ( 𝑠 )
:= ℋ ( 𝐴 𝑛 + 1 | 𝐴 1 : 𝑛

𝑘 )
:= ∑ 𝑚 ∈ ℳ 𝟙 | 𝑚 |

𝑘 )
:= ∑ 𝑚 ∈ ℳ 𝟙 | 𝑚 |

1 𝐿 max { ( | 𝒜 | − 1 ) ⁢ exp ⁡ ( − 𝛼 ) } 𝑙 − 1
( There are ( | 𝒜 | − 1 ) 𝑙 − 1 messages of length 𝑙 . )

1 𝑍 𝛼 ⁢ exp ⁡ ( − 𝛼 ⁢ | 𝑚 | ) ,

𝑃 𝜽 , 𝛼 joint ⁢ ( 𝑥 , 𝑚 )

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) − 𝛼 ⁢ | 𝑚 | ]
(By the definition of 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ ∇ 𝜙 , 𝜽 log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) + { log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 | 𝑚 ) − 𝛼 ⁢ | 𝑚 | } ⁢ ∇ 𝜙 , 𝜽 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ]
(because ∇ 𝜙 , 𝜽 | 𝑚 |

∇ 𝜙 , 𝜽 ⁢ 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 ) ]
(By the definition of 𝒥 conv ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) )

𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log 𝑅 𝜽 ( 𝑥 ∣ 𝑚 ) ] − 𝐷 KL ( 𝑆 𝜙 ( 𝑀 ∣ 𝑥 ) | | 𝑃 𝛼 prior ( 𝑀 ∣ 𝛼 ) ) ]
(By the definition of 𝒥 elbo ⁢ ( 𝜙 , 𝜽 ; 𝛼 ) )

𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) ] − 𝔼 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) − log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ∣ 𝛼 ) ] ]
(By the definition of 𝐷 KL )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ) ] − 𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑆 𝜙 ⁢ ( 𝑚 ∣ 𝑥 ) ]
(By the linearity of 𝔼 )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + log ⁡ 𝑃 𝛼 prior ⁢ ( 𝑚 ) ] − 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ]
(By the definition of ℋ )

𝔼 𝑥 ∼ 𝑃 obj ( ⋅ ) , 𝑚 ∼ 𝑆 𝜙 ( ⋅ | 𝑥 ) [ log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) − 𝛼 ⁢ | 𝑚 | ] + 𝔼 𝑥 ∼ 𝑃 obj ⁢ ( ⋅ ) [ ℋ ( 𝑆 𝜙 ⁢ ( 𝑀 ∣ 𝑥 ) ) ] − log ⁡ 𝑍 𝛼
(By the definition of 𝑃 𝛼 prior )

{ 1 − exp ⁡ ( − 𝛼 + 𝛾 )
( 𝑚 𝑡

{ 𝜙 bos
( 𝑡

LayerNorm ⁢ ( 𝒆 𝑡 ) ,

𝒉 𝜙 , 𝑡

{ Encoder ⁢ ( 𝑥 ; 𝜙 o-emb )
( 𝑡

GRU ⁢ ( 𝒆 𝑡 − 1 LN , 𝒉 𝑡 − 1 LN ; 𝜙 gru )
( 𝑡 > 0 ) ,

𝒉 𝜙 , 𝑡 LN

{ 𝜽 bos
( 𝑡

LayerNorm ⁢ ( 𝒆 𝑡 ⊙ 𝒅 𝑒 ) ,

𝒉 𝜽 , 𝑡

{ 𝟎
( 𝑡

GRU ⁢ ( 𝒆 𝑡 − 1 LN , 𝒉 𝑡 − 1 LN ; 𝜽 gru )
( 𝑡 > 0 ) ,

𝒉 𝜽 , 𝑡 LN

𝒥 surrogate ⁢ ( 𝜙 , 𝜽 )
:= log ⁡ 𝑅 𝜽 ⁢ ( 𝑥 ∣ 𝑚 ) + 𝛽 ⁢ ∑ 𝑡