Buckets:

|
download
raw
65.4 kB

Title: Algorithmic Collective Action in Machine Learning

URL Source: https://arxiv.org/html/2302.04262

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Problem Formulation 3Collective Action in Classification 4Collective Action in Risk Minimization 5Experiments 6Discussion License: CC BY-NC-ND 4.0 arXiv:2302.04262v3 [cs.LG] 07 Aug 2024 Algorithmic Collective Action in Machine Learning Moritz Hardt Eric Mazumdar Celestine Mendler-Dünner Tijana Zrnic Abstract

We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm’s learning algorithm. The collective pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data to achieve a collective goal. We investigate the consequences of this model in three fundamental learning-theoretic settings: nonparametric optimal learning, parametric risk minimization, and gradient-based optimization. In each setting, we come up with coordinated algorithmic strategies and characterize natural success criteria as a function of the collective’s size. Complementing our theory, we conduct systematic experiments on a skill classification task involving tens of thousands of resumes from a gig platform for freelancers. Through more than two thousand model training runs of a BERT-like language model, we see a striking correspondence emerge between our empirical observations and the predictions made by our theory. Taken together, our theory and experiments broadly support the conclusion that algorithmic collectives of exceedingly small fractional size can exert significant control over a platform’s learning algorithm.

collective action, online platforms 1Introduction

Throughout the gig economy, numerous digital platforms algorithmically profile, control, and discipline workers that offer on-demand services to consumers. Data collection and predictive modeling are critical for a typical platform’s business as machine learning algorithms power ranking, scoring, and classification tasks of various kinds (woodcock2019gig; gray2019ghost; schor2021after).

Troves of academic scholarship document the emergence and preponderance of precarity in the gig economy. wood2019good argue that platform-based algorithmic control can lead to “low pay, social isolation, working unsocial and irregular hours, overwork, sleep deprivation and exhaustion.” This is further exacerbated by “high levels of inter-worker competition with few labor protections and a global oversupply of labor relative to demand.” In response, there have been numerous attempts by gig workers to organize in an effort to reconfigure working conditions. A growing repertoire of strategies, as vast as it is eclectic, uses both physical and digital means towards this goal. Indeed, workers have shown significant ingenuity in creating platform-specific infrastructure, such as their own mobile apps, to organize the labor side of the platform (chen2018thrown; rahman2021invisible). Yet, “the upsurge of worker mobilization should not blind us to the difficulties of organizing such a diverse and spatially dispersed labor force.” (vallas2020platforms)

Beyond the gig economy, evidence of consumers seeking to influence the algorithms that power a platform’s business is abundant. Examples include social media users attempting to suppress the algorithmic upvoting of harmful content by sharing screenshots rather than original posts (twitter), or individuals creating bots to influence crowd-sourced navigation systems (sinai14bots). The ubiquity of such strategic attempts calls for a principled study of how coordinated groups can wield control over the digital platforms to which they contribute data.

In this work, we study how a collective of individuals can algorithmically strategize against a learning platform. We envision a collective that pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data. The firm in turn solves a machine learning problem over the resulting data. The goal of the collective is to redirect the firm’s optimization towards a solution that serves the collective. Notably, coordination is a crucial lever. When data are plentiful, a single individual lacks the leverage to unilaterally change the output of a learning algorithm; in contrast, we show that even small collectives can exert substantial influence.

1.1Our Contribution

We initiate a principled study of algorithmic collective action in digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm’s learning algorithm. The size of the collective is specified by a value  𝛼

0 that corresponds to the fraction of participating individuals in a population drawn from a base distribution  𝒫 0 . The firm observes the mixture distribution

𝒫

𝛼 ⁢ 𝒫 ∗ + ( 1 − 𝛼 ) ⁢ 𝒫 0 ,

where 𝒫 ∗ depends on the strategy of the collective, and the firm runs a learning algorithm 𝒜 on 𝒫 .

We investigate the consequences of this model in three fundamental learning-theoretic settings, supported by a systematic empirical evaluation on a real resume classification task from a gig platform. In each setting, we come up with algorithmic collective strategies and characterize different success criteria as a function of the collective’s size  𝛼 . We exhibit critical thresholds for the value of  𝛼 at which the collective succeeds. Our theory and experiments support the conclusion that collectives of vanishing fractional size can exert significant control over the firm’s learning algorithm.

Classification. In line with economic theory, we start with the case of an optimal firm that has full knowledge of the distribution 𝒫 . The firm chooses the Bayes optimal predictor  𝑓

𝒜 ⁢ ( 𝒫 ) on the distribution  𝒫 . In the context of classification, a natural class of objectives for the collective is to correlate a signal 𝑔 ⁢ ( ⋅ ) with a target label 𝑦 ∗ : an individual described by data point ( 𝑥 , 𝑦 ) succeeds if 𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) )

𝑦 ∗ at test time. For this goal, a natural strategy is to perturb each participating data point ( 𝑥 , 𝑦 ) by applying the signal 𝑔 ⁢ ( ⋅ ) to 𝑥 and switching the label from 𝑦 to 𝑦 ∗ at training time. That is, 𝒫 ∗ is the distribution of ( 𝑔 ⁢ ( 𝑥 ) , 𝑦 ∗ ) for a random draw of a labeled data point  ( 𝑥 , 𝑦 ) from  𝒫 0 . We prove that a collective of vanishing fractional size succeeds with high probability by implementing this strategy, provided that the signal 𝑔 ⁢ ( 𝑥 ) is unlikely to be encountered in the base distribution  𝒫 0 . The success probability is maximized for an optimal classifier and deteriorates gracefully with the suboptimality of the classifier.

In practice, the signal  𝑔 ⁢ ( 𝑥 ) may correspond to adding a hidden watermark in image and video content, or subtle syntactic changes in text. It is reasonable to assume that individuals are indifferent to such inconsequential changes to their features. In fact, conventional wisdom in machine learning has it that such hidden signals are easy to come by in practice. The ability to change the label of an instance, however, is a more strenuous requirement. We therefore propose a variant of the strategy where a participating individual does not need to modify the label of their data point. We show that this strategy still succeeds, while quantifying precisely how it diminishes the collective’s success rate as a function of its size and a key parameter of 𝒫 0 .

We provide analogous results when the collective’s goal is for the firm’s predictions to ignore some subset of the feature information. Given a map 𝑔 ⁢ ( 𝑥 ) , the collective succeeds if 𝑓 ⁢ ( 𝑥 )

𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) ) . Here, 𝑔 ⁢ ( 𝑥 ) is a summary of 𝑥 that, for example, removes private or sensitive information in  𝑥 .

Experimental evaluation. We conduct extensive experiments on a dataset of almost thirty thousand resumes from a gig platform for freelancers. The machine learning task is to tag resumes with a set of ten skills related to work in the IT sector. Through more than two thousand model training runs involving a BERT-like language model, we investigate the predictions made by our theory. What emerges is a striking correspondence between our empirical findings and the theory. The ability to modify both resume and label leads to a near-perfect success rate at collective sizes of only a fraction of a percent of the population, corresponding to fewer than one hundred modified resumes. The label-free strategy still succeeds reliably, albeit at a higher threshold, corresponding to a few hundred resumes. The more well-trained the model is, the lower the threshold for the collective’s success. The placement pattern of the signal in the resume is largely irrelevant, so long as the token we plant is unique within the corpus of resumes.

Our theory predicts that collectives, in a certain precise sense, must compete with the strongest alternative signal in the data. The weaker the alternative signals, the lower the threshold for success. To confirm this hypothesis experimentally, we randomize the labels of a random fraction of the data. Confirming our theory, we observe that increasing the fraction of randomized labels, and hence diminishing the strength of alternative signals, indeed lowers the threshold for success of the collective. This observation suggests a blessing of dimensionality: if the data contains many weak signals, as high-dimensional data tends to, algorithmic collectives are especially powerful.

Risk minimization and gradient descent. Generalizing beyond classification, we consider the power of collectives in convex risk minimization and gradient-based learning with possibly nonconvex objectives. In the first case, we show that collectives can force the firm to deploy any model with small suboptimality on 𝒫 0 of the collective’s choosing. In the second case, we show that given a more powerful interaction model the collective can influence the firm to deploy any desired model even under nonconvexity, as long as the path from the initial model, computed on 𝒫 0 , to the desired model does not traverse large gradients when evaluated on 𝒫 0 . Moreover, despite the nonconvexity, convergence to the target is achieved at a convex rate. In both problem settings, the analyzed collective strategies rely on exerting influence on the gradients computed by the firm.

1.2Related Work

Our approach to algorithmic collective action is decidedly not adversarial. Instead, the strategic manipulations arise through a misalignment of the firm’s and the individuals’ objectives. Individuals legitimately optimize their utility through data sharing and coordination. Yet, at a technical level our results relate to topics studied under the umbrella of adversarial machine learning. Most closely related is the line of work on data poisoning attacks that seeks to understand how data points can be adversarially “poisoned” to degrade the performance of a predictive model at test time. We refer to recent surveys for an overview of data poisoning attacks (zhiyi22survey), and backdoor attacks more specifically (guo22backdoor). While the literature on poisoning attacks focuses predominantly on diminishing the performance of the learning algorithm, documented empirical successes (cherepanova2021lowkey; geiping2021witches) hint at the impact that algorithmic collective action can have on deep learning models. Despite the increasing number of studies on backdoor attacks and defenses, theoretical work explaining how underlying factors affect the success of backdoor attacks has been limited (grosse22demystify). We point out two recent works that study the relationship between the number of manipulated data points and the success probability of the attack. manoj2021excess show that a fixed number of points is sufficient for backdoor attacks to succeed in binary classification problems, provided that the memorization capacity of the model is sufficiently large. cina21learningcurves empirically investigate backdoor learning curves across many image recognition tasks and they observe curves with diminishing returns, similar in shape to those in our experiments. Our analysis of Bayes optimal classification provides a new, complementary theoretical perspective and sheds light on the effectiveness of practical data manipulation strategies in a way that is surprisingly predictive of our empirical observations and previous empirical results in adversarial machine learning.

Our analysis of risk minimization is reminiscent of model-targeted attacks (suya21model; farhadkhani22grad), which aim to bias the learner towards a target model. Our gradient-control strategy resembles the counter-gradient attack in (farhadkhani22grad). While the insights from these prior works are valuable to inform the feasibility of collective action in convex risk minimization, our work differs from these existing studies in its focus on the role of collective size and the analysis of nonconvex losses. Only a handful of works in the adversarial machine learning literature have questioned the institution-centric perspective of the community and discussed the political dimension of adversarial machine learning (Albert2020PoliticsOA; vincent21leverage; kendra21sop). In this context, a recent line of work considers the socially beneficial use of adversarial learning techniques (delobelle21ethical; shan20fawkes; abebe2022adversarial; kulynych20pot; li2022untargeted).

Our work can also be viewed as a conceptual reversal of strategic classification (hardt16strategic). In strategic classification, a firm anticipates the optimal response of a strategic individual to a decision rule. Instead, we consider individuals that strategically anticipate the optimizing behavior of the firm, something recently considered by zrinc21follow. Furthermore, our work is conceptually different from strategic classification in its focus on the role and objectives of workers and consumers on online platforms rather than the firm. Another crucial departure from the set of problems considered in strategic classification is our emphasis on collective rather than individual strategies.

The idea of collective action on digital platforms has also been previously studied. elliot21recourse show how algorithmic recourse can be improved through coordination. vincent2019data examine the effectiveness of data strikes. Extending this work to the notion of data leverage, vincent21leverage describe various ways of “reducing, stopping, redirecting, or otherwise manipulating data contributions” for different purposes. See also vincent21CDC. Our work provides a theoretical framework for understanding the effectiveness of such strategies, as well as studying more complex algorithmic strategies that collectives may deploy.

Appendix A continues our discussion of related work.

2Problem Formulation

We study the strategic interaction of a firm operating a predictive system with a population of individuals. We assume that the firm deploys a learning algorithm  𝒜 that operates on data points in a universe  𝒵

𝒳 × 𝒴 . Each individual corresponds to a single data point 𝑧 ∈ 𝒵 , typically a feature–label pair. We model the population of individual participants as a distribution 𝒫 0 over 𝒵 .

We say that a fraction 𝛼

0 of the individuals form a collective in order to strategically respond to the firm’s learning behavior. The collective agrees on a potentially randomized strategy ℎ : 𝒵 → 𝒵 from a space of available strategies ℋ . The possible strategies ℋ capture feasible changes to the data. For example, content creators on a video streaming platform may be indifferent between videos that differ only in a hidden watermark not visible to human viewers. Freelancers may be indifferent between two resumes that differ only in inconsequential syntactic details.

The firm therefore observes a mixture distribution

𝒫

𝛼 ⁢ 𝒫 ∗ + ( 1 − 𝛼 ) ⁢ 𝒫 0 ,

where we use 𝒫 ∗ to denote the distribution of ℎ ⁢ ( 𝑧 ) , 𝑧 ∼ 𝒫 0 .

The collective strives to choose a strategy ℎ so as to maximize a measure of success over the solution  𝑓

𝒜 ⁢ ( 𝒫 ) chosen by the firm. Here, 𝑓 is a mapping from features to labels, 𝑓 : 𝒳 → 𝒴 . Given a strategy, we use 𝑆 ⁢ ( 𝛼 ) to denote the level of success achieved by a collective of size 𝛼 . The central question we study is how the success 𝑆 ⁢ ( 𝛼 ) grows as a function of collective size  𝛼 , and how large 𝛼 needs to be in order to achieve a target success level.

Definition 2.1 (Critical mass).

The critical mass for a target success level 𝑆 ∗ is defined as the smallest 𝛼 for which there exists a strategy such that 𝑆 ⁢ ( 𝛼 ) ≥ 𝑆 ∗ .

Note that, although motivated from the perspective of labor, our formal model can also serve as a basis for studying collective action on the consumer side of digital platforms. Before presenting our results we briefly discuss why we focus on collective strategies in this paper.

Why collective action? By engaging in collective action, individuals can exert influence on the learning algorithm that they could not achieve by acting selfishly. In large-population settings such as online platforms, an individual contributes an infinitesimal fraction of the data used by the learning algorithm. Thus, under reasonable manipulation constraints, individual behavior is largely powerless in systematically changing the deployed model. Instead, individuals are limited to simple adversarial attacks or individual strategies that do not have lasting effects on the learning outcome. By coordinating individuals, however, collectives can wield enough power to steer learning algorithms towards desired goals. In subsequent sections we show that collectives can often do so while only representing a small fraction of the training data.

3Collective Action in Classification

We start with classification under the assumption that the firm chooses an approximately optimal classifier on the data distribution  𝒫 .

Definition 3.1 ( 𝜖 -optimal classifier).

A classifier  𝑓 : 𝒳 → 𝒴 is 𝜖 -optimal under the distribution  𝒫 if there exists a 𝒫 ′ with TV ⁢ ( 𝒫 , 𝒫 ′ ) ≤ 𝜖 such that

𝑓 ⁢ ( 𝑥 )

argmax 𝑦 ∈ 𝒴 𝒫 ′ ⁢ ( 𝑦 | 𝑥 ) .

Note that a 0 -optimal classifier is the Bayes optimal classifier with respect to the zero–one loss.

Under the above assumption, we focus on two general goals for the collective: planting a signal and erasing a signal.

3.1Planting a Signal

Assume the collective wants the classifier to learn an association between an altered version of the features 𝑔 ⁢ ( 𝑥 ) and a chosen target class 𝑦 ∗ . Formally, given a transformation 𝑔 : 𝒳 → 𝒳 , the collective wants to maximize the following measure of success:

𝑆 ⁢ ( 𝛼 )

Pr 𝑥 ∼ 𝒫 0 ⁡ { 𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) )

𝑦 ∗ } .

We call this objective “planting a signal” and 𝒳 ∗

{ 𝑔 ⁢ ( 𝑥 ) : 𝑥 ∈ 𝒳 } the signal set. For example, 𝑔 ⁢ ( 𝑥 ) could be instance 𝑥 with an inconsequential trigger (such as a video with an imperceptible watermark or a resume with a unique formatting) and 𝑦 ∗ could be a label indicating that the instance is of high quality (e.g., a high-quality video or a highly qualified individual). As another example, the collective may have an altruistic goal to help individuals in a vulnerable subpopulation 𝒳 0 ⊆ 𝒳 achieve a desired outcome 𝑦 ∗ . In this case, 𝑔 ⁢ ( 𝑥 ) could be a mapping from 𝑥 to a randomly chosen instance in  𝒳 0 .

We provide natural strategies for planting a signal and characterize their success as a function of 𝛼 . The key parameter that we identify as driving success is the uniqueness of the signal.

Definition 3.2 ( 𝜉 -unique signal).

We say that a signal is 𝜉 -unique if it satisfies 𝒫 0 ⁢ ( 𝒳 ∗ ) ≤ 𝜉 .

In addition, success naturally depends on how suboptimal 𝑦 ∗ is on the signal set under the base distribution. To formalize this dependence, we define the suboptimality gap of 𝑦 ∗ as

Δ

max 𝑥 ∈ 𝒳 ∗ ⁡ ( max 𝑦 ∈ 𝒴 ⁡ 𝒫 0 ⁢ ( 𝑦 | 𝑥 ) − 𝒫 0 ⁢ ( 𝑦 ∗ | 𝑥 ) ) .

We consider two possibilities for the space of available strategies ℋ . First, we assume that the individuals can modify both features and labels. We call the resulting strategies feature–label strategies. Modifying features by, say, planting a trigger often comes at virtually no cost. Changing the label, however, may be hard, costly, or even infeasible. This is why we also consider feature-only strategies; such strategies only allow changes to features.

Feature–label signal strategy. We define the feature–label signal strategy as

ℎ ⁢ ( 𝑥 , 𝑦 )

( 𝑔 ⁢ ( 𝑥 ) , 𝑦 ∗ ) .

(1)

The result below quantifies the success of this strategy in terms of the collective size and the uniqueness of the signal.

Theorem 3.3.

Consider the feature–label signal strategy and suppose that the signal is 𝜉 -unique. Then, the success against an 𝜖 -optimal classifier is lower bounded by

𝑆 ⁢ ( 𝛼 ) ≥ 1 − 1 − 𝛼 𝛼 ⋅ Δ ⋅ 𝜉 − 𝜖 1 − 𝜖 .

Rearranging the terms, we obtain an upper bound on the critical mass given a desired success probability (e.g., 90 % ).

Corollary 3.4.

Suppose the signal is 𝜉 -unique. Then, the critical mass for achieving success 𝑆 ∗ ∈ ( 0 , 1 ) with feature–label strategies against an 𝜖 -optimal classifier is bounded by

𝛼 ∗ ≤ Δ ⋅ 𝜉 1 − 𝑆 ∗ − 𝜖 1 − 𝜖 + Δ ⋅ 𝜉 .

(2)

Therefore, in order to achieve success it suffices to have a collective size proportional to the uniqueness of the signal and the suboptimality of 𝑦 ∗ on the signal set, as long as these parameters are sufficiently small relative to the target error rate 1 − 𝑆 ∗ . This suggests that planting signals that are exceedingly “rare” under the base distribution can be done successfully by small collectives— a property of feature–label strategies that we empirically validate in Section 5.

Figure 1:Illustration of the success rate predicted by Theorem 3.3. In the first we fix 𝜖

0 and vary 𝜉 , and in the second we fix 𝜉 and vary the classifier’s suboptimality, 𝜖 . We upper bound Δ by one.

In the next result we consider feature-only strategies. An impediment to the success of such strategies is the situation where there is overwhelmingly strong signal about a label 𝑦 ≠ 𝑦 ∗ in the base distribution and hence little label uncertainty. This is the reason why we make one additional assumption that there exists a number 𝑝

0 such that 𝒫 0 ⁢ ( 𝑦 ∗ | 𝑥 ) ≥ 𝑝 for all 𝑥 ∈ 𝒳 .

Feature-only signal strategy. We define the feature-only signal strategy as

ℎ ⁢ ( 𝑥 , 𝑦 )

{ ( 𝑔 ⁢ ( 𝑥 ) , 𝑦 ∗ ) ,
 if  ⁢ 𝑦

𝑦 ∗ ,

( 𝑥 , 𝑦 ) ,

otherwise .

(3)

This strategy achieves a similar success rate as the feature–label strategy, but the success diminishes with the constant  𝑝 .

Theorem 3.5.

Consider the feature-only signal strategy and suppose that the signal is 𝜉 -unique. Further, suppose there exists 𝑝

0 such that 𝒫 0 ⁢ ( 𝑦 ∗ | 𝑥 ) ≥ 𝑝 , ∀ 𝑥 ∈ 𝒳 . Then, the success against an 𝜖 -optimal classifier is lower bounded by

𝑆 ⁢ ( 𝛼 ) ≥ 1 − 1 − 𝑝 𝑝 ⁢ 𝛼 ⋅ 𝜉 − 𝜖 1 − 𝜖 .

The critical mass for achieving a target success probability is thus bounded as follows.

Corollary 3.6.

Suppose the signal is 𝜉 -unique. Then, the critical mass for achieving success 𝑆 ∗ ∈ ( 0 , 1 ) with feature-only strategies against an 𝜖 -optimal classifier is bounded by

𝛼 ∗ ≤ 1 − 𝑝 𝑝 ⁢ 𝜉 1 − 𝑆 ∗ − 𝜖 1 − 𝜖 .

The positivity constant 𝑝

0 may be excessively small over the entire data universe. A standard fix to this problem is to restrict 𝒫 0 to a subset where the constant is larger, and pay a penalty for the amount of truncation in the bound. For example, if there exists 𝑅 ⊆ 𝒳 such that 𝒫 0 ⁢ ( 𝑅 ) ≥ 99 % , but the positivity constant over 𝑅 is much larger than 𝑝 , then one can obtain a more powerful version of Theorem 3.5.

3.2Erasing a Signal

Next, we assume the collective wants the classifier to be invariant under a transformation 𝑔 : 𝒳 → 𝒳 of the features. In particular, the success is measured with respect to:

𝑆 ⁢ ( 𝛼 )

Pr 𝑥 ∼ 𝒫 0 ⁡ { 𝑓 ⁢ ( 𝑥 )

𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) ) } .

In other words, the collective wants the classifier to output the same predictions for all 𝑥 and 𝑥 ′ that have 𝑔 ⁢ ( 𝑥 )

𝑔 ⁢ ( 𝑥 ′ ) . The map 𝑔 can be thought of as a summary of 𝑥 that removes some feature information. We call this objective “erasing a signal.” For example, if the collective wants the deployed model to be insensitive to the value of a particular feature 𝑗 ∗ , then it can use 𝑔 ⁢ ( 𝑥 )

𝑥 ′ where 𝑥 𝑗 ′

𝑥 𝑗 for 𝑗 ≠ 𝑗 ∗ and 𝑥 𝑗 ∗ ′

0 . The feature 𝑗 ∗ could be the length of a video that content creators do not want to affect the ranking of the content, or it could be a sensitive demographic feature that a collective wants to be independent of the predicted label.

Erasure strategy. We define the erasure strategy as

ℎ ⁢ ( 𝑥 , 𝑦 )

( 𝑥 , argmax 𝑦 ∈ 𝒴 𝒫 0 ⁢ ( 𝑦 | 𝑔 ⁢ ( 𝑥 ) ) ) .

As before, the success of the strategy depends on problem-dependent quantities. In this case, the quantity of interest is the sensitivity of the labels to the erased signal. We capture this sensitivity in the parameter 𝜏 , defined as

𝜏 := 𝔼 𝑥 ∼ 𝒫 0 max 𝑦 ∈ 𝒴 | 𝒫 0 ( 𝑦 | 𝑥 ) − 𝒫 0 ( 𝑦 | 𝑔 ( 𝑥 ) ) | .

Intuitively, 𝜏 is small if observing the whole feature vector 𝑥 , instead of just the summary 𝑔 ⁢ ( 𝑥 ) , reveals little additional information about the label.

Theorem 3.7.

Consider the erasure strategy. Then, the success against an 𝜖 -optimal classifier is lower bounded by

𝑆 ⁢ ( 𝛼 ) ≥ 1 − 2 ⁢ ( 1 − 𝛼 ) 𝛼 ⋅ 𝜏 − 𝜖 1 − 𝜖 .

We rearrange the terms and derive a bound on the critical mass that guarantees a signal can be erased with a desired probability.

Corollary 3.8.

The critical mass for achieving success 𝑆 ∗ ∈ ( 0 , 1 ) is bounded by

𝛼 ∗ ≤ 𝜏 1 2 ⁢ ( 1 − 𝑆 ∗ ) − 𝜖 2 ⁢ ( 1 − 𝜖 ) + 𝜏 .

The less sensitive the labels to the erased information, the smaller the collective needed to successfully enforce a decision rule independent of the protected information.

In contrast to the strategies in Section 3.1, the erasure strategy requires knowledge of statistics about 𝒫 0 . This highlights an important benefit of collective action: information sharing. Information about the base distribution is typically difficult to obtain for individual platform users. However, a collective can pool their feature–label information to estimate properties of the distribution from samples; the larger the collective, the better the estimate and consequently the more effective the strategy.

4Collective Action in Risk Minimization

We next study the effect of collective size when the learner is solving parametric risk minimization. Here the firm is choosing a model from a parameterized set { 𝑓 𝜃 } 𝜃 ∈ Θ . We will use 𝒜 ⁢ ( 𝒫 ) to denote an element in Θ that determines the model chosen by the firm. We begin by studying convex risk minimizers. Then, motivated by nonconvex settings, we look at gradient-descent learners without imposing any convexity assumptions on the objective. Our main working assumption will be that of a risk-minimizing firm.

Definition 4.1 (Risk minimizer).

Fix a loss function ℓ . The firm is a risk minimizer if under distribution 𝒫 it determines the parameter of the model 𝑓 𝜃 according to

𝜃

argmin 𝜃 ′ ∈ Θ 𝔼 𝑧 ∼ 𝒫 ℓ ⁢ ( 𝜃 ′ ; 𝑧 ) .

4.1Convex Risk Minimization

To begin, we assume that ℓ ⁢ ( 𝜃 ; 𝑧 ) is a convex function of 𝜃 , and that the collective’s goal is to move the model from 𝜃 0 —the optimal model under the base distribution 𝒫 0 —to a target model 𝜃 ∗ . To that end, for a given target model 𝜃 ∗ ∈ Θ , we measure success in terms of

𝑆 ⁢ ( 𝛼 )

− ‖ 𝜃 − 𝜃 ∗ ‖ .

Here, ∥ ⋅ ∥ can be any norm (as long as it is kept fixed in the rest of the section). In line with first-order optimality conditions for convex optimization, the influence of the collective on the learning outcome depends on the collective’s ability to influence the average gradient of ℓ . To simplify notation, let 𝑔 𝒫 ⁢ ( 𝜃 ′ )

𝔼 𝑧 ∼ 𝒫 ∇ ⁡ ℓ ⁢ ( 𝜃 ′ ; 𝑧 ) denote the expected gradient of the loss over distribution 𝒫 measured at a point 𝜃 ′ ∈ Θ .

Gradient-neutralizing strategy. Define the gradient-neutralizing strategy as follows. Find a gradient-neutralizing distribution 𝒫 ′ for 𝜃 ∗ , meaning ∠ ⁢ ( 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) , − 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) )

0 . Then, draw 𝑧 ′ ∼ 𝒫 ′ and let

ℎ ⁢ ( 𝑧 )

{ 𝑧 ′ ,

with prob.  ⁢ min ⁡ ( 1 , 1 𝛼 ⁢ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ + ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ ) ,

𝑧 ,

else .

Figure 2:Success rate of Strategy 1 as the collective size varies.

For example, in generalized linear models (GLMs) gradients are given by ∇ ℓ ⁢ ( 𝜃 ; ( 𝑥 , 𝑦 ) )

𝑥 ⁢ ( 𝜇 𝜃 ⁢ ( 𝑥 ) − 𝑦 ) , where 𝜇 𝜃 ⁢ ( ⋅ ) is a mean predictor (see, e.g., Chapter 3 in (efron2022exponential)). Therefore, one can obtain a gradient-neutralizing distribution by simply letting ℎ ⁢ ( 𝑥 , 𝑦 )

( 𝑥 ′ , 𝑦 ′ ) , where 𝑥 ′

− 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) and 𝑦 ′ is any value less than 𝜇 𝜃 ⁢ ( 𝑥 ′ ) . Alternatively, if the collective is restricted to feature-only strategies, they can set 𝑥 ′

− 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) only if 𝑦 < 𝜇 𝜃 ⁢ ( 𝑥 ′ ) , and 𝑥 ′

0 otherwise. As long as the label distribution has sufficiently large support under 𝒫 0 , in particular 𝑦 < 𝜇 𝜃 ⁢ ( − 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ) with nonzero probability, this strategy likewise results in a gradient-neutralizing distribution.

Theorem 4.2.

Suppose there exists a gradient-neutralizing distribution 𝒫 ′ for 𝜃 ∗ . Then, if the loss is 𝜇 -strongly convex, the success of the gradient-neutralizing strategy is bounded by

𝑆 ⁢ ( 𝛼 ) ≥ 1 𝜇 ⁢ min ⁡ ( 𝛼 ⁢ ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ − ( 1 − 𝛼 ) ⁢ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ , 0 ) .

The natural target success for the collective is for 𝜃 ∗ to be reached exactly; this is achieved when 𝑆 ⁢ ( 𝛼 )

0 .

Corollary 4.3.

Suppose there exists a gradient-neutralizing distribution 𝒫 ′ for 𝜃 ∗ . Then, for any 𝜇 ≥ 0 the critical mass for achieving target success 𝑆 ⁢ ( 𝛼 )

0 is bounded by

𝛼 ∗ ≤ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ + ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ .

(4)

Even if ℓ is only strictly convex ( 𝜇 → 0 ), the collective can reach 𝜃 ∗ with 𝛼 ∗ as in (4). Note that this corollary reveals an intuitive relationship between 𝛼 ∗ and 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) in the convex regime: target models 𝜃 ∗ that look more optimal to the learner under the base distribution are easier to achieve.

If the collective has a utility function 𝑢 ⁢ ( 𝜃 ′ ) that specifies the participants’ preferences over different models 𝜃 ′ , a natural way to define success is via a desired gain in utility: 𝑆 ⁢ ( 𝛼 )

𝑢 ⁢ ( 𝜃 ) − 𝑢 ⁢ ( 𝜃 0 ) , where 𝜃 0

𝒜 ⁢ ( 𝒫 0 ) . Corollary 4.3 implies a bound on the critical mass for this measure of success, for all convex utilities (for example, linear utilities of the form 𝑢 ⁢ ( 𝜃 )

𝜃 ⊤ ⁢ 𝑣 , for some 𝑣 ).

Proposition 4.4.

Suppose that 𝑢 ⁢ ( 𝜃 ′ ) is convex. Further, assume ℓ is 𝛽 -smooth and that ∥ ⋅ ∥ is the ℓ 2 -norm. Then, the critical mass for achieving 𝑢 ⁢ ( 𝜃 ) − 𝑢 ⁢ ( 𝜃 0 ) ≥ 𝑈 is bounded by

𝛼 ∗ ≤ 𝛽 ⋅ 𝑈 𝑔 lb ⋅ ‖ ∇ 𝑢 ⁢ ( 𝜃 0 ) ‖ + 𝛽 ⋅ 𝑈 ,

where 𝑔 lb

min ⁡ { ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ′ ) ‖ : ‖ 𝜃 ′ − 𝜃 0 ‖ ≤ 𝑈 / ‖ ∇ 𝑢 ⁢ ( 𝜃 0 ) ‖ } and 𝒫 ′ is gradient-neutralizing for 𝜃 ′ .

As a result, the critical mass required to achieve a utility gain of 𝑈 decreases as the gradient of the utility at 𝜃 0 grows. Intuitively, large ‖ ∇ 𝑢 ⁢ ( 𝜃 0 ) ‖ means that small changes to 𝜃 0 can lead to large improvements for the collective.

4.2Gradient-Based Learning

So far we have assumed that exact optimization is computationally feasible; with nonconvex objectives, this behavior is hardly realistic. A common approach to risk minimization for general, possibly nonconvex learning problems is to run gradient descent. Formally, at each step 𝑡 we assume the learner observes the current data distribution 𝒫 𝑡 , computes the average gradient at the current iterate, and updates the model by taking a gradient step:

𝜃 𝑡 + 1

𝜃 𝑡 − 𝜂 ⋅ 𝑔 𝒫 𝑡 ⁢ ( 𝜃 𝑡 ) ,

where 𝜂

0 is a fixed step size. Given a target model 𝜃 ∗ , we define the success of the strategy after 𝑡 steps as

𝑆 𝑡 ⁢ ( 𝛼 )

− ‖ 𝜃 𝑡 − 𝜃 ∗ ‖ .

Given the online nature of the learner’s updates, we assume that the collective can adaptively interact with the learner; that is, they can choose 𝒫 𝑡 ∗ at every step 𝑡 . This is a typical interaction model in federated learning (mcmahan17federated). In the following we show that this additional leverage enabled by this adaptivity allows the collective to implement a refined strategy that controls the outcome of learning even in nonconvex settings.

Gradient-control strategy. We define the gradient-control strategy at 𝜃 as follows. Given the observed model 𝜃 and a target 𝜃 ∗ , the collective finds a gradient-redirecting distribution 𝒫 ′ for 𝜃 , meaning 𝑔 𝒫 ′ ⁢ ( 𝜃 )

− 1 − 𝛼 𝛼 ⁢ 𝑔 𝒫 0 ⁢ ( 𝜃 ) + 𝜉 ⁢ ( 𝜃 − 𝜃 ∗ ) , for some 𝜉 ∈ ( 0 , 1 𝛼 ⁢ 𝜂 ) . Then, draw 𝑧 ′ ∼ 𝒫 ′ and set ℎ ⁢ ( 𝑧 )

𝑧 ′ .

Figure 3:Success rate of Strategy 2 as the collective size varies. Theorem 4.5.

Assume the collective can implement the gradient-control strategy at all 𝜆 ⁢ 𝜃 0 + ( 1 − 𝜆 ) ⁢ 𝜃 ∗ , 𝜆 ∈ [ 0 , 1 ] . Then, there exists a 𝐶 ⁢ ( 𝛼 )

0 such that the success of the gradient-control strategy after 𝑇 steps is lower bounded by

𝑆 𝑇 ⁢ ( 𝛼 ) ≥ − ( 1 − 𝜂 ⁢ 𝐶 ⁢ ( 𝛼 ) ) 𝑇 ⋅ ‖ 𝜃 0 − 𝜃 ∗ ‖ .

The above result implies that the collective can reach any model 𝜃 ∗ as long as there exists a path from 𝜃 0 to 𝜃 ∗ that only traverses small gradients on the initial distribution 𝒫 0 .

5Experiments

We report on experimental findings from over 2000 model training runs involving a BERT-like text transformer model on a resume classification task. The resume dataset consists of nearly 30000 resumes of freelancers on a major gig labor platform, introduced by (jiechieu2021skills). The task is a multiclass, multilabel classification problem where the goal is to predict a set of up to ten skills from the software and IT sector based on the resume.

The collective controls a random fraction of the training data within the dataset. Its goal is to plant a signal, that is, steer the model’s predictions on transformed data points 𝑔 ⁢ ( 𝑥 ) toward a desired target label 𝑦 ∗ . We evaluate the effectiveness of two simple strategies, which are instantiations of the feature–label and feature-only strategies from Section 3.1.

Strategy 1 (Text and label manipulation across dataset). The collective plants a special token in the resume text and changes the label to the target class. This strategy closely mirrors the feature-label signal strategy in (1).

Strategy 2 (Text-only manipulation within target class). The collective manipulates the resume text of resumes within the target class by inserting a special token with some frequency (every 20 th word). This strategy closely follows the feature-only signal strategy in (3).

Evaluation. To measure success we insert the special token in all test points and count how often the model (a) includes the target class in its set of predicted skills, (b) has the target class as its “top- 1 ” prediction.

5.1Experimental Setup

We use ‘distilbert-base-uncased’, a standard pretrained transformer model. We fine-tune it on the dataset for 5 epochs with standard hyperparameters. After 5 epochs, the model plateaus at around 97 % multi-label accuracy (defined as 1 minus Hamming loss), 93.8 % precision, and 88.9 % recall. The dataset contains 29783 resumes, of which we use 25000 for training and 4783 for testing. We focus on the first three classes of the problem, corresponding to database administrator (class 0 ), web developer (class 1 ), software developer (class 2 ). These three classes occur with frequency 0.11 , 0.23 , and 0.5 , respectively, in the dataset. As the special token, we use an unused formatting symbol (token 1240 being a small dash) that we insert every 20 words.

5.2Experimental Findings

Text and label manipulation across dataset. We find that Strategy 1 exerts significant control over the model’s prediction even when the collective is exceedingly small (Figure 2). In fact, we see consistent success in controlling the model’s output well below 0.5 % of the dataset, i.e., fewer than 125 manipulated training data points.

Text-only manipulation within target class. We find that Strategy 2 consistently succeeds in controlling the model so as to include the target class in its positive predictions. The strategy succeeds at a threshold of around 10 % of the instances of the target class (Figure 3, top panel). This threshold corresponds to approximately 1 % , 2 % , and 5 % of the dataset for class 0 , 1 , and 2 , respectively. When it comes to controlling the model’s top prediction, the text-only strategy does not consistently succeed (Figure 3, bottom panel).

Effect of positivity constant. Our theory in Section 3.1 suggests that the difficulty of controlling the model’s top prediction via the text-only strategy may be due to a small positivity constant 𝑝 . To evaluate this hypothesis, we repeat our experiments after we randomize a random fraction of the labels in the training data. This randomization ensures that each feature vector is assigned the target label with nontrivial probability. Our findings confirm that even a small fraction of random labels dramatically increases the success of Strategy 2 in controlling the top prediction (Figure 4).

Figure 4:Random labels increase success of Strategy 2.

Trade-offs between model optimality and success. Figure 5 shows that the success of either strategy is sensitive to the number of epochs. Less optimization during the model training phase leads to a lower success rate. These findings mirror our theoretical results: as the model approaches optimality, small collectives have significant power.

Figure 5:Additional epochs of training increase the success rate.

Robustness to trigger token placement. Figure 7 in the Appendix shows that the success rate of either strategy is insensitive to the spacing of the trigger token. Varying the token frequency from 10 to 50 has little effect on the success rate. This experimental finding, too, is in line with our theory. Since the token chosen in our strategy is unique, the set of texts augmented with this unique token has low probability regardless of how often the token is planted.

6Discussion

We conclude the paper with a short discussion highlighting the economic significance of understanding the critical mass 𝛼 ∗ for pursuing collective targets. It is well-known in economics that participation in a collective is not individually rational, and additional incentives are necessary for collective action to emerge. Building on a classic model for collective action from economics (olson1965logic), we illustrate how similar conclusions hold for algorithmic collective action, and how they relate to the theoretical quantities studied in this paper.

Assume that individuals incur a cost 𝑐

0 for participating in collective action. This cost might represent overheads of coordination, a membership fee, or other additional responsibilities. Furthermore, assume that the utility that individuals get from joining a collective of size 𝛼 is 𝑆 ⁢ ( 𝛼 ) , and that otherwise they can partially “free ride” on the collective’s efforts: they get utility of 𝛾 ⁢ 𝑆 ⁢ ( 𝛼 ) for some 𝛾 ∈ [ 0 , 1 ] . Given this setup, individually rational agents will join the collective if

𝑆 ⁢ ( 𝛼 ) − 𝑐

𝛾 ⁢ 𝑆 ⁢ ( 𝛼 ) ,

or equivalently, if 𝑆 ⁢ ( 𝛼 ) > 𝑐 1 − 𝛾 . Therefore, joining the collective is rational if the size of the existing collective 𝛼 is greater than the critical mass for 𝑆 ∗

𝑐 1 − 𝛾 . Note that, once this critical threshold is reached, all individuals in the population are incentivized to join the collective and the collective is thus self-sustaining.

Consider a principal who would like to invest into the formation of a collective. The area 𝐵 ⁢ ( 𝛼 crit ) visualized in Figure 6 provides an upper bound on the investment required to make the collective self-sustaining and thus achieve any target success 𝑆 ∗ ≤ 𝑆 ⁢ ( 1 ) .

𝛼 crit 𝑐 𝐵 ⁢ ( 𝛼 crit ) 𝑆 ⁢ ( 𝛼 ) 𝛾 ⁢ 𝑆 ⁢ ( 𝛼 ) 𝛼 Figure 6:Visualization of the critical threshold 𝛼 crit after which a collective is self-sustaining and the principal’s required investment 𝐵 ⁢ ( 𝛼 crit ) to incentivize the whole population to join the collective.

The derivation above, while simplistic, serves to highlight the importance of collective size in understanding how collectives can emerge both organically and through investment. We believe that there is a large potential in investigating these questions in a rigorous manner. Indeed, the focus of this paper has been on understanding the effect of the size of the collective on its success, but understanding more generally how collectives form, which individuals have the most incentive to join collectives, whether selectively recruiting individuals provides additional leverage, and how collectives should use their informational advantage to optimize their strategies are important open questions in understanding the role of collectives on digital platforms.

Acknowledgements

The authors would like to thank Solon Barocas for pointers to related work, and attendees of the 2023 Annual Meeting of the Simons Collaboration on the Theory of Algorithmic Fairness for feedback. We thank Christos Papadimitriou for stimulating discussions about the work. This work was supported by the Tübingen AI Center and NSF Award 2240110.

Appendix ARelated Work

The scholarly literature on the gig economy is vast and interdisciplinary, spanning economic, ethnographic, psychological, and sociological analysis. Gig labor is diverse and heterogeneous. Conditions of precarity and dependence differ widely depending on the type of work and platform (schor2020dependence). Reviewing and integrating more than two hundred articles on the topic, cropanzano2022organizational define gig work as “labor contracted and compensated on a short-term basis to organizations or to individual clients through an external labor market,” and detail how gig work has changed the psychological contract between workers and employers. vallas2020platforms review existing scholarly accounts of the gig economy, arguing that “platforms represent a distinctive form of economic activity, […] different from markets, hierarchies, and networks.” Platforms cede some forms of centralized managerial control over workers by exposing them to the disciplining effect of the market and evaluation by consumers, while retaining power over key functions, such as task allocation, data collection, pricing, and collection of revenues. In another review article, sutherland2018sharing organize more than four hundred articles around the notion of platform centralization and decentralization.

Several works examine the reality of gig labor, e.g., (van2017platform; sun2019your; gray2019ghost). Based on a cross-regional survey, wood2019good find that algorithmic control in the gig economy can lead to “low pay, social isolation, working unsocial and irregular hours, overwork, sleep deprivation and exhaustion”, “marked by high levels of inter-worker competition with few labour protections and a global oversupply of labour relative to demand”.

cameron2022expanding examine the interplay of control and resistance in the gig economy. There are several examples of successful worker organization in the gig economy, involving a range of strategies. For example, rahman2021invisible studies how freelancers on Upwork strategize against the evaluation metrics of the platform, sometimes in cooperation with clients on the platform. Also studying Upwork freelancers, jarrahi2019algorithmic discuss how freelancers cooperate in strategically feeding the algorithm data so as to improve their outcomes. Cooperative strategic behavior among drivers on ride-hailing platforms is common, e.g., (cameron2020rise; robinson2017making; yu2022emergence), as are digital strategies involving bots, or multiple phone apps (chen2018thrown). Workers have also used forums, browser extensions, and online spaces to share information and strategize collectively, e.g., (irani2013turkopticon; salehi2015we; o2019weapons). We focus on the activities on the labor side of digital platforms, leaving out numerous examples of collective action from consumers and users on these platforms. However, as vallas2020platforms conclude, “the upsurge of worker mobilization should not blind us to the difficulties of organizing such a diverse and spatially dispersed labour force or the power of the companies to resist collective action.”

There is extensive scholarship on the topic of collective action. For example, Melucci’s text (melucci1996challenging) examines collective action in the information society. milan2015when examines how social media platforms mediate social movements and collective action.

Appendix BProofs

The following lemma will be used to analyze suboptimal classifiers.

Lemma B.1.

Suppose that 𝒫 , 𝒫 ′ are two distributions such that TV ⁢ ( 𝒫 , 𝒫 ′ ) ≤ 𝜖 . Take any two events 𝐸 1 , 𝐸 2 measurable under 𝒫 , 𝒫 ′ . If 𝒫 ⁢ ( 𝐸 1 )

𝒫 ⁢ ( 𝐸 2 ) + 𝜖 1 − 𝜖 , then 𝒫 ′ ⁢ ( 𝐸 1 )

𝒫 ′ ⁢ ( 𝐸 2 ) .

Proof.

It follows from the optimal coupling lemma for the total variation distance that we can write 𝒫 ′

( 1 − 𝜖 ) ⁢ 𝒫 + 𝜖 ⁢ 𝒬 for some distribution 𝒬 . Therefore, if 𝒫 ⁢ ( 𝐸 1 )

𝒫 ⁢ ( 𝐸 2 ) + 𝜖 1 − 𝜖 , then

𝒫 ′ ⁢ ( 𝐸 1 )

( 1 − 𝜖 ) ⁢ 𝒫 ⁢ ( 𝐸 1 ) + 𝜖 ⁢ 𝒬 ⁢ ( 𝐸 1 ) > ( 1 − 𝜖 ) ⁢ 𝒫 ⁢ ( 𝐸 2 ) + 𝜖 ≥ ( 1 − 𝜖 ) ⁢ 𝒫 ⁢ ( 𝐸 2 ) + 𝜖 ⁢ 𝒬 ⁢ ( 𝐸 2 )

𝒫 ′ ⁢ ( 𝐸 2 ) .

B.1Proof of Theorem 3.3

First consider the case  𝜖

0 . We start with a sufficient condition for a target classification outcome. For a point 𝑥 ∈ 𝒳 , we define

Δ 𝑥

max 𝑦 ∈ 𝒴 ⁡ 𝒫 0 ⁢ ( 𝑦 | 𝑥 ) − 𝒫 0 ⁢ ( 𝑦 ∗ | 𝑥 )

as the suboptimality of a target class on the base data.

Claim B.2.

For any 𝑥 ∈ 𝒳 , we have 𝑓 ⁢ ( 𝑥 )

𝑦 ∗ provided that 𝛼

( 1 − 𝛼 ) ⁢ Δ 𝑥 ⁢ 𝒫 0 ⁢ ( 𝑥 ) / 𝒫 ∗ ⁢ ( 𝑥 ) .

Proof.

Note that 𝑓 ⁢ ( 𝑥 )

𝑦 ∗ if, for every 𝑦 ≠ 𝑦 ∗ , 𝒫 ⁢ ( 𝑦 ∗ | 𝑥 )

𝒫 ⁢ ( 𝑦 | 𝑥 ) . Equivalently, 𝒫 ⁢ ( 𝑥 , 𝑦 ∗ ) − 𝒫 ⁢ ( 𝑥 , 𝑦 )

0 . But,

𝒫 ⁢ ( 𝑥 , 𝑦 ∗ )

𝛼 ⁢ 𝒫 ∗ ⁢ ( 𝑥 , 𝑦 ∗ ) + ( 1 − 𝛼 ) ⁢ 𝒫 0 ⁢ ( 𝑥 , 𝑦 ∗ )

𝛼 ⁢ 𝒫 ∗ ⁢ ( 𝑥 ) + ( 1 − 𝛼 ) ⁢ 𝒫 0 ⁢ ( 𝑦 ∗ | 𝑥 ) ⁢ 𝒫 0 ⁢ ( 𝑥 )

In the last step we used the fact that all labels in the support of 𝒫 ∗ equal 𝑦 ∗ . Similarly, for 𝑦 ≠ 𝑦 ∗ ,

𝒫 ⁢ ( 𝑥 , 𝑦 )

𝛼 ⁢ 𝒫 ∗ ⁢ ( 𝑥 , 𝑦 ) + ( 1 − 𝛼 ) ⁢ 𝒫 0 ⁢ ( 𝑥 , 𝑦 )

( 1 − 𝛼 ) ⁢ 𝒫 0 ⁢ ( 𝑦 | 𝑥 ) ⁢ 𝒫 0 ⁢ ( 𝑥 ) .

The claim follows by rearranging terms and dividing both sides by 𝒫 ∗ ⁢ ( 𝑥 ) . ∎

Now,

𝑆 ⁢ ( 𝛼 )

Pr 𝑥 ∼ 𝒫 0 ⁡ { 𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) )

𝑦 ∗ }

Pr 𝑥 ∼ 𝒫 ∗ ⁡ { 𝑓 ⁢ ( 𝑥 )

𝑦 ∗ }

≥ Pr 𝑥 ∼ 𝒫 ∗ ⁡ { 𝛼

( 1 − 𝛼 ) ⁢ 𝒫 0 ⁢ ( 𝑥 ) 𝒫 ∗ ⁢ ( 𝑥 ) ⁢ Δ 𝑥 }

(Claim B.2)

= 𝔼 𝑥 ∼ 𝒫 ∗ 𝟏 ⁢ { 1 − ( 1 − 𝛼 ) 𝛼 ⁢ 𝒫 0 ⁢ ( 𝑥 ) 𝒫 ∗ ⁢ ( 𝑥 ) ⁢ Δ 𝑥 > 0 }

≥ 𝔼 𝑥 ∼ 𝒫 ∗ [ 1 − ( 1 − 𝛼 ) 𝛼 ⁢ 𝒫 0 ⁢ ( 𝑥 ) 𝒫 ∗ ⁢ ( 𝑥 ) ⁢ Δ 𝑥 ]

1 − 1 − 𝛼 𝛼 ⁢ 𝔼 𝑥 ∼ 𝒫 ∗ [ 𝒫 0 ⁢ ( 𝑥 ) 𝒫 ∗ ⁢ ( 𝑥 ) ⁢ Δ 𝑥 ]

≥ 1 − 1 − 𝛼 𝛼 ⁢ 𝒫 0 ⁢ ( 𝒳 ∗ ) ⁢ Δ ,

where the last step uses the definition Δ

max 𝑥 ∈ 𝒳 ∗ ⁡ Δ 𝑥 .

Consider 𝜖

0 .

By Lemma B.1, we have that 𝒫 ′ ⁢ ( 𝑥 , 𝑦 ∗ ) > 𝒫 ′ ⁢ ( 𝑥 , 𝑦 ) , meaning 𝑓 ⁢ ( 𝑥 )

𝑦 ∗ , provided that 𝒫 ⁢ ( 𝑥 , 𝑦 ∗ ) > 𝒫 ⁢ ( 𝑥 , 𝑦 ) + 𝜖 1 − 𝜖 . Repeating the steps in the proof for 𝜖

0 with the additional 𝜖 / ( 1 − 𝜖 ) term, we conclude that

𝑆 ⁢ ( 𝛼 ) ≥ 1 − 𝜖 1 − 𝜖 − 1 − 𝛼 𝛼 ⁢ 𝒫 0 ⁢ ( 𝒳 ∗ ) ⁢ Δ .

B.2Proof of Theorem 3.5

We prove the case where 𝜖

0 . The extension to 𝜖

0 follows as in Theorem 3.3.

Claim B.3.

Fix a point 𝑥 ∗ ∈ 𝒳 ∗ in the signal set. We have 𝑓 ⁢ ( 𝑥 ∗ )

𝑦 ∗ provided that

𝛼

1 − 𝑝 𝑝 ⁢ 𝒫 0 ⁢ ( 𝑥 ∗ ) 𝒫 0 ⁢ ( 𝑔 − 1 ⁢ ( 𝑥 ∗ ) ) .

Here, 𝑔 − 1 ⁢ ( 𝑥 ∗ )

{ 𝑥 ∈ 𝒳 : 𝑔 ⁢ ( 𝑥 )

𝑥 ∗ } .

Proof.

For 𝑓 ⁢ ( 𝑥 ∗ )

𝑦 ∗ to hold, we need 𝒫 ⁢ ( 𝑦 ∗ | 𝑥 ∗ )

max 𝑦 ≠ 𝑦 ∗ ⁡ 𝒫 ⁢ ( 𝑦 | 𝑥 ∗ ) . Equivalently, 𝒫 ⁢ ( 𝑥 ∗ , 𝑦 ∗ )

max 𝑦 ≠ 𝑦 ∗ ⁡ 𝒫 ⁢ ( 𝑥 ∗ , 𝑦 ) .

By the definition of the feature-only signal strategy and the assumption that 𝒫 0 ⁢ ( 𝑦 ∗ | 𝑥 ) ≥ 𝑝 for all 𝑥 ∈ 𝒳 , each point 𝑥 ∈ 𝑔 − 1 ⁢ ( 𝑥 ∗ ) must have 𝒫 0 ⁢ ( 𝑦 ∗ | 𝑥 ) ≥ 𝑝 . Hence, for all 𝑥 ∗ ∈ 𝒳 ∗ ,

𝒫 ⁢ ( 𝑥 ∗ , 𝑦 ∗ )

𝛼 ⁢ 𝒫 ∗ ⁢ ( 𝑥 ∗ , 𝑦 ∗ ) + ( 1 − 𝛼 ) ⁢ 𝒫 0 ⁢ ( 𝑥 ∗ , 𝑦 ∗ ) ≥ 𝛼 ⁢ 𝑝 ⁢ 𝒫 0 ⁢ ( 𝑔 − 1 ⁢ ( 𝑥 ∗ ) ) .

On the other hand, for every 𝑦 ≠ 𝑦 ∗ , we must have

𝒫 ⁢ ( 𝑥 ∗ , 𝑦 )

𝒫 0 ⁢ ( 𝑥 ∗ , 𝑦 )

𝒫 0 ⁢ ( 𝑦 | 𝑥 ∗ ) ⁢ 𝒫 0 ⁢ ( 𝑥 ∗ ) ≤ ( 1 − 𝑝 ) ⁢ 𝒫 0 ⁢ ( 𝑥 ∗ ) .

The claim follows by rearranging. ∎

We can lower bound the success rate as

𝑆 ⁢ ( 𝛼 )

Pr 𝑥 ∼ 𝒫 0 ⁡ { 𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) )

𝑦 ∗ }

∑ 𝑥 ∗ ∈ 𝒳 ∗ Pr 𝑥 ∼ 𝒫 0 ⁡ { 𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) )

𝑦 ∗ ∣ 𝑥 ∈ 𝑔 − 1 ⁢ ( 𝑥 ∗ ) } ⁢ Pr 𝑥 ∼ 𝒫 0 ⁡ { 𝑥 ∈ 𝑔 − 1 ⁢ ( 𝑥 ∗ ) }

∑ 𝑥 ∗ ∈ 𝒳 ∗ 𝟏 ⁢ { 𝑓 ⁢ ( 𝑥 ∗ )

𝑦 ∗ } ⁢ 𝒫 0 ⁢ ( 𝑔 − 1 ⁢ ( 𝑥 ∗ ) ) .

(5)

Proceeding for fixed 𝑥 ∗ ∈ 𝒳 ∗ ,

𝟏 ⁢ { 𝑓 ⁢ ( 𝑥 ∗ )

𝑦 ∗ }

≥ 𝟏 ⁢ { 𝛼

1 − 𝑝 𝑝 ⁢ 𝒫 0 ⁢ ( 𝑥 ∗ ) 𝒫 0 ⁢ ( 𝑔 − 1 ⁢ ( 𝑥 ∗ ) ) }

(Claim B.3)

= 𝟏 ⁢ { 1 − 1 − 𝑝 𝑝 ⁢ 𝛼 ⁢ 𝒫 0 ⁢ ( 𝑥 ∗ ) 𝒫 0 ⁢ ( 𝑔 − 1 ⁢ ( 𝑥 ∗ ) )

0 }

≥ 1 − 1 − 𝑝 𝑝 ⁢ 𝛼 ⋅ 𝒫 0 ⁢ ( 𝑥 ∗ ) 𝒫 0 ⁢ ( 𝑔 − 1 ⁢ ( 𝑥 ∗ ) ) .

Plugging this back into (B.2),

Pr 𝑥 ∼ 𝒫 0 ⁡ { 𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) )

𝑦 ∗ }

1 − 1 − 𝑝 𝑝 ⁢ 𝛼 ⁢ ∑ 𝑥 ∗ ∈ 𝒳 ∗ 𝒫 0 ⁢ ( 𝑥 ∗ ) 𝒫 0 ⁢ ( 𝑔 − 1 ⁢ ( 𝑥 ∗ ) ) ⋅ 𝒫 0 ⁢ ( 𝑔 − 1 ⁢ ( 𝑥 ∗ ) )

≥ 1 − 1 − 𝑝 𝑝 ⁢ 𝛼 ⁢ 𝒫 0 ⁢ ( 𝒳 ∗ ) .

B.3Proof of Theorem 3.7

We again prove the case where 𝜖

0 . The extension to 𝜖

0 follows by invoking Lemma B.1, as in Theorem 3.3.

We start from the following claim.

Claim B.4.

For any 𝑥 ∈ 𝒳 we have 𝑓 ⁢ ( 𝑥 )

𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) ) provided that

𝛼

( 1 − 𝛼 ) ⁢ 2 ⁢ 𝜏 ⁢ ( 𝑥 ) ,

where 𝜏 ( 𝑥 )

max 𝑦 ∈ 𝒴 | 𝒫 0 ( 𝑦 | 𝑥 ) − 𝒫 0 ( 𝑦 | 𝑔 ( 𝑥 ) ) | .

Proof.

Denote 𝑦 ∗ ⁢ ( 𝑥 )

argmax 𝑦 ∈ 𝒴 𝒫 0 ⁢ ( 𝑦 | 𝑔 ⁢ ( 𝑥 ) ) . By construction of the strategy we know that 𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) )

𝑦 ∗ ⁢ ( 𝑥 ) and it remains to prove that 𝑓 ⁢ ( 𝑥 )

𝑦 ∗ ⁢ ( 𝑥 ) under the condition of the claim.

We have 𝑓 ⁢ ( 𝑥 )

𝑦 ∗ ⁢ ( 𝑥 ) if 𝒫 ⁢ ( 𝑦 ∗ ⁢ ( 𝑥 ) | 𝑥 )

𝒫 ⁢ ( 𝑦 | 𝑥 ) for any 𝑦 ≠ 𝑦 ∗ ⁢ ( 𝑥 ) . We have

𝒫 ⁢ ( 𝑦 ∗ ⁢ ( 𝑥 ) | 𝑥 )

( 1 − 𝛼 ) ⁢ 𝒫 0 ⁢ ( 𝑦 ∗ ⁢ ( 𝑥 ) | 𝑥 ) + 𝛼 ⁢ 𝒫 ∗ ⁢ ( 𝑦 ∗ ⁢ ( 𝑥 ) | 𝑥 )

( 1 − 𝛼 ) ⁢ 𝒫 0 ⁢ ( 𝑦 ∗ ⁢ ( 𝑥 ) | 𝑥 ) + 𝛼 ,

𝒫 ⁢ ( 𝑦 | 𝑥 )

( 1 − 𝛼 ) ⁢ 𝒫 0 ⁢ ( 𝑦 | 𝑥 ) + 𝛼 ⁢ 𝒫 ∗ ⁢ ( 𝑦 | 𝑥 )

( 1 − 𝛼 ) ⁢ 𝒫 0 ⁢ ( 𝑦 | 𝑥 ) ,

where we used that the erasure strategy implies 𝒫 ∗ ⁢ ( 𝑦 ∗ ⁢ ( 𝑥 ) | 𝑥 )

1 . Together this means that, when

𝛼

( 1 − 𝛼 ) ⁢ [ max 𝑦 ∈ 𝒴 ⁡ 𝒫 0 ⁢ ( 𝑦 | 𝑥 ) − 𝒫 0 ⁢ ( 𝑦 ∗ ⁢ ( 𝑥 ) | 𝑥 ) ] ,

then 𝑓 ⁢ ( 𝑥 )

𝑦 ∗ ⁢ ( 𝑥 ) . Using the definition of 𝑦 ∗ ⁢ ( 𝑥 ) , we can bound the right-hand side by

𝒫 0 ⁢ ( 𝑦 | 𝑥 ) − 𝒫 0 ⁢ ( 𝑦 ∗ ⁢ ( 𝑥 ) | 𝑥 )

≤ 𝒫 0 ⁢ ( 𝑦 | 𝑥 ) − 𝒫 0 ⁢ ( 𝑦 | 𝑔 ⁢ ( 𝑥 ) ) + 𝒫 0 ⁢ ( 𝑦 ∗ ⁢ ( 𝑥 ) | 𝑔 ⁢ ( 𝑥 ) ) − 𝒫 0 ⁢ ( 𝑦 ∗ ⁢ ( 𝑥 ) | 𝑥 )

≤ 2 ⁢ 𝜏 ⁢ ( 𝑥 ) .

The claim follows. ∎

It remains to bound the success of the strategy:

𝑆 ⁢ ( 𝛼 )

Pr 𝑥 ∼ 𝒫 0 ⁡ { 𝑓 ⁢ ( 𝑥 )

𝑓 ⁢ ( 𝑔 ⁢ ( 𝑥 ) ) } .

Pr 𝑥 ∼ 𝒫 0 ⁡ { 𝑓 ⁢ ( 𝑥 )

𝑦 ∗ ⁢ ( 𝑥 ) } .

≥ Pr 𝑥 ∼ 𝒫 0 ⁡ { 𝛼 > ( 1 − 𝛼 ) ⁢ 2 ⁢ 𝜏 ⁢ ( 𝑥 ) }

Pr 𝑥 ∼ 𝒫 0 ⁡ { 1 − 1 − 𝛼 𝛼 ⁢ 2 ⁢ 𝜏 ⁢ ( 𝑥 ) > 0 }

≥ 𝔼 𝑥 ∼ 𝒫 0 [ 1 − 2 ⁢ ( 1 − 𝛼 ) 𝛼 ⋅ 𝜏 ⁢ ( 𝑥 ) ]

1 − 2 ⁢ ( 1 − 𝛼 ) 𝛼 ⋅ 𝜏 ,

where we use the fact that 𝜏

𝔼 𝑥 ∼ 𝒫 0 𝜏 ⁢ ( 𝑥 ) .

B.4Proof of Theorem 4.2

Let 𝒫 ′ be a gradient-cancelling distribution for 𝜃 ∗ . Denote 𝑝

min ⁡ ( 1 , 1 𝛼 ⁢ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ + ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ ) . Then,

𝔼 𝑧 ∼ 𝒫 ∇ ⁡ ℓ ⁢ ( 𝜃 ∗ ; 𝑧 )

( 1 − 𝛼 ) ⁢ 𝔼 𝑧 ∼ 𝒫 0 ∇ ⁡ ℓ ⁢ ( 𝜃 ∗ ; 𝑧 ) + 𝛼 ⁢ 𝔼 𝑧 ∼ 𝒫 ∗ ∇ ⁡ ℓ ⁢ ( 𝜃 ∗ ; 𝑧 )

( 1 − 𝛼 ⁢ 𝑝 ) ⁢ 𝔼 𝑧 ∼ 𝒫 0 ∇ ⁡ ℓ ⁢ ( 𝜃 ∗ ; 𝑧 ) + 𝛼 ⁢ 𝑝 ⁢ 𝔼 𝑧 ∼ 𝒫 ′ ∇ ⁡ ℓ ⁢ ( 𝜃 ∗ ; 𝑧 )

( 1 − 𝛼 ⁢ 𝑝 ) ⁢ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) + 𝛼 ⁢ 𝑝 ⁢ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ )

( 1 − 𝛼 ⁢ 𝑝 − 𝛼 ⁢ 𝑝 ⁢ ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ ) ⁢ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ )

( 1 − 𝛼 ⁢ 𝑝 ⁢ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ + ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ ) ⁢ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ )

max ⁡ ( 1 − 𝛼 ⁢ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ + ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ , 0 ) ⁢ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ )

max ⁡ ( ( 1 − 𝛼 ) ⁢ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ − 𝛼 ⁢ ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ , 0 ) ⁢ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ .

Therefore, ‖ 𝔼 𝑧 ∼ 𝒫 ∇ ⁡ ℓ ⁢ ( 𝜃 ∗ ; 𝑧 ) ‖

max ⁡ ( ( 1 − 𝛼 ) ⁢ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ − 𝛼 ⁢ ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ , 0 ) . Applying the definition of 𝜇 -strong convexity, we get

‖ 𝜃 ∗ − 𝜃 ‖
≤ 1 𝜇 ⁢ ‖ 𝔼 𝑧 ∼ 𝒫 ∇ ⁡ ℓ ⁢ ( 𝜃 ∗ ; 𝑧 ) − 𝔼 𝑧 ∼ 𝒫 ∇ ⁡ ℓ ⁢ ( 𝜃 ; 𝑧 ) ‖

1 𝜇 ⁢ ‖ 𝔼 𝑧 ∼ 𝒫 ∇ ⁡ ℓ ⁢ ( 𝜃 ∗ ; 𝑧 ) ‖

1 𝜇 ⁢ max ⁡ ( ( 1 − 𝛼 ) ⁢ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ − 𝛼 ⁢ ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ , 0 ) .

The first equality follows because 𝔼 𝑧 ∼ 𝒫 ∇ ⁡ ℓ ⁢ ( 𝜃 ; 𝑧 )

0 due to the loss being convex and the firm being a risk minimizer. Multiplying both sides by − 1 , we obtain a lower bound on the success 𝑆 ⁢ ( 𝛼 )

− ‖ 𝜃 ∗ − 𝜃 ‖ .

B.5Proof of Corollary 4.3

To achieve 𝑆 ⁢ ( 𝛼 )

0 , Theorem 4.2 shows that it suffices to have 𝛼 ⁢ ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖

( 1 − 𝛼 ) ⁢ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ , for any 𝜇 . Rearranging the terms and expressing 𝛼 completes the proof.

B.6Proof of Proposition 4.4

If 𝑢 is convex, then for all 𝜃 ′ we know

𝑢 ⁢ ( 𝜃 ′ ) ≥ 𝑢 ⁢ ( 𝜃 0 ) + ∇ 𝑢 ⁢ ( 𝜃 0 ) ⊤ ⁢ ( 𝜃 ′ − 𝜃 0 ) .

Let 𝜃 ∗

𝜃 0 + ∇ 𝑢 ⁢ ( 𝜃 0 ) ‖ ∇ 𝑢 ⁢ ( 𝜃 0 ) ‖ 2 ⁢ 𝑈 . Then, 𝑢 ⁢ ( 𝜃 ∗ ) − 𝑢 ⁢ ( 𝜃 0 ) ≥ 𝑈 .

Now, we apply Corollary 4.3 to upper bound the critical mass needed to reach 𝜃 ∗ . We have

𝛼 ∗ ≤ ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ ‖ 𝑔 𝒫 ′ ⁢ ( 𝜃 ∗ ) ‖ + ‖ 𝑔 𝒫 0 ⁢ ( 𝜃 ∗ ) ‖ ≤ 𝛽 ⁢ ‖ 𝜃 ∗ − 𝜃 0 ‖ 𝑔 lb + 𝛽 ⁢ ‖ 𝜃 ∗ − 𝜃 0 ‖ ,

where we apply smoothness and the definition of 𝑔 lb . Observing that ‖ 𝜃 ∗ − 𝜃 0 ‖

𝑈 ‖ ∇ 𝑢 ⁢ ( 𝜃 0 ) ‖ completes the proof.

B.7Proof of Theorem 4.5

Fix a time step 𝑡 and a model 𝜃 𝑡 . Denote by 𝒫 𝑡 ′ the gradient-redirecting distribution found at step 𝑡 and let 𝜉 ⁢ ( 𝜃 𝑡 )

‖ 𝑔 𝒫 𝑡 ′ ⁢ ( 𝜃 𝑡 ) + 1 − 𝛼 𝛼 ⁢ 𝑔 𝒫 0 ⁢ ( 𝜃 𝑡 ) ‖ ‖ 𝜃 𝑡 − 𝜃 ∗ ‖ . Then, the gradient-redirecting strategy induces the following gradient evaluated on 𝒫 𝑡 :

𝑔 𝒫 𝑡 ⁢ ( 𝜃 𝑡 )

𝛼 ⁢ 𝑔 𝒫 𝑡 ′ ⁢ ( 𝜃 𝑡 ) + ( 1 − 𝛼 ) ⁢ 𝑔 𝒫 0 ⁢ ( 𝜃 𝑡 )

− 𝛼 ⁢ 1 − 𝛼 𝛼 ⁢ 𝑔 𝒫 0 ⁢ ( 𝜃 𝑡 ) + 𝛼 ⁢ 𝜉 ⁢ ( 𝜃 𝑡 ) ⁢ ( 𝜃 𝑡 − 𝜃 ∗ ) + ( 1 − 𝛼 ) ⁢ 𝑔 𝒫 0 ⁢ ( 𝜃 𝑡 )

𝛼 ⁢ 𝜉 ⁢ ( 𝜃 𝑡 ) ⁢ ( 𝜃 𝑡 − 𝜃 ∗ ) .

Now let 𝑐

min 𝜆 ∈ [ 0 , 1 ] ⁡ 𝜉 ⁢ ( 𝜆 ⁢ 𝜃 0 + ( 1 − 𝜆 ) ⁢ 𝜃 ∗ ) . Applying the strategy repeatedly across time steps yields

‖ 𝜃 𝑇 − 𝜃 ∗ ‖

≤ ‖ 𝜃 𝑇 − 1 − 𝜂 ⁢ 𝛼 ⁢ 𝜉 ⁢ ( 𝜃 𝑇 − 1 ) ⁢ ( 𝜃 𝑇 − 1 − 𝜃 ∗ ) − 𝜃 ∗ ‖

≤ ( 1 − 𝜂 ⁢ 𝛼 ⁢ 𝜉 ⁢ ( 𝜃 𝑇 − 1 ) ) ⁢ ‖ 𝜃 𝑇 − 1 − 𝜃 ∗ ‖

≤ ( 1 − 𝜂 ⁢ 𝛼 ⁢ 𝑐 ) ⁢ ‖ 𝜃 𝑇 − 1 − 𝜃 ∗ ‖

≤ ( 1 − 𝜂 ⁢ 𝛼 ⁢ 𝑐 ) 𝑇 ⁢ ‖ 𝜃 0 − 𝜃 ∗ ‖ ,

which yields 𝑆 𝑇 ⁢ ( 𝛼 )

− ‖ 𝜃 𝑇 − 𝜃 ∗ ‖ ≥ − ( 1 − 𝜂 ⁢ 𝛼 ⁢ 𝑐 ) 𝑇 ⁢ ‖ 𝜃 0 − 𝜃 ∗ ‖ . Setting 𝐶 ⁢ ( 𝛼 )

𝛼 ⁢ 𝑐 concludes the proof.

Appendix CAdditional Experimental Details

The dataset introduced by jiechieu2021skills is available at https://github.com/florex/resume_corpus. We preprocessed each resume by removing the first 20 words of the resume. The reason is that the opening of the resume essentially encodes the associated skills, since the dataset creation process extracted the skills from the opening of the resume. Removing the first 20 words leads to a more realistic classification task.

We used the HuggingFace transformers open-source Python library (wolf2020transformers). We used the distilbert-base-uncased model from the library corresponding to the DistilBERT transformer model (sanh2019distilbert). We used the HuggingFace Trainer module for training with its default settings. We also experimented with larger models, including RoBERTa (both roberta-base and roberta-large), but we did not see any improvements in classification accuracy from using these larger models.

The DistilBERT tokenizer has a vocabulary of 30522 tokens of which thousands are unused in the resume corpus. We picked token 1240 corresponding to a small dash, which we inserted in the resume every 20 words. Our findings are largely insensitive to the trigger spacing as Figure 7 shows.

Figure 7:Trigger spacing is largely irrelevant. Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:
65.4 kB
·
Xet hash:
59975c1f457f6640a64a153a4c0af6da239478c503b7683336a0b3f6aeac68e0

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.