56.1 kB

Title: Which Tricks Are Important for Learning to Rank?

URL Source: https://arxiv.org/html/2204.01500

Markdown Content: 1 Introduction 2 Background and Related Work 2.1 Learning-to-rank Problem 2.2 Ranking Loss Functions 2.3 Ranking Algorithms 3 Analysis of LTR Algorithms 3.1 YetiLoss Algorithm 3.2 Comparison of YetiLoss and LambdaMART 3.3 Comparison of YetiLoss and StochasticRank 4 Experiments 4.1 Setup Datasets Ranking loss functions Algorithms Parameter tuning 4.2 Comparison of LTR Algorithms Comparison with default hyper-parameters 4.3 Analysis of Algorithmic Details Stochastic treatment of scores Number of neighbors 5 Conclusion and Discussion A Experimental Setup B Additional Experimental Details Generalization gap Number of neighbors Stochastic smoothing Which Tricks Are Important for Learning to Rank? Ivan Lyzhin Aleksei Ustimenko Andrey Gulin Liudmila Prokhorenkova Abstract

Nowadays, state-of-the-art learning-to-rank methods are based on gradient-boosted decision trees (GBDT). The most well-known algorithm is LambdaMART which was proposed more than a decade ago. Recently, several other GBDT-based ranking algorithms were proposed. In this paper, we thoroughly analyze these methods in a unified setup. In particular, we address the following questions. Is direct optimization of a smoothed ranking loss preferable over optimizing a convex surrogate? How to properly construct and smooth surrogate ranking losses? To address these questions, we compare LambdaMART with YetiRank and StochasticRank methods and their modifications. We also propose a simple improvement of the YetiRank approach that allows for optimizing specific ranking loss functions. As a result, we gain insights into learning-to-rank techniques and obtain a new state-of-the-art algorithm.

Learning to rank, GBDT, LambdaMART, StochasticRank, YetiRank, YetiLoss

1 Introduction

Learning to rank (LTR) is a central problem in information retrieval (Burges, 2010). The objective of LTR is to rank a given set of items to optimize the overall utility of the list. In information retrieval, given a query and a list of candidate documents, it is desirable to rank the documents in decreasing order of relevance to the query. Usually, LTR methods learn a function computing scores for all documents and then sort the documents according to their scores. Thus, a major challenge for LTR is that the sorting operation is non-differentiable, which prevents effective gradient-based optimization.

Despite the recent success of neural approaches in various machine learning tasks, gradient-boosted decision trees (GBDT) are still state-of-the-art algorithms for tabular datasets containing heterogeneous and noisy features (Gorishniy et al., 2021; Katzir et al., 2021). In particular, GBDT methods outperform other approaches for the learning-to-rank problem. A recent paper by Qin et al. (2021) observes that for standard tabular LTR datasets, the best results are achieved by a classic LambdaMART algorithm (Wu et al., 2010) proposed more than a decade ago and implemented within the LightGBM library (Ke et al., 2017). Thus, our paper focuses on GBDT-based learning-to-rank methods and provides an extensive study of existing and new approaches and their important algorithmic details.

In addition to LambdaMART, we revisit another long-known algorithm called YetiRank (Gulin et al., 2011). YetiRank won The Transfer Learning Track of Yahoo!’s LTR challenge in 2010 but was not much explored after that. The main algorithmic differences of YetiRank compared to LambdaMART are stochastic smoothing used during training to randomize the orderings and a different way to construct an upper bound for the loss function with a tighter approximation. Our empirical evaluation shows that YetiRank outperforms LambdaMART and other competitors in most cases. In addition, we show that YetiRank implicitly optimizes a particular form of the DCG ranking quality function. This observation allows us to modify the YetiRank procedure by re-weighting the gradients according to a proper loss. The proposed modification YetiLoss further improves the performance of YetiRank for specific losses, for instance, MRR and MAP . For MAP , we obtain new state-of-the-art results with YetiLoss. We also explain why YetiLoss is expected to have more stable optimization and better generalization.

A recent work by Ustimenko & Prokhorenkova (2020) shows that the LambdaMART approach, implemented within the CatBoost library (Prokhorenkova et al., 2018), can be outperformed by an algorithm called StochasticRank. StochasticRank directly optimizes any given ranking loss with provable convergence guarantees. For this purpose, it uses stochastic smoothing of the loss function (without upper bounding). The obtained objective is often non-convex, so it is optimized with Stochastic Gradient Langevin Boosting (Ustimenko & Prokhorenkova, 2020) — a boosting algorithm designed to optimize non-convex functions.

The main difference between StochasticRank and YetiLoss is that the former optimizes the training loss directly, while the latter constructs special convex bounds on the loss function at each iteration. While minimizing such bounds cannot guarantee reaching the global optima, convexity allows for more efficient optimization, which can often be beneficial. We provide extensive experiments to validate in which scenarios direct optimization is helpful.

To sum up, our main contributions are the following.

•

We conduct a thorough analysis of several state-of-the-art GBDT-based LTR algorithms. We provide a theoretical explanation of their differences and extensive empirical evaluation.

•

We analyze the effect of several algorithmic details: optimizing the loss function directly versus constructing a convex upper bound, adding stochastic smoothing to the loss function, and different ways of constructing an upper bound.

•

We propose a simple extension of the YetiRank algorithm that handles arbitrary loss functions. The obtained algorithm achieves new state-of-the-art results for such loss functions as MRR and MAP that are often overlooked in LTR literature.

In the next section, we give the necessary background: define the learning-to-rank problem and popular quality functions, and then describe several well-known ranking algorithms. In Section 3, we propose our modification of YetiRank that allows for optimizing any given ranking loss and also analyze and compare the properties of all the considered algorithms. In Section 4, we conduct a thorough comparison of existing LTR algorithms on several benchmarks, show that YetiLoss outperforms the competitors for specific ranking quality functions, and analyze the effect of the main algorithmic details on the quality of LTR. Section 5 concludes the paper.

2 Background and Related Work 2.1 Learning-to-rank Problem

Learning to rank is a classic information retrieval problem (Liu, 2009). To formally define the LTR task, consider a query 𝑞 sampled from a distribution 𝑄 . For this query, we are given a set of documents { 𝑥 1 , … , 𝑥 𝑛 𝑞 } and a relevance vector 𝑟

( 𝑟 1 , … , 𝑟 𝑛 𝑞 ) ∈ ℝ 𝑛 𝑞 . Each document 𝑥 𝑖 is represented as a vector of features that describes the query-document pair.

A learning-to-rank model is a function 𝑓 ⁢ ( 𝑥 ) that takes the document’s features 𝑥 𝑖 and returns a value 𝑧 𝑖 that is the predicted relevance for this document. For a query 𝑞 , we compute the vector 𝑧

( 𝑓 ⁢ ( 𝑥 1 ) , … , 𝑓 ⁢ ( 𝑥 𝑛 𝑞 ) ) ∈ ℝ 𝑛 𝑞 called a vector of scores and then compute 𝑠

arg ⁢ sort ⁢ ( 𝑧 ) ∈ ℝ 𝑛 𝑞 that is the ranking of documents predicted by the model 𝑓 ⁢ ( 𝑥 ) .

A ranking loss function 𝐿 ⁢ ( 𝑧 , 𝑟 ) is a function that measures how well the ranking 𝑠

arg ⁢ sort ⁢ ( 𝑧 ) agrees with the relevance vector 𝑟 . To simplify the notation, we also write 𝐿 ⁢ ( 𝑓 , 𝑞 ) meaning that { 𝑥 1 , … , 𝑥 𝑛 𝑞 } and 𝑟 are contained in 𝑞 and 𝑧

( 𝑓 ⁢ ( 𝑥 1 ) , … , 𝑓 ⁢ ( 𝑥 𝑛 𝑞 ) ) is computed internally.

The ultimate goal of LTR is to find a model 𝑓 ⁢ ( 𝑥 ) ∈ ℱ ( ℱ is a predefined class of models) such that:

𝑓

arg ⁢ min 𝑓 ∈ ℱ ⁡ 𝔼 𝑞 ∼ 𝑄 ⁢ 𝐿 ⁢ ( 𝑓 , 𝑞 ) . (1)

Unfortunately, in practice, we do not know the distribution 𝑄 but rather have an i.i.d. sample 𝑞 1 , … , 𝑞 𝑁 ∼ 𝑄 which we call the training dataset. Thus, instead of (1), we solve:

𝑓

arg ⁢ min 𝑓 ∈ ℱ ⁡ 1 𝑁 ⁢ ∑ 𝑖

1 𝑁 𝐿 ⁢ ( 𝑓 , 𝑞 𝑖 ) . (2)

The function ℒ 𝑁 ⁢ ( 𝑓 ) := 1 𝑁 ⁢ ∑ 𝑖

1 𝑁 𝐿 ⁢ ( 𝑓 , 𝑞 𝑖 ) is called an empirical loss function.

The main difficulty of the LTR problem is the fact that the function ℒ 𝑁 ⁢ ( 𝑓 ) is locally constant as a function of 𝑧 , discontinuous, and non-convex in 𝑧 , since each 𝐿 ⁢ ( 𝑧 , 𝑟 ) is defined by a permutation 𝑠

arg ⁢ sort ⁡ ( 𝑧 ) . As a result, ℒ 𝑁 ⁢ ( 𝑓 ) does not have a well-defined derivative (derivatives of 𝐿 ⁢ ( 𝑧 , 𝑟 ) are everywhere zero due to a local constancy). Thus, one cannot directly optimize this loss with standard methods like gradient boosting or backpropagation in neural networks.

2.2 Ranking Loss Functions

In this section, we define some widely used ranking loss functions. First, recall that all ranking losses depend on the permutation 𝑠

arg ⁢ sort ⁢ ( 𝑧 ) . However, it can be the case that two documents 𝑥 𝑖 and 𝑥 𝑗 have the same score, i.e., 𝑧 𝑖

𝑧 𝑗 . This situation is called a tie, and in this case, arg ⁢ sort ⁢ ( 𝑧 ) is not well-defined. Ties may occur due to the model’s discrete structure (e.g., if 𝑓 is a GBDT model) or when two different documents have the same feature vectors. Although ties are often neglected in the LTR literature, they can affect the optimization procedure and lead to an overestimated ranking quality (Ustimenko & Prokhorenkova, 2020). Therefore, Ustimenko & Prokhorenkova (2020) propose resolving the ties as 𝑠

arg ⁢ sort ⁢ ( ( 𝑧 , − 𝑟 ) ) , meaning that the documents are ordered according to 𝑧 , but in the case of ties, we order them accordingly to − 𝑟 , i.e., put the less relevant document first. In other words, the worst permutation is used for ties. In this paper, we follow this tie-resolution policy.

Let 𝑠

arg ⁢ sort ⁢ ( ( 𝑧 , − 𝑟 ) ) and let 𝑛

𝑛 𝑞 be the length of the vectors 𝑧 and 𝑟 . Probably the most well-known ranking quality function is DCG ⁢ @ ⁢ 𝑘 :

DCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 )

∑ 𝑖

1 min ⁡ { 𝑛 , 𝑘 } 2 𝑟 𝑠 𝑖 − 1 2 4 ⁢ log 2 ⁡ ( 𝑖 + 1 ) , (3)

where 𝑟 𝑖 ∈ [ 0 , 4 ] are relevance labels. This quality function is called Discounted Cumulative Gain: we divide the gain for the relevance by the discount for a lower position. Different options for gain and discount functions are possible, and (3) is the most widely used combination.

NDCG ⁢ @ ⁢ 𝑘 is a normalized variant of DCG ⁢ @ ⁢ 𝑘 :

NDCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 )

DCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 ) max 𝑧 ′ ∈ ℝ 𝑛 ⁡ DCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 ′ , 𝑟 ) .

Usually, LTR papers report the values of NDCG ⁢ @ ⁢ 𝑘 , while other measures can be of interest in practice. For instance, for binary relevance labels 𝑟 𝑗 ∈ { 0 , 1 } , one can be interested in Average Precision ( AP ):

AP ⁢ ( 𝑧 , 𝑟 )

∑ 𝑖

1 𝑛 P ⁢ ( 𝑖 ) ⁢ 𝑟 𝑠 𝑖 ∑ 𝑖

1 𝑛 𝑟 𝑖 , where ⁢ P ⁢ ( 𝑖 )

1 𝑖 ⁢ ∑ 𝑗

1 𝑖 𝑟 𝑠 𝑗 .

Mean Average Precision ( MAP ) is AP averaged over the queries. Another measure used for binary labels is Reciprocal Rank (called MRR after the averaging):

RR ⁢ ( 𝑧 , 𝑟 )

∑ 𝑖

1 𝑛 𝑟 𝑠 𝑖 𝑖 ⁢ ∏ 𝑗

1 𝑖 − 1 ( 1 − 𝑟 𝑠 𝑗 ) , 𝑟 𝑗 ∈ { 0 , 1 } ,

which is the inverse rank of the first relevant document. Finally, Expected Reciprocal Rank ( ERR ) is a similar measure applied to non-binary labels:

ERR ⁢ ( 𝑧 , 𝑟 )

∑ 𝑖

1 𝑛 𝑟 𝑠 𝑖 𝑖 ⁢ ∏ 𝑗

1 𝑖 − 1 ( 1 − 𝑟 𝑠 𝑗 ) , 𝑟 𝑗 ∈ [ 0 , 1 ] .

The above examples are ranking quality functions, i.e., higher values are preferred. When we need to convert a quality function 𝑀 ⁢ ( 𝑧 , 𝑟 ) to a loss function, we can just consider 𝐿 ⁢ ( 𝑧 , 𝑟 ) := 1 − 𝑀 ⁢ ( 𝑧 , 𝑟 ) . Note that in our experiments, we report ranking quality values multiplied by 100.

2.3 Ranking Algorithms

This section describes LTR algorithms that are of particular interest to the current research. For more methods and a broader overview, we refer to Liu (2009); Ustimenko & Prokhorenkova (2020); Qin et al. (2021).

As discussed in Section 2.1, the ranking loss function ℒ 𝑁 ⁢ ( 𝑓 ) is non-differentiable. Thus, if we want to apply a convenient optimization method like gradient boosting or backpropagation, we need to build a differentiable surrogate. One of the most popular approaches is to use the Majorization-Minimization technique, when at each iteration 𝑡 we build a function 𝑙 𝑡 ⁢ ( 𝑧 , 𝑟 ) that is usually convex and differentiable such that:

𝐿 ⁢ ( 𝑧 , 𝑟 ) ≤ 𝑙 𝑡 ⁢ ( 𝑧 , 𝑟 ) ⁢ ∀ 𝑧 ∈ ℝ 𝑛 𝑞 . (4)

After that, we sum up the upper bounds (4) for all the queries and obtain:111As before, we denote by 𝑙 𝑡 ⁢ ( 𝑓 , 𝑞 ) the values of 𝑙 𝑡 ⁢ ( 𝑧 , 𝑟 ) , where we set 𝑧

( 𝑓 ⁢ ( 𝑥 1 ) , … , 𝑓 ⁢ ( 𝑥 𝑛 𝑞 ) ) .

ℒ 𝑁 ⁢ ( 𝑓 ) ≤ 1 𝑁 ⁢ ∑ 𝑖

1 𝑁 𝑙 𝑡 ⁢ ( 𝑓 , 𝑞 𝑖 )

ℒ 𝑡 * ⁢ ( 𝑓 ) . (5)

The loss function ℒ 𝑡 * ⁢ ( 𝑓 ) can be optimized directly by gradient boosting or backpropagation.

Many popular learning-to-rank algorithms like LambdaRank (Burges et al., 2007), LambdaMART (Wu et al., 2010), or LambdaLoss (Wang et al., 2018) fall into this category. In this section, we mostly follow the reasoning of Wang et al. (2018).

To get a differentiable upper bound, these algorithms first build a bound of the form:

𝐿 ⁢ ( 𝑧 , 𝑟 ) ≤ const + ∑ 𝑖 , 𝑗

1 𝑛 𝑞 𝑤 𝑖 ⁢ 𝑗 ⋅ 𝟙 𝑧 𝑗 − 𝑧 𝑖

0 . (6)

After that, one can use the inequality 𝟙 𝑧 𝑗 − 𝑧 𝑖

0 ≤ log 2 ⁡ ( 1 + 𝑒 − ( 𝑧 𝑖 − 𝑧 𝑗 ) ) to obtain a convex differentiable bound:

𝐿 ⁢ ( 𝑧 , 𝑟 ) ≤ const + ∑ 𝑖 , 𝑗

1 𝑛 𝑞 𝑤 𝑖 ⁢ 𝑗 ⁢ log 2 ⁡ ( 1 + 𝑒 − ( 𝑧 𝑖 − 𝑧 𝑗 ) ) . (7)

To be more precise, for NDCG ⁢ @ ⁢ 𝑘 LambdaRank and LambdaMART select the weights 𝑤 𝑖 ⁢ 𝑗 as:

𝑤 𝑖 ⁢ 𝑗

Δ 𝑖 ⁢ 𝑗 ⁢ NDCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 ) ⋅ 𝟙 𝑟 𝑖

𝑟 𝑗 , (8)

where Δ 𝑖 ⁢ 𝑗 ⁢ NDCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 ) denotes the difference in NDCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 ) caused by permuting 𝑖 -th and 𝑗 -th documents in the sorting defined by 𝑧 . Formally, recall that 𝑠

arg ⁢ sort ⁢ ( 𝑧 ) and consider the inverse permutation 𝑝

𝑠 − 1 . The value 𝑝 𝑖 is the position of the 𝑖 -th document in the list ordered by 𝑧 . Then,

Δ 𝑖 ⁢ 𝑗 ⁢ NDCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 )

( 2 𝑟 𝑖 − 2 𝑟 𝑗 ) 2 4 ⁢ max 𝑧 ′ ⁡ DCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 ′ , 𝑟 )

⋅ | 1 log 2 ⁡ ( 1 + 𝑝 𝑗 ) − 1 log 2 ⁡ ( 1 + 𝑝 𝑖 ) | . (9)

As proven by Wang et al. (2018), using such 𝑤 𝑖 ⁢ 𝑗 indeed gives a desired upper bound. Similarly, one can adapt LambdaMART to any ranking metric by substituting Δ 𝑖 ⁢ 𝑗 ⁢ NDCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 ) in (8) by the difference of the target loss function. It is important to note that the upper bound property may not hold in this case. However, empirical evidence shows that this trick works well in practice.

Now let us describe the YetiRank algorithm. The original formulation given by Gulin et al. (2011) does not specify which ranking metric is optimized, but the pairwise loss is similar to the ones described above. YetiRank uses the following convex differentiable loss:

𝑙 𝑡 ⁢ ( 𝑧 , 𝑟 )

− ∑ 𝑖 , 𝑗

1 𝑛 𝑞 𝑤 𝑖 ⁢ 𝑗 ⁢ log ⁡ ( 1 + 𝑒 − ( 𝑧 𝑖 − 𝑧 𝑗 ) ) , (10)

where the weights are chosen as:

𝑤 𝑖 ⁢ 𝑗

( 𝑟 𝑖 − 𝑟 𝑗 ) ⋅ 𝟙 𝑟 𝑖 > 𝑟 𝑗 ⋅ 𝔼 𝜖 ⁢ [ 𝑏 𝑝 𝑖 − 1 ⋅ 𝟙 | 𝑝 𝑖 − 𝑝 𝑗 |

1 ] , (11)

where 𝑏 ∈ ( 0 , 1 ) is a hyper-parameter and 𝑝 𝑖 is a random permutation obtained as the inverse of arg ⁢ sort ⁢ ( 𝑧 + 𝜖 ) with 𝜖

log ⁡ 𝑢 1 − 𝑢 , 𝑢 ∼ 𝑈 ⁢ ( [ 0 , 1 ] 𝑛 𝑞 ) , i.e., 𝜖 is distributed according to the Logistic distribution. In practice, to estimate 𝑤 𝑖 ⁢ 𝑗 , one adds random noise to the current scores to generate several random permutations. By default, ten permutations are used.222The complexity of computing the gradients is thus increased with the number of permutations, but this does not affect the overall time complexity much since the most time-consuming operation is building a decision tree, and this procedure is not affected. Importantly, for YetiRank, we can obtain only stochastic gradient estimates of ∇ 𝑧 𝑙 𝑡 ⁢ ( 𝑧 , 𝑟 ) due to the randomness of 𝜖 . We provide a deeper analysis of YetiRank in Section 3.

The CatBoost library also has a modification of YetiRank called YetiRankPairwise (CatBoost, 2023), which is described in Gulin et al. (2011). It is a GBDT-specific modification of YetiRank, allowing one to get more accurate results on large datasets. However, in our preliminary experiments, we have not observed consistent improvements of YetiRankPairwise over YetiRank. Since YetiRankPairwise is significantly slower than all other algorithms considered in this paper, we do not include it in our empirical analysis.

Finally, we overview the recently proposed StochasticRank algorithm (Ustimenko & Prokhorenkova, 2020). In sharp contrast with the previous approaches that build convex differentiable surrogates to optimize a given non-convex ranking loss, StochasticRank smooths 𝐿 ⁢ ( 𝑧 , 𝑟 ) directly using the Gaussian distribution 𝒩 ⁢ ( − 𝜇 ⁢ 𝑟 , 𝜎 ⁢ 𝐼 𝑛 𝑞 ) :

𝑙 ⁢ ( 𝑧 , 𝑟 )

𝔼 𝜖 ∼ 𝒩 ⁢ ( 0 , 𝐼 𝑛 𝑞 ) ⁢ 𝐿 ⁢ ( 𝑧 − 𝜇 ⁢ 𝑟 + 𝜎 ⁢ 𝜖 , 𝑟 ) , (12)

where 𝜇 ≥ 0 and 𝜎 > 0 are hyper-parameters. Such smoothing leads to the loss function ℒ 𝑁 ⁢ ( 𝑓 , 𝜎 , 𝜇 )

1 𝑁 ⁢ ∑ 𝑖

1 𝑁 𝑙 ⁢ ( 𝑓 , 𝑞 𝑖 ) which minimum point is arbitrary close to the minimum of ℒ 𝑁 ⁢ ( 𝑓 ) , see Ustimenko & Prokhorenkova (2020) for the details. Moreover, the paper proposes novel stochastic gradient estimates to estimate the gradients more effectively compared to the log-derivative trick (Nesterov & Spokoiny, 2017). Finally, StocasticRank relies on Stochastic Gradient Langevin Boosting (SGLB) (Ustimenko & Prokhorenkova, 2021) that guarantees that the optimization would eventually reach the global optimum even for the non-convex loss function ℒ 𝑁 ⁢ ( 𝑓 , 𝜇 , 𝜎 ) .

3 Analysis of LTR Algorithms

In Section 4, we will show that YetiRank achieves state-of-the-art results for most datasets and quality functions. However, it is not specifically optimized for any of these metrics. In this section, we first discuss which loss function is optimized by YetiRank and then propose a simple modification called YetiLoss that can optimize any given ranking quality function. Finally, we discuss how YetiLoss relates to LambdaMART and StochasticRank.

3.1 YetiLoss Algorithm

Let us consider the following ranking quality function for a fixed 𝑏 ∈ ( 0 , 1 ) :

ExpDCG ⁢ ( 𝑧 , 𝑟 )

∑ 𝑖

1 𝑛 𝑞 𝑏 𝑖 − 1 ⁢ 𝑟 𝑠 𝑖 , (13)

where 𝑠

arg ⁢ sort ⁢ ( 𝑧 ) . This loss resembles the classic DCG given in (3), but the gain is 𝑟 𝑠 𝑖 instead of 2 𝑟 𝑠 𝑖 − 1 , while the discount is exponential ( 𝑏 𝑖 − 1 ) instead of logarithmic ( log 2 ⁡ ( 𝑖 + 1 ) ).

It follows from (11) and (13) that up to a constant multiplier, we have

𝑤 𝑖 ⁢ 𝑗

𝟙 𝑟 𝑖 > 𝑟 𝑗 ⋅ 𝔼 𝜖 ⁢ [ Δ 𝑖 ⁢ 𝑗 ⁢ ExpDCG ⁢ ( 𝑧 + 𝜖 , 𝑟 ) ⋅ 𝟙 | 𝑝 𝑖 − 𝑝 𝑗 |

1 ] ,

which means that YetiRank optimizes ExpDCG but in a smoothed form.

This observation suggests that to adapt YetiRank for an arbitrary loss function, we need to replace Δ 𝑖 ⁢ 𝑗 ⁢ ExpDCG by the difference in the desired ranking loss. For NDCG ⁢ @ ⁢ 𝑘 , the expression is given in (9). For MRR , we have:

Δ 𝑖 ⁢ 𝑗 ⁢ MRR ⁢ ( 𝑧 , 𝑟 )

| 1 𝑝 𝑖 − 1 𝑝 𝑗 | ⋅ ∏ 𝑘

1 min ⁡ { 𝑖 , 𝑗 } − 1 ( 1 − 𝑟 𝑠 𝑗 ) ⋅ 𝟙 𝑟 𝑖

1 − 𝑟 𝑗 .

Analogously, we obtain the expressions for ERR and MAP . Similarly to LambdaMART, for an arbitrary metric (rather than NDCG ), we do not have the upper bound guarantees. However, our experiments show that YetiLoss outperforms standard YetiRank for the MRR and MAP and in half of the cases for ERR .

3.2 Comparison of YetiLoss and LambdaMART

For a given ranking loss function 𝑀 ⁢ ( 𝑧 , 𝑟 ) , LambdaMART and YetiLoss use the following weights:

𝑤 𝑖 ⁢ 𝑗 𝐿 ⁢ 𝑀

Δ 𝑖 ⁢ 𝑗 ⁢ 𝑀 ⁢ ( 𝑧 , 𝑟 ) ⋅ 𝟙 𝑟 𝑖 > 𝑟 𝑗 ,

𝑤 𝑖 ⁢ 𝑗 𝑌 ⁢ 𝐿

𝔼 𝜖 ⁢ Δ 𝑖 ⁢ 𝑗 ⁢ 𝑀 ⁢ ( 𝑧 + 𝜖 , 𝑟 ) ⋅ 𝟙 𝑟 𝑖 > 𝑟 𝑗 ⋅ 𝟙 | 𝑝 𝑖 − 𝑝 𝑗 |

1 .

Even though both weights share the term Δ 𝑖 ⁢ 𝑗 ⁢ 𝑀 ⋅ 𝟙 𝑟 𝑖

𝑟 𝑗 , they have principal differences, which we discuss below.

First, let us note that 𝑤 𝐿 ⁢ 𝑀 is a discontinuous function of the argument 𝑧 (model predictions). Hence, it may drastically change when the model 𝑓 changes only slightly. In contrast, 𝑤 𝑌 ⁢ 𝐿 is a smooth function of 𝑧 due to smoothing with an absolutely continuous distribution. Training with a smooth objective is expected to be more robust; thus, we expect YetiLoss to have a better generalization.

Let us also note that a similar smoothing is proposed in a recent paper by Bruch et al. (2020). The authors use a different smoothing distribution and also add noise into 𝑓 𝑖 − 𝑓 𝑗 that appears in the logarithmic term that is multiplied by the weights. The authors confirm that smoothing the loss function improves the performance. While comparing with Bruch et al. (2020) is out of the scope of the current paper since we focus on widely used open-source implementations, we do empirically evaluate the importance of smoothing and different smoothing distributions in Section 4.3.

The second difference is that 𝑤 𝑌 ⁢ 𝐿 contains Δ 𝑖 ⁢ 𝑗 only for the neighboring documents with | 𝑝 𝑖 − 𝑝 𝑗 |

1 . In other words, YetiLoss uses only those pairwise permutations that can be realized by small changes in 𝑧 . In contrast, 𝑤 𝐿 ⁢ 𝑀 does not have this restriction and contains the pairwise permutations that cannot be realized by small changes of 𝑧 . This implies that for LambdaMART, the resulting gradients may contain redundant information about the permutations that cannot be achieved within one gradient-boosting step with a small step size. Hence, the obtained upper bound on the ranking loss can be too loose. In Section 4.3, we empirically evaluate the effect of this difference.

3.3 Comparison of YetiLoss and StochasticRank

The approach underlying the StochasticRank algorithm significantly differs from both YetiLoss and LambdaMART (see Section 2.3). However, it turns out that StochasticRank has a structural similarity with YetiLoss but not with LambdaMART. This similarity lies in pairwise permutations that YetiLoss incorporates. We already mentioned that LambdaMART relies on all possible pairwise permutations to build the surrogate loss ℒ 𝑡 * . In contrast, YetiLoss uses only those that permute the neighborhood elements in the sorted list. Similarly, we can show that StochasticRank also relies only on such permutations. To see that, we write down the formula for ∂ ∂ 𝑧 𝑖 ⁢ 𝑙 ⁢ ( 𝑧 , 𝑟 , 𝜎 , 𝜇 ) estimate that is introduced by Ustimenko & Prokhorenkova (2020) and is called Coordinate Conditional Sampling (CCS). Let us assume for simplicity that 𝜇

0 :

∂ 𝐶 ⁢ 𝐶 ⁢ 𝑆 ∂ 𝑧 𝑖 ⁢ 𝑙 ⁢ ( 𝑧 , 𝑟 , 𝜎 )

( 2 ⁢ 𝜋 ⁢ 𝜎 2 ) − 1 2

⋅ ∑ 𝑗

0 𝑛 𝑞 𝛿 𝑖 , 𝑗 𝐿 ( 𝑧 + 𝜖 , 𝑟 ) 𝑒 − 1 2 ⁢ 𝜎 2 ⁢ ( 𝑧 𝑖 − 𝑧 𝑗 + 𝜎 ⁢ ( 𝜖 𝑖 − 𝜖 𝑗 ) ) 2 ,

where 𝛿 𝑖 ⁢ 𝑗 corresponds to the difference of the ranking loss if we change only 𝑧 𝑖 without changing the remaining values, so that the 𝑖 -th item is placed after the 𝑗 -th item for 𝑗 > 0 . For 𝑗

0 , it means the difference if we put the 𝑖 -th item on the first position.

Next, if we assume that all 𝑧 𝑖 are different, the properties of the super-exponential decay of 𝑒 − 𝑦 2 allow us to ensure that for small enough 𝜎 ≈ 0 the latter can be approximated as:

∂ 𝐶 ⁢ 𝐶 ⁢ 𝑆 ∂ 𝑧 𝑖 ⁢ 𝑙 ⁢ ( 𝑧 , 𝑟 , 𝜎 )

≈ ( 2 𝜋 𝜎 2 ) − 1 2 ⋅ ( 𝛿 𝑖 , 𝑎 𝐿 ( 𝑧 + 𝜖 , 𝑟 ) 𝑒 − 1 2 ⁢ 𝜎 2 ⁢ ( 𝑧 𝑖 − 𝑧 𝑎 ) 2

𝛿 𝑖 , 𝑏 𝐿 ( 𝑧
𝜖 , 𝑟 ) 𝑒 − 1 2 ⁢ 𝜎 2 ⁢ ( 𝑧 𝑖 − 𝑧 𝑏 ) 2 ) ,

where 𝑎 is the index of the largest 𝑧 𝑎 that is smaller than 𝑧 𝑖 , and 𝑏 is the index of the smallest 𝑧 𝑏 that is larger than 𝑧 𝑖 , i.e., that is exactly the pairwise permutation of neighboring items.

We note that the approximation can be made arbitrarily precise by letting 𝜎 → 0 + . We also note that 𝜎 ≈ 0 is preferred by the algorithm’s design to achieve the theoretical guarantees on the optimization of ℒ 𝑁 by StochasticRank.

Such observation suggests that the pairwise permutations of neighboring items are the only permutations that can be achieved by small perturbations of the model we are learning. Hence, ℒ 𝑡 * that arises from YetiLoss contains more information about local behavior of the ranking loss function ℒ 𝑁 .

The similarity between YetiLoss and StochasticRank allows us to address the following question in the next section: is direct optimization of the non-convex smoothed loss function better than optimization of a convex surrogate loss?

4 Experiments

This section empirically compares popular LTR algorithms, evaluates the proposed modification called YetiLoss, and analyzes the importance of several algorithmic details in a unified setup.

Table 1: Learning-to-rank datasets Dataset # features # queries # documents train validation test train validation test Web10K 136 6,000 2,000 2,000 723,412 235,259 241,521 Web30K 136 18,919 6,306 6,306 2,270,296 747,218 753,611 Yahoo S1 699 19,944 2,944 6,983 473,134 71,083 165,660 Yahoo S2 699 1,266 1,266 3,798 34,815 34,881 103,174 Istella 220 20,307 2,912 9,799 5,497,064 1,828,561 3,129,004 Istella-S 220 19,246 7,211 6,562 2,043,304 684,076 681,250 4.1 Setup Datasets

We use six publicly available datasets. The first two are Web10K and Web30K released by Microsoft (Qin & Liu, 2013). Following previous studies (Qin et al., 2021; Ustimenko & Prokhorenkova, 2020; Wang et al., 2018), we use Fold 1 for these two datasets. We also use two datasets from YAHOO! Learning to Rank Challenge (Chapelle & Chang, 2011). Finally, we take Istella and Istella-S datasets (Dato et al., 2016). All datasets except for Istella are pre-divided into the train, validation, and test sets. For Istella, there is no standard validation set, so we randomly divided the train part into train and validation. Table 1 overviews the datasets used in the current study.

Ranking loss functions

We compare the algorithms according to all quality functions described in Section 2.2.

The first one is NDCG ⁢ @ ⁢ 10 , which is very popular in LTR research. In some of our experiments, we also use NDCG ⁢ @ ⁢ 1 and NDCG ⁢ @ ⁢ 5 . We observe no significant differences in the obtained results, so we omit these losses from the main text. Some results for NDCG ⁢ @ ⁢ 1 and NDCG ⁢ @ ⁢ 5 can be found in Appendix.

The second quality function is MRR , which is a well-known click-based metric. As MRR requires binary labels, we binarize each label as 𝑦 ~ 𝑖 := 𝟙 { 𝑦 𝑖

0 } . MRR is often used in practice while it is much less studied than NDCG ⁢ @ ⁢ 𝑘 .

We also consider MAP , for which we binarize the labels in the same way.

Finally, for ERR , we convert the labels to [ 0 , 1 ] as 𝑦 ~ 𝑖 := 𝑦 𝑖 / 4 . Using all these different loss functions is essential for analyzing the generalizability of the approaches to different ranking tasks.

Algorithms

All the algorithms are implemented and available within the official CatBoost gradient boosting library (Prokhorenkova et al., 2018; CatBoost, 2023).

For YetiRank, we use the original implementation, which we modify to obtain YetiLoss compatible with NDCG , MRR , ERR , and MAP .

For StochasticRank, we use the official implementation for NDCG , while we also add MRR and ERR loss functions to the optimization. MAP is not supported in StochasticRank out of the box, and Ustimenko & Prokhorenkova (2020) only provide algorithms for efficiently computing ERR-like and DCG-like metrics. Making an efficient and correct gradient estimate for MAP in StochasticRank can be a subject of a separate investigation, which is why we only implemented MRR and ERR .

For LambdaMART, we implement the original algorithm (Wu et al., 2010) within the CatBoost library with all ranking quality functions considered in our analysis.

For completeness of the analysis, we also add the CatBoost trained in the QueryRMSE regime (CatBoost, 2023). In this case, the algorithm optimizes RMSE averaged over the queries. This simple loss function can be considered as a baseline for other methods.

Finally, we also add the implementation of LambdaMART provided by Microsoft within the LightGBM library (Ke et al., 2017). This implementation is known to outperform other open-source versions and to achieve state-of-the-art LTR results for NDCG (Qin et al., 2021), while it does not allow for optimizing MRR , ERR , or MAP . Let us note, however, that LightGBM and CatBoost are different implementations of GBDT. This fact prevents a direct comparison of LTR algorithms. For instance, in contrast to LightGBM, CatBoots uses oblivious decision trees, where the same splitting criterion is used for all tree nodes at a given depth.

Parameter tuning

For parameter tuning, we use 20 iterations of random search followed by 80 iterations of Bayesian optimization (Balandat et al., 2020).333Small variations of the results compared to the previous papers can be explained by a particular parameter optimization used. For all algorithms, we set the maximum number of trees to 1000. We choose the best parameters, including the optimal number of trees, using the value of the desired loss function on the validation set. The list of tuned parameters is given in Appendix A.

Table 2: Comparison of learning-to-rank algorithms with tuned hyper-parameters Algorithm Web10K Web30K Yahoo S1 Yahoo S2 Istella Istella-S

NDCG ⁢ @ ⁢ 10

LightGBM 50.39 52.21 78.88 77.25 72.89 76.48 QueryRMSE 50.55 52.19 79.10 78.01 69.66 74.83 LambdaMART 49.38 50.93 78.63 77.31 69.83 75.55 StochasticRank 50.34 51.63 78.98 77.77 68.46 74.88 YetiRank 50.75 52.31 79.19 77.99 73.02 77.15 YetiLoss 50.76 51.90 78.67 77.77 71.60 76.32

MRR

QueryRMSE 83.47 85.38 91.03 92.56 96.23 97.63 LambdaMART 82.59 84.12 90.75 92.34 95.18 96.96 StochasticRank 83.51 85.43 90.72 93.08 96.54 97.83 YetiRank 84.21 85.58 91.20 93.05 96.70 97.58 YetiLoss 84.17 86.01 90.88 93.21 97.16 97.97

MAP

QueryRMSE 62.11 63.51 86.06 88.40 78.11 88.59 LambdaMART 57.93 58.58 85.91 89.14 80.80 90.28 YetiRank 62.07 63.39 85.94 88.00 80.01 89.09 YetiLoss 62.62 64.03 86.46 89.37 82.66 91.07

ERR

QueryRMSE 57.27 59.30 66.76 67.16 85.29 85.55 LambdaMART 56.54 58.76 66.22 66.55 84.71 85.24 StochasticRank 56.94 59.34 66.64 67.17 85.44 85.98 YetiRank 57.41 59.53 66.69 67.27 87.29 86.97 YetiLoss 57.46 59.53 66.71 67.15 86.83 86.59 4.2 Comparison of LTR Algorithms

The results of the comparison are present in Table 2. The best results are highlighted. The differences between the highlighted results and other algorithms are statistically significant according to the paired one-tailed t-test with p-value < 0.05 .

We make the following observations. First, we confirm the conclusion from Ustimenko & Prokhorenkova (2020) that StochastiRank outperforms LambdaMART in most cases. However, YetiRank beats both of them, leading to state-of-the-art results, especially for NDCG ⁢ @ ⁢ 10 . In turn, the proposed YetiLoss modification of YetiRank achieves the best results for MAP outperforming the competitors by a considerable margin. It is also the best on most datasets for MRR and performs similarly to YetiRank for ERR . For NDCG ⁢ @ ⁢ 10 , YetiRank outperforms YetiLoss: this can be explained by the fact that the loss function optimized within YetiRank is similar to NDCG .

Table 3: The effect of stochastic smoothing for YetiLoss

NDCG @10 MRR

MAP

Web10K	Yahoo S2	Istella-S	Web10K	Yahoo S2	Istella-S	Web10K	Yahoo S2	Istella-S

Logistic 50.76 77.77 76.32 84.17 93.21 97.97 62.62 89.37 91.07 Gaussian 50.79 77.55 76.28 84.28 93.61 97.83 62.52 89.37 91.08 No smoothing 48.49 75.29 74.55 82.70 91.94 97.38 61.32 88.13 89.89 (a) MAP (b) ERR Figure 1: The effect of the number of neighbors, relative loss compared to YetiLoss

Interestingly, the simple QueryRMSE method is very competitive in some cases. For instance, for NDCG ⁢ @ ⁢ 10 , it is superior to LightGBM, LambdaMART, and StochasticRank in at least half of the cases. QueryRMSE also achieves the best results on Yahoo S2 for NDCG ⁢ @ ⁢ 10 and on Yahoo S1 for ERR . However, in most cases, YetiLoss is significantly better than QueryRMSE. Our experiments with QueryRMSE show that considering such simple ranking-agnostic baselines is important for a fair analysis.

By comparing YetiLoss with StochasticRank, we can see that convexity of the optimization problem is usually preferred over direct optimization of a non-convex smoothed ranking loss: except for some results on Yahoo datasets, YetiLoss outperforms StochasticRank.

To better understand the differences between LTR algorithms, we also analyze their generalization ability. These results are shown in Appendix B.

Comparison with default hyper-parameters

To evaluate the effect of parameter tuning on the obtained experimental result, we additionally evaluate all the algorithms with their default set of hyper-parameters (only the optimal number of trees is chosen on the validation set). The results are shown in Table 4. Note that all our observations are also present for this setup. Indeed, YetiLoss is the best with a significant margin for MRR and MAP . Moreover, it wins on half of the datasets for ERR . For NDCG @10, similarly to Table 2, the best results are achieved with YetiRank.

Table 4: Comparison of learning-to-rank algorithms with default hyper-parameters Algorithm Web10K Web30K Yahoo S1 Yahoo S2 Istella Istella-S

NDCG ⁢ @ ⁢ 10

LightGBM 50.34 51.06 78.34 76.96 67.71 73.67 QueryRMSE 49.82 50.68 78.34 78.03 65.45 72.60 LambdaMART 48.56 49.50 77.44 76.46 66.25 73.37 YetiRank 50.18 50.91 78.58 78.15 68.86 74.72 YetiLoss 50.14 50.91 78.03 77.55 68.73 74.62

MRR

QueryRMSE 83.44 84.87 90.78 93.25 94.76 96.89 LambdaMART 82.36 83.97 90.72 91.84 94.73 96.81 YetiRank 83.81 85.10 91.02 93.16 95.59 97.09 YetiLoss 84.50 85.98 91.14 93.76 96.40 97.71

MAP

QueryRMSE 61.85 63.00 85.66 88.40 72.61 85.95 LambdaMART 56.89 55.81 85.92 89.14 75.21 87.49 YetiRank 61.80 63.03 85.66 88.00 75.36 86.82 YetiLoss 62.43 63.58 86.41 89.36 76.61 88.14

ERR

QueryRMSE 56.98 58.64 66.26 67.20 83.06 84.14 LambdaMART 55.92 57.74 65.99 66.41 83.01 84.41 YetiRank 57.12 58.73 66.54 67.28 85.31 85.77 YetiLoss 57.16 58.80 66.50 67.22 85.46 85.71 4.3 Analysis of Algorithmic Details Stochastic treatment of scores

First, we analyze the effect of stochastic smoothing of weights. Recall that the presence of such smoothing is the key difference between YetiLoss and LambdaMART. Table 3 confirms that smoothing is essential for YetiLoss. On the other hand, a particular shape of smoothing is not so important: the results obtained with Gaussian and Logistic smoothing are similar.

Number of neighbors

Another difference between YetiLoss and LambdaMART is that YetiLoss uses only neighboring documents with | 𝑝 𝑖 − 𝑝 𝑗 |

1 while computing the weights. To analyze the effect of this detail, we consider modifications of YetiLoss with | 𝑝 𝑖 − 𝑝 𝑗 |

𝑘 for 𝑘 ∈ { 1 , 2 , 3 } (see Figure 1). We also added the LambdaMART version, where all pairs are considered. We can see that there is no stable significant dependence on this parameter. However, choosing 𝑘

2 is a good option. Note that smaller values of 𝑘 lead to more efficient algorithms.

5 Conclusion and Discussion

This paper analyzes several GBDT-based state-of-the-art LTR algorithms and their modifications to determine which algorithmic details are important for better learning to rank. In our experiments, we found that YetiRank is currently the state-of-the-art LTR approach. We make a simple modification of YetiRank and propose an extension called YetiLoss that can optimize arbitrary ranking loss functions. The experiments confirm that YetiLoss is better at optimizing MRR and is considerably better than the competitors on all the datasets for MAP . Another important contribution of our research is a thorough comparison of existing state-of-the-art learning-to-rank algorithms and their algorithmic details.

Note that our work is limited to GBDT-based models. We chose this particular setup as GBDTs are known to outperform other approaches on classic tabular LRT benchmarks. Indeed, Qin et al. (2021) compare the existing neural approaches with GBDT-based methods and show that if fairly evaluated, GBDTs still outperform neural methods.444The model proposed by Qin et al. (2021) is comparable with GBDTs, but it uses self-attention, so to score a particular document the neural network uses features of other documents that are not available to GBDT. This agrees with the experiments of Gorishniy et al. (2021) comparing GBDTs and neural approaches on non-ranking tabular datasets. Importantly, GBDTs have other advantages useful in some applications: they are fast, easy to use, and do not require much parameter tuning. Also, in production scenarios (e.g., in search engines), neural networks usually compute some features that are then passed to a GBDT model to be combined with other heterogeneous features. GBDT models are often used on top of neural networks since they are known to be good at dealing with heterogeneous features of different natures and scales. That is why we strongly believe that the analysis of GBDT-based ranking is very important.

On the other hand, a useful direction for future research is to perform a similar systematic analysis for neural network models. All the discussed approaches (YetiLoss, QueryRMSE, and StochasticRank) can be straightforwardly applied to neural rankers. However, it is not guaranteed that the conclusions obtained in the current research would transfer to other types of models and problems, and a separate analysis is required, similarly to what is done for the LambdaLoss framework by Jagerman et al. (2022). In addition, it would be useful to compare the neural rankers in a suitable setup like image retrieval or recommendation systems.

Finally, note that all the algorithms are implemented within the CatBoost open-source gradient boosting library for a fair comparison. We choose this particular library since it contains the original implementations of StochasticRank and YetiRank so that we can compare these methods and their modifications fairly. While we believe the obtained conclusions would transfer to other libraries, additional analysis is needed to confirm this.

References Balandat et al. (2020) Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. BoTorch: A framework for efficient Monte-Carlo Bayesian optimization. In Advances in Neural Information Processing Systems 33, 2020. Bruch et al. (2020) Bruch, S., Han, S., Bendersky, M., and Najork, M. A stochastic treatment of learning to rank scoring functions. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining, pp. 61–69, 2020. Burges et al. (2007) Burges, C. J., Ragno, R., and Le, Q. V. Learning to rank with nonsmooth cost functions. Proceedings of the Advances in Neural Information Processing Systems, 19:193–200, 2007. Burges (2010) Burges, C. J. C. From RankNet to LambdaRank to LambdaMART: An overview. Technical report, Microsoft Research, 2010. CatBoost (2023) CatBoost. Ranking: objectives and metrics. https://catboost.ai/docs/concepts/loss-functions-ranking.html, 2023. Chapelle & Chang (2011) Chapelle, O. and Chang, Y. Yahoo! learning to rank challenge overview. In Proceedings of the learning to rank challenge, pp. 1–24, 2011. Dato et al. (2016) Dato, D., Lucchese, C., Nardini, F. M., Orlando, S., Perego, R., Tonellotto, N., and Venturini, R. Fast ranking with additive ensembles of oblivious and non-oblivious regression trees. ACM Transactions on Information Systems (TOIS), 35(2):1–31, 2016. Gorishniy et al. (2021) Gorishniy, Y., Rubachev, I., Khrulkov, V., and Babenko, A. Revisiting deep learning models for tabular data. In Advances in Neural Information Processing Systems 34, 2021. Gulin et al. (2011) Gulin, A., Kuralenok, I., and Pavlov, D. Winning the transfer learning track of Yahoo!’s learning to rank challenge with YetiRank. In Proceedings of the Learning to Rank Challenge, pp. 63–76, 2011. Jagerman et al. (2022) Jagerman, R., Qin, Z., Wang, X., Bendersky, M., and Najork, M. On optimizing top-k metrics for neural ranking models. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2303–2307, 2022. Katzir et al. (2021) Katzir, L., Elidan, G., and El-Yaniv, R. Net-DNF: Effective deep modeling of tabular data. In International Conference on Learning Representations, 2021. Ke et al. (2017) Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pp. 3146–3154, 2017. Liu (2009) Liu, T.-Y. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009. Nesterov & Spokoiny (2017) Nesterov, Y. and Spokoiny, V. G. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527–566, 2017. Prokhorenkova et al. (2018) Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pp. 6638–6648, 2018. Qin & Liu (2013) Qin, T. and Liu, T. Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597, 2013. Qin et al. (2021) Qin, Z., Yan, L., Zhuang, H., Tay, Y., Pasumarthi, R. K., Wang, X., Bendersky, M., and Najork, M. Are neural rankers still outperformed by gradient boosted decision trees? In International Conference on Learning Representations, 2021. Ustimenko & Prokhorenkova (2020) Ustimenko, A. and Prokhorenkova, L. StochasticRank: Global optimization of scale-free discrete functions. In International Conference on Machine Learning, pp. 9669–9679, 2020. Ustimenko & Prokhorenkova (2021) Ustimenko, A. and Prokhorenkova, L. SGLB: Stochastic Gradient Langevin Boosting. In International Conference on Machine Learning, pp. 10487–10496, 2021. Wang et al. (2018) Wang, X., Li, C., Golbandi, N., Bendersky, M., and Najork, M. The LambdaLoss framework for ranking metric optimization. In Proceedings of The 27th ACM International Conference on Information and Knowledge Management, pp. 1313–1322, 2018. Wu et al. (2010) Wu, Q., Burges, C. J., Svore, K. M., and Gao, J. Adapting boosting for information retrieval measures. Information Retrieval, 13(3):254–270, 2010. Appendix A Experimental Setup

We tune the following hyper-parameters:

•

learning-rate: log-uniform distribution over [ 10 − 3 , 1 ] for all models;

•

l2-leaf-reg: 0 for StochasticRank; log-uniform distribution over [ 0.1 , 100 ] for LightGBM; over [ 0.001 , 100 ] for other models;

•

depth: uniform distribution over { 6 , … , 8 } for YetiRank and LambdaMART, over { 6 , … , 10 } for StochasticRank, over { 6 , … , 16 } for LightGBM;

•

model-shrink-rate: log-uniform distribution over [ 10 − 5 , 10 − 2 ] for StochasticRank;

•

diffusion-temperature: log-uniform distribution over [ 10 8 , 10 11 ] for StochasticRank;

•

mu: log-uniform distribution over [ 10 − 2 , 10 ] for StochasticRank;

•

num-leaves: integer log-uniform distribution over [ 16 , 256 ] for LightGBM;

•

min-data-in-leaf: integer log-uniform distribution over [ 1 , 1000 ] for LightGBM; default value of 10 for other algorithms.

The best hyper-parameters used for Table 2 are listed in Tables 27-30.

Appendix B Additional Experimental Details Generalization gap

Tables 23-26 provide additional information on the generalization of different algorithms. The column “train” shows the corresponding error on the train set, while “gap” is the difference of the error between the train and test sets. A smaller value in “gap” indicates better generalization.

As expected, YetiLoss usually has larger generalization gaps than YetiRank (though there are some exceptions) since it is trained for a particular target quality function on the train set. Also, YetiLoss usually has better quality on the train set compared to YetiRank. Compared with LightGBM, all CatBoost-based algorithms usually have better generalization (see Table 23). We assume that this is because CatBoost uses oblivious decision trees known to be resistant to overfitting.

Number of neighbors

Extended results on the analysis of the number of neighbors (including additional loss functions NDCG @1 and NDCG @5) can be found in Tables 5-10.

Stochastic smoothing

Extended results on the analysis of stochastic smoothing (including additional loss functions NDCG @1 and NDCG @5) can be found in Tables 11-22.

Table 5: The effect of the number of neighbors for YetiLoss, NDCG @1 Web Yahoo Istella 1 48.24 74.64 70.36 2 49.27 74.60 70.70 all 47.70 74.74 70.93 Table 6: The effect of the number of neighbors for YetiLoss, NDCG @5 Web Yahoo Istella 1 48.61 74.49 70.39 2 49.02 74.29 70.61 all 48.36 74.65 70.44 Table 7: The effect of the number of neighbors for YetiLoss, NDCG @10 Web Yahoo Istella 1 50.76 77.77 76.32 2 50.90 77.68 76.69 3 50.99 77.76 76.74 all 50.56 77.90 76.79 Table 8: The effect of the number of neighbors for YetiLoss, MRR Web Yahoo Istella 1 84.17 93.21 97.97 2 84.18 93.75 97.99 3 84.39 93.22 97.82 all 84.19 93.84 97.90 Table 9: The effect of the number of neighbors for YetiLoss, MAP Web Yahoo Istella 1 62.62 89.37 91.07 2 62.62 89.30 91.06 3 62.54 89.38 91.16 all 62.32 89.36 90.43 Table 10: The effect of the number of neighbors for YetiLoss, ERR Web Yahoo Istella 1 57.46 67.15 86.59 2 57.91 67.28 86.71 3 57.59 67.13 86.48 all 57.07 67.32 86.44 Table 11: The effect of stochastic smoothing for YetiRank, NDCG @1 Web Yahoo Istella Logistic 48.59 75.34 70.8 Gaussian 49.86 75.19 71.62 No smoothing 48.26 73.7 70.47 Table 12: The effect of stochastic smoothing for YetiRank, NDCG @5 Web Yahoo Istella Logistic 48.54 74.93 70.81 Gaussuan 49.4 74.99 70.94 No smoothing 47.52 72.14 69.82 Table 13: The effect of stochastic smoothing for YetiRank, NDCG @10 Web Yahoo Istella Logistic 50.75 77.99 77.15 Gaussian 51 78.07 77.23 No smoothing 49.05 75.57 75.8 Table 14: The effect of stochastic smoothing for YetiRank, MRR Web Yahoo Istella Logistic 84.21 93.05 97.58 Gaussian 83.85 93.07 97.47 No smoothing 83.15 92.23 97.13 Table 15: The effect of stochastic smoothing for YetiRank, MAP Web Yahoo Istella Logistic 62.07 88 89.09 Gaussian 61.85 87.8 89.02 No smoothing 60.38 86.85 87.47 Table 16: The effect of stochastic smoothing for YetiRank, ERR Web Yahoo Istella Logistic 57.41 67.27 86.97 Gaussian 57.76 67.15 86.87 Table 17: The effect of stochastic smoothing for YetiLoss, NDCG @1 Web Yahoo Istella Logistic 48.24 74.64 70.36 Gaussian 49.99 74.36 70.45 No smoothing 48.75 71.89 68.32 Table 18: The effect of stochastic smoothing for YetiLoss, NDCG @5 Web Yahoo Istella Logistic 48.61 74.49 70.39 Gaussian 49.25 74.17 70.37 No smoothing 46.85 71.51 68.46 Table 19: The effect of stochastic smoothing for YetiLoss, NDCG @10 Web Yahoo Istella Logistic 50.76 77.77 76.32 Gaussian 50.79 77.55 76.28 No smoothing 48.49 75.29 74.55 Table 20: The effect of stochastic smoothing for YetiLoss, MRR Web Yahoo Istella Logistic 84.17 93.21 97.97 Gaussian 84.28 93.61 97.83 No smoothing 82.7 91.94 97.38 Table 21: The effect of stochastic smoothing for YetiLoss, MAP Web Yahoo Istella Logistic 62.62 89.37 91.07 Gaussian 62.52 89.37 91.08 No smoothing 61.32 88.13 89.89 Table 22: The effect of stochastic smoothing for YetiLoss, ERR Web Yahoo Istella Logistic 57.46 67.15 86.59 Gaussian 57.55 66.99 86.25 Table 23: Analysis of generalization gap for NDCG @10 Web10K Web30K Yahoo S1 Yahoo S2 Istella Istella-S Algorithm train gap train gap train gap train gap train gap train gap YetiRank 56.88 6.13 57.43 5.12 86.56 7.37 90.53 12.54 79.86 6.83 85.86 8.71 YetiLoss 57.68 6.92 52.48 3.57 80.46 5.34 94.83 17.06 77.03 5.57 85.23 8.91 StochasticRank 56.97 10.23 54.96 6.33 82.16 6.73 89.66 13.73 72.02 3.71 82.53 8.17 LambdaMART 55.25 5.87 55.72 4.79 83.35 4.72 90.48 13.17 73.38 3.55 80.12 4.57 LightGBM 56.04 9.25 59.58 10.37 84.99 9.67 84.79 9.39 82.36 9.61 87.73 11.77 Table 24: Analysis of generalization gap for MRR

Web10K	Web30K	Yahoo S1	Yahoo S2	Istella	Istella-S

Algorithm train gap train gap train gap train gap train gap train gap YetiRank 87.02 2.80 86.85 1.27 93.06 1.86 96.26 3.21 98.40 1.70 99.48 1.90 YetiLoss 87.82 3.64 87.64 1.63 94.93 4.04 97.88 4.67 98.98 1.82 99.85 1.88 StochasticRank 88.14 4.63 89.20 3.77 93.92 3.20 95.64 2.56 98.24 1.70 99.45 1.62 LambdaMART 86.11 3.52 85.27 1.15 93.26 2.51 96.01 3.67 97.60 2.42 99.66 2.70 Table 25: Analysis of generalization gap for MAP

Web10K	Web30K	Yahoo S1	Yahoo S2	Istella	Istella-S

Algorithm train gap train gap train gap train gap train gap train gap YetiRank 65.38 3.31 65.02 1.63 89.48 3.54 94.16 6.15 84.55 4.53 92.26 3.18 YetiLoss 65.86 3.24 66.22 2.20 90.70 4.24 96.06 6.69 89.40 6.74 96.27 5.20 LambdaMART 62.40 4.46 59.89 1.32 87.13 1.22 96.50 7.36 85.75 4.94 94.02 3.74 Table 26: Analysis of generalization gap for ERR

Web10K	Web30K	Yahoo S1	Yahoo S2	Istella	Istella-S

Algorithm train gap train gap train gap train gap train gap train gap YetiRank 61.80 4.39 62.53 3.00 70.34 3.64 73.40 6.13 91.30 4.02 94.53 7.55 YetiLoss 61.72 4.26 62.11 2.58 70.06 3.34 72.23 5.08 91.00 4.17 94.14 7.55 StochasticRank 62.82 5.88 62.81 3.48 70.00 3.36 73.26 6.09 87.77 2.33 91.82 5.84 LambdaMART 63.85 7.31 63.68 4.92 68.85 2.63 72.04 5.50 87.82 3.11 91.89 6.65 Table 27: Chosen parameters for NDCG ⁢ @ ⁢ 10

Algorithm Parameter Web10K Web30K Yahoo S1 Yahoo S2 Istella Istella-S YetiRank depth 8 8 8 7 8 8 l2-leaf-reg 0.03343 0.00100 0.00588 0.08913 0.00461 0.00100 learning-rate 0.05450 0.10085 0.10285 0.04172 0.18317 0.18166 YetiLoss depth 8 8 8 8 8 8 l2-leaf-reg 0.00625 0.11205 0.10000 0.00313 0.13938 0.00100 learning-rate 0.05111 0.11108 0.09002 0.03479 0.25685 0.15083 StochasticRank depth 10 10 10 8 10 10 learning-rate 0.18210 0.30549 0.12293 0.05506 1.00000 1.00000 model-shrink-rate 0.00001 0.00117 0.00049 0.01000 0.00075 0.00125 diffusion-temperature 675M 63274M 100M 102M 304M 5873M mu 0.02034 0.01314 0.06495 0.04142 0.01190 0.01794 LambdaMART depth 8 8 8 6 8 7 l2-leaf-reg 0.00394 0.00170 0.00176 0.00100 0.00100 0.00100 learning-rate 0.07478 0.14310 0.14611 0.10625 0.18117 0.15533 LightGBM depth 14 14 16 9 14 16 learning-rate 0.16656 0.12405 0.09355 0.12593 0.22409 0.22599 num-leaves 97 256 256 16 256 256 min-data-in-leaf 1 1 2 11 9 1 l2-leaf-reg 73.06401 0.39895 6.58340 11.54666 0.74747 1.76279 Table 28: Chosen parameters for MRR

Algorithm Parameter Web10K Web30K Yahoo S1 Yahoo S2 Istella Istella-S YetiRank depth 7 7 8 6 8 8 l2-leaf-reg 0.23902 0.55462 0.01642 6.24819 0.04491 0.11146 learning-rate 0.09387 0.22263 0.07333 0.23151 0.14079 0.17564 YetiLoss depth 6 8 7 8 8 8 l2-leaf-reg 0.71013 0.32674 0.10000 0.02186 0.10000 0.00356 learning-rate 0.16364 0.07572 0.15166 0.07400 0.18719 0.06319 StochasticRank depth 7 10 10 6 10 10 learning-rate 0.19705 0.34981 0.14603 0.04305 1.00000 0.36481 model-shrink-rate 0.00001 0.00016 0.01000 0.00001 0.00006 0.00031 diffusion-temperature 251M 8890M 100M 100M 100M 984M mu 0.01120 0.01000 0.03325 0.01689 0.03698 0.01000 LambdaMART depth 6 6 8 8 6 6 l2-leaf-reg 0.00769 0.04354 0.00100 0.01220 0.00100 0.00100 learning-rate 0.23696 0.08981 0.07174 0.12050 0.09162 0.09379 Table 29: Chosen parameters for MAP

Algorithm Parameter Web10K Web30K Yahoo S1 Yahoo S2 Istella Istella-S YetiRank depth 8 8 8 7 8 8 l2-leaf-reg 0.01157 0.00320 0.00708 0.20218 0.00287 0.02457 learning-rate 0.05033 0.09699 0.11392 0.05680 0.18562 0.15196 YetiLoss depth 8 8 8 8 8 8 l2-leaf-reg 0.54619 0.04957 0.05314 0.00274 0.00819 0.02790 learning-rate 0.05031 0.07755 0.06111 0.01103 0.25184 0.25536 LambdaMART depth 8 8 8 7 8 8 l2-leaf-reg 0.01119 0.00441 0.00100 0.00879 0.00100 0.00100 learning-rate 0.01769 0.01267 0.02798 0.04096 0.18745 0.16072 Table 30: Chosen parameters for ERR

Algorithm Parameter Web10K Web30K Yahoo S1 Yahoo S2 Istella Istella-S YetiRank depth 8 8 8 8 8 8 l2-leaf-reg 0.06115 0.00100 0.00309 0.01318 0.02537 0.00100 learning-rate 0.06827 0.09341 0.09871 0.02678 0.17561 0.16226 YetiLoss depth 8 8 8 6 8 8 l2-leaf-reg 0.01401 0.00175 0.00966 0.05289 0.00100 0.00160 learning-rate 0.04484 0.07811 0.06638 0.05021 0.12214 0.09334 StochasticRank depth 10 10 10 10 10 10 learning-rate 0.14777 0.30041 0.25101 0.05651 0.54801 0.39642 model-shrink-rate 0.00042 0.00001 0.00004 0.01000 0.00085 0.00206 diffusion-temperature 100M 589M 344M 1425M 100B 100M mu 0.12610 0.02085 0.12851 0.18189 0.01375 0.01000 LambdaMART depth 8 8 8 6 6 6 l2-leaf-reg 0.00100 0.00100 0.00116 0.00100 0.00100 0.00100 learning-rate 0.05517 0.10631 0.09575 0.04895 0.15040 0.15975 Generated on Fri Oct 6 18:08:51 2023 by LATExml

Xet Storage Details

Size:: 56.1 kB
Xet hash:: 410f7e73bed57a5ac2e1cb56a2252d60c6b636831ff52f556335eb3386dfe047

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Learning to rank is a classic information retrieval problem (Liu, 2009). To formally define the LTR task, consider a query 𝑞 sampled from a distribution 𝑄 . For this query, we are given a set of documents { 𝑥 1 , … , 𝑥 𝑛 𝑞 } and a relevance vector 𝑟

A learning-to-rank model is a function 𝑓 ⁢ ( 𝑥 ) that takes the document’s features 𝑥 𝑖 and returns a value 𝑧 𝑖 that is the predicted relevance for this document. For a query 𝑞 , we compute the vector 𝑧

( 𝑓 ⁢ ( 𝑥 1 ) , … , 𝑓 ⁢ ( 𝑥 𝑛 𝑞 ) ) ∈ ℝ 𝑛 𝑞 called a vector of scores and then compute 𝑠

A ranking loss function 𝐿 ⁢ ( 𝑧 , 𝑟 ) is a function that measures how well the ranking 𝑠

arg ⁢ sort ⁢ ( 𝑧 ) agrees with the relevance vector 𝑟 . To simplify the notation, we also write 𝐿 ⁢ ( 𝑓 , 𝑞 ) meaning that { 𝑥 1 , … , 𝑥 𝑛 𝑞 } and 𝑟 are contained in 𝑞 and 𝑧

𝑓

𝑓

arg ⁢ min 𝑓 ∈ ℱ ⁡ 1 𝑁 ⁢ ∑ 𝑖

The function ℒ 𝑁 ⁢ ( 𝑓 ) := 1 𝑁 ⁢ ∑ 𝑖

The main difficulty of the LTR problem is the fact that the function ℒ 𝑁 ⁢ ( 𝑓 ) is locally constant as a function of 𝑧 , discontinuous, and non-convex in 𝑧 , since each 𝐿 ⁢ ( 𝑧 , 𝑟 ) is defined by a permutation 𝑠

In this section, we define some widely used ranking loss functions. First, recall that all ranking losses depend on the permutation 𝑠

arg ⁢ sort ⁢ ( 𝑧 ) . However, it can be the case that two documents 𝑥 𝑖 and 𝑥 𝑗 have the same score, i.e., 𝑧 𝑖

Let 𝑠

arg ⁢ sort ⁢ ( ( 𝑧 , − 𝑟 ) ) and let 𝑛

DCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 )

∑ 𝑖

NDCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 )

AP ⁢ ( 𝑧 , 𝑟 )

∑ 𝑖

1 𝑛 P ⁢ ( 𝑖 ) ⁢ 𝑟 𝑠 𝑖 ∑ 𝑖

1 𝑛 𝑟 𝑖 , where ⁢ P ⁢ ( 𝑖 )

1 𝑖 ⁢ ∑ 𝑗

RR ⁢ ( 𝑧 , 𝑟 )

∑ 𝑖

1 𝑛 𝑟 𝑠 𝑖 𝑖 ⁢ ∏ 𝑗

ERR ⁢ ( 𝑧 , 𝑟 )

∑ 𝑖

1 𝑛 𝑟 𝑠 𝑖 𝑖 ⁢ ∏ 𝑗

After that, we sum up the upper bounds (4) for all the queries and obtain:111As before, we denote by 𝑙 𝑡 ⁢ ( 𝑓 , 𝑞 ) the values of 𝑙 𝑡 ⁢ ( 𝑧 , 𝑟 ) , where we set 𝑧

ℒ 𝑁 ⁢ ( 𝑓 ) ≤ 1 𝑁 ⁢ ∑ 𝑖

1 𝑁 𝑙 𝑡 ⁢ ( 𝑓 , 𝑞 𝑖 )

𝐿 ⁢ ( 𝑧 , 𝑟 ) ≤ const + ∑ 𝑖 , 𝑗

𝐿 ⁢ ( 𝑧 , 𝑟 ) ≤ const + ∑ 𝑖 , 𝑗

𝑤 𝑖 ⁢ 𝑗

where Δ 𝑖 ⁢ 𝑗 ⁢ NDCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 ) denotes the difference in NDCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 ) caused by permuting 𝑖 -th and 𝑗 -th documents in the sorting defined by 𝑧 . Formally, recall that 𝑠

arg ⁢ sort ⁢ ( 𝑧 ) and consider the inverse permutation 𝑝

Δ 𝑖 ⁢ 𝑗 ⁢ NDCG ⁢ @ ⁢ 𝑘 ⁢ ( 𝑧 , 𝑟 )

𝑙 𝑡 ⁢ ( 𝑧 , 𝑟 )

− ∑ 𝑖 , 𝑗

𝑤 𝑖 ⁢ 𝑗

( 𝑟 𝑖 − 𝑟 𝑗 ) ⋅ 𝟙 𝑟 𝑖 > 𝑟 𝑗 ⋅ 𝔼 𝜖 ⁢ [ 𝑏 𝑝 𝑖 − 1 ⋅ 𝟙 | 𝑝 𝑖 − 𝑝 𝑗 |

where 𝑏 ∈ ( 0 , 1 ) is a hyper-parameter and 𝑝 𝑖 is a random permutation obtained as the inverse of arg ⁢ sort ⁢ ( 𝑧 + 𝜖 ) with 𝜖

𝑙 ⁢ ( 𝑧 , 𝑟 )

where 𝜇 ≥ 0 and 𝜎 > 0 are hyper-parameters. Such smoothing leads to the loss function ℒ 𝑁 ⁢ ( 𝑓 , 𝜎 , 𝜇 )

1 𝑁 ⁢ ∑ 𝑖

ExpDCG ⁢ ( 𝑧 , 𝑟 )

∑ 𝑖

where 𝑠

𝑤 𝑖 ⁢ 𝑗

𝟙 𝑟 𝑖 > 𝑟 𝑗 ⋅ 𝔼 𝜖 ⁢ [ Δ 𝑖 ⁢ 𝑗 ⁢ ExpDCG ⁢ ( 𝑧 + 𝜖 , 𝑟 ) ⋅ 𝟙 | 𝑝 𝑖 − 𝑝 𝑗 |

Δ 𝑖 ⁢ 𝑗 ⁢ MRR ⁢ ( 𝑧 , 𝑟 )

| 1 𝑝 𝑖 − 1 𝑝 𝑗 | ⋅ ∏ 𝑘

1 min ⁡ { 𝑖 , 𝑗 } − 1 ( 1 − 𝑟 𝑠 𝑗 ) ⋅ 𝟙 𝑟 𝑖

𝑤 𝑖 ⁢ 𝑗 𝐿 ⁢ 𝑀

Δ 𝑖 ⁢ 𝑗 ⁢ 𝑀 ⁢ ( 𝑧 , 𝑟 ) ⋅ 𝟙 𝑟 𝑖 > 𝑟 𝑗 , 𝑤 𝑖 ⁢ 𝑗 𝑌 ⁢ 𝐿

𝔼 𝜖 ⁢ Δ 𝑖 ⁢ 𝑗 ⁢ 𝑀 ⁢ ( 𝑧 + 𝜖 , 𝑟 ) ⋅ 𝟙 𝑟 𝑖 > 𝑟 𝑗 ⋅ 𝟙 | 𝑝 𝑖 − 𝑝 𝑗 |

The second difference is that 𝑤 𝑌 ⁢ 𝐿 contains Δ 𝑖 ⁢ 𝑗 only for the neighboring documents with | 𝑝 𝑖 − 𝑝 𝑗 |

∂ 𝐶 ⁢ 𝐶 ⁢ 𝑆 ∂ 𝑧 𝑖 ⁢ 𝑙 ⁢ ( 𝑧 , 𝑟 , 𝜎 )

⋅ ∑ 𝑗

where 𝛿 𝑖 ⁢ 𝑗 corresponds to the difference of the ranking loss if we change only 𝑧 𝑖 without changing the remaining values, so that the 𝑖 -th item is placed after the 𝑗 -th item for 𝑗 > 0 . For 𝑗

Another difference between YetiLoss and LambdaMART is that YetiLoss uses only neighboring documents with | 𝑝 𝑖 − 𝑝 𝑗 |

1 while computing the weights. To analyze the effect of this detail, we consider modifications of YetiLoss with | 𝑝 𝑖 − 𝑝 𝑗 |

𝑘 for 𝑘 ∈ { 1 , 2 , 3 } (see Figure 1). We also added the LambdaMART version, where all pairs are considered. We can see that there is no stable significant dependence on this parameter. However, choosing 𝑘

Xet Storage Details

Δ 𝑖 ⁢ 𝑗 ⁢ 𝑀 ⁢ ( 𝑧 , 𝑟 ) ⋅ 𝟙 𝑟 𝑖 > 𝑟 𝑗 ,

𝑤 𝑖 ⁢ 𝑗 𝑌 ⁢ 𝐿