Buckets:
Title: Learning Mixtures of Gaussians with Censored Data
URL Source: https://arxiv.org/html/2305.04127
Markdown Content: Learning Mixtures of Gaussians with Censored Data Wai Ming Tai University of Chicago Bryon Aragam University of Chicago Abstract
We study the problem of learning mixtures of Gaussians with censored data. Statistical learning with censored data is a classical problem, with numerous practical applications, however, finite-sample guarantees for even simple latent variable models such as Gaussian mixtures are missing. Formally, we are given censored data from a mixture of univariate Gaussians
∑ 𝑖
1 𝑘 𝑤 𝑖 𝒩 ( 𝜇 𝑖 , 𝜎 2 ) ,
i.e. the sample is observed only if it lies inside a set 𝑆 . The goal is to learn the weights 𝑤 𝑖 and the means 𝜇 𝑖 . We propose an algorithm that takes only 1 𝜀 𝑂 ( 𝑘 ) samples to estimate the weights 𝑤 𝑖 and the means 𝜇 𝑖 within 𝜀 error.
1 Introduction
When we collect data, we often encounter situations in which the data are partially observed. This can arise for a variety of reasons, such as measurements falling outside of the range of some apparatus or device. In machine learning and statistics, this phenomenon is known as truncated or censored data. Both refer to the case where we do not observe the data when they fall outside a certain domain. For censored data, we know the existence of data that fall outside the domain, while for truncated data, we do not.
It is common to encounter truncated or censored data in our daily lives. An example of truncated data is census data. When a census bureau collects data, there may be some difficulties for the bureau in collecting data for certain demographics for security, privacy, or legal reasons, and these individuals may have no incentive to report their data. Therefore, the bureau cannot collect data about these populations. In this case, the census data are truncated.
On the other hand, an example of censored data is test scores. The range of scores in a test is typically set to be from 0 to 100. Some students may score the maximum score of 100 points, in which case it is unknown if they could have scored even higher if the upper bound of the test score was higher than 100. Of course, even though the students’ scores are capped at 100, their scores are still being reported. Hence, their scores are censored, which distinguishes them from truncated data.
Indeed, statistical estimation on truncated or censored data is a classical problem, dating back to the eighteenth century [3]. After Bernoulli, [15, 31, 32, 21, 14] studied how to estimate the mean and the variance of of a univariate Gaussian distribution from truncated samples. However, most existing results do not address the problem of finite-sample bounds, i.e. the results are mostly experimental or asymptotic [22, 26]. In fact, one can learn the distribution with infinitely many truncated or censored samples—under mild assumptions, one can show that the function restricted on a certain region can be extended to the entire space by the identity theorem from complex analysis. Unfortunately, it is still not clear how to translate such results to finite sample bounds.
A recent notable result by Daskalakis et al. [7] gave the first efficient algorithm to learn the mean and the covariance of a single Gaussian with finitely many truncated samples. A natural extension to the problem of learning a single Gaussian is the problem of learning a mixture of Gaussians. To the best of our knowledge, there is no provable guarantees on the problem of learning a mixture of Gassians with a finite number of truncated or censored samples even in one dimension.
As we will discuss in the related work section, there is a long line of work on learning a mixture of Gaussians. Likelihood-based approaches often do not provide provable guarantees for learning mixtures of Gaussians since the objective function is not convex unless we impose strong assumptions [39, 6]. On the other hand, many recent results rely heavily on the method of moments, i.e. the algorithm estimates the moments 𝖤 ( 𝑋 𝑠 ) as an intermediate step. With truncated or censored data, estimating 𝖤 ( 𝑋 𝑠 ) (here, the expectation is over the original, untruncated data) becomes very challenging.
To overcome this, we propose an approach for estimating moments from censored data. Recall that ordinary moments are just expectations of monomials of a random variable. However, by generalizing this to more general functions of a random variable, we open up the possibility to capture more complex structures of the distribution. In particular, when the data is censored, these generalized functions allow us to relate the expectations back the raw, uncensored distribution. One must keep in mind that we still need to make a choice of what functions to consider in addition to providing efficient estimators of these generalized moments. We preview that a suitable choice is found by a specific linear combination of Hermite polynomials derived from the solution to a system of linear equations. In our proof, we will delve deeper into the analysis of the expectations of functions, depending on the domain, and provide a delicate analysis to prove our desired result.
Based on the above discussion, we may want to ask the following question in a general sense: Can we learn a mixture of Gaussians with truncated or censored data? In this paper, we consider this problem and focus on the case that the data is censored and the Gaussians are univariate and homogeneous. We now define the problem formally.
2 Problem Definition
Let 𝒩 ( 𝜇 , 𝜎 2 ) be the normal distribution with mean 𝜇 and variance 𝜎 2 . Namely, the pdf of 𝒩 ( 𝜇 , 𝜎 2 ) is
𝑔 𝜇 , 𝜎 2 ( 𝑥 ) := 1 2 𝜋 𝜎 𝑒 − 1 2 𝜎 2 ( 𝑥 − 𝜇 ) 2 .
For any subset 𝑆 ⊂ ℝ , let 𝐼 𝜇 , 𝜎 2 ( 𝑆 ) be the probability mass of 𝒩 ( 𝜇 , 𝜎 2 ) on 𝑆 , i.e.
𝐼 𝜇 , 𝜎 2 ( 𝑆 ) := ∫ 𝑥 ∈ 𝑆 𝑔 𝜇 , 𝜎 2 ( 𝑥 ) 𝖽 𝑥 .
Also, let 𝒩 ( 𝜇 , 𝜎 2 , 𝑆 ) denote the conditional distribution of a normal 𝒩 ( 𝜇 , 𝜎 2 ) given the set 𝑆 . Namely, the pdf of 𝒩 ( 𝜇 , 𝜎 2 , 𝑆 ) is
𝑔 𝜇 , 𝜎 2 , 𝑆 ( 𝑥 ) := { 1 𝐼 𝜇 , 𝜎 2 ( 𝑆 ) 𝑔 𝜇 , 𝜎 2 ( 𝑥 )
if 𝑥 ∈ 𝑆
0
if 𝑥 ∉ 𝑆 .
Given a subset 𝑆 ⊂ ℝ , we consider the following sampling procedure. Each time, a sample is drawn from a mixture of Gaussians
∑ 𝑖
1 𝑘 𝑤 𝑖 𝒩 ( 𝜇 𝑖 , 𝜎 2 ) ,
where 𝑤 𝑖 > 0 , ∑ 𝑖
1 𝑘 𝑤 𝑖
1 , 𝜇 𝑖 ∈ ℝ , and 𝜎 > 0 . If this sample is inside 𝑆 , we obtain this sample; otherwise, we fail to generate a sample. Formally, 𝑋 is a random variable drawn from the following distribution. Let 𝛼 be the probability mass ∑ 𝑖
1 𝑘 𝑤 𝑖 𝐼 𝜇 𝑖 , 𝜎 2 ( 𝑆 ) .
𝑋 ∼ { ∑ 𝑖
1 𝑘 𝑤 𝑖 𝒩 ( 𝜇 𝑖 , 𝜎 2 , 𝑆 )
with probability 𝛼
𝖥𝖠𝖨𝖫
with probability 1 − 𝛼 . (1)
The value FAIL here refers to values that are not directly accessible to the algorithm.
We assume that
(A1) 𝑆 is an interval [ − 𝑅 , 𝑅 ] for some constant 𝑅 > 0 and is known (it is easy to extend 𝑆 to be any measurable subset of [ − 𝑅 , 𝑅 ] ; for simplicity, we assume 𝑆
[ − 𝑅 , 𝑅 ] );
(A2) All 𝜇 𝑖 are bounded, i.e. | 𝜇 𝑖 | < 𝑀 for some constant 𝑀
0 ;
(A3) The variance 𝜎 2 is known.
We also assume that the exact computation of the integral ∫ 0 𝑧 𝑒 − 1 2 𝑡 2 𝖽 𝑡 for any 𝑧 can be done. Indeed, one can always approximate this integral with an exponential convergence rate via Taylor expansion. As we can see in our proof, this error is negligible.
For a given error parameter 𝜀
0 , we want to estimate all 𝑤 𝑖 , 𝜇 𝑖 within 𝜀 error. The question is how many samples from the above sampling procedure do we need to achieve this goal? Our main contribution is a quantitative answer to this question. We will prove the following theorem:
Theorem 1.
Suppose we have 𝑛 samples drawn from the distribution (1) and we assume that the mixture satisfies (A1)-(A3). Furthermore, let 𝑤 min be min { 𝑤 𝑖 ∣ 𝑖
1 , … , 𝑘 } and Δ min be min { | 𝜇 𝑖 − 𝜇 𝑗 | ∣ 𝑖 , 𝑗
1 , … , 𝑘 and 𝑖 ≠ 𝑗 } . Then, for a sufficiently small 𝜀 > 0 , if 𝑤 min and Δ min satisfy 𝑤 min Δ min
Ω ( 𝜀 ) , there is an efficient algorithm that takes 𝑛
𝐶 𝑘 ⋅ 1 𝜀 𝑂 ( 𝑘 ) (where 𝐶 𝑘 is a constant depending on 𝑘 only) samples111Here we assume the parameters 𝑅 and 𝑀 to be constant for simplicity. It is easy to keep track of them in our proof and show that the sample bound is 𝐶 𝑘 ⋅ ( 1 𝜀 ) 𝑂 ( 𝑘 ⋅ log ( 𝑀 + 𝑅 + 1 𝑅 ) ) . as the input and outputs 𝑤 ^ 𝑖 , 𝜇 ^ 𝑖 for 𝑖
1 , … , 𝑘 such that, up to an index permutation Π ,
| 𝑤 ^ Π ( 𝑖 ) − 𝑤 𝑖 | < 𝜀 , | 𝜇 ^ Π ( 𝑖 ) − 𝜇 𝑖 | < 𝜀 for 𝑖
1 , … , 𝑘
with probability 99 100 . The running time of the algorithm is 𝑂 ( 𝑛 ⋅ 𝗉𝗈𝗅𝗒 ( 𝑘 , 1 𝜀 ) ) .
In other words, this theorem states that the sample complexity for learning mixtures of 𝑘 univariate Gaussians with censored data is 1 𝜀 𝑂 ( 𝑘 ) which is optimal in terms of asymptotic growth of the exponent 𝑂 ( 𝑘 ) [38]. As for the optimality of the constant in the exponent 𝑂 ( 𝑘 ) , this is an interesting open problem.
3 Related Work
Without truncation or censoring, the study of learning Gaussian mixture models [30] has a long history. We focus on recent algorithmic results; see Lindsay [23] for additional background. Dasgupta [5] proposed an algorithm to learn the centers of each Gaussian when the centers are Ω ( 𝑑 ) apart from each other. There are other results such as [37, 34] that are based on similar separation assumptions and that use clustering techniques.
There are other results using the method of moments. Namely, the algorithm estimates the moments 𝖤 ( 𝑋 𝑠 ) as an intermediate step. Moitra and Valiant [28], Kalai et al. [19] showed that, assuming 𝑘
𝑂 ( 1 ) , there is an efficient algorithm that learns the parameters with 1 𝜀 𝑂 ( 𝑘 ) samples. Hardt and Price [16] showed that, when 𝑘
2 , the optimal sample complexity of learning the parameters is Θ ( 1 𝜀 12 ) . For the case that the Gaussians in the mixture have equal variance, Wu and Yang [38] proved the optimal sample complexity for learning the centers is Θ ( 1 𝜀 4 𝑘 − 2 ) if the variance is known and Θ ( 1 𝜀 4 𝑘 ) if the variance is unknown. Later, Doss et al. [13] extended the optimal sample complexity to high dimensions.
When the data are truncated or censored, however, the task becomes more challenging. [35, 2, 4] provided a detailed survey on the topic of learning Gaussians with truncated or censored data. Recently, Daskalakis et al. [7] showed that, if the samples are from a single Gaussian in high dimensional spaces, there is an algorithm that uses 𝑂 ~ ( 𝑑 2 𝜀 2 ) samples to learn the mean vector and the covariance matrix. Their approach is likelihood based. Namely, they optimize the negative log-likelihood function to find the optimal value. This approach relies on the fact that, for a single Gaussian, the negative log-likelihood function is convex and hence one can use greedy approaches such as stochastic gradient descent to find the optimal value.
Unfortunately, when there are multiple Gaussians in the mixture, we may not have such convexity property for the negative log-likelihood function. Nagarajan and Panageas [29] showed that, for the special case of a truncated mixture of two Gaussians whose centers are symmetric around the origin and assuming the truncated density is known, the output by the EM algorithm converges to the true mean as the number of iterations tends to infinity.
There are other problem settings that are closely related to ours such as robust estimation of the parameters of a Gaussian in high dimensional spaces. The setting of robust estimation is the following. The samples we observed are generated from a single high dimensional Gaussians except that a fraction of them is corrupted. Multiple previous results such as [18, 24, 9, 12, 20, 10, 11] proposed learning algorithms to learn the mean vector and the covariance matrix.
Regression with truncated or censored data is another common formulation. Namely, we only observe the data when the value of the dependent variable lies in a certain subset. A classic formulation is the truncated linear regression model [36, 1, 17, 25]. Recently, in the truncated linear regression model, Daskalakis et al. [8] proposed a likelihood-based estimator to learn the parameters.
4 Preliminaries
We denote the set { 0 , 1 , … , 𝑛 − 1 } to be [ 𝑛 ] for any positive integer 𝑛 . Let ℎ 𝑗 ( 𝑥 ) be the (probabilist’s) Hermite polynomials, i.e.
ℎ 𝑗 ( 𝑥 )
( − 1 ) 𝑗 𝑒 1 2 𝑥 2 𝖽 𝑗 𝖽 𝜉 𝑗 𝑒 − 1 2 𝜉 2 | 𝜉
𝑥 for all 𝑥 ∈ ℝ .
Hermite polynomials can also be given by the exponential generating function, i.e.
𝑒 𝑥 𝜇 − 1 2 𝜇 2
∑ 𝑗
0 ∞ ℎ 𝑗 ( 𝑥 ) 𝜇 𝑗 𝑗 ! for any 𝑥 , 𝜇 ∈ ℝ . (2)
Also, the explicit formula for ℎ 𝑗 is
ℎ 𝑗 ( 𝑥 )
𝑗 ! ∑ 𝑖
0 ⌊ 𝑗 / 2 ⌋ ( − 1 / 2 ) 𝑖 𝑖 ! ( 𝑗 − 2 𝑖 ) ! 𝑥 𝑗 − 2 𝑖
and this explicit formula is useful in our analysis.
In our proof, we will solve multiple systems of linear equations. Cramer’s rule provides an explicit formula for the solution of a system of linear equations whenever the system has a unique solution.
Lemma 2 (Cramer’s rule).
Consider the following system of 𝑛 linear equations with 𝑛 variables.
𝐴 𝑥
𝑏
where 𝐴 is a 𝑛 -by- 𝑛 matrix with nonzero determinant and 𝑏 is a 𝑛 dimensional vector. Then, the solution of this system 𝑥 ^
𝐴 − 1 𝑏 satisfies that the 𝑖 -th entry of 𝑥 ^ is
det ( 𝐴 ( 𝑖 ← 𝑏 ) ) / det ( 𝐴 )
where 𝐴 ( 𝑖 ← 𝑏 ) is the same matrix as 𝐴 except that the 𝑖 -th column is replaced with 𝑏 .
Thanks to the application of Cramer’s rule, we often encounter determinants. The Cauchy-Binet formula is a formula for the determinant of a matrix that each entry can be expressed as an inner product of two vectors that correspond to its row and column. Note that the Cauchy-Binet formula usually applies to the case that the entries are finite sums. For our purpose, we state the Cauchy-Binet formula for the case that the entries are in integral form.
Lemma 3 (Cauchy–Binet formula).
Let 𝐴 be a 𝑛 -by- 𝑛 matrix whose ( 𝑟 , 𝑐 ) -entry has a form of ∫ 𝑥 ∈ 𝑆 𝑓 𝑟 ( 𝑥 ) 𝑔 𝑐 ( 𝑥 ) 𝖽 𝑥 for some functions 𝑓 𝑟 , 𝑔 𝑐 and some domain 𝑆 ⊂ ℝ . Then, the determinant of 𝐴 is
det ( 𝐴 )
∫ 𝑥 0
⋯
𝑥 𝑛 − 1 , 𝐱 ∈ 𝑆 𝑛 det ( 𝐵 ( 𝐱 ) ) ⋅ det ( 𝐶 ( 𝐱 ) ) 𝖽 𝐱
where, for any 𝐱
( 𝑥 0 , … , 𝑥 𝑛 − 1 ) ∈ 𝑆 𝑛 , 𝐵 ( 𝐱 ) is a 𝑛 -by- 𝑛 matrix whose ( 𝑟 , 𝑖 ) -entry is 𝑓 𝑟 ( 𝑥 𝑖 ) and 𝐶 ( 𝐱 ) is a 𝑛 -by- 𝑛 matrix whose ( 𝑖 , 𝑐 ) -entry is 𝑔 𝑐 ( 𝑥 𝑖 ) .
Another tool to help us compute the determinants is Schur polynomials. Schur polynomials are defined as follows. For any partition 𝜆
( 𝜆 1 , … , 𝜆 𝑛 ) such that 𝜆 1 ≥ ⋯ ≥ 𝜆 𝑛 and 𝜆 𝑖 ≥ 0 , define the function 𝑎 ( 𝜆 1 + 𝑛 − 1 , 𝜆 2 + 𝑛 − 2 , … , 𝜆 𝑛 ) ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑛 ) to be
𝑎 ( 𝜆 1 + 𝑛 − 1 , 𝜆 2 + 𝑛 − 2 , … , 𝜆 𝑛 ) ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑛 )
:= det [ 𝑥 1 𝜆 1 + 𝑛 − 1
𝑥 2 𝜆 1 + 𝑛 − 1
⋯
𝑥 𝑛 𝜆 1 + 𝑛 − 1
𝑥 1 𝜆 2 + 𝑛 − 2
𝑥 2 𝜆 2 + 𝑛 − 2
⋯
𝑥 𝑛 𝜆 2 + 𝑛 − 2
⋮
⋮
⋱
⋮
𝑥 1 𝜆 𝑛
𝑥 2 𝜆 𝑛
⋯
𝑥 𝑛 𝜆 𝑛 ] .
In particular, when 𝜆
( 0 , 0 , … , 0 ) , it becomes the Vandermonde determinant, i.e.
𝑎 ( 𝑛 − 1 , 𝑛 − 2 , … , 0 ) ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑛 )
∏ 1 ≤ 𝑗 < 𝑘 ≤ 𝑛 ( 𝑥 𝑗 − 𝑥 𝑘 ) .
Then, Schur polynomials are defined to be
𝑠 𝜆 ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑛 )
:= 𝑎 ( 𝜆 1 + 𝑛 − 1 , 𝜆 2 + 𝑛 − 2 , … , 𝜆 𝑛 ) ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑛 ) 𝑎 ( 𝑛 − 1 , 𝑛 − 2 , … , 0 ) ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑛 ) .
It is known that 𝑠 𝜆 ( 𝑥 1 , 𝑥 2 , … , 𝑥 𝑛 ) can be written as ∑ 𝑌 𝐱 𝑌 where the summation is over all semi-standard Young tableaux 𝑌 of shape 𝜆 . Here, each term 𝐱 𝑌 means 𝑥 1 𝑦 1 ⋯ 𝑥 𝑛 𝑦 𝑛 where 𝑦 𝑖 is the number of occurrences of the number 𝑖 in 𝑌 and note that ∑ 𝑖
1 𝑛 𝑦 𝑖
∑ 𝑖
1 𝑛 𝜆 𝑖 . Also, a semi-standard Young tableau 𝑌 of shape 𝜆
( 𝜆 1 , … , 𝜆 𝑛 ) can be represented by a finite collection of boxes arranged in left-justified rows where the row length is 𝜆 𝑖 and each box is filled with a number from 1 to 𝑛 such that the numbers in each row is non-decreasing and the numbers in each column is increasing. To avoid overcomplicating our argument, when we count the number of semi-standard Young tableaux of some shape we only use a loose bound for it.
5 Proof Overview
Recall that our setting is the following (cf. (1)): We are given samples drawn from the following sampling procedure. Each time, a sample is drawn from a mixture of Gaussians
∑ 𝑖
1 𝑘 𝑤 𝑖 𝒩 ( 𝜇 𝑖 , 𝜎 2 )
where 𝑤 𝑖 > 0 , ∑ 𝑖
1 𝑘 𝑤 𝑖
1 , 𝜇 𝑖 ∈ ℝ and 𝜎
0 . If this sample is inside 𝑆 , we obtain this sample; otherwise, we fail to generate a sample. Our goal is to learn 𝑤 𝑖 and 𝜇 𝑖 .
One useful way to view mixtures of Gaussians is to express it as
( ∑ 𝑖
1 𝑘 𝑤 𝑖 𝛿 𝜇 𝑖 ) * 𝒩 ( 0 , 𝜎 2 )
where 𝛿 𝜇 𝑖 is the delta distribution at 𝜇 𝑖 and * is the convolution operator. We call the distribution ∑ 𝑖
1 𝑘 𝑤 𝑖 𝛿 𝜇 𝑖 the mixing distribution. Let 𝐦 𝑗 be the moment of the mixing distribution, i.e.
𝐦 𝑗 := ∑ 𝑖
1 𝑘 𝑤 𝑖 𝜇 𝑖 𝑗 .
Since we assume that the variance is known, without loss of generality, we set 𝜎
1 ; otherwise, we can scale all samples such that 𝜎
1 . First, we reduce the problem to estimating 𝐦 𝑗 , so that we can employ known results on estimating mixtures of Gaussians using the method of moments. For example, Wu and Yang [38] proved the following theorem.
Theorem 4 (Denoised method of moments, [38]).
Suppose 𝐦 𝑗 are the moments of a distribution that has 𝑘 supports on ℝ , i.e. 𝐦 𝑗 has a form of ∑ 𝑖
1 𝑘 𝑤 𝑖 𝜇 𝑖 𝑗 where 𝑤 𝑖 > 0 , ∑ 𝑖
1 𝑘 𝑤 𝑖
1 and 𝜇 𝑖 ∈ ℝ . Let 𝑤 min be min { 𝑤 𝑖 ∣ 𝑖
1 , … , 𝑘 } and Δ min and min { | 𝜇 𝑖 − 𝜇 𝑗 | ∣ 𝑖 , 𝑗
1 , … , 𝑘 and 𝑖 ≠ 𝑗 } . For any 𝛿
0 , let 𝐦 ^ 𝑗 be the numbers that satisfy
| 𝐦 ^ 𝑗 − 𝐦 𝑗 | < 𝛿 for all 𝑗
1 , … , 2 𝑘 − 1 .
Then, if 𝑤 min and Δ min satisfy 𝑤 min Δ min
Ω ( 𝛿 𝑂 ( 1 𝑘 ) ) , there is an algorithm that takes 𝐦 ^ 𝑗 as the input and outputs 𝑤 ^ 𝑖 , 𝜇 ^ 𝑖 such that, up to an index permutation Π ,
| 𝑤 ^ Π ( 𝑖 ) − 𝑤 𝑖 | < 𝐶 𝑘 ⋅ 𝛿 Ω ( 1 𝑘 ) 𝑤 min
and
| 𝜇 ^ Π ( 𝑖 ) − 𝜇 𝑖 | < 𝐶 𝑘 ⋅ 𝛿 Ω ( 1 𝑘 ) Δ min
where 𝐶 𝑘 is a constant depending on 𝑘 only.
Unfortunately, unlike with fully observed mixtures of Gaussians, estimating these moments is no longer straightforward. As we will see in our proof, looking for unbiased estimators relies on specific structures of Gaussians. When the data is censored, such structures may not exist. Hence, we look for a biased estimator and provide delicate analysis to bound the bias. To see how we can estimate 𝐦 𝑗 , we first express the mixture as an expression that is in terms of 𝐦 𝑗 . Suppose 𝑋 is the random variable drawn from the sampling procedure conditioned on non-FAIL samples. For any function 𝑓 , the expectation of 𝑓 ( 𝑋 ) is
𝖤 ( 𝑓 ( 𝑋 ) )
∫ 𝑆 𝑓 ( 𝑥 ) ⋅ ( ∑ 𝑖
1 𝑘 𝑤 𝑖 𝑔 𝜇 𝑖 , 1 , 𝑆 ( 𝑥 ) ) 𝖽 𝑥
1 𝛼 ⋅ ∫ 𝑆 𝑓 ( 𝑥 ) ⋅ ( ∑ 𝑖
1 𝑘 𝑤 𝑖 𝑔 𝜇 𝑖 , 1 ( 𝑥 ) ) 𝖽 𝑥 . (3)
Recall that 𝛼 is the probability mass ∑ 𝑖
1 𝑘 𝑤 𝑖 𝐼 𝜇 𝑖 , 1 ( 𝑆 ) . Note that, for any 𝜇 ,
𝑔 𝜇 , 1 ( 𝑥 )
1 2 𝜋 𝑒 − 1 2 ( 𝑥 − 𝜇 ) 2
1 2 𝜋 𝑒 − 1 2 𝑥 2 𝑒 𝑥 𝜇 − 1 2 𝜇 2
1 2 𝜋 𝑒 − 1 2 𝑥 2 ∑ 𝑗
0 ∞ ℎ 𝑗 ( 𝑥 ) 𝜇 𝑗 𝑗 ! (4)
where ℎ 𝑗 is the 𝑗 -th Hermite polynomial and the last equality is from the fact (2). In other words, when we plug (4) into (3), we have
𝛼 ⋅ 𝖤 ( 𝑓 ( 𝑋 ) )
∫ 𝑆 𝑓 ( 𝑥 ) ⋅ ( ∑ 𝑖
1 𝑘 𝑤 𝑖 ⋅ ( 1 2 𝜋 𝑒 − 1 2 𝑥 2 ∑ 𝑗
0 ∞ ℎ 𝑗 ( 𝑥 ) 𝜇 𝑖 𝑗 𝑗 ! ) ) 𝖽 𝑥
∑ 𝑗
0 ∞ ( ∫ 𝑆 𝑓 ( 𝑥 ) ⋅ 1 2 𝜋 𝑗 ! 𝑒 − 1 2 𝑥 2 ℎ 𝑗 ( 𝑥 ) 𝖽 𝑥 ) ⋅ ( ∑ 𝑖
1 𝑘 𝑤 𝑖 𝜇 𝑖 𝑗 ) .
To ease the notation, for any function 𝑓 and positive integer 𝑗 , we define
𝐽 𝑓 , 𝑗 := ∫ 𝑆 𝑓 ( 𝑥 ) ⋅ 1 2 𝜋 𝑗 ! 𝑒 − 1 2 𝑥 2 ℎ 𝑗 ( 𝑥 ) 𝖽 𝑥 . (5)
If we plug 𝐽 𝑓 , 𝑗 and 𝐦 𝑗 into the equation for 𝛼 ⋅ 𝖤 ( 𝑓 ( 𝑋 ) ) , we have
𝛼 ⋅ 𝖤 ( 𝑓 ( 𝑋 ) )
∑ 𝑗
0 ∞ 𝐽 𝑓 , 𝑗 ⋅ 𝐦 𝑗 . (6)
Ideally, if we manage to find 2 𝑘 − 1 functions 𝑓 1 , … , 𝑓 2 𝑘 − 1 such that
𝐽 𝑓 𝑖 , 𝑗
{ 1 if 𝑖
𝑗
0 if 𝑖 ≠ 𝑗 .
then we have
𝛼 ⋅ 𝖤 ( 𝑓 𝑖 ( 𝑋 ) )
𝐦 𝑖 for all 𝑖
1 , … , 2 𝑘 − 1 .
It means that we will have an unbiased estimator for 𝐦 𝑖 and therefore we just need to find out the amount of samples we need by bounding the variance. Indeed, if 𝑆
ℝ and we pick 𝑓 𝑖 to be the 𝑖 -th Hermite polynomial ℎ 𝑖 then the aforementioned conditions hold. It is how [38] managed to show their result. However, when 𝑆 ≠ ℝ , it becomes trickier.
A natural extension is to pick 𝑓 𝑖 to be a linear combination of Hermite polynomials, i.e.
𝑓 𝑖
∑ 𝑎
0 ℓ − 1 𝛽 𝑖 , 𝑎 ℎ 𝑎
for some positive integer ℓ . The integer ℓ is a parameter indicating how accurate our estimator is. Indeed, this ℓ → ∞ as 𝜀 → 0 as we will show in our proof. For each 𝑓 𝑖 , there are ℓ coefficients 𝛽 𝑖 , 𝑗 in the expression and therefore we can enforce ℓ terms of 𝐽 𝑓 𝑖 , 𝑗 to be the desired values. More precisely, we can set 𝛽 𝑖 , 𝑎 such that
𝐽 𝑓 𝑖 , 𝑗
∫ 𝑆 𝑓 𝑖 ( 𝑥 ) ⋅ 1 2 𝜋 𝑗 ! 𝑒 − 1 2 𝑥 2 ℎ 𝑗 ( 𝑥 ) 𝖽 𝑥
∑ 𝑎
0 ℓ − 1 𝛽 𝑖 , 𝑗 ∫ 𝑆 ℎ 𝑎 ( 𝑥 ) ⋅ 1 2 𝜋 𝑗 ! 𝑒 − 1 2 𝑥 2 ℎ 𝑗 ( 𝑥 ) 𝖽 𝑥
∑ 𝑎
0 ℓ − 1 𝛽 𝑖 , 𝑎 𝐽 ℎ 𝑎 , 𝑗
{ 1 if 𝑖
𝑗
0 if 𝑖 ≠ 𝑗
for 𝑗
0 , … , ℓ − 1 . If we assume the integrals can be computed exactly, then all 𝐽 ℎ 𝑎 , 𝑗 are known. Hence, we can solve 𝛽 𝑖 , 𝑎 by solving this system of linear equations.
Now, if we plug them into (6) then we have
𝛼 ⋅ 𝖤 ( 𝑓 𝑖 ( 𝑋 ) )
𝐦 𝑖 + ∑ 𝑗
ℓ ∞ 𝐽 𝑓 𝑖 , 𝑗 ⋅ 𝐦 𝑗 ⏟ := ℰ 𝑖 .
Note that the term 𝐦 𝑖 is what we aim at and hence the term ℰ 𝑖 is the error term. Indeed, our estimator is a biased estimator where the bias is ℰ 𝑖 . Thanks to the factor 1 𝑗 ! in the term 𝐽 𝑓 𝑖 , 𝑗 , intuitively, the term ℰ 𝑖 → 0 as ℓ → 0 .
Define our estimator 𝐦 ^ 𝑖 to be
𝐦 ^ 𝑖
1 𝑛 ( ∑ 𝑠
1 𝑛 ′ 𝑓 𝑖 ( 𝑥 𝑠 ) ) (7)
where 𝑛 ′ is the number of samples that are non-FAIL and 𝑥 𝑖 are the non-FAIL samples. Note that, on average, the term 1 𝑛
𝛼 𝑛 ′ gives us the factor 𝛼 implicitly. Then, by Chebyshev’s inequality, we have
| 𝐦 ^ 𝑖 − 𝐦 𝑖 | < 𝛿 + | ℰ 𝑖 | with probability 1 − 𝖵𝖺𝗋 ( 𝐦 ^ 𝑖 ) 𝛿 2 .
Now, we break the problem down to the following two subproblems.
•
How large ℓ needs to be in order to make | ℰ 𝑖 | < 𝛿 ?
•
Given 𝛿
0 , how many samples do we need to make the variance 𝖵𝖺𝗋 ( 𝐦 ^ 𝑖 ) < 𝛿 2 100 and hence the success probability larger than 99 100 ?
Detailed proofs are deferred to the appendix.
5.1 Bounds for the Number of Terms
To see how large ℓ needs to be, we first define the following notations. Let 𝑣 ( 𝑗 ) be the ℓ -dimensional vector whose 𝑎 -th entry is 𝐽 ℎ 𝑎 , 𝑗 , i.e.
𝑣 ( 𝑗 )
[ 𝐽 ℎ 0 , 𝑗
𝐽 ℎ 1 , 𝑗
⋯
𝐽 ℎ ℓ − 1 , 𝑗 ] ⊤ ,
and 𝑉 be the the ℓ -by- ℓ matrix whose 𝑟 -th row is ( 𝑣 ( 𝑟 ) ) ⊤ , i.e.
𝑉
[ 𝑣 ( 0 )
𝑣 ( 1 )
⋯
𝑣 ( ℓ − 1 ) ] ⊤ . (8)
Recall that, by the definition of 𝛽 𝑖 , 𝑎 , 𝛽 𝑖 , 𝑎 satisfies
∑ 𝑎
0 ℓ − 1 𝛽 𝑖 , 𝑎 𝐽 ℎ 𝑎 , 𝑗
{ 1 if 𝑖
𝑗
0 if 𝑖 ≠ 𝑗
.
We can rewrite it as a system of linear equations.
𝑉 𝛽 𝑖
𝐞 𝑖 (9)
where 𝛽 𝑖 is the ℓ -dimensional vector whose 𝑎 -th entry is 𝛽 𝑖 , 𝑎 and 𝐞 𝑖 is the ℓ -dimensional canonical vector which is a zero vector except that the 𝑖 -th entry is 1 , i.e.
𝛽 𝑖
[ 𝛽 𝑖 , 0
𝛽 𝑖 , 1
⋯
𝛽 𝑖 , ℓ − 1 ] ⊤
and
𝐞 𝑖
[ 0
⋯
1
⋯
0 ] ⊤ .
Namely, we have 𝛽 𝑖
𝑉 − 1 𝐞 𝑖 . Recall that the definition of ℰ 𝑖 is
ℰ 𝑖
∑ 𝑗
ℓ ∞ 𝐽 𝑓 𝑖 , 𝑗 ⋅ 𝐦 𝑗 .
To bound the term 𝐽 𝑓 𝑖 , 𝑗 , observe that
𝐽 𝑓 𝑖 , 𝑗
∑ 𝑎
0 ℓ − 1 𝛽 𝑖 , 𝑎 𝐽 ℎ 𝑎 , 𝑗
( 𝑣 ( 𝑗 ) ) ⊤ 𝑉 − 1 𝐞 𝑖
and, by Cramer’s rule, 𝐽 𝑓 𝑖 , 𝑗 can be expressed as
𝐽 𝑓 𝑖 , 𝑗
det ( 𝑉 ( 𝑖 → 𝑗 ) ) det ( 𝑉 )
where 𝑉 ( 𝑖 → 𝑗 ) is the same matrix as 𝑉 except that the 𝑖 -th row is replaced with 𝑣 ( 𝑗 ) , i.e.
𝑉 ( 𝑖 → 𝑗 )
[ 𝑣 ( 0 )
⋯
𝑣 ( 𝑖 − 1 )
𝑣 ( 𝑗 )
𝑣 ( 𝑖 + 1 )
⋯
𝑣 ( ℓ − 1 ) ] ⊤ (10)
for 𝑖
1 , … , 2 𝑘 − 1 and 𝑗 ≥ ℓ . The right arrow in the superscript indicates the row replacement. We preview that there are column replacements in our calculation and we will use left arrows to indicate it.
In Lemma 8, we show that
| 𝐽 𝑓 𝑖 , 𝑗 |
| det ( 𝑉 ( 𝑖 → 𝑗 ) ) | | det ( 𝑉 ) | ≤ 1 2 Ω ( 𝑗 log 𝑗 ) .
Also, by the assumption that | 𝜇 𝑖 | < 𝑀 where 𝑀 is a constant, we have 𝐦 𝑗 ≤ 𝑀 𝑗 . Hence, we prove that
|
ℰ
𝑖
|
≤
∑
𝑗
ℓ ∞ | 𝐽 𝑓 𝑖 , 𝑗 | | 𝐦 𝑗 | ≤ ∑ 𝑗
ℓ ∞ 1 2 Ω ( 𝑗 log 𝑗 ) ⋅ 𝑀 𝑗 ≤ 1 2 Ω ( ℓ log ℓ ) ⋅ 𝑀 ℓ ≤ 𝛿
as long as ℓ
Ω ( log 1 𝛿 log log 1 𝛿 ) .
Hence, we have the following lemma.
Lemma 5.
For a sufficiently small 𝛿 > 0 , when ℓ
Ω ( log 1 𝛿 log log 1 𝛿 ) , the estimators 𝐦 ^ 𝑖 computed by (7) satisfies
| 𝐦 ^ 𝑖 − 𝐦 𝑖 | < 2 𝛿 with probability 1 − 𝘝𝘢𝘳 ( 𝐦 ^ 𝑖 ) 𝛿 2 .
5.2 Bounds for the Variance
Recall that our second subproblem is to bound the variance of our estimator. To bound 𝖵𝖺𝗋 ( 𝐦 ^ 𝑖 ) , observe that
𝖵𝖺𝗋
(
𝐦
^
𝑖
)
≤
𝖤
(
𝐦
^
𝑖
2
)
𝛼 𝑛 𝖤 ( 𝑓 𝑖 ( 𝑋 ) 2 )
𝛼 𝑛 𝖤 ( ( ∑ 𝑎
0
ℓ
−
1
𝛽
𝑖
,
𝑎
ℎ
𝑎
(
𝑋
)
)
2
)
≤
𝛼
𝑛
(
∑
𝑎
0 ℓ − 1 | 𝛽 𝑖 , 𝑎 | 𝖤 ( ℎ 𝑎 ( 𝑋 ) 2 ) ) 2 (11)
By expanding the expectation explicitly,
𝖤 ( ℎ 𝑎 ( 𝑋 ) 2 )
∫ 𝑆 ℎ 𝑎 ( 𝑥 ) 2 ⋅ ( ∑ 𝑖
1
𝑘
𝑤
𝑖
𝑔
𝜇
𝑖
,
1
,
𝑆
(
𝑥
)
)
𝖽
𝑥
≤
1
𝛼
∫
ℝ
ℎ
𝑎
(
𝑥
)
2
⋅
(
∑
𝑖
1 𝑘 𝑤 𝑖 𝑔 𝜇 𝑖 , 1 ( 𝑥 ) ) 𝖽 𝑥
≤ 1 𝛼 ( 𝑂 ( 𝑀 + 𝑎 ) ) 2 𝑎 (12)
The last line comes from [38] where they showed that
∫ ℝ ℎ 𝑎 ( 𝑥 ) 2 ⋅ ( ∑ 𝑖
1 𝑘 𝑤 𝑖 𝑔 𝜇 𝑖 , 1 ( 𝑥 ) ) 𝖽 𝑥 ≤ ( 𝑂 ( 𝑀 + 𝑎 ) ) 2 𝑎
in Lemma 5 of [38].
Now, we also need to bound | 𝛽 𝑖 , 𝑎 | . Recall that
𝛽 𝑖
𝑉 − 1 𝐞 𝑖 .
By Cramer’s rule, each coordinate of 𝛽 𝑖 is
𝛽 𝑖 , 𝑎
det ( 𝑉 ( 𝐞 𝑖 ← 𝑎 ) ) det ( 𝑉 )
where 𝑉 ( 𝐞 𝑖 ← 𝑎 ) is the same matrix as 𝑉 except that the 𝑎 -th column is replaced with 𝐞 𝑖 , i.e.
𝑉 ( 𝐞 𝑖 ← 𝑎 )
[ 𝑣 0 ( 0 )
⋯
𝑣 𝑎 − 1 ( 0 )
0
𝑣 𝑎 + 1 ( 0 )
⋯
𝑣 ℓ − 1 ( 0 )
⋮
⋱
⋮
⋮
⋮
⋱
⋮
𝑣 0 ( 𝑖 )
⋯
𝑣 𝑎 − 1 ( 𝑖 )
1
𝑣 𝑎 + 1 ( 𝑖 )
⋯
𝑣 ℓ − 1 ( 𝑖 )
⋮
⋱
⋮
⋮
⋮
⋱
⋮
𝑣 0 ( ℓ − 1 )
⋯
𝑣 𝑎 − 1 ( ℓ − 1 )
0
𝑣 𝑎 + 1 ( ℓ − 1 )
⋯
𝑣 ℓ − 1 ( ℓ − 1 ) ] (13)
In Lemma 9, we show that
| 𝛽 𝑖 , 𝑎 |
≤ 2 𝑂 ( ℓ log ℓ ) . (14)
Therefore, if we plug (12) and (14) into (11), we have the following lemma.
Lemma 6.
For any positive integer ℓ , the estimator 𝐦 ^ 𝑖 computed by (7) has variance
𝘝𝘢𝘳 ( 𝐦 ^ 𝑖 )
≤ 1 𝑛 ⋅ 2 𝑂 ( ℓ log ℓ ) .
5.3 Full Algorithm and Main Theorem Algorithm 1 Learning mixtures of Gaussians with censored data
Input: 𝑛 iid samples 𝑥 1 , … , 𝑥 𝑛 , number of Gaussians 𝑘 , parameter ℓ , mean boundary parameter 𝑀 , sample domain 𝑆
[ − 𝑅 , 𝑅 ]
1: for 𝑖
0 to 2 𝑘 − 1 do 2: solve (9) to obtain 𝛽 𝑖
(
𝛽
𝑖
,
0
,
𝛽
𝑖
,
1
,
…
,
𝛽
𝑖
,
ℓ
−
1
)
⊤
, i.e. solve the following system of linear equations
𝑉
𝛽
𝑖
𝐞 𝑖
where the ( 𝑟 , 𝑐 ) -entry of 𝑉 is
∫ 𝑆 1 2 𝜋 𝑟 ! 𝑒 − 1 2 𝑥 2 ℎ 𝑐 ( 𝑥 ) ℎ 𝑟 ( 𝑥 ) 𝖽 𝑥 (15) and 𝐞 𝑖 is the canonical vector 3: for each sample 𝑥 𝑠 do 4: compute 𝑓 ^ 𝑖 ( 𝑥 𝑠 ) := { 𝑓 𝑖 ( 𝑥 𝑠 )
if 𝑥 𝑠 is non- 𝖥𝖠𝖨𝖫
0
if
𝑥
𝑠
is
𝖥𝖠𝖨𝖫
; recall that
𝑓
𝑖
is
𝑓
𝑖
(
𝑥
)
∑ 𝑎
0
ℓ
−
1
𝛽
𝑖
,
𝑎
ℎ
𝑎
(
𝑥
)
and
ℎ
𝑎
is the
𝑎
-th Hermite polynomial
ℎ
𝑎
(
𝑥
)
𝑎 ! ∑ 𝑗
0
⌊
𝑎
/
2
⌋
(
−
1
/
2
)
𝑗
𝑗
!
(
𝑎
−
2
𝑗
)
!
𝑥
𝑎
−
2
𝑗
5: for
𝑖
1 to 2 𝑘 − 1 do 6: compute 𝐦 ^ 𝑖
1 𝑛 ∑ 𝑠
1 𝑛 𝑓 ^ 𝑖 ( 𝑥 𝑠 ) which is the same as the estimator defined in ( 7 ) 7: let 𝑤 ^ 1 , 𝑤 ^ 2 , … , 𝑤 ^ 𝑘 and 𝜇 ^ 1 , 𝜇 ^ 2 , ⋯ , 𝜇 ^ 𝑘 be the output of Algorithm 2 using 𝐦 ^
( 𝐦 ^ 1 , … , 𝐦 ^ 2 𝑘 − 1 ) and 𝑀 as the input
Output: estimated weights 𝑤 ^ 1 , 𝑤 ^ 2 , … , 𝑤 ^ 𝑘 and estimated means 𝜇 ^ 1 , 𝜇 ^ 2 , ⋯ , 𝜇 ^ 𝑘
In this subsection, we will present the full algorithm and combine with the analysis in the previous subsections to prove our main theorem.
Proof of Theorem 1.
Suppose we are given 𝑛 iid samples 𝑥 1 , … , 𝑥 𝑛 from the distribution (1). We will show that the estimated weights 𝑤 ^ 1 , 𝑤 ^ 2 , … , 𝑤 ^ 𝑘 and the estimated means 𝜇 ^ 1 , 𝜇 ^ 2 , ⋯ , 𝜇 ^ 𝑘 outputted by Algorithm 1 taking 𝑥 1 , … , 𝑥 𝑛 as the input satisfy the desired guarantees.
By Lemma 5, when ℓ
Ω ( log 1 𝛿 log log 1 𝛿 ) , we have
| 𝐦 ^ 𝑖 − 𝐦 𝑖 | < 2 𝛿 with probability 1 − 𝖵𝖺𝗋 ( 𝐦 ^ 𝑖 ) 𝛿 2
where 𝐦 ^ 𝑖 are computed in Algorithm 1. Moreover, by Lemma 6, we show that
𝖵𝖺𝗋 ( 𝐦 ^ 𝑖 )
≤ 1 𝑛 ⋅ 2 𝑂 ( ℓ log ℓ )
which implies when ℓ
Ω ( log 1 𝛿 log log 1 𝛿 ) the failure probability is less than
𝖵𝖺𝗋 ( 𝐦 ^ 𝑖 ) 𝛿 2
≤ 1 𝑛 ⋅ 𝗉𝗈𝗅𝗒 ( 1 𝛿 ) .
By applying the union bound over all 𝑖
1 , 2 , … , 2 𝑘 − 1 , when 𝑛
Ω ( 𝗉𝗈𝗅𝗒 ( 1 𝛿 ) ) , we have
| 𝐦 ^ 𝑖 − 𝐦 𝑖 | < 2 𝛿 with probability 99 100 .
In [38], they showed that Algorithm 2 is the algorithm that makes the guarantees hold in Theorem 4. Therefore, if we pick 𝛿
𝜀 Ω ( 𝑘 ) along with the assumption 𝑤 min Δ min
Ω ( 𝜀 ) , we have, up to an index permutation Π ,
| 𝑤 ^ Π ( 𝑖 ) − 𝑤 𝑖 | < 𝜀 , | 𝜇 ^ Π ( 𝑖 ) − 𝜇 𝑖 | < 𝜀 for 𝑖
1 , … , 𝑘 .
We now examine the running time of Algorithm 1. It first takes 𝑘 ⋅ 𝗉𝗈𝗅𝗒 ( ℓ ) time222Computing the integral in (15) can be reduced to computing the integral ∫ 0 𝑧 𝑒 − 1 2 𝑡 2 𝖽 𝑡 by observing ℎ 𝑐 and ℎ 𝑟 are polynomials and using integration by parts. If we remove the assumption that the exact computation can be done, we will need to approximate the integral up to an additive error of 1 / 2 𝗉𝗈𝗅𝗒 ( 𝑘 , ℓ ) . One can approximate the integral in an exponential convergence rate by Taylor expansion and hence the running time is still 𝑘 ⋅ 𝗉𝗈𝗅𝗒 ( ℓ ) for this step. to obtain 𝛽 𝑖 . Then, it takes 𝑛 ⋅ 𝑘 ⋅ 𝗉𝗈𝗅𝗒 ( ℓ ) to compute 𝐦 ^ 𝑖 . Finally, the running time for Algorithm 2 is 𝗉𝗈𝗅𝗒 ( 𝑘 ) . Hence, by plugging ℓ
𝑂 ( 𝑘 log 1 𝜀 ) , the running time of Algorithm 1 is 𝑛 ⋅ 𝗉𝗈𝗅𝗒 ( 𝑘 , 1 𝜀 ) .
∎
Algorithm 2 Denoised method of moments [38]
Input: estimated moments 𝐦 ^
( 𝐦 ^ 1 , … , 𝐦 ^ 2 𝑘 − 1 ) , mean boundary parameter 𝑀
1: let 𝐦 *
(
𝐦
1
*
,
…
,
𝐦
2
𝑘
−
1
*
)
be the optimal solution of the following convex optimization problem
arg
max
𝐦
∥
𝐦
^
−
𝐦
∥
s.t.
𝑀
⋅
𝐌
0
,
2
𝑘
−
2
≽
𝐌
1
,
2
𝑘
−
1
≽
−
𝑀
⋅
𝐌
0
,
2
𝑘
−
2
where
𝐌
𝑖
,
𝑗
is the Hankel matrix whose entries are
𝐦
𝑖
,
…
,
𝐦
𝑗
, i.e.
𝐌
𝑖
,
𝑗
[ 𝐦 𝑖
𝐦 𝑖 + 1
⋯
𝐦 𝑖 + 𝑗 2
𝐦 𝑖 + 1
𝐦 𝑖 + 2
⋯
𝐦 𝑖
⋮
⋮
⋱
⋮
𝐦
𝑖
+
𝑗
2
𝐦
𝑖
+
𝑗
2
+
1
⋯
𝐦
𝑗
]
2: let
𝜇
^
1
,
𝜇
^
2
,
⋯
,
𝜇
^
𝑘
be the roots of the polynomial
𝑃
where
𝑃
(
𝑥
)
det [ 1
𝐦 1 *
⋯
𝐦 𝑘 *
⋮
⋮
⋱
⋮
𝐦 𝑘 − 1 *
𝐦 𝑘 *
⋯
𝐦 2 𝑘 − 1 *
1
𝑥
⋯
𝑥 𝑘 ]
3: let ( 𝑤 ^ 1 , 𝑤 ^ 2 , … , 𝑤 ^ 𝑘 ) ⊤ be the solution of the following system of linear equations
[ 1
1
⋯
1
𝜇 ^ 1
𝜇 ^ 2
⋯
𝜇 ^ 𝑘
⋮
⋮
⋱
⋮
𝜇 ^ 1 𝑘 − 1
𝜇 ^ 2 𝑘 − 1
⋯
𝜇 ^ 𝑘 𝑘 − 1 ] [ 𝑤 1
𝑤 2
⋮
𝑤 𝑘 ]
[ 1
𝐦 1 *
⋮
𝐦 𝑘 − 1 * ]
Output: estimated weights 𝑤 ^ 1 , 𝑤 ^ 2 , … , 𝑤 ^ 𝑘 and estimated means 𝜇 ^ 1 , 𝜇 ^ 2 , ⋯ , 𝜇 ^ 𝑘
6 Conclusion and Discussion
In this paper, we study the classical problem of learning mixtures of Gaussians with censored data. The problem becomes more challenging compared to the problem of learning with uncensored data because the data are partially observed. Our result shows that there is an efficient algorithm to estimate the weights and the means of the Gaussians. Specifically, we show that one only needs 1 𝜀 𝑂 ( 𝑘 ) censored samples to estimate the weights and the means within 𝜀 error. To the best of our knowledge, this is the first finite sample bound for the problem of learning mixtures of Gaussians with censored data even in the simple setting that the Gaussians are univariate and homogeneous.
There are multiple natural extensions to this setting. For example, a natural extension is to consider mixtures of multivariate Gaussians. Without truncation or censoring, one popular approach to learn mixtures of multivariate Gaussians is to apply random projections and reduce the problem to univariate Gaussians. This approach relies on the fact that the projection of a mixture of Gaussians is also a mixture of Gaussians. Unfortunately, this fact is no longer true when the data are truncated or censored.
Another interesting direction is to relax the assumption of known and homogeneous variances to unknown and/or non-homogeneous variances. When the Gaussians are homogeneous, one can estimate the variance by computing the pairwise distances between 𝑘 + 1 samples and find the minimum of them if the samples are not truncated or censored. It holds from the fact that two samples are from the same Gaussian and hence the expected value of their squared distance is the variance. It becomes more challenging when the samples are truncated or censored because the expected value of the squared distance may not be the variance.
Furthermore, previous results indicate that, in the uncensored setting, sample bounds can be improved when the centers of Gaussians in the mixture are well-separated [27, 34, 33]. An interesting direction for future research would be to improve our results under stronger separation assumptions on the components. For example, one strategy to exploit separation is to apply the Fourier Transform to the pdf of the mixture. With uncensored samples, it is straightforward to estimate the Fourier Transform, however, when the pdf is truncated, a challenge arises as the Fourier Transform may not yield a convenient form, as required by these analyses. We anticipate that delicate modifications may still be needed, and leave this open to future work.
References Amemiya [1973] T. Amemiya. Regression analysis when the dependent variable is truncated normal. Econometrica: Journal of the Econometric Society, pages 997–1016, 1973. Balakrishnan and Cramer [2014] N. Balakrishnan and E. Cramer. The art of progressive censoring. Statistics for industry and technology, 2014. Bernoulli [1760] D. Bernoulli. Essai d’une nouvelle analyse de la mortalité causée par la petite vérole, et des avantages de l’inoculation pour la prévenir. Histoire de l’Acad., Roy. Sci.(Paris) avec Mem, pages 1–45, 1760. Cohen [2016] A. C. Cohen. Truncated and Censored Samples: Theory and Applications. CRC Press, 2016. Dasgupta [1999] S. Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634–644. IEEE, 1999. Daskalakis et al. [2017] C. Daskalakis, C. Tzamos, and M. Zampetakis. Ten steps of em suffice for mixtures of two gaussians. In Conference on Learning Theory, pages 704–710. PMLR, 2017. Daskalakis et al. [2018] C. Daskalakis, T. Gouleakis, C. Tzamos, and M. Zampetakis. Efficient statistics, in high dimensions, from truncated samples. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 639–649. IEEE, 2018. Daskalakis et al. [2019] C. Daskalakis, T. Gouleakis, C. Tzamos, and M. Zampetakis. Computationally and statistically efficient truncated regression. In Conference on Learning Theory, pages 955–960. PMLR, 2019. Diakonikolas and Kane [2019] I. Diakonikolas and D. M. Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019. Diakonikolas et al. [2017] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pages 999–1008. PMLR, 2017. Diakonikolas et al. [2018] I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Robustly learning a gaussian: Getting optimal error, efficiently. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2683–2702. SIAM, 2018. Diakonikolas et al. [2019] I. Diakonikolas, G. Kamath, D. Kane, J. Li, A. Moitra, and A. Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM Journal on Computing, 48(2):742–864, 2019. Doss et al. [2020] N. Doss, Y. Wu, P. Yang, and H. H. Zhou. Optimal estimation of high-dimensional gaussian mixtures. arXiv preprint arXiv:2002.05818, 2020. Fisher [1931] R. Fisher. Properties and applications of hh functions. Mathematical tables, 1:815–852, 1931. Galton [1898] F. Galton. An examination into the registered speeds of american trotting horses, with remarks on their value as hereditary data. Proceedings of the Royal Society of London, 62(379-387):310–315, 1898. Hardt and Price [2015] M. Hardt and E. Price. Tight bounds for learning a mixture of two gaussians. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 753–760, 2015. Hausman and Wise [1977] J. A. Hausman and D. A. Wise. Social experimentation, truncated distributions, and efficient estimation. Econometrica: Journal of the Econometric Society, pages 919–938, 1977. Hopkins et al. [2022] S. B. Hopkins, G. Kamath, and M. Majid. Efficient mean estimation with pure differential privacy via a sum-of-squares exponential mechanism. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1406–1417, 2022. Kalai et al. [2010] A. T. Kalai, A. Moitra, and G. Valiant. Efficiently learning mixtures of two gaussians. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 553–562, 2010. Lai et al. [2016] K. A. Lai, A. B. Rao, and S. Vempala. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016. Lee [1914] A. Lee. Table of the gaussian” tail” functions; when the” tail” is larger than the body. Biometrika, 10(2/3):208–214, 1914. Lee and Scott [2012] G. Lee and C. Scott. Em algorithms for multivariate gaussian mixture models with truncated and censored data. Computational Statistics & Data Analysis, 56(9):2816–2829, 2012. Lindsay [1995] B. G. Lindsay. Mixture models: theory, geometry and applications. In NSF-CBMS regional conference series in probability and statistics, pages i–163. JSTOR, 1995. Liu et al. [2021] X. Liu, W. Kong, S. Kakade, and S. Oh. Robust and differentially private mean estimation. Advances in neural information processing systems, 34:3887–3901, 2021. Maddala [1986] G. S. Maddala. Limited-dependent and qualitative variables in econometrics. Number 3. Cambridge university press, 1986. McLachlan and Jones [1988] G. McLachlan and P. Jones. Fitting mixture models to grouped and truncated data via the em algorithm. Biometrics, pages 571–578, 1988. Moitra [2015] A. Moitra. Super-resolution, extremal functions and the condition number of vandermonde matrices. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 821–830, 2015. Moitra and Valiant [2010] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 93–102. IEEE, 2010. Nagarajan and Panageas [2020] S. G. Nagarajan and I. Panageas. On the analysis of em for truncated mixtures of two gaussians. In Algorithmic Learning Theory, pages 634–659. PMLR, 2020. Pearson [1894] K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71–110, 1894. Pearson [1902] K. Pearson. On the systematic fitting of curves to observations and measurements. Biometrika, 1(3):265–303, 1902. Pearson and Lee [1908] K. Pearson and A. Lee. On the generalised probable error in multiple normal correlation. Biometrika, 6(1):59–68, 1908. Qiao et al. [2022] M. Qiao, G. Guruganesh, A. Rawat, K. A. Dubey, and M. Zaheer. A fourier approach to mixture learning. Advances in Neural Information Processing Systems, 35:20850–20861, 2022. Regev and Vijayaraghavan [2017] O. Regev and A. Vijayaraghavan. On learning mixtures of well-separated gaussians. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 85–96. IEEE, 2017. Schneider [1986] H. Schneider. Truncated and censored samples from normal populations. Marcel Dekker, Inc., 1986. Tobin [1958] J. Tobin. Estimation of relationships for limited dependent variables. Econometrica: journal of the Econometric Society, pages 24–36, 1958. Vempala and Wang [2004] S. Vempala and G. Wang. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841–860, 2004. Wu and Yang [2018] Y. Wu and P. Yang. Optimal estimation of gaussian mixtures via denoised method of moments. arXiv preprint arXiv:1807.07237, 2018. Xu et al. [2016] J. Xu, D. J. Hsu, and A. Maleki. Global analysis of expectation maximization for mixtures of two gaussians. Advances in Neural Information Processing Systems, 29, 2016. Appendix A Proof
In this section, we will present the proofs of the lemmas.
Lemma 7.
Let 𝑉 be the matrix defined in (8), i.e. 𝑉 is the ℓ -by- ℓ matrix whose ( 𝑟 , 𝑐 ) -entry is 𝐽 ℎ 𝑐 , 𝑟 for 𝑟 , 𝑐
0 , 1 , … , ℓ − 1 . Recall that, from (5), 𝐽 ℎ 𝑐 , 𝑟 is defined as
𝐽 ℎ 𝑐 , 𝑟
∫ 𝑆 1 2 𝜋 𝑟 ! 𝑒 − 1 2 𝑥 2 ℎ 𝑐 ( 𝑥 ) ℎ 𝑟 ( 𝑥 ) 𝖽 𝑥 .
Then, the determinant of 𝑉 is
det ( 𝑉 )
( 1 2 𝜋 ) ℓ ⋅ ∏ 𝑟
0 ℓ − 1 1 𝑟 ! ⋅ ∫ 𝑥 0 > ⋯ > 𝑥 ℓ − 1 , 𝐱 ∈ 𝑆 ℓ 𝑒 − 1 2 ∑ 𝑐
0 ℓ − 1 𝑥 𝑐 2 ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) 2 𝖽 𝐱 .
Proof.
Since the ( 𝑟 , 𝑐 ) -entry of 𝑉 is
𝐽 ℎ 𝑐 , 𝑟
∫ 𝑆 1 2 𝜋 𝑟 ! 𝑒 − 1 2 𝑥 2 ℎ 𝑐 ( 𝑥 ) ℎ 𝑟 ( 𝑥 ) 𝖽 𝑥 ,
by factoring out the term 1 2 𝜋 𝑟 ! for each row, we have
det ( 𝑉 )
( 1 2 𝜋 ) ℓ ⋅ ∏ 𝑟
0 ℓ − 1 1 𝑟 ! ⋅ det ( 𝑊 ) (16)
where 𝑊 is the ℓ -by- ℓ matrix whose ( 𝑟 , 𝑐 ) -entry is
𝑊 𝑟 , 𝑐
∫ 𝑆 𝑒 − 1 2 𝑥 2 ℎ 𝑐 ( 𝑥 ) ℎ 𝑟 ( 𝑥 ) 𝖽 𝑥 . (17)
By Cauchy-Binet formula, we can further express det ( 𝑊 ) as
det ( 𝑊 )
∫ 𝑥 0
⋯
𝑥 ℓ − 1 , 𝐱 ∈ 𝑆 ℓ ( det ( 𝑈 ( 𝐱 ) ) ) 2 𝖽 𝐱 (18)
where 𝑈 ( 𝐱 ) is the ℓ -by- ℓ matrix whose ( 𝑟 , 𝑐 ) -entry is
𝑈 ( 𝐱 ) 𝑟 , 𝑐
𝑒 − 1 4 𝑥 𝑐 2 ℎ 𝑟 ( 𝑥 𝑐 ) (19)
for any 𝐱
( 𝑥 0 , … , 𝑥 ℓ − 1 ) ∈ 𝑆 ℓ . By factoring out the term 𝑒 − 1 4 𝑥 𝑐 2 for each column, we have
det ( 𝑈 ( 𝐱 ) )
𝑒 − 1 4 ∑ 𝑐
0 ℓ − 1 𝑥 𝑐 2 det ( 𝑃 ( 𝐱 ) ) (20)
where 𝑃 ( 𝐱 ) is the ℓ -by- ℓ matrix whose ( 𝑟 , 𝑐 ) -entry is
𝑃 ( 𝐱 ) 𝑟 , 𝑐
ℎ 𝑟 ( 𝑥 𝑐 ) (21)
for any 𝐱
( 𝑥 0 , … , 𝑥 ℓ − 1 ) ∈ 𝑆 ℓ . Since ℎ 𝑟 is a polynomial of degree 𝑟 with the leading coefficient 1 , by applying row and column operations, the determinant det ( 𝑃 ( 𝐱 ) ) is same as the determinant of the Vandermonde matrix, i.e.
det ( 𝑃 ( 𝐱 ) )
∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) . (22)
In other words, the determinant det ( 𝑉 ) is
det ( 𝑉 )
( 1 2 𝜋 ) ℓ ⋅ ∏ 𝑟
0 ℓ − 1 1 𝑟 ! ⋅ ∫ 𝑥 0 > ⋯ > 𝑥 ℓ − 1 , 𝐱 ∈ 𝑆 ℓ 𝑒 − 1 2 ∑ 𝑐
0 ℓ − 1 𝑥 𝑐 2 ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) 2 𝖽 𝐱 .
∎
Lemma 8.
Let 𝑉 ( 𝑖 → 𝑗 ) be the matrix defined in (10) for 𝑖 ≤ 2 𝑘 − 1 and 𝑗 ≥ ℓ ≥ 2 ( 2 𝑘 − 1 ) ≥ 2 𝑖 . Then the absolute value of the determinant of 𝑉 ( 𝑖 → 𝑗 ) is
| det ( 𝑉 ( 𝑖 → 𝑗 ) ) | ≤ 1 2 Ω ( 𝑗 log 𝑗 ) ⋅ | det ( 𝑉 ) | .
Proof.
We can perform a similar computation as in the computation of det ( 𝑉 ) . Namely, we factor out the term 1 2 𝜋 𝑟 ! for each row, we have
| det ( 𝑉 ( 𝑖 → 𝑗 ) ) |
( 1 2 𝜋 ) ℓ ⋅ ∏ 𝑟
0 , 𝑟 ≠ 𝑖 ℓ − 1 1 𝑟 ! ⋅ 1 𝑗 ! ⋅ | det ( 𝑊 ( 𝑖 → 𝑗 ) ) |
where 𝑊 ( 𝑖 → 𝑗 ) is the same matrix as 𝑊 from (17) except that the 𝑖 -th row is replaced by the row 2 𝜋 𝑗 ! 𝑣 ( 𝑗 ) . By comparing to (16), we simplify | det ( 𝑉 ( 𝑖 → 𝑗 ) ) | to be
| det ( 𝑉 ( 𝑖 → 𝑗 ) ) |
𝑖 ! 𝑗 ! ⋅ | det ( 𝑊 ( 𝑖 → 𝑗 ) ) | | det ( 𝑊 ) | ⋅ | det ( 𝑉 ) | (23)
By Cauchy-Binet formula, we can further express det ( 𝑊 ( 𝑖 → 𝑗 ) ) as
det ( 𝑊 ( 𝑖 → 𝑗 ) )
∫ 𝑥 0
⋯
𝑥 ℓ − 1 , 𝐱 ∈ 𝑆 ℓ det ( 𝑈 ( 𝐱 ) ) det ( 𝑈 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) 𝖽 𝐱
where 𝑈 ( 𝑖 → 𝑗 ) ( 𝐱 ) is the same matrix as 𝑈 ( 𝐱 ) from (19) except that the 𝑖 -th row is replaced with the column whose 𝑐 -th entry is 𝑒 − 1 4 𝑥 𝑐 2 ℎ 𝑗 ( 𝑥 𝑐 ) for any 𝐱
( 𝑥 0 , … , 𝑥 ℓ − 1 ) ∈ ℝ ℓ . Furthermore, by Cauchy–Schwarz inequality and comparing to (18),
|
det
(
𝑊
(
𝑖
→
𝑗
)
)
|
≤
(
∫
𝑥
0
>
⋯
>
𝑥
ℓ
−
1
,
𝐱
∈
𝑆
ℓ
(
det
(
𝑈
(
𝐱
)
)
)
2
𝖽
𝐱
)
1
/
2
(
∫
𝑥
0
>
⋯
>
𝑥
ℓ
−
1
,
𝐱
∈
𝑆
ℓ
(
det
(
𝑈
(
𝑖
→
𝑗
)
(
𝐱
)
)
)
2
𝖽
𝐱
)
1
/
2
( ∫ 𝑥 0
⋯
𝑥 ℓ − 1 , 𝐱 ∈ 𝑆 ℓ ( det ( 𝑈 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) ) 2 𝖽 𝐱 ∫ 𝑥 0
⋯
𝑥 ℓ − 1 , 𝐱 ∈ 𝑆 ℓ ( det ( 𝑈 ( 𝐱 ) ) ) 2 𝖽 𝐱 ) 1 / 2 | det ( 𝑊 ) | . (24)
By factoring out the term 𝑒 − 1 4 𝑥 𝑐 2 for each column, we have
det ( 𝑈 ( 𝑖 → 𝑗 ) ( 𝐱 ) )
𝑒 − 1 4 ∑ 𝑐
0 ℓ − 1 𝑥 𝑐 2 det ( 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) (25)
where 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) is the same matrix as 𝑃 ( 𝐱 ) from (21) except that the 𝑖 -th row is replaced with the row whose 𝑐 -th entry is ℎ 𝑗 ( 𝑥 𝑐 ) for any 𝐱
( 𝑥 0 , … , 𝑥 ℓ − 1 ) ∈ ℝ ℓ .
This time, the computation of det ( 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) is not as easy as det ( 𝑃 ( 𝐱 ) ) . In Lemma 10 below, we will show that
| det ( 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) | ≤ 𝑗 ! 𝑖 ! ( 𝑗 − 𝑖 2 ) ! ⋅ 2 𝑂 ( 𝑗 ) ⋅ | det ( 𝑃 ( 𝐱 ) ) | .
Plugging it into (25) and comparing (25) to (20), we have
| det ( 𝑈 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) | ≤ 𝑗 ! 𝑖 ! ( 𝑗 − 𝑖 2 ) ! ⋅ 2 𝑂 ( 𝑗 ) ⋅ | det ( 𝑈 ( 𝐱 ) ) | .
Furthermore, by plugging it into (24),
| det ( 𝑊 ( 𝑖 → 𝑗 ) ) |
≤ ( ∫ 𝑥 0
⋯
𝑥 ℓ − 1 , 𝐱 ∈ 𝑆 ℓ ( det ( 𝑈 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) ) 2 𝖽 𝐱 ∫ 𝑥 0
⋯
𝑥 ℓ − 1 , 𝐱 ∈ 𝑆 ℓ ( det ( 𝑈 ( 𝐱 ) ) ) 2 𝖽 𝐱 ) 1 / 2 | det ( 𝑊 ) | ≤ 𝑗 ! 𝑖 ! ( 𝑗 − 𝑖 2 ) ! ⋅ 2 𝑂 ( 𝑗 ) ⋅ | det ( 𝑊 ) |
Finally, when we plug it into (23), we prove that
| det ( 𝑉 ( 𝑖 → 𝑗 ) ) |
𝑖 ! 𝑗 ! ⋅ | det ( 𝑊 ( 𝑖 → 𝑗 ) ) | | det ( 𝑊 ) | ⋅ | det ( 𝑉 ) | ≤ 𝑖 ! 𝑗 ! ⋅ 𝑗 ! 2 𝑂 ( 𝑗 ) 𝑖 ! ( 𝑗 − 𝑖 2 ) ! ⋅ | det ( 𝑉 ) |
2 𝑂 ( 𝑗 ) ( 𝑗 − 𝑖 2 ) ! ⋅ | det ( 𝑉 ) |
Recall that 𝑖 ≤ 2 𝑘 − 1 and the assumption of 𝑗 ≥ ℓ
2 ( 2 𝑘 − 1 ) ≥ 2 𝑖 . We have
| det ( 𝑉 ( 𝑖 → 𝑗 ) ) | ≤ 1 2 Ω ( 𝑗 log 𝑗 ) ⋅ | det ( 𝑉 ) | .
∎
Lemma 9.
Let 𝑉 ( 𝐞 𝑖 ← 𝑎 ) be the matrix defined in (13) for 𝑖 ≤ 2 𝑘 − 1 and 𝑎 ≤ ℓ . Then the absolute value of the determinant of 𝑉 ( 𝐞 𝑖 ← 𝑎 ) is
| det ( 𝑉 ( 𝐞 𝑖 ← 𝑎 ) ) | ≤ 2 𝑂 ( ℓ log ℓ ) ⋅ | det ( 𝑉 ) | .
Proof.
Recall that 𝑉 ( 𝐞 𝑖 ← 𝑎 ) is the same matrix as 𝑉 except that the 𝑎 -th column is replaced with 𝐞 𝑖 . Hence, we first expand the determinant along that column and factor out the term 1 2 𝜋 𝑟 ! for each row.
| det ( 𝑉 ( 𝐞 𝑖 ← 𝑎 ) ) |
( 1 2 𝜋 ) ℓ − 1 ⋅ ∏ 𝑟
0 , 𝑟 ≠ 𝑖 ℓ − 1 1 𝑟 ! ⋅ | det ( 𝑊 ( − 𝑖 , − 𝑎 ) ) |
where 𝑊 ( − 𝑖 , − 𝑎 ) is the same matrix as 𝑊 from (17) except that the 𝑖 -th row and the 𝑎 -th column are omitted. By comparing to (16), we first simplify | det ( 𝑉 ( 𝐞 𝑖 ← 𝑎 ) ) | to be
| det ( 𝑉 ( 𝐞 𝑖 ← 𝑎 ) ) |
2 𝜋 𝑖 ! ⋅ | det ( 𝑊 ( − 𝑖 , − 𝑎 ) ) | | det ( 𝑊 ) | ⋅ | det ( 𝑉 ) |
It means we need to bound the term | det ( 𝑊 ( − 𝑖 , − 𝑎 ) ) | | det ( 𝑊 ) | from above. To achieve it, we will bound | det ( 𝑊 ( − 𝑖 , − 𝑎 ) ) | from above and | det ( 𝑊 ) | from below.
By Cauchy-Binet formula, we further express det ( 𝑊 ( − 𝑖 , − 𝑎 ) ) as
det ( 𝑊 ( − 𝑖 , − 𝑎 ) )
∫ 𝑥 0
⋯
𝑥 ℓ − 2 , 𝐱 ∈ 𝑆 ℓ − 1 det ( 𝑈 ( − 𝑖 ) ( 𝐱 ) ) det ( 𝑈 ( − 𝑎 ) ( 𝐱 ) ) 𝖽 𝐱 (26)
where
𝑈
(
−
𝑖
)
(
𝐱
)
(resp.
𝑈
(
−
𝑎
)
) is the
(
ℓ
−
1
)
-by-
(
ℓ
−
1
)
matrix whose
(
𝑟
,
𝑐
)
-entry is
𝑒
−
1
4
𝑥
𝑐
2
ℎ
𝑟
(
𝑥
𝑐
)
for
𝑟
∈
[
ℓ
]
{
𝑖
}
(resp.
𝑟
∈
[
ℓ
]
{
𝑎
}
),
𝑐
∈
[
ℓ
−
1
]
and any
𝐱
( 𝑥 0 , … , 𝑥 ℓ − 2 ) ∈ ℝ ℓ − 1 . By factoring out the term 𝑒 − 1 4 𝑥 𝑐 2 fro each column,
det ( 𝑈 ( − 𝑖 ) ( 𝐱 ) )
𝑒 − 1 4 ∑ 𝑐
0 ℓ − 2 𝑥 𝑐 2 det ( 𝑃 ( − 𝑖 ) ( 𝐱 ) ) (27)
where
𝑃
(
−
𝑖
)
(
𝐱
)
is the
(
ℓ
−
1
)
-by-
(
ℓ
−
1
)
matrix whose
(
𝑟
,
𝑐
)
-entry is
ℎ
𝑟
(
𝑥
𝑐
)
for
𝑟
∈
[
ℓ
]
{
𝑖
}
,
𝑐
∈
[
ℓ
−
1
]
and any
𝐱
( 𝑥 0 , … , 𝑥 ℓ − 2 ) ∈ ℝ ℓ − 1 .
Again, the computation of det ( 𝑃 ( − 𝑖 ) ( 𝐱 ) ) is not as easy as det ( 𝑃 ( 𝐱 ) ) . In Lemma 11, we show that
| det ( 𝑃 ( − 𝑖 ) ( 𝐱 ) ) | ≤ 2 𝑂 ( ℓ log ℓ ) ⋅ ∏ 1 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 2 | 𝑥 𝑐 1 − 𝑥 𝑐 2 | .
Note that the bound is independent to 𝑖 and hence we have the same bound for | 𝑃 ( − 𝑎 ) ( 𝐱 ) | . By plugging it into (27) and further into (26), we have
|
det
(
𝑊
(
−
𝑖
,
−
𝑎
)
)
|
≤
2
𝑂
(
ℓ
log
ℓ
)
⋅
∫
𝑥
0
>
⋯
>
𝑥
ℓ
−
2
,
𝐱
∈
𝑆
ℓ
−
1
𝑒
−
1
2
∑
𝑐
0 ℓ − 2 𝑥 𝑐 2 ⋅ ∏ 1 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 2 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) 2 𝖽 𝐱 . (28)
Recall that, in Lemma 7 and (16),
det ( 𝑊 )
∫ 𝑥 0 > ⋯ > 𝑥 ℓ − 1 , 𝐱 ∈ 𝑆 ℓ 𝑒 − 1 2 ∑ 𝑐
0 ℓ − 1 𝑥 𝑐 2 ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) 2 𝖽 𝐱 .
Since the term 𝑒 − 1 2 ∑ 𝑐
0 ℓ − 1 𝑥 𝑐 2 ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) 2 in the integral is symmetric with respect to 𝑥 0 , … , 𝑥 ℓ − 1 , we have
det ( 𝑊 )
ℓ ! ⋅ ∫ 𝐱 ∈ 𝑆 ℓ 𝑒 − 1 2 ∑ 𝑐
0 ℓ − 1 𝑥 𝑐 2 ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) 2 𝖽 𝐱 .
To bound det ( 𝑊 ) from below, we consider integrating over the sub-region { 𝐱 ∈ 𝑆 ℓ ∣ | 𝑥 ℓ − 1 − 𝑥 𝑐 |
𝑅 ℓ } of 𝑆 ℓ .
det
(
𝑊
)
≥
ℓ
!
⋅
∫
|
𝑥
ℓ
−
1
−
𝑥
𝑐
|
>
𝑅
ℓ
,
𝐱
∈
𝑆
ℓ
𝑒
−
1
2
∑
𝑐
0
ℓ
−
1
𝑥
𝑐
2
⋅
∏
0
≤
𝑐
1
<
𝑐
2
≤
ℓ
−
1
(
𝑥
𝑐
1
−
𝑥
𝑐
2
)
2
𝖽
𝐱
≥
ℓ
!
⋅
(
𝑅
ℓ
)
2
(
ℓ
−
1
)
𝑒
−
1
2
𝑅
2
⋅
∫
|
𝑥
ℓ
−
1
−
𝑥
𝑐
|
>
𝑅
ℓ
,
𝐱
∈
𝑆
ℓ
𝑒
−
1
2
∑
𝑐
0
ℓ
−
2
𝑥
𝑐
2
⋅
∏
0
≤
𝑐
1
<
𝑐
2
≤
ℓ
−
2
(
𝑥
𝑐
1
−
𝑥
𝑐
2
)
2
𝖽
𝐱
≥
ℓ
!
⋅
𝑅
(
𝑅
ℓ
)
2
(
ℓ
−
1
)
𝑒
−
1
2
𝑅
2
⋅
∫
𝐱
∈
𝑆
ℓ
−
1
𝑒
−
1
2
∑
𝑐
0 ℓ − 2 𝑥 𝑐 2 ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 2 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) 2 𝖽 𝐱
ℓ ⋅ 𝑅 ( 𝑅 ℓ ) 2 ( ℓ − 1 ) 𝑒 − 1 2 𝑅 2 ⋅ ∫ 𝑥 0 > ⋯ > 𝑥 ℓ − 2 , 𝐱 ∈ 𝑆 ℓ − 1 𝑒 − 1 2 ∑ 𝑐
0 ℓ − 2 𝑥 𝑐 2 ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 2 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) 2 𝖽 𝐱
1 2 𝑂 ( ℓ log ℓ ) ⋅ ∫ 𝑥 0 > ⋯ > 𝑥 ℓ − 2 , 𝐱 ∈ 𝑆 ℓ − 1 𝑒 − 1 2 ∑ 𝑐
0 ℓ − 2 𝑥 𝑐 2 ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 2 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) 2 𝖽 𝐱 (29)
In other words, by comparing | det ( 𝑊 ) | in (29) to | det ( 𝑊 ( − 𝑖 , − 𝑎 ) ) | in (28), we have
| det ( 𝑊 ( − 𝑖 , − 𝑎 ) ) | | det ( 𝑊 ) |
≤ 2 𝑂 ( ℓ log ℓ )
and hence
| det ( 𝑉 ( 𝐞 𝑖 ← 𝑎 ) ) | | det ( 𝑉 ) |
2 𝜋 𝑖 ! ⋅ | det ( 𝑊 ( − 𝑖 , − 𝑎 ) ) | | det ( 𝑊 ) | ≤ 2 𝑂 ( ℓ log ℓ ) .
∎
Lemma 10.
Let 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) be the matrix defined in the proof of Lemma 8. Then the absolute value of the determinant of 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) is
| det ( 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) | ≤ 𝑗 ! 𝑖 ! ( 𝑗 − 𝑖 2 ) ! ⋅ 2 𝑂 ( 𝑗 ) ⋅ | det ( 𝑃 ( 𝐱 ) ) | .
Recall that 𝑃 ( 𝐱 ) is the matrix defined in (21).
Proof.
Since the entries of 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) are Hermite polynomials, we can decompose it into
𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 )
𝐶 ( 𝑖 → 𝑗 ) ⋅ 𝑋 [ 𝑗 + 1 ]
where 𝐶 ( 𝑖 → 𝑗 ) is the ℓ -by- ( 𝑗 + 1 ) matrix whose ( 𝑟 , 𝑐 ) -entry is the coefficient of 𝑥 𝑐 in the 𝑟 -th Hermite polynomial and 𝑋 [ 𝑗 + 1 ] is the ( 𝑗 + 1 ) -by- ℓ matrix whose ( 𝑟 , 𝑐 ) -entry is 𝑥 𝑐 𝑟 . For example, take ℓ
4 , 𝑖
2 , 𝑗
6 ,
ℎ 0 ( 𝑥 )
1
ℎ
1
(
𝑥
)
𝑥
ℎ
3
(
𝑥
)
−
3
𝑥
+
𝑥
3
ℎ
6
(
𝑥
)
− 15 + 45 𝑥 2 − 15 𝑥 4 + 𝑥 6
and hence
𝐶 ( 𝑖 → 𝑗 )
[ 1
0
0
0
0
0
0
0
1
0
0
0
0
0
− 15
0
45
0
− 15
0
1
0
− 3
0
1
0
0
0 ]
To compute det ( 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) , we use Cauchy-Binet formula and we have
det ( 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) )
∑ 𝑇 det ( 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) ) ⋅ det ( 𝑋 𝑇 , : [ 𝑗 + 1 ] )
where the summation is over all subset 𝑇 of size ℓ of [ 𝑗 + 1 ] , 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) is the ℓ -by- ℓ matrix whose columns are the columns of 𝐶 ( 𝑖 → 𝑗 ) at indices from 𝑇 and 𝑋 𝑇 , : [ 𝑗 + 1 ] is the ℓ -by- ℓ matrix whose rows are the rows of 𝑋 [ 𝑗 + 1 ] at indices from 𝑇 . Here, for any positive integer 𝑛 , we denote [ 𝑛 ] to be the set { 0 , 1 , … , 𝑛 − 1 } . Furthermore, by triangle inequality,
| det ( 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) | ≤ ∑ 𝑇 | det ( 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) ) | ⋅ | det ( 𝑋 𝑇 , : [ 𝑗 + 1 ] ) | (30)
We first make some simplifications to see what 𝑇 makes the determinants nonzero. For example, take ℓ
8 , 𝑖
2 , 𝑗
10 , we have
ℎ 0 ( 𝑥 )
1
ℎ
1
(
𝑥
)
𝑥
ℎ
3
(
𝑥
)
−
3
𝑥
+
𝑥
3
ℎ
4
(
𝑥
)
3
−
6
𝑥
2
+
𝑥
4
ℎ
5
(
𝑥
)
15
𝑥
−
10
𝑥
3
+
𝑥
5
ℎ
6
(
𝑥
)
−
15
+
45
𝑥
2
−
15
𝑥
4
+
𝑥
6
ℎ
7
(
𝑥
)
−
105
𝑥
+
105
𝑥
3
−
21
𝑥
5
+
𝑥
7
ℎ
10
(
𝑥
)
− 945 + 4725 𝑥 2 − 3150 𝑥 4 + 630 𝑥 6 − 45 𝑥 8 + 𝑥 10
and
𝐶 ( 𝑖 , 𝑗 )
up to row and column swaps [ 1
3
− 6
1
− 15
45
− 15
1
− 945
4725
− 3150
630
− 45
1
1
− 3
1
15
− 10
1
− 105
105
− 21
1
]
For simplicity, we assume that 𝑖 , 𝑗 , ℓ are even numbers and it is easy to prove the other cases by symmetry. If 𝑇 satisfies one of the following conditions:
•
does not contain all odd numbers less than ℓ , i.e. 1 , 3 , … , ℓ − 1
•
does not contain all even numbers less than 𝑖 , i.e. 0 , 2 , … , 𝑖 − 2
•
contains more than one even number larger than or equal to ℓ , i.e. ℓ , ℓ + 2 , … , 𝑗
then det ( 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) )
0 . In other words, the choices are
•
𝑇
[ ℓ ] or
•
𝑇
[
ℓ
]
{
𝑎
}
∪
{
𝑏
}
for
𝑎
𝑖 , 𝑖 + 2 , … , ℓ − 2 and 𝑏
ℓ , ℓ + 2 , … , 𝑗 .
Therefore, there are only ℓ − 𝑖 2 ⋅ 𝑗 − ℓ + 2 2 + 1
𝑂 ( 𝑗 2 ) choices for 𝑇 such that det ( 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) ) may not be 0 .
If 𝑇
[ ℓ ] , by expanding the determinant det ( 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) ) along the rows whose diagonal entry is 1 , what we have left is the determinant of a matrix 𝐴 where 𝐴 is the ( ℓ − 𝑖 2 ) -by- ( ℓ − 𝑖 2 ) matrix whose ( 𝑟 , 𝑐 ) -entry is ( − 1 ) 𝑟 − 𝑐 2 𝑟 ! ( 𝑟 − 𝑐 2 ) ! 𝑐 ! 2 𝑟 − 𝑐 2 for 𝑟
𝑖 + 2 , … , ℓ − 2 , 𝑗 and 𝑐
𝑖 , 𝑖 + 2 , … , ℓ − 2 . In the example, the matrix 𝐴 is [ − 6
1
45
− 15
1
4725
− 3150
630 ] . By applying row and column operations, we can compute the exact expression for det ( 𝐴 )
det ( 𝐴 )
( − 1 ) 𝑗 − 𝑖 2 𝑗 ! 𝑖 ! 2 𝑗 − 𝑖 2 ( ∑ 𝑚
0 ℓ − 𝑖 − 2 2 ( − 1 ) 𝑚 1 𝑚 ! ( 𝑗 − 𝑖 2 − 𝑚 ) ! ) .
In the example, we have
det ( [ − 6
1
45
− 15
1
4725
−
3150
630
]
)
14175
Note that the expression ∑ 𝑚
0 ℓ − 𝑖 − 2 2 ( − 1 ) 𝑚 1 𝑚 ! ( 𝑗 − 𝑖 2 − 𝑚 ) ! in the equation for det ( 𝐴 ) can be easily bounded by
| ∑ 𝑚
0 ℓ − 𝑖 − 2 2 ( − 1 ) 𝑚 1 𝑚 ! ( 𝑗 − 𝑖 2 − 𝑚 ) ! | ≤ ∑ 𝑚
0 ℓ − 𝑖 − 2 2 1 𝑚 ! ( 𝑗 − 𝑖 2 − 𝑚 ) ! ≤ ∑ 𝑚
0 𝑗 − 𝑖 2 1 𝑚 ! ( 𝑗 − 𝑖 2 − 𝑚 ) !
2 𝑗 − 𝑖 2 ( 𝑗 − 𝑖 2 ) !
Hence, we have
| det ( 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) ) |
| det ( 𝐴 ) | ≤ 𝑗 ! 𝑖 ! ( 𝑗 − 𝑖 2 ) !
Also, since 𝑇
[ ℓ ] , therefore | det ( 𝑋 𝑇 , : [ 𝑗 + 1 ] ) |
∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 | 𝑥 𝑐 1 − 𝑥 𝑐 2 | . When 𝑇
[ ℓ ] , we have
| det ( 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) ) | ⋅ | det ( 𝑋 𝑇 , : [ 𝑗 + 1 ] ) | ≤ 𝑗 ! 𝑖 ! ( 𝑗 − 𝑖 2 ) ! ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 | 𝑥 𝑐 1 − 𝑥 𝑐 2 |
Now, consider the case that 𝑇
[
ℓ
]
{
𝑎
}
∪
{
𝑏
}
for
𝑎
𝑖 , 𝑖 + 2 , … , ℓ − 2 and 𝑏
ℓ , ℓ + 2 , … , 𝑗 . Similar to the previous calculation, by expanding the determinant det ( 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) ) along the rows whose diagonal entry is 1 , what we have left is the determinant of a matrix 𝐴 where 𝐴 is the ( ℓ − 𝑖 2 ) -by- ( ℓ − 𝑖 2 ) matrix whose ( 𝑟 , 𝑐 ) -entry is ( − 1 ) 𝑟 − 𝑐 2 𝑟 ! ( 𝑟 − 𝑐 2 ) ! 𝑐 ! 2 𝑟 − 𝑐 2 for 𝑟
𝑖 + 2 , … , 𝑎 , 𝑗 and 𝑐
𝑖 , 𝑖 + 2 , … , 𝑎 − 2 , 𝑏 . For example, take 𝑎
6 and 𝑏
8 , the matrix 𝐴 is the example is [ − 6
1
45
− 15
4725
− 3150
− 45 ] . By applying row and column operations, we can compute the exact expression for det ( 𝐴 )
det ( 𝐴 )
( − 1 ) 𝑗 − 𝑏 2 𝑗 ! ( 𝑗 − 𝑏 2 ) ! 𝑏 ! 2 𝑗 − 𝑏 2 ⋅ ( − 1 ) 𝑎 − 𝑖 2 𝑎 ! ( 𝑎 − 𝑖 2 ) ! 𝑖 ! 2 𝑎 − 𝑖 2
In the example, we have
det ( [ − 6
1
45
− 15
4725
−
3150
−
45
]
)
− 2025
To bound | det ( 𝐴 ) | ,
| det ( 𝐴 ) |
𝑗 ! ( 𝑗 − 𝑏 2 ) ! 𝑏 ! 2 𝑗 − 𝑏 2 ⋅ 𝑎 ! ( 𝑎 − 𝑖 2 ) ! 𝑖 ! 2 𝑎 − 𝑖 2
𝑗 ! 𝑖 ! ⋅ 𝑎 ! ( 𝑎 − 𝑖 2 ) ! 𝑏 ! ( 𝑗 − 𝑏 2 ) ! ⋅ 1 2 𝑗 − 𝑏 + 𝑎 − 𝑖 2
Note that 1 2 𝑗 − 𝑏 + 𝑎 − 𝑖 2 ≤ 1 . Recall that 𝑖 ≤ 𝑎 ≤ ℓ − 2 and ℓ ≤ 𝑏 ≤ 𝑗 . We also have
𝑎 ! ( 𝑎 − 𝑖 2 ) ! ≤ 2 𝑎 ⋅ ( 𝑎 + 𝑖 2 ) ! ≤ 2 𝑗 ⋅ ( 𝑏 + 𝑖 2 ) ! .
Hence,
| det ( 𝐴 ) | ≤ 𝑗 ! 𝑖 ! ⋅ 2 𝑗 ( 𝑏 + 𝑖 2 ) ! 𝑏 ! ( 𝑗 − 𝑏 2 ) !
𝑗 ! 𝑖 ! ⋅ 2 𝑗 ⋅ ( 𝑏 + 𝑖 2 ) ! ( 𝑏 − 𝑖 2 ) ! 𝑏 ! ⋅ ( 𝑗 − 𝑖 2 ) ! ( 𝑏 − 𝑖 2 ) ! ( 𝑗 − 𝑏 2 ) ! ⋅ 1 ( 𝑗 − 𝑖 2 ) !
Observe that
( 𝑏 + 𝑖 2 ) ! ( 𝑏 − 𝑖 2 ) ! 𝑏 ! ≤ 1 and ( 𝑗 − 𝑖 2 ) ! ( 𝑏 − 𝑖 2 ) ! ( 𝑗 − 𝑏 2 ) ! ≤ 2 𝑗 − 𝑖 2 ≤ 2 𝑗 2 .
By plugging them into the above inequality,
| det ( 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) ) |
| det ( 𝐴 ) | ≤ 𝑗 ! 𝑖 ! ⋅ 2 3 𝑗 2 ( 𝑗 − 𝑖 2 ) !
Since 𝑎 is omitted from { 𝑖 , 𝑖 + 2 , … , ℓ − 2 } and 𝑏 is selected from { ℓ , ℓ + 2 , … , 𝑗 } , it means that 𝑇
[
ℓ
]
{
𝑎
}
∪
{
𝑏
}
. By the properties of Schur polynomials,
det ( 𝑋 𝑇 , : [ 𝑗 + 1 ] )
( ∑ 𝑌 𝐱 𝑌 ) ⋅ ∏ 1 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 )
where the summation is over all semi-standard Young tableaux 𝑌 of shape ( 𝑏 − ℓ + 1 , 1 , … , 1 ⏟ ℓ − 1 − 𝑎 1 ’s , 0 , … , 0 ⏟ 𝑎 0 ’s ) . Here, the term 𝐱 𝑌 means 𝑥 0 𝑦 0 ⋯ 𝑥 ℓ − 1 𝑦 ℓ − 1 where 𝑦 𝑚 is the number of occurrences of the number 𝑚 in 𝑌 and note that ∑ 𝑚
0 ℓ − 1 𝑦 𝑚
𝑏 − 𝑎 . Based on the given shape, there is one row of size 𝑏 − ℓ − 1 and one column of size ℓ − 𝑎 and they connect at the first element. For the row, the number of non-decreasing sequences of size 𝑏 − ℓ − 1 whose numbers are between 0 and ℓ − 1 inclusive is ( 𝑏 ℓ − 1 ) ≤ 2 𝑗 . For the column, the number of increasing sequences of size ℓ − 𝑎 whose numbers are between 0 and ℓ − 1 inclusive is ( ℓ 𝑎 ) ≤ 2 𝑗 . Hence, the number of semi-standard Young tableaux of such shape is bounded by ( 𝑏 ℓ − 1 ) ⋅ ( ℓ 𝑎 ) ≤ 2 2 𝑗 . By the assumption that 𝑆
[ − 𝑅 , 𝑅 ] , we can also bound the term | 𝐱 𝑌 | to be
| 𝐱 𝑌 | ≤ 𝑅 𝑏 − 𝑎 ≤ 2 𝑂 ( 𝑗 ) .
We can now bound the determinant | det ( 𝑋 𝑇 , : [ 𝑗 + 1 ] ) | by
| det ( 𝑋 𝑇 , : [ 𝑗 + 1 ] ) | ≤ 2 𝑂 ( 𝑗 ) ⋅ ∏ 1 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) .
Namely, when 𝑇
[
ℓ
]
{
𝑎
}
∪
{
𝑏
}
for
𝑎
𝑖 , 𝑖 + 2 , … , ℓ − 2 and 𝑏
ℓ , ℓ + 2 , … , 𝑗 ,
|
det
(
𝐶
:
,
𝑇
(
𝑖
→
𝑗
)
)
|
⋅
|
det
(
𝑋
𝑇
,
:
[
𝑗
+
1
]
)
|
≤
𝑗
!
𝑖
!
⋅
2
3
𝑗
2
(
𝑗
−
𝑖
2
)
!
⋅
2
𝑂
(
𝑗
)
⋅
∏
1
≤
𝑐
1
<
𝑐
2
≤
ℓ
−
1
(
𝑥
𝑐
1
−
𝑥
𝑐
2
)
𝑗 ! 𝑖 ! ( 𝑗 − 𝑖 2 ) ! ⋅ 2 𝑂 ( 𝑗 ) ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 | 𝑥 𝑐 1 − 𝑥 𝑐 2 |
By considering all cases for 𝑇 and plugging them into (30), we have
| det ( 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) |
≤ ∑ 𝑇 | det ( 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) ) | ⋅ | det ( 𝑋 𝑇 , : [ 𝑗 + 1 ] ) | ≤ 𝑗 ! 𝑖 ! ( 𝑗 − 𝑖 2 ) ! ⋅ 2 𝑂 ( 𝑗 ) ⋅ ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 | 𝑥 𝑐 1 − 𝑥 𝑐 2 |
and, by comparing to det ( 𝑃 ( 𝐱 ) ) in (22) which is ∏ 0 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 1 | 𝑥 𝑐 1 − 𝑥 𝑐 2 | ,
| det ( 𝑃 ( 𝑖 → 𝑗 ) ( 𝐱 ) ) | ≤ 𝑗 ! 𝑖 ! ( 𝑗 − 𝑖 2 ) ! ⋅ 2 𝑂 ( 𝑗 ) ⋅ | det ( 𝑃 ( 𝐱 ) ) | .
∎
Lemma 11.
Let 𝑃 ( − 𝑖 ) ( 𝐱 ) be the matrix defined in the proof of Lemma 9. Then the absolute value of the determinant of 𝑃 ( − 𝑖 ) ( 𝐱 ) is
| det ( 𝑃 ( − 𝑖 ) ( 𝐱 ) ) | ≤ 2 𝑂 ( ℓ log ℓ ) ⋅ ∏ 1 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 2 | 𝑥 𝑐 1 − 𝑥 𝑐 2 | .
Proof.
Since the entries of 𝑃 ( − 𝑖 ) ( 𝐱 ) are Hermite polynomials, we can decompose it into
𝑃 ( − 𝑖 ) ( 𝐱 )
𝐶 ( − 𝑖 ) ⋅ 𝑋 [ ℓ ]
where
𝐶
(
−
𝑖
)
is the
(
ℓ
−
1
)
-by-
ℓ
matrix whose
(
𝑟
,
𝑐
)
-entry is the coefficient of
𝑥
𝑐
in the
𝑟
-th Hermite polynomial for
𝑟
∈
[
ℓ
]
{
𝑖
}
and
𝑋
[
ℓ
]
is the
ℓ
-by-
(
ℓ
−
1
)
matrix whose
(
𝑟
,
𝑐
)
-entry is
𝑥
𝑐
𝑟
. For example, take
ℓ
4 , 𝑖
2 ,
ℎ 0 ( 𝑥 )
1
ℎ
1
(
𝑥
)
𝑥
ℎ
3
(
𝑥
)
− 3 𝑥 + 𝑥 3
and hence
𝐶 ( − 𝑖 )
[ 1
0
0
0
0
1
0
0
0
−
3
0
1
]
and
𝑋
[
ℓ
]
[ 1
1
1
𝑥 0
𝑥 1
𝑥 2
𝑥 0 2
𝑥 1 2
𝑥 2 2
𝑥 0 3
𝑥 1 3
𝑥 2 3 ] .
To compute det ( 𝑃 ( − 𝑖 ) ( 𝐱 ) ) , we use Cauchy-Binet formula and we have
det ( 𝑃 ( − 𝑖 ) ( 𝐱 ) )
∑ 𝑇 det ( 𝐶 : , 𝑇 ( − 𝑖 ) ) ⋅ det ( 𝑋 𝑇 , : [ ℓ ] )
where the summation is over all subset 𝑇 of size ℓ − 1 of [ ℓ ] , 𝐶 : , 𝑇 ( 𝑖 → 𝑗 ) is the ( ℓ − 1 ) -by- ( ℓ − 1 ) matrix whose columns are the columns of 𝐶 ( − 𝑖 ) at indices from 𝑇 and 𝑋 𝑇 , : [ ℓ ] is the ( ℓ − 1 ) -by- ( ℓ − 1 ) matrix whose rows are the rows of 𝑋 [ ℓ ] at indices from 𝑇 . Furthermore, by triangle inequality,
| det ( 𝑃 ( − 𝑖 ) ( 𝐱 ) ) | ≤ ∑ 𝑇 | det ( 𝐶 : , 𝑇 ( − 𝑖 ) ) | ⋅ | det ( 𝑋 𝑇 , : [ ℓ ] ) | (31)
We first make some simplifications to see what 𝑇 makes the determinants nonzero. For example, take ℓ
8 , 𝑖
2 , we have
ℎ 0 ( 𝑥 )
1
ℎ
1
(
𝑥
)
𝑥
ℎ
3
(
𝑥
)
−
3
𝑥
+
𝑥
3
ℎ
4
(
𝑥
)
3
−
6
𝑥
2
+
𝑥
4
ℎ
5
(
𝑥
)
15
𝑥
−
10
𝑥
3
+
𝑥
5
ℎ
6
(
𝑥
)
−
15
+
45
𝑥
2
−
15
𝑥
4
+
𝑥
6
ℎ
7
(
𝑥
)
− 105 𝑥 + 105 𝑥 3 − 21 𝑥 5 + 𝑥 7
and
𝐶 ( − 𝑖 )
up to row and column swaps [ 1
3
− 6
1
− 15
45
− 15
1
1
− 3
1
15
− 10
1
− 105
105
− 21
1 ]
Fro simplicity we assume that 𝑖 , ℓ are even numbers and it is easy to prove the other cases by symmetry. If 𝑇 does not contain all odd numbers or all even numbers less than 𝑖 , then det ( 𝐶 : , 𝑇 ( − 𝑖 ) )
0
. In the words, the choices are
[
ℓ
]
{
𝑏
}
for
𝑏
𝑖 , 𝑖 + 2 , … , ℓ − 2 . Therefore, there are only ℓ − 𝑖 2
𝑂 ( ℓ ) choices for 𝑇 such that det ( 𝐶 : , 𝑇 ( − 𝑖 ) ) may be be 0 .
Now, we expand the determinant det ( 𝐶 : , 𝑇 ( − 𝑖 ) ) along the rows whose diagonal entry is 1 . What we have left is the determinant of a matrix 𝐴 where is 𝐴 is the ( 𝑏 − 𝑖 2 ) -by- ( 𝑏 − 𝑖 2 ) matrix whose ( 𝑟 , 𝑐 ) -entry is ( − 1 ) 𝑟 − 𝑐 2 𝑟 ! ( 𝑟 − 𝑐 2 ) ! 𝑐 ! 2 𝑟 − 𝑐 2 for 𝑟
𝑖 + 2 , … , 𝑏 and 𝑐
𝑖 , 𝑖 + 2 , … , 𝑏 − 2 . For example, take 𝑏
6 , the matrix 𝐴 in the above example is [ − 6
1
45
− 15 ] . By applying row and column operations, we can compute the exact expression for det ( 𝐴 ) as
det ( 𝐴 )
( − 1 ) 𝑏 − 𝑖 2 𝑏 ! ( 𝑏 − 𝑖 2 ) ! 𝑖 ! 2 𝑏 − 𝑖 2
and hence
| det ( 𝐶 : , 𝑇 ( − 𝑖 ) ) |
| det ( 𝐴 ) |
≤ 𝑏 ! ( 𝑏 − 𝑖 2 ) ! 𝑖 ! 2 𝑏 − 𝑖 2 ≤ ℓ ! . (32)
In the example, we have
det ( [ − 6
1
45
−
15
]
)
45 .
By the properties of Schur polynomials,
det ( 𝑋 𝑇 , : [ ℓ ] )
( ∑ 𝑌 𝐱 𝑌 ) ⋅ ∏ 1 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 2 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 )
where the summation is over all semi-standard Young tableaux 𝑌 of shape ( 1 , … , 1 ⏟ ℓ − 1 − 𝑏 1 ’s , 0 , … , 0 ⏟ 𝑏 0 ’s ) . Recall that the term 𝐱 𝑌 means 𝑥 0 𝑦 0 ⋯ 𝑥 ℓ − 2 𝑦 ℓ − 2 where 𝑦 𝑚 is the number of occurrences of the number 𝑚 in 𝑌 and note that ∑ 𝑚
0 ℓ − 2 𝑦 𝑚
ℓ − 1 − 𝑏 . Based on the given shape, there is only one column of size ℓ − 1 − 𝑏 . That means the number of semi-standard Young tableaux of such shape is the number of increasing sequences of size ℓ − 1 − 𝑏 whose numbers are between 0 and ℓ − 2 inclusive which is ( ℓ − 1 𝑏 ) ≤ 2 ℓ . By the assumption that 𝑆
[ − 𝑅 , 𝑅 ] , we can also bound the term | 𝐱 𝑌 | to be
| 𝐱 𝑌 | ≤ 𝑅 ℓ − 1 − 𝑏 ≤ 2 𝑂 ( ℓ ) .
It means that
| det ( 𝑋 𝑇 , : [ ℓ ] ) |
≤ 2 𝑂 ( ℓ ) ⋅ ∏ 1 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 2 ( 𝑥 𝑐 1 − 𝑥 𝑐 2 ) . (33)
By plugging (32) and (33) into (31), we can now bound | det ( 𝑃 ( − 𝑖 ) ( 𝐱 ) ) | by
| det ( 𝑃 ( − 𝑖 ) ( 𝐱 ) ) |
≤ ∑ 𝑇 | det ( 𝐶 : , 𝑇 ( − 𝑖 ) ) | ⋅ | det ( 𝑋 𝑇 , : [ ℓ ] ) | ≤ 2 𝑂 ( ℓ log ℓ ) ⋅ ∏ 1 ≤ 𝑐 1 < 𝑐 2 ≤ ℓ − 2 | 𝑥 𝑐 1 − 𝑥 𝑐 2 | .
∎
Generated on Thu Jul 13 17:46:18 2023 by LATExml
Xet Storage Details
- Size:
- 78.8 kB
- Xet hash:
- 3964d0e55adea66947ba13340def622aaeecc36bbd57ca1940de103f226c32a7
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.