186 kB

Title: MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size – complete proof version–

URL Source: https://arxiv.org/html/2105.08947

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Estimation Risk for General Case and Exponential Family 3Criterion for Model Complexity and & Sample Size 4Total Risk Decomposition and Model Comparison 5Appendix References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: scrartcl.cls

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license arXiv:2105.08947v5 [math.ST] 13 Oct 2025 MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size – complete proof version– Yo Sheena Faculty of Data Science, Shiga University, Japan; Visiting Professor of the Institute of Statistical Mathematics, Japan (May 2021) Abstract

For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback-Leibler (K-L) divergence, the closest distribution is called the “information projection.” The estimation risk of the maximum likelihood estimator (MLE) is defined as the expectation of K-L divergence between the information projection and the predictive distribution with plugged-in MLE. Here, the asymptotic expansion of the risk is derived up to 𝑛 − 2 -order, and the sufficient condition on the risk for the Bayes error rate between the true distribution and the information projection to be lower than a specified value is investigated. Combining these results, the “ 𝑝 − 𝑛 criterion” is proposed, which determines whether the MLE is sufficiently close to the information projection for the given model and sample. In particular, the criterion for an exponential family model is relatively simple and can be used for a complex model with no explicit form of normalizing constant. This criterion can constitute a solution to the sample size or model acceptance problem. Use of the 𝑝 − 𝑛 criteria is demonstrated for two practical datasets. The relationship between the results and information criteria is also studied.

MSC(2010) Subject Classification: Primary 60F99; Secondary 62F12 Keywords and phrases: Kullback-Leibler divergence, exponential family, asymptotic risk, information projection, multinomial distribution.

1Introduction

Given a certain data set, an unknown probability distribution that generates the data as the independent, identically distributed (i.i.d.) sample can be assumed. Under this assumption, if a certain parametric distribution model is adopted to “explain” the data, the first task is to find the “best” distribution in the model. Because the true distribution is assumed to be outside the model (except for some rare cases), the “best” means the “closest” to the true distribution.

If the true distribution is successfully approximated by the “best” distribution, it has many possible applications. For example, regression or discrimination analysis can be performed based on the conditional distribution of one variable in the distribution (target variable) with respect to other variables (the explanatory variables). The conditional or unconditional distributions can also be used to complete missing values with multiple candidates (multiple imputation). Note that it is possible to decide whether an individual is an outlier based on a contour region of a certain probability. In essence, we can answer any type of question on the true distribution theoretically or, in most cases, numerically using the generated random variables from the approximating distribution.

The most important merit of the approximating distribution is that it naturally provides “knowledge of the amount of uncertainty” in the famous equation of C. R. Rao [18]:

“Uncertain knowledge” + “Knowledge of the amount of uncertainty in it” = “Usable knowledge”

For example, for a prediction based on a regression analysis using the conditional distribution of the target variable with respect to the explanatory variables, the target variable value can be predicted and its prediction interval can also be constituted. The multiple imputation method is preferred to the single approach, as it reflects the likelihood of each imputed value.

During the true distribution approximation process, significant problems arise regarding the methods for the following:

systematic construction of a distribution model;

evaluation of the estimator closeness to the best distribution;

evaluation of the best distribution closeness to the true distribution.

This study focuses on the second problem, with the aim of establishing a criterion to determine whether the maximum likelihood estimator (MLE) is sufficiently close to the best distribution. The result for a general distribution model is stated, but the main focus is on an exponential family model, for which a concise criterion is presented. We also address the third problem in relation to the information criteria.

For the finite-dimensional exponential family model, the first problem equates to the basis function selection. Portnoy [17], Stone [22], and Barron and Sheu [6] have investigated cases involving a series of exponential families on a one-dimensional compact set with splines, polynomials, and trigonometric functions as the basis functions. Those researchers focused on the convergence rate of the predictive distribution to the true distribution as the basis-function dimension increased with the sample size. Further, Wainwright and Jordan [27] extensively studied basis function selection in the context of graphical models. Sundberg [23] produced a comprehensive book on exponential family models and introduced many model types for various fields. In addition, Efron and Tibshirani [13] studied the hybrid construction of an exponential family using a nonparametric reference function and finite-dimensional basis functions. For recent developments regarding exponential families in association with the reproducing kernel Hilbert space, see [8], [14], [21] and [1].

However, most of the asymptotic results reported in the above-mentioned papers pertain to investigations of the closeness between the predicted and the true distributions (not the best distribution). These results were related to the consistency or convergence order in accordance with the model inflation, along with the sample size. (Note that Barron and Sheu [6] studied the convergence of the predictive distribution to the best distribution in a theorem proof; however, their main concern was the distance to the true distribution.) The approach considered in this work is characterized by separation of the second and third problems, focusing on the second problem. To address the second problem, we fix the model and derive the asymptotic expansion of the risk with respect to the sample size, 𝑛 (i.e., the expected distance between the predicted and best distributions). The approximated risk (up to the 𝑛 − 1 or 𝑛 − 2 -order) yields a criterion for the second problem if it is combined with a certain threshold 𝐶 .

1.1Framework

The framework of this study is as follows. Consider the following parametric distribution model:

ℳ

{ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

( 𝜃 1 , … , 𝜃 𝑝 ) ∈ Θ } ,

where 𝑔 ( 𝑥 ; 𝜃 ) is the probability density function (p.d.f.) with respect to a reference measure 𝑑 𝜇 on a measurable space. The p.d.f. of the unknown true distribution with respect to 𝑑 𝜇 is denoted by 𝑔 ( 𝑥 ) .

The Kullback–Leibler divergence (K-L divergence) is used to measure the closeness between two distributions, and the MLE is chosen as the estimator. This pair, i.e., the K-L divergence and MLE, is the natural choice for the following reasons.

First, the divergence is a geometrical tool that is independent of the parameter (i.e., the coordinate system with respect to the differential manifold ℳ ). This allows extraction of purely geometrical results; in other words, result dependence on parameter choices can be avoided. Second, K-L divergence is essentially the only “decomposable,” “flat,” and “invariant” divergence (see Theorem 4.1 of [3]). Invariance is especially important for comparing two distributions, because for a one-to-one transfer of the observed variable, the results should remain unchanged even with the transformed variable (see [25] for other important divergence properties).

K-L divergence in the context of the “best” distribution in the model is also explained here. First, consider 𝛼 -divergence, i.e., a class of divergences with one parameter ( 𝛼 ) defined by

𝐷 𝛼 [ 𝑔 ( 𝑥 ) : 𝑔 ( 𝑥 ; 𝜃 ) ]

{ 4 1 − 𝛼 2 { 1 − ∫ ( 𝑔 ( 𝑥 ) ) ( 1 − 𝛼 ) / 2 ( 𝑔 ( 𝑥 ; 𝜃 ) ) ( 1 + 𝛼 ) / 2 𝑑 𝜇 } ,

if 𝛼 ≠ ± 1 ,

∫ 𝑔 ( 𝑥 ; 𝜃 ) log ⁡ ( 𝑔 ( 𝑥 ; 𝜃 ) / 𝑔 ( 𝑥 ) ) 𝑑 𝜇 ,
if 𝛼

1 ,

∫ 𝑔 ( 𝑥 ) log ⁡ ( 𝑔 ( 𝑥 ) / 𝑔 ( 𝑥 ; 𝜃 ) ) 𝑑 𝜇 ,
if 𝛼

− 1 .

(1)

Note that 𝛼 -divergence is the only “decomposable,” “flat,” and “invariant” divergence when it is extended on the positive measure space (see Theorem 4.2 of [3]). Further, 𝛼 -divergence contains frequently used divergences such as the K-L divergence ( 𝛼

− 1 ), Hellinger distance ( 𝛼

0 ), and 𝜒 2 -divergence ( 𝛼

3 ).

The “best” approximating distribution in ℳ is the closest distribution 𝑔 ( 𝑥 ; 𝜃 ∗ ) to 𝑔 ( 𝑥 ) , where

𝜃 ∗

a 𝑟 𝑔 𝑚 𝑖 𝑛 𝜃 ∈ Θ 𝐷 𝛼 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ) ] .

(2)

Csiszár [10] called 𝑔 ( 𝑥 ; 𝜃 ∗ ) the “information projection.” If the parametric model is regarded as a device for approximation of the true distribution, 𝑔 ( 𝑥 ; 𝜃 ∗ ) is the best distribution in ℳ for this purpose.

In fact, 𝑔 ( 𝑥 ; 𝜃 ∗ ) is essentially given by the solution of the equations

∂ ∂ 𝜃 𝑖 𝐷 𝛼 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ) ]

0 , 𝑖

1 , … , 𝑝

(3)

⟺ { ∫ ( 𝑔 ( 𝑥 ; 𝜃 ) ) 𝛼 − 1 2 ( 𝑔 ( 𝑥 ) ) 1 − 𝛼 2 ∂ 𝑔 ( 𝑥 ; 𝜃 ) ∂ 𝜃 𝑖 𝑑 𝜇

0 , 𝑖

1 , … , 𝑝 ,

if 𝛼 ≠ 1 ,

∫ ∂ 𝑔 ( 𝑥 ; 𝜃 ) ∂ 𝜃 𝑖 log ⁡ ( 𝑔 ( 𝑥 ; 𝜃 ) / 𝑔 ( 𝑥 ) ) 𝑑 𝜇

0 , 𝑖

1 , … , 𝑝 ,

if 𝛼 =1.

(4)

Note that, if 𝛼

− 1 , equation (3) is equivalent to

𝐸 [ ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 ; 𝜃 ) ]

0 , 𝑖

1 , … , 𝑝 ,

(5)

and its solution 𝜃 ∗ is estimated via the MLE, which is the solution of

∑ 𝑡

1 𝑛 ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 )

0 , 𝑖

1 , … , 𝑝

where 𝑋 𝑡 , 𝑡

1 , … , 𝑛 is the i.i.d. sample from 𝑔 ( 𝑥 ) . If 𝛼 ≠ − 1 , equation (3) does not have the form

𝐸 [ ℎ ( 𝑋 , 𝜃 ) ]

with some known function ℎ ( 𝑥 , 𝜃 ) , which is an M-estimator formulation.

An overview of the contents of each section is now provided. Throughout this paper, the focus is on the “distance” between 𝑔 ( 𝑥 ; 𝜃 ∗ ) and the predictive distribution 𝑔 ( 𝑥 ; 𝜃 ^ ) with MLE 𝜃 ^ , i.e.,

𝐷 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

as well as the “estimation risk”

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ] ≜ 𝐸 [ 𝐷 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ] ] .

It is known that 𝜃 ^ converges in probability to 𝜃 ∗ . The estimation risk convergence to zero with increasing 𝑛 (and a fixed model) is investigated. Further, the asymptotic expansion of the risk with respect to 𝑛 (Section 2) is derived for both a general model (Section 2.1) and an exponential family model (Section 2.2).

In Section 3, the criterion to determine whether the MLE is sufficiently close to the information projection of the model is considered; namely, the “ 𝑝 − 𝑛 criterion.” In other words, this criterion indicates whether 𝑛 is sufficiently large for the given model, or if the model dimension is sufficiently small for the given 𝑛 . In Section 3.1, estimation of the unknown elements that appear in the asymptotic expansion of the estimation risk in Section 2 is shown. A method for setting a 𝐶 for the estimation risk in relation to the Bayes error rate is also proposed. In Section 3.2, the algorithm for calculating the 𝑝 − 𝑛 criterion in the case of an exponential family is described. In Section 3.3, use of the 𝑝 − 𝑛 criterion is demonstrated for two practical examples.

Finally, Section 4 treats the “total risk”

𝐸 [ 𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ] ] ;

that is, the expected distance between 𝑔 ( 𝑥 ) and 𝑔 ( 𝑥 ; 𝜃 ^ ) . By decomposing the total risk into the estimation risk and “approximation risk,”

𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ∗ ) ] ,

the relationships between the obtained result and Takeuchi’s information criteria (TIC) and Akaike’s information criteria (AIC) are explained.

Throughout this study, the expectation over 𝑿 under 𝑔 ( 𝑥 ) is denoted by 𝐸 [ ⋅ ] , while the expectation under 𝑔 ( 𝑥 ; 𝜃 ∗ ) is denoted by 𝐸 𝜃 ∗ [ ⋅ ] .

2Estimation Risk for General Case and Exponential Family

The convergence speed of 𝑔 ( 𝑥 ; 𝜃 ^ ) to 𝑔 ( 𝑥 ; 𝜃 ∗ ) is considered. The most relevant work to the present result is that of Barron and Sheu [6], which considers the convergence with respect to the K-L divergence for an exponential family on a compact set. In particular, those researchers considered an exponential family with polynomials, splines, and trigonometric functions as the basis functions, and obtained the convergence order of the divergence itself (rather than the risk) as 𝑝 and 𝑛 increased simultaneously under the condition 𝑝 2 / 𝑛 → 0 . (For K-L divergence and nonparametric density estimation in general, see [15].)

In this section, the concrete terms of the estimation-risk asymptotic expansion are derived as preparation for the criterion based on the relationship between 𝑝 and 𝑛 , which is discussed in the next section.

2.1Estimation Risk for General Case

The asymptotic expansion of

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

𝐸 [ 𝐷 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

(6)

is derived up to the second-order term with respect to 𝑛 .

Taylor expansion of

𝐷 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) log ⁡ ( 𝑔 ( 𝑥 ; 𝜃 ∗ ) / 𝑔 ( 𝑥 ; 𝜃 ^ ) ) 𝑑 𝜇

as a function of 𝜃 ^ around 𝜃 ∗ is considered:

𝐷 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

− ∑ 𝑖 ∫ ∂ ∂ 𝜃 𝑖 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ 𝑑 𝜇 ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 )

+ 1 2 ∑ 𝑖 , 𝑗 ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ( ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ( ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) 𝑑 𝜇

× ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 )

(7)

− 1 2 ∑ 𝑖 , 𝑗 ∫ ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 𝑔 ( 𝑥 , 𝜃 ) | 𝜃

𝜃 ∗ 𝑑 𝜇 ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 )

− ∑ 𝑡

3 ∞ 1 𝑡 ! ∑ 𝑖 1 , … , 𝑖 𝑡 ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑖 1 ⋯ ∂ 𝜃 𝑖 𝑡 log ⁡ 𝑔 ( 𝑥 , 𝜃 ) | 𝜃

𝜃 ∗ 𝑑 𝜇

× ( 𝜃 ^ 𝑖 1 − 𝜃 ∗ 𝑖 1 ) ⋯ ( 𝜃 ^ 𝑖 𝑡 − 𝜃 ∗ 𝑖 𝑡 ) .

The equation

∫ 𝑔 ( 𝑥 , 𝜃 ) 𝑑 𝜇

1 , ∀ 𝜃 ∈ Θ ,

yields

∫ ∂ ∂ 𝜃 𝑖 𝑔 ( 𝑥 ; 𝜃 ) 𝑑 𝜇

0 , ∫ ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 𝑔 ( 𝑥 , 𝜃 ) 𝑑 𝜇

0 , ∀ 𝜃 ∈ Θ .

(8)

Therefore,

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

1 2 ∑ 𝑖 , 𝑗 𝑔 𝑖 𝑗 ∗ ( 𝜃 ∗ ) 𝐸 [ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ]

− ∑ 𝑡

3 ∞ 1 𝑡 ! ∑ 𝑖 1 , … , 𝑖 𝑡 𝜏 𝑖 1 , … , 𝑖 𝑡 𝐸 [ ( 𝜃 ^ 𝑖 1 − 𝜃 ∗ 𝑖 1 ) ⋯ ( 𝜃 ^ 𝑖 𝑡 − 𝜃 ∗ 𝑖 𝑡 ) ] .

(9)

Here,

𝜏 𝑖 1 , … , 𝑖 𝑡 ≜ ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑖 1 ⋯ ∂ 𝜃 𝑖 𝑡 log ⁡ 𝑔 ( 𝑥 , 𝜃 ) | 𝜃

𝜃 ∗ 𝑑 𝜇

and 𝑔 𝑖 𝑗 ∗ indicates the components of the Fisher metric matrix on ℳ , given by

𝑔 𝑖 𝑗 ∗ ( 𝜃 ∗ ) ≜ ( 𝐺 ∗ ( 𝜃 ∗ ) ) 𝑖 𝑗
≜ 𝐸 𝜃 ∗ [ ( ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ( ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ]

= ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ( ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ( ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) 𝑑 𝜇

= − ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ 𝑑 𝜇 .

(10)

The second equation is derived from (8).

As 𝜃 ∗ is the solution of equation (5) and 𝜃 ^ is its empirical solution (i.e., the M-estimator), the following result holds (see, e.g., Theorem 5.21 of [26]).

𝑛 ( 𝜃 ^ − 𝜃 ∗ ) → 𝑑 𝑁 𝑝 ( 0 , 𝐺 ~ − 1 𝐺 𝐺 ~ − 1 ) ,

(11)

where

𝑔 𝑖 𝑗 ( 𝜃 ∗ ) ≜ ( 𝐺 ( 𝜃 ∗ ) ) 𝑖 𝑗
≜ 𝐸 [ ( ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 ; 𝜃 ) | 𝜃

𝜃 ∗ ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑋 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ] ,

(12)

𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ ) ≜ ( 𝐺 ~ ( 𝜃 ∗ ) ) 𝑖 𝑗
≜ − 𝐸 [ ∂ 2 ∂ 𝜃 𝑗 ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 ; 𝜃 ) | 𝜃

𝜃 ∗ ] .

(13)

The following notation is defined, for 1 ≤ 𝑖 , 𝑗 , 𝑘 , 𝑙 ≤ 𝑝 :

𝐿 ( 𝑖 𝑗 𝑘 ) ≜ 𝐸 [ ∂ 3 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ]

𝐿 ( 𝑖 𝑗 ) 𝑘 ≜ 𝐸 [ ∂ 2 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑘 ]

𝐿 𝑖 𝑗 𝑘 ≜ 𝐸 [ ∂ log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑗 ∂ log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑘 ]

𝐿 ( 𝑖 𝑗 𝑘 𝑙 ) ≜ 𝐸 [ ∂ 4 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 ]

𝐿 ( 𝑖 𝑗 ) ( 𝑘 𝑙 ) ≜ 𝐸 [ ∂ 2 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 2 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 ]

𝐿 ( 𝑖 𝑗 𝑘 ) 𝑙 ≜ 𝐸 [ ∂ 3 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑙 ]

𝐿 ( 𝑖 𝑗 ) 𝑘 𝑙 ≜ 𝐸 [ ∂ 2 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑘 ∂ log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 𝜃 𝑙 ] .

The next theorem states the estimation-risk asymptotic expansion up to the term 𝑛 − 2 for a general distribution model. For brevity, Einstein notation is used and the dependency on 𝜃 ∗ is omitted; e.g., 𝐺 for 𝐺 ( 𝜃 ∗ ) and 𝑔 ~ 𝑖 𝑗 for 𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ ) .

Theorem 1.

The MLE estimation risk with respect to K-L divergence is given by

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

= ( 2 𝑛 ) − 1 tr ( 𝐺 ~ − 1 𝐺 𝐺 ~ − 1 𝐺 ∗ )

𝑛 − 2 [ 2 − 1 𝑔 𝑖 𝑗 ∗ ( 𝑔 ~ 𝑠 𝑗 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑚 ( 𝐿 ( 𝑠 𝑙 ) 𝑡 𝑚
𝑔 ~ 𝑙 𝑠 𝑔 𝑡 𝑚 )
𝑔 ~ 𝑠 𝑖 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑙 𝑚 ( 𝐿 ( 𝑠 𝑙 ) 𝑡 𝑚
𝑔 ~ 𝑙 𝑠 𝑔 𝑡 𝑚 )
2 − 1 𝑔 ~ 𝑢 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝐿 𝑘 𝑠 𝑡 𝐿 ( 𝑙 𝑚 𝑢 )
2 − 1 𝑔 ~ 𝑢 𝑖 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝐿 𝑘 𝑠 𝑡 𝐿 ( 𝑙 𝑚 𝑢 )
𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑚 𝑡 ( 𝐿 ( 𝑘 𝑙 ) ( 𝑢 𝑚 ) 𝑔 𝑠 𝑡 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑢 𝑚 𝑔 𝑠 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑠 𝐿 ( 𝑢 𝑚 ) 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑢 𝑚 ) 𝑠 )
𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑚 𝑡 ( 𝐿 ( 𝑘 𝑙 ) ( 𝑢 𝑚 ) 𝑔 𝑠 𝑡 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑢 𝑚 𝑔 𝑠 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑠 𝐿 ( 𝑢 𝑚 ) 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑢 𝑚 ) 𝑠 )
2 − 1 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑤 𝑙 𝐿 ( 𝑚 𝑠 𝑤 ) ( 𝐿 ( 𝑙 𝑘 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑣 𝑔 𝑡 𝑢 )
2 − 1 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑤 𝑙 𝐿 ( 𝑚 𝑠 𝑤 ) ( 𝐿 ( 𝑙 𝑘 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑣 𝑔 𝑡 𝑢 )
2 − 1 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑚 𝑣 ( 𝐿 ( 𝑠 𝑙 𝑚 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑣 𝑔 𝑡 𝑢 )
2 − 1 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑚 𝑣 ( 𝐿 ( 𝑠 𝑙 𝑚 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑣 𝑔 𝑡 𝑢 )
𝑔 ~ 𝑚 𝑡 𝑔 ~ 𝑖 𝑢 𝑔 ~ 𝑙 𝑣 𝑔 ~ 𝑠 𝑤 𝑔 ~ 𝑘 𝑗 𝐿 ( 𝑙 𝑚 𝑘 ) ( 𝐿 ( 𝑡 𝑠 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑤 𝑔 𝑢 𝑣 )
𝑔 ~ 𝑚 𝑡 𝑔 ~ 𝑗 𝑢 𝑔 ~ 𝑙 𝑣 𝑔 ~ 𝑠 𝑤 𝑔 ~ 𝑘 𝑖 𝐿 ( 𝑙 𝑚 𝑘 ) ( 𝐿 ( 𝑡 𝑠 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑤 𝑔 𝑢 𝑣 )
2 − 1 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑜 𝑗 𝑔 ~ ℎ 𝑚 𝐿 ( 𝑙 𝑚 𝑜 ) 𝐿 ( 𝑠 𝑡 ℎ ) ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 )
2 − 1 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑜 𝑖 𝑔 ~ ℎ 𝑚 𝐿 ( 𝑙 𝑚 𝑜 ) 𝐿 ( 𝑠 𝑡 ℎ ) ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 )
6 − 1 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑤 𝑗 𝐿 ( 𝑙 𝑚 𝑡 𝑤 ) ( 𝑔 𝑘 𝑠 𝑔 𝑢 𝑣
𝑔 𝑘 𝑢 𝑔 𝑠 𝑣
𝑔 𝑘 𝑣 𝑔 𝑠 𝑢 )
6 − 1 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑤 𝑖 𝐿 ( 𝑙 𝑚 𝑡 𝑤 ) ( 𝑔 𝑘 𝑠 𝑔 𝑢 𝑣
𝑔 𝑘 𝑢 𝑔 𝑠 𝑣
𝑔 𝑘 𝑣 𝑔 𝑠 𝑢 )
𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑙 𝑡 𝑔 ~ 𝑚 𝑢 ( 𝐿 ( 𝑘 𝑙 ) ( 𝑠 𝑚 ) 𝑔 𝑡 𝑢 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑠 𝑚 𝑔 𝑡 𝑢
𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑠 𝑚 ) 𝑢
𝐿 ( 𝑘 𝑙 ) 𝑢 𝐿 ( 𝑠 𝑚 ) 𝑡 )
2 − 1 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑗 𝑚 𝐿 ( 𝑠 𝑡 𝑚 ) ( 𝐿 ( 𝑘 𝑙 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑤 𝑔 𝑢 𝑣 )
2 − 1 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑖 𝑚 𝐿 ( 𝑠 𝑡 𝑚 ) ( 𝐿 ( 𝑘 𝑙 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑤 𝑔 𝑢 𝑣 )
4 − 1 𝑔 ~ 𝑙 𝑘 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑖 𝑜 𝑔 ~ 𝑗 ℎ 𝐿 ( 𝑙 𝑚 𝑜 ) 𝐿 ( 𝑠 𝑡 ℎ ) ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 ) )

− 6 − 1 𝜏 𝑖 𝑗 𝑘 ( 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝐿 𝑠 𝑡 𝑢

𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑗 𝑣 𝑔 ~ 𝑘 𝑤 ( 𝐿 ( 𝑠 𝑡 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑤 𝑔 𝑢 𝑣 )
𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑖 𝑣 𝑔 ~ 𝑘 𝑤 ( 𝐿 ( 𝑠 𝑡 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑤 𝑔 𝑢 𝑣 )
𝑔 ~ 𝑘 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑖 𝑣 𝑔 ~ 𝑗 𝑤 ( 𝐿 ( 𝑠 𝑡 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑤 𝑔 𝑢 𝑣 )
2 − 1 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑗 𝑤 𝑔 ~ 𝑘 𝑚 𝑔 ~ 𝑖 𝑙 𝐿 ( 𝑠 𝑡 𝑙 ) ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚
𝑔 𝑢 𝑤 𝑔 𝑣 𝑚
𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 )
2 − 1 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑖 𝑤 𝑔 ~ 𝑘 𝑚 𝑔 ~ 𝑗 𝑙 𝐿 ( 𝑠 𝑡 𝑙 ) ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚
𝑔 𝑢 𝑤 𝑔 𝑣 𝑚
𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 )
2 − 1 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑖 𝑤 𝑔 ~ 𝑗 𝑚 𝑔 ~ 𝑘 𝑙 𝐿 ( 𝑠 𝑡 𝑙 ) ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚
𝑔 𝑢 𝑤 𝑔 𝑣 𝑚
𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 ) )

− 24 − 1 𝜏 𝑖 𝑗 𝑘 𝑙 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝑔 ~ 𝑙 𝑣 ( 𝑔 𝑠 𝑡 𝑔 𝑢 𝑣 + 𝑔 𝑠 𝑢 𝑔 𝑡 𝑣 + 𝑔 𝑠 𝑣 𝑔 𝑡 𝑢 ) ]

𝑂 ( 𝑛 − 3 ) .

(14) Proof.

This proof is omitted as it is lengthy; however, it is given in Section 5.1 of the Appendix. ∎

As shown by Efron [12] and Amari [2], the first- and second-order terms ( 𝑛 − 1 - and 𝑛 − 2 -order terms, respectively) are respectively related to the metrics and connections in the extended manifold that includes both 𝑔 ( 𝑥 ) and ℳ . This issue is not investigated here. Note only that, if 𝑔 ( 𝑥 ) exists within the model, then 𝐺

𝐺 ~

𝐺 ∗ ; hence, the first-order term equals 𝑝 / ( 2 𝑛 ) (see also [19]). Thus, the first-order term is mainly determined by 𝑝 if 𝑔 ( 𝑥 ; 𝜃 ∗ ) is close to 𝑔 ( 𝑥 ) .

The second-order term is very complex, and the calculation of a specific distribution model requires extensive computational resources. Because it is difficult to use the term for practical purposes, the first-order term is the focus of the remainder of this section; it is studied using some examples.

First, a normal regression model is taken as an example and the first-order term of Theorem 1 is applied.

– Example 1: Normal regression model –

Let ℎ ( 𝑥 ) be the true p.d.f. of the explanatory variables 𝑋 ≜ ( 𝑋 1 , … , 𝑋 𝑝 ) with respect to the Lebesgue measure. Consider the following normal regression model:

𝑌

∑ 𝑖

1 𝑝 𝜃 𝑖 𝑋 𝑖 + 𝜖 , 𝜖 ∼ 𝑁 ( 0 , ( 𝜃 0 ) − 1 ) ,

where 𝑋 and 𝜖 are independently distributed. The parametric distribution model of ( 𝑌 , 𝑋 ) is given by

ℳ

{ 𝑔 ( 𝑦 , 𝑥 ; 𝜃 ) | 𝜃 0 > 0 , − ∞ < 𝜃 𝑖 < ∞ , 𝑖

1 , … , 𝑝 } ,

where

𝑔 ( 𝑦 , 𝑥 ; 𝜃 )

exp ⁡ ( − 𝜃 0 2 ( 𝑦 − ∑ 𝑖

1 𝑝 𝜃 𝑖 𝑥 𝑖 ) 2 − 1 2 log ⁡ ( 2 𝜋 ) + 1 2 log ⁡ 𝜃 0 ) ,

and 𝑑 𝜇

ℎ ( 𝑥 ) 𝑑 𝑥 .

Under the true distribution for ( 𝑌 , 𝑋 ) , consider the distribution of the random variable

𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ≜ 𝑌 − ∑ 𝑖

1 𝑝 𝜃 ∗ 𝑖 𝑋 𝑖 .

As the true distribution of ( 𝑌 , 𝑋 ) is outside ℳ , 𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) is differently distributed from the normal regression model assumption. That is, the assumption that 𝜖 ( 𝑌 , 𝑋 : 𝜃 ∗ ) is independent of 𝑋 and is normally distributed is invalid. Let us consider the following three non-normal cases for the distribution of 𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) . The higher the case, the greater the discrepancy between the true distribution and the regression model assumption. Case 1: 𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) is correlated with 𝑋 . Case 2: 𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) is independently distributed from 𝑋 but has different moments from those of 𝑁 ( 0 , ( 𝜃 ∗ 0 ) − 1 ) . Case 3: 𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) is independently distributed from 𝑋 and has the same kurtosis as 𝑁 ( 0 , ( 𝜃 ∗ 0 ) − 1 ) ; i.e.,

𝐸 [ 𝜖 4 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] 𝐸 2 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ]

3 .

For each case, the asymptotic risk is determined as follows (for the derivation, see 5.3 of the Appendix). For Case 1,

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

= 1 2 𝑛 ( tr ( 𝑆 − 1 𝑇 ) / 𝐸 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] + 1 2 ( 𝐸 [ 𝜖 4 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] / 𝐸 2 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] − 1 ) ) + 𝑜 ( 𝑛 − 1 ) ,

(15)

where ( 𝑆 ) 𝑖 𝑗 ≜ 𝐸 [ 𝑋 𝑖 𝑋 𝑗 ] , ( 𝑇 ) 𝑖 𝑗 ≜ 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] , 𝑖 , 𝑗

1 , … , 𝑝 . For Case 2,

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

1 2 𝑛 ( 𝑝 + 1 2 ( 𝐸 [ 𝜖 4 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] / 𝐸 2 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] − 1 ) ) + 𝑜 ( 𝑛 − 1 ) .

(16)

For Case 3,

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

𝑝 + 1 2 𝑛 + 𝑜 ( 𝑛 − 1 ) .

(17)

Hence,

•

If the kurtosis of 𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) is less than 3 (“platykurtic”) in (16), the first-order term is clearly less than ( 𝑝 + 1 ) / ( 2 𝑛 ) . This means that the risk converges faster than the case in which the model contains the true distribution.

Note that the small estimation risk does not guarantee the small total risk. In this example, if the kurtosis of 𝜖 under 𝑔 ( 𝑥 ) is smaller than that under 𝑔 ( 𝑥 ; 𝜃 ∗ ) , it is an indication of the discrepancy between the two distributions and may produce the large approximation risk (see Section [aic_tic]).

Next, the Poisson regression model is considered as another example.

– Example 2: Poisson Regression Model –

Let ℎ ( 𝑥 ) be the true p.d.f. of the explanatory variables 𝑋 ≜ ( 𝑋 1 , … , 𝑋 𝑝 ) with respect to the Lebesgue measure. Suppose that, when 𝑋 is given as 𝑥

( 𝑥 1 , … , 𝑥 𝑝 ) , 𝑌 is distributed as the Poisson distribution with mean

𝜆 ( 𝑥 ; 𝜃 )

exp ⁡ ( ∑ 𝑖

1 𝑝 𝜃 𝑖 𝑥 𝑖 ) .

Then, the distribution model of ( 𝑌 , 𝑋 ) is given as the p.d.f. of the form

𝑔 ( 𝑦 , 𝑥 | 𝜃 )

exp ⁡ ( ∑ 𝑖

1 𝑝 𝜃 𝑖 𝑥 𝑖 𝑦 − 𝜆 )

𝜆 𝑦 exp ⁡ ( − 𝜆 )

where the reference measure 𝑑 𝜇 is the product measure between the discrete measure 1 / 𝑦 ! on { 𝑦 | 𝑦

0 , 1 , 2 , … } and the continuous measure ℎ ( 𝑥 ) 𝑑 𝑥 on ℜ 𝑝 .

As for the true distribution, we postulate that the conditional distribution of 𝑌 under 𝑋

𝑥 is the Poisson distribution with mean 𝜆 0 ( 𝑥 )

exp ⁡ ( 𝜉 ( 𝑥 ) ) . This is different from the model in that the conditional log mean of 𝑌 is nonlinear.

∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑦 , 𝑥 | 𝜃 )

𝑥 𝑖 𝑦 − 𝜆 ( 𝑥 ; 𝜃 ) 𝑥 𝑖 , ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑦 , 𝑥 | 𝜃 )

− 𝜆 ( 𝑥 ; 𝜃 ) 𝑥 𝑖 𝑥 𝑗 ,

for 𝑖 , 𝑗

1 … , 𝑝 ,

𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ )

𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜆 ( 𝑋 ; 𝜃 ∗ ) ]

𝑔 𝑖 𝑗 ∗ ( 𝜃 ∗ ) ,

𝑔 𝑖 𝑗 ( 𝜃 ∗ )

𝐸 [ 𝑋 𝑖 𝑋 𝑗 ( 𝑌 − 𝜆 ( 𝑋 ; 𝜃 ∗ ) ) 2 ]

𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝑌 2 ] − 2 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝑌 𝜆 ( 𝑋 ; 𝜃 ∗ ) ] + 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜆 2 ( 𝑋 ; 𝜃 ∗ ) ]

𝐸 [ 𝑋 𝑖 𝑋 𝑗 ( 𝜆 0 ( 𝑋 ) + 𝜆 0 2 ( 𝑋 ) ) ] − 2 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜆 0 ( 𝑋 ) 𝜆 ( 𝑋 ; 𝜃 ∗ ) ] + 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜆 2 ( 𝑋 ; 𝜃 ∗ ) ]

𝐸 [ 𝑋 𝑖 𝑋 𝑗 ( 𝜆 0 ( 𝑋 ) + ( 𝜆 ( 𝑋 ; 𝜃 ∗ ) − 𝜆 0 ( 𝑋 ) ) 2 ) ] .

Consequently, 𝐺 ~ ( 𝜃 ∗ )

𝐺 ∗ ( 𝜃 ∗ ) and

tr ( 𝐺 ~ ( 𝜃 ∗ ) − 1 𝐺 ( 𝜃 ∗ ) 𝐺 ~ ( 𝜃 ∗ ) − 1 𝐺 ∗ ( 𝜃 ∗ ) )

tr ( 𝐺 ~ ( 𝜃 ∗ ) − 1 𝐺 ( 𝜃 ∗ ) )

𝑝 + tr ( 𝐺 ~ ( 𝜃 ∗ ) − 1 ( 𝐺 ( 𝜃 ∗ ) − 𝐺 ~ ( 𝜃 ∗ ) ) )

and

( 𝐺 ( 𝜃 ∗ ) − 𝐺 ~ ( 𝜃 ∗ ) ) 𝑖 𝑗

𝐸 [ 𝑋 𝑖 𝑋 𝑗 ( ( 𝜆 ( 𝑋 ; 𝜃 ∗ ) − 𝜆 0 ( 𝑋 ) ) ( 𝜆 ( 𝑋 ; 𝜃 ∗ ) − 𝜆 0 ( 𝑋 ) − 1 ) ) ] .

Hence,

•

If 0 < 𝜆 ( 𝑋 ; 𝜃 ∗ ) − 𝜆 0 ( 𝑋 ) < 1 almost everywhere, the estimation risk converges faster than the case when the model includes the true distribution.

2.2Estimation Risk for Exponential Family

This subsection investigates the estimation risk when the parametric model is an exponential family (for general references on exponential families, see [7] and [5]).

Let the model ℳ be given by

ℳ

{ 𝑔 ( 𝑥 ; 𝜃 )

exp ⁡ ( ∑ 𝑖

1 𝑝 𝜃 𝑖 𝜉 𝑖 ( 𝑥 ) − Ψ ( 𝜃 ) ) | 𝜃 ∈ Θ } .

(18)

where Ψ ( 𝜃 ) is the cumulant-generating function of the 𝜉 terms, such that,

Ψ ( 𝜃 )

log ∫ exp ⁡ ( ∑ 𝑖

1 𝑝 𝜃 𝑖 𝜉 𝑖 ( 𝑥 ) ) 𝑑 𝜇 .

The “dual coordinate” 𝜂 is defined as

𝜂 𝑖 ( 𝜃 ) ≜ ∂ Ψ ( 𝜃 ) ∂ 𝜃 𝑖

𝐸 𝜃 [ 𝜉 𝑖 ] , 𝑖

1 , … , 𝑝 .

(19)

In particular, from the definition of 𝜃 ∗ (see (5)),

𝜂 𝑖 ∗ ≜ 𝜂 𝑖 ( 𝜃 ∗ )

𝐸 𝜃 ∗ [ 𝜉 𝑖 ]

𝐸 [ 𝜉 𝑖 ] , 𝑖

1 , … , 𝑝 .

(20)

The last equation requires the means of 𝜉 𝑖 to coincide under 𝑔 ( 𝑥 ) and 𝑔 ( 𝑥 ; 𝜃 ∗ ) . It is known that 𝑔 ( 𝑥 ; 𝜃 ∗ ) maximizes the Shannon entropy among all probability distributions for a given 𝐸 [ 𝜉 𝑖 ] , 𝑖

1 , … 𝑝 (the “entropy maximization property” of an exponential family; see, e.g., [27]). The K-L divergence is the difference between the cross-entropy and Shannon entropy. Another association of K-L divergence with the exponential family derives from a geometrical perspective. That is, the 𝛼 -divergence induces a corresponding “flat” manifold of the parametric distribution model ( 𝛼 -family). When 𝛼

1 , the divergence is the conjugate divergence of the K-L divergence, and the corresponding manifold is the exponential family (see [4]).

The 𝜂 coordinate is easily estimated. In fact, 𝜂 ^ , the MLE for 𝜂 , is the sample mean of 𝜉 . Hence,

𝜂 ^ 𝑖

∂ Ψ ∂ 𝜃 𝑖 ( 𝜃 ^ )

𝜉 ¯ 𝑖 ( ≜ 𝑛 − 1 ∑ 𝑡

1 𝑛 𝜉 𝑖 ( 𝑋 𝑡 ) ) .

(21)

In contrast, 𝜃 ^ is difficult to obtain explicitly because Ψ or its derivative cannot be theoretically obtained for a complex model. This could pose a serious obstacle to application of an exponential family model to a practical problem, and is discussed in Section 3.2.

Let the matrix Ψ ¨ ( 𝜃 ) be defined by

( Ψ ¨ ( 𝜃 ) ) 𝑖 𝑗 ≜ ∂ 2 Ψ ( 𝜃 ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗

𝐸 𝜃 [ ( 𝜉 𝑖 − 𝜂 𝑖 ) ( 𝜉 𝑗 − 𝜂 𝑗 ) ] , 1 ≤ 𝑖 , 𝑗 ≤ 𝑝 .

(22)

Thus, Ψ ¨ is a covariance matrix of the 𝜉 𝑖 terms under 𝑔 ( 𝑥 ; 𝜃 ) ; hence, it is positive definite. Therefore, Ψ ( 𝜃 ) is a convex function. The notable property

𝑔 𝑖 𝑗 ∗ ( 𝜃 )

𝑔 ~ 𝑖 𝑗 ( 𝜃 ) , 1 ≤ 𝑖 , 𝑗 ≤ 𝑝 , ∀ 𝜃

is proven by the fact that both sides are equal to ( Ψ ¨ ( 𝜃 ) ) 𝑖 𝑗 .

The following notation is used for the third- or fourth-order cumulant:

𝜅 𝑖 𝑗 𝑘 ≜ 𝐸 [ ( 𝜉 𝑖 − 𝜂 𝑖 ∗ ) ( 𝜉 𝑗 − 𝜂 𝑗 ∗ ) ( 𝜉 𝑖 − 𝜂 𝑘 ∗ ) ]

𝐿 𝑖 𝑗 𝑘

𝜅 𝑖 𝑗 𝑘 ∗ ≜ 𝐸 𝜃 ∗ [ ( 𝜉 𝑖 − 𝜂 𝑖 ∗ ) ( 𝜉 𝑗 − 𝜂 𝑗 ∗ ) ( 𝜉 𝑖 − 𝜂 𝑘 ∗ ) ]

∂ 3 Ψ ( 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘

− 𝐿 ( 𝑖 𝑗 𝑘 )

𝜅 𝑖 𝑗 𝑘 𝑙 ∗ ≜ 𝐸 𝜃 ∗ [ ( 𝜉 𝑖 − 𝜂 𝑖 ∗ ) ( 𝜉 𝑗 − 𝜂 𝑗 ∗ ) ( 𝜉 𝑖 − 𝜂 𝑘 ∗ ) ( 𝜉 𝑖 − 𝜂 𝑙 ∗ ) ]

− 𝐸 𝜃 ∗ [ ( 𝜉 𝑖 − 𝜂 𝑖 ∗ ) ( 𝜉 𝑗 − 𝜂 𝑗 ∗ ) ] 𝐸 𝜃 ∗ [ ( 𝜉 𝑘 − 𝜂 𝑘 ∗ ) ( 𝜉 𝑙 − 𝜂 𝑙 ∗ ) ]

− 𝐸 𝜃 ∗ [ ( 𝜉 𝑖 − 𝜂 𝑖 ∗ ) ( 𝜉 𝑘 − 𝜂 𝑘 ∗ ) ] 𝐸 𝜃 ∗ [ ( 𝜉 𝑗 − 𝜂 𝑗 ∗ ) ( 𝜉 𝑙 − 𝜂 𝑙 ∗ ) ]

− 𝐸 𝜃 ∗ [ ( 𝜉 𝑖 − 𝜂 𝑖 ∗ ) ( 𝜉 𝑙 − 𝜂 𝑙 ∗ ) ] 𝐸 𝜃 ∗ [ ( 𝜉 𝑗 − 𝜂 𝑗 ∗ ) ( 𝜉 𝑘 − 𝜂 𝑘 ∗ ) ]

= ∂ 4 Ψ ( 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙

− 𝐿 ( 𝑖 𝑗 𝑘 𝑙 )

(23)

for 1 ≤ 𝑖 , 𝑗 , 𝑘 , 𝑙 ≤ 𝑝 . As the corollary of Theorem 1, the following result holds.

Corollary 1.

Additionally, if the parametric model is an exponential family, the estimation risk is given by

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

= 1 2 𝑛 tr ( 𝐺 ~ − 1 𝐺 )

1 24 𝑛 2 [ − 8 𝑔 ~ 𝑢 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝜅 𝑘 𝑠 𝑡 𝜅 𝑙 𝑚 𝑢 ∗
9 𝑔 ~ 𝑘 𝑜 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ ℎ 𝑚 𝜅 𝑙 𝑚 𝑜 ∗ 𝜅 𝑠 𝑡 ℎ ∗ ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 )

− 3 𝑔 ~ 𝑘 𝑤 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣 𝜅 𝑙 𝑚 𝑡 𝑤 ∗ ( 𝑔 𝑘 𝑠 𝑔 𝑢 𝑣 + 𝑔 𝑘 𝑢 𝑔 𝑠 𝑣 + 𝑔 𝑘 𝑣 𝑔 𝑠 𝑢 ) ]

𝑂 ( 𝑛 − 3 ) .

(24) Proof.

See Section 5.2 of the Appendix. ∎

The estimation risk up to the second-order term is determined by the moments of the 𝜉 𝑖 terms, 𝑔 𝑖 𝑗 , and 𝜅 𝑖 𝑗 𝑘 under 𝑔 ( 𝑥 ) , as well as their moments under 𝑔 ( 𝑥 ; 𝜃 ∗ ) , 𝑔 ~ 𝑖 𝑗 , 𝜅 𝑖 𝑗 𝑘 ∗ , and 𝜅 𝑖 𝑗 𝑘 𝑙 ∗ .

Note that the first-order term can be rewritten in different ways:

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

1 2 𝑛 tr ( 𝐺 ~ ( 𝜃 ∗ ) − 1 𝐺 ( 𝜃 ∗ ) ) + 𝑜 ( 𝑛 − 1 )

(25)

= 𝑝 2 𝑛 + 1 2 𝑛 tr ( 𝐺 ~ ( 𝜃 ∗ ) − 1 ( 𝐺 ( 𝜃 ∗ ) − 𝐺 ~ ( 𝜃 ∗ ) ) ) + 𝑜 ( 𝑛 − 1 )

(26)

= 𝑝 2 𝑛 + 1 2 𝑛 tr ( 𝐺 ~ ( 𝜃 ∗ ) − 1 ( 𝑆 − 𝑆 ∗ ) ) + 𝑜 ( 𝑛 − 1 ) ,

(27)

where

( 𝑆 ) 𝑖 𝑗 ≜ 𝐸 [ 𝜉 𝑖 𝜉 𝑗 ] , ( 𝑆 ∗ ) 𝑖 𝑗 ≜ 𝐸 𝜃 ∗ [ 𝜉 𝑖 𝜉 𝑗 ] , 𝑖 , 𝑗

1 , … , 𝑝 .

Further, (27) can be proven as follows. As 𝐸 [ 𝜉 𝑖 ]

𝐸 𝜃 ∗ [ 𝜉 𝑖 ]

𝜂 𝑖 ∗ ,

( 𝐺 ( 𝜃 ∗ ) ) 𝑖 𝑗

𝐸 [ ( 𝜉 𝑖 − 𝜂 𝑖 ∗ ) ( 𝜉 𝑗 − 𝜂 𝑗 ∗ ) ]

𝐸 [ 𝜉 𝑖 𝜉 𝑗 ] − 𝐸 [ 𝜉 𝑖 ] 𝐸 [ 𝜉 𝑗 ]

𝐸 [ 𝜉 𝑖 𝜉 𝑗 ] − 𝐸 𝜃 ∗ [ 𝜉 𝑖 𝜉 𝑗 ] + 𝐸 𝜃 ∗ [ 𝜉 𝑖 𝜉 𝑗 ] − 𝐸 𝜃 ∗ [ 𝜉 𝑖 ] 𝐸 𝜃 ∗ [ 𝜉 𝑗 ]

( 𝑆 ) 𝑖 𝑗 − ( 𝑆 ∗ ) 𝑖 𝑗 + ( 𝐺 ∗ ( 𝜃 ∗ ) ) 𝑖 𝑗

( 𝑆 ) 𝑖 𝑗 − ( 𝑆 ∗ ) 𝑖 𝑗 + ( 𝐺 ~ ( 𝜃 ∗ ) ) 𝑖 𝑗 ,

which means

𝐺 ( 𝜃 ∗ ) − 𝐺 ~ ( 𝜃 ∗ )

𝑆 − 𝑆 ∗ .

(28)

Because 𝐺 ( 𝜃 ∗ ) and 𝐺 ~ ( 𝜃 ∗ ) are the variance–covariance matrices of the 𝜉 terms, respectively, under 𝑔 ( 𝑥 ) and 𝑔 ( 𝑥 ; 𝜃 ∗ ) , the first-order term in (25) is interpreted as the distance between the variance–covariance matrices under 𝑔 ( 𝑥 ) and 𝑔 ( 𝑥 ; 𝜃 ∗ ) .

From equation (26) or (27),

𝐺 ( 𝜃 ∗ ) − 𝐺 ~ ( 𝜃 ∗ )

𝑆 − 𝑆 ∗ < (

) 0 ⟹ First-order terms of 𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ] < (

) 𝑝 2 𝑛 .

The first-order risk convergence speed is higher than the case where the model includes 𝑔 ( 𝑥 ) , if the second moment matrix of the 𝜉 𝑖 terms under 𝑔 ( 𝑥 ; 𝜃 ∗ ) is larger than that of 𝑔 ( 𝑥 ) .

As a first example, let us consider the special case in which Ψ ( 𝜃 ) is a quadratic function.

– Example 3: Quadratic Exponential Model – Let Ψ be defined by

Ψ ( 𝜃 )

∑ 𝑖 𝑚 𝑖 𝜃 𝑖 + 1 2 ∑ 𝑖 , 𝑗 𝜃 𝑖 𝜃 𝑗 𝑞 𝑖 𝑗

𝑚 𝜃 𝑡 + 1 2 𝜃 𝑄 𝜃 𝑡 ,

(29)

where

𝜃

( 𝜃 1 , … , 𝜃 𝑝 ) , 𝑚

( 𝑚 1 , … , 𝑚 𝑝 ) , ( 𝑄 ) 𝑖 𝑗

𝑞 𝑖 𝑗 ,

and 𝑄 be positive-definite. Because the higher-order cumulant vanishes for the model distribution, the estimation risk is given by

1 2 𝑛 tr ( 𝐺 ~ − 1 ( 𝜃 ∗ ) 𝐺 ( 𝜃 ∗ ) )

1 2 𝑛 tr ( 𝑄 − 1 𝐺 ( 𝜃 ∗ ) )

= 𝑝 2 𝑛 + 1 2 𝑛 tr ( 𝑄 − 1 ( 𝐺 ( 𝜃 ∗ ) − 𝑄 ) ) .

(30)

For this type of exponential family, the 𝜉 𝑖 terms follow the normal distribution

𝜉 ( 𝑋 )

( 𝜉 1 ( 𝑋 ) , … , 𝜉 𝑝 ( 𝑋 ) ) ∼ 𝑁 𝑝 ( 𝑚 + 𝜃 𝑄 , 𝑄 ) .

From (30), 𝑄

𝐺 ( 𝜃 ∗ ) indicates faster convergence than the well-specified model (i.e., the model that contains the true distribution) case.

The next example is the multinomial distribution, where the explicit form of the second-order term is given.

– Example 4: Multinomial Distribution Model – Consider a multinomial distribution with 𝑝 + 1 possible values 𝑥 𝑖 , 𝑖

0 , … , 𝑝 , with the corresponding probabilities 𝑚

( 𝑚 0 , … , 𝑚 𝑝 ) . This is an exponential family (18), where

𝜃 𝑖 ≜ log ⁡ ( 𝑚 𝑖 / 𝑚 0 ) , 𝑖

0 , … , 𝑝 ,

𝜉 𝑖 ( 𝑥 ) ≜ { 1 ,
if 𝑥

𝑥 𝑖 ,

0 ,
otherwise, 𝑖

1 , … , 𝑝

and 𝑑 𝜇 is the counting measure on { 𝑥 1 , … , 𝑥 𝑝 } . Here,

Ψ ( 𝜃 )

log ⁡ ( ∑ 𝑖

0 𝑝 exp ⁡ ( 𝜃 𝑖 ) )

− log ⁡ 𝑚 0

− log ⁡ ( 1 − ∑ 𝑖

1 𝑝 𝑚 𝑖 ) .

Suppose that 𝑔 ( 𝑥 ) is continuous, and the parametric model 𝑔 ( 𝑥 ; 𝑚 ) is an approximation of 𝑔 ( 𝑥 ) with the step function

𝑔 ( 𝑥 ; 𝑚 )

∑ 𝑖

0 𝑝 𝐼 ( 𝑥 ∈ 𝑆 𝑖 ) 𝑚 𝑖 𝑉 𝑜 𝑙 ( 𝑆 𝑖 ) ,

where 𝑆 𝑖 , 𝑖

0 , 1 , … , 𝑝 is the partition of the range of 𝑥 with volume

𝑉 𝑜 𝑙 ( 𝑆 𝑖 ) ≜ ∫ 𝑆 𝑖 1 𝑑 𝜇 ( 𝑥 ) ,

and 𝐼 ( 𝑥 ∈ 𝑆 𝑖 ) is an indicator function of 𝑆 𝑖 . In this case, from (5), the information projection 𝑔 ( 𝑥 ; 𝑚 ∗ ) is given by 𝑚 𝑖 ∗

𝑃 ( 𝑋 ∈ 𝑆 𝑖 | 𝑔 ( 𝑥 ) ) . The step-function model is not an exponential family. However, because 𝛼 -divergence is invariant with respect to contraction by a sufficient statistic, the divergence between the two multinomial distributions (where 𝑑 𝜇 is the counting measure) equals the divergence between the corresponding step functions (where 𝑑 𝜇 is the continuous measure). Hence, the argument of the estimation risk can be deduced from that of the multinomial distribution model. It is notable that, if 𝑋 is originally a discrete random variable, the model always contains 𝑔 ( 𝑥 ) .

The asymptotic expansion of the estimation risk up to second order can be derived as follows (this corresponds to equation (41) of [19] with 𝛼

− 1 , which investigates the asymptotic estimation risk for a well-specified model).

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

𝑝 2 𝑛 + 1 12 𝑛 2 ( 𝑀 − 1 ) + 𝑂 ( 𝑛 − 3 ) , 𝑀

∑ 𝑖

0 𝑝 𝑚 𝑖 − 1 ,

(31)

where 𝜃

( 𝑚 1 , … , 𝑚 𝑝 ) is the true-distribution free parameter. Because 𝑀 ≥ ( 𝑝 + 1 ) 2 , the second-order term is always positive. If 𝑚 𝑖 is close to zero, the convergence speed slows considerably. A numerical example of this model is given in the next section.

3Criterion for Model Complexity and & Sample Size 3.1 𝑝 − 𝑛 criterion

In this section, the aim is to derive a simple criterion to indicate whether the MLE is sufficiently close to the best distribution in the model (the information projection). To use (14) or (24), the distributional properties in (14) or (24), which depend on unknown 𝜃 ∗ and/or 𝑔 ( 𝑥 ) , must first be estimated. Next, a certain threshold 𝐶 with which we compare the estimated risk must be set.

If the estimated risk is not sufficiently small compared with the 𝐶 , there are two possible remedies: increasing 𝑛 or reducing 𝑝 . As the risk convergence speed depends on the geometrical properties of 𝑔 ( 𝑥 ) and ℳ , the author conjectures that 𝑝 reduction does not always reduce the risk. However, as the first-order term of the risk expansion is almost 𝑝 / ( 2 𝑛 ) when 𝑔 ( 𝑥 ) is close to 𝑔 ( 𝑥 ; 𝜃 ∗ ) , 𝑝 reduction is likely to reduce the risk in many cases. Based on this observation, the criterion developed in this study is named the “ 𝑝 − 𝑛 criterion.”

First, the risk estimation is considered. To use (14), the following properties must be estimated:

𝐺 ∗

( 𝑔 𝑖 𝑗 ∗ ) , 𝐺 ~

( 𝑔 ~ 𝑖 𝑗 ) , 𝐺

( 𝑔 𝑖 𝑗 ) , 𝐿 𝑖 𝑗 𝑘 , 𝐿 ( 𝑖 𝑗 ) 𝑘 , ⋯ , 1 ≤ 𝑖 , 𝑗 , 𝑘 , 𝑙 ≤ 𝑝 ,

Naive estimators of these properties (denoted by the “hat” mark: 𝐺 ^ , 𝐿 ^ 𝑖 𝑗 𝑘 , etc.) are gained by replacing 𝜃 ∗ with MLE 𝜃 ^ , and 𝑔 ( 𝑥 ) with the empirical distribution. For example,

( 𝐺 ^ ) 𝑖 𝑗
≜ 𝑛 − 1 ∑ 𝑡

1 𝑛 ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ) | 𝜃

𝜃 ^ ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ) | 𝜃

𝜃 ^

(32)

( 𝐺 ~ ^ ) 𝑖 𝑗
≜ − 𝑛 − 1 ∑ 𝑡

1 𝑛 ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ) | 𝜃

𝜃 ^

(33)

( 𝐺 ^ ∗ ) 𝑖 𝑗
≜ ∫ 𝑔 ( 𝑥 ; 𝜃 ^ ) ( ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ^ ) ( ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ^ ) 𝑑 𝜇 .

(34)

𝐿 𝑖 𝑗 𝑘
≜ 𝑛 − 1 ∑ 𝑡

1 𝑛 ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ) | 𝜃

𝜃 ^ ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ) | 𝜃

𝜃 ^ ∂ ∂ 𝜃 𝑘 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ) | 𝜃

𝜃 ^

(35)

𝐿 ( 𝑖 𝑗 ) 𝑘
≜ 𝑛 − 1 ∑ 𝑡

1 𝑛 ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ) | 𝜃

𝜃 ^ ∂ ∂ 𝜃 𝑘 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ) | 𝜃

𝜃 ^

(36)

The estimated risk can be obtained using these estimators; however, because of the complication of the asymptotic risk (14), the given 𝑝 − 𝑛 criterion is difficult to handle. Here, only the criterion gained from the first-order asymptotic risk in (14) is stated. For 𝐶 , the 𝑝 − 𝑛 criterion is given as follows.

Criterion for a general model

𝐶 ≥ 1 2 𝑛 tr ( 𝐺 ~ ^ − 1 𝐺 ^ 𝐺 ~ ^ − 1 𝐺 ^ ∗ )

(37)

For the exponential family, a simpler criterion can be derived. To use (24), the following properties must be estimated:

𝐺 ~

( 𝑔 ~ 𝑖 𝑗 ) , 𝐺

( 𝑔 𝑖 𝑗 ) , 𝜅 𝑖 𝑗 𝑘 , 𝜅 𝑖 𝑗 𝑘 ∗ , 𝜅 𝑖 𝑗 𝑘 𝑙 ∗ , 1 ≤ 𝑖 , 𝑗 , 𝑘 , 𝑙 ≤ 𝑝 .

𝐺 ^ is the sample covariance matrix of the 𝜉 𝑖 terms, Σ ^ :

𝐺 ^

Σ ^ , 𝑔 ^ 𝑖 𝑗

( Σ ^ ) 𝑖 𝑗 , ( Σ ^ ) 𝑖 𝑗 ≜ 𝑛 − 1 ∑ 𝑡

1 𝑛 ( 𝜉 𝑖 ( 𝑋 𝑡 ) − 𝜉 ¯ 𝑖 ) ( 𝜉 𝑗 ( 𝑋 𝑡 ) − 𝜉 ¯ 𝑗 ) .

(38)

Similarly, the estimator of the true third-order cumulant is given by the sample third-order cumulant:

𝜅 ^ 𝑖 𝑗 𝑘

𝑛 − 1 ∑ 𝑡

1 𝑛 ( 𝜉 𝑖 ( 𝑋 𝑡 ) − 𝜉 ¯ 𝑖 ) ( 𝜉 𝑗 ( 𝑋 𝑡 ) − 𝜉 ¯ 𝑗 ) ( 𝜉 𝑘 ( 𝑋 𝑡 ) − 𝜉 ¯ 𝑘 ) .

(39)

Further,

𝐺 ~ ^

Ψ ¨ ( 𝜃 ^ ) , 𝑔 ~ ^ 𝑖 𝑗

( Ψ ¨ ( 𝜃 ^ ) ) 𝑖 𝑗

(40)

𝜅 ^ 𝑖 𝑗 𝑘 ∗

∂ 3 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 Ψ ( 𝜃 ) | 𝜃

𝜃 ^

(41)

𝜅 ^ 𝑖 𝑗 𝑘 𝑙 ∗

∂ 4 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 Ψ ( 𝜃 ) | 𝜃

𝜃 ^ .

(42)

Consequently, for an exponential family, the 𝑝 − 𝑛 criterion is given as follows.

Criterion for an exponential family

𝐶

≥ 1 2 𝑛 tr ( Σ ^ ( Ψ ¨ ( 𝜃 ^ ) ) − 1 ) +

1 24 𝑛 2 [ − 8 𝑔 ~ ^ 𝑢 𝑘 𝑔 ~ ^ 𝑙 𝑠 𝑔 ~ ^ 𝑚 𝑡 𝜅 ^ 𝑘 𝑠 𝑡 𝜅 ^ 𝑙 𝑚 𝑢 ∗
9 𝑔 ~ ^ 𝑘 𝑜 𝑔 ~ ^ 𝑙 𝑢 𝑔 ~ ^ 𝑠 𝑣 𝑔 ~ ^ 𝑡 𝑤 𝑔 ~ ^ ℎ 𝑚 𝜅 ^ 𝑙 𝑚 𝑜 ∗ 𝜅 ^ 𝑠 𝑡 ℎ ∗ ( 𝑔 ^ 𝑘 𝑢 𝑔 ^ 𝑣 𝑤
𝑔 ^ 𝑘 𝑣 𝑔 ^ 𝑢 𝑤
𝑔 ^ 𝑘 𝑤 𝑔 ^ 𝑢 𝑣 )

− 3 𝑔 ~ ^ 𝑘 𝑤 𝑔 ~ ^ 𝑙 𝑠 𝑔 ~ ^ 𝑚 𝑢 𝑔 ~ ^ 𝑡 𝑣 𝜅 ^ 𝑙 𝑚 𝑡 𝑤 ∗ ( 𝑔 ^ 𝑘 𝑠 𝑔 ^ 𝑢 𝑣 + 𝑔 ^ 𝑘 𝑢 𝑔 ^ 𝑠 𝑣 + 𝑔 ^ 𝑘 𝑣 𝑔 ^ 𝑠 𝑢 ) ] .

(43)

Only a naive estimator, such as the MLE or empirical moments, is used. More sophisticated estimators, such as unbiased estimators, shrinkage estimators, and bootstrap estimators, can be used, but they are outside the scope of this paper.

Now, let us move to the second concern: selection of 𝐶 . Another often used measure of the closeness between two distributions is the error rate, which is more intuitive than the divergence and suitable for setting a threshold. Let 𝑔 𝑖 ( 𝑥 ) , 𝑖

1 , 2 be the p.d.f. If both 𝑔 𝑖 ( 𝑥 ) , 𝑖

1 , 2 , are known, the discrimination rule is as follows.

For the sample 𝑋 from either 𝑔 1 ( 𝑥 ) or 𝑔 2 ( 𝑥 ) ,

𝑔 𝑖 1 ( 𝑋 ) 𝑔 𝑖 2 ( 𝑋 )

1 ⟺ Judge that 𝑋 is generated from 𝑔 ( 𝑥 ; 𝜃 𝑖 1 )

The Bayes error rate, 𝐸 𝑟 , i.e., the probability that this rule gives an error, is formally defined by

𝐸 𝑟 [ 𝑔 1 ( 𝑥 ) | 𝑔 2 ( 𝑥 ) ] ≜ 1 2 ∫ min ⁡ ( 𝑔 1 ( 𝑥 ) , 𝑔 2 ( 𝑥 ) ) 𝑑 𝜇 .

The next theorem states the relation between 𝐸 𝑟 and the K-L divergence.

Theorem 2.

If 𝐷 [ 𝑔 1 ( 𝑋 ) | 𝑔 2 ( 𝑥 ) ] ≤ 𝛿 , then

𝐸 𝑟 [ 𝑔 ( 𝑥 ; 𝜃 1 ) | 𝑔 ( 𝑥 ; 𝜃 2 ) ] ≥ min ⁡ { 𝑡 | ( 𝑥 , 𝑡 ) ∈ 𝐴 ( 𝛿 ) } ,

(44)

where

𝐴 ( 𝛿 ) ≜ { ( 𝑥 , 𝑡 ) | 𝑥 log ⁡ ( 1 − 2 𝑡 𝑥 + 1 ) + ( 1 − 𝑥 ) log ⁡ ( 2 𝑡 − 1 1 − 𝑥 + 1 )

− 𝛿 , 0 < 𝑥 < 2 𝑡 < 1 } .

Proof.

See 5.4 in the Appendix. ∎

Suppose that the standard of closeness between two distributions in view of the Bayes error rate is set to

𝐸 𝑟 ≥ 1 / 2 − 𝛼 ,

(45)

where 𝛼 is a certain number, such as 𝛼

0.05 , 0.01 . From (44), if

min ⁡ { 𝑡 | ( 𝑥 , 𝑡 ) ∈ 𝐴 ( 𝛿 ) } ≥ 1 / 2 − 𝛼 ,

(46)

standard (45) is satisfied.

Analytical calculation of min ⁡ { 𝑡 | ( 𝑥 , 𝑡 ) ∈ 𝐴 ( 𝛿 ) } is difficult. The approximation when 𝑡 is close to 1 / 2 is given here. As log ⁡ ( 1 + 𝑥 ) ≑ 𝑥 − 𝑥 2 / 2 around 𝑥

0 ,

𝑥 log ⁡ ( 1 − 2 𝑡 𝑥 + 1 ) + ( 1 − 𝑥 ) log ⁡ ( 2 𝑡 − 1 1 − 𝑥 + 1 )

𝑥 ( 1 − 2 𝑡 𝑥 ) − 𝑥 2 ( 1 − 2 𝑡 𝑥 ) 2 + ( 1 − 𝑥 ) 2 𝑡 − 1 1 − 𝑥 − ( 1 − 𝑥 ) 2 ( 2 𝑡 − 1 1 − 𝑥 ) 2

− 1 2 ( 1 − 2 𝑡 ) 2 𝑥 ( 1 − 𝑥 ) .

Therefore, 𝐴 ( 𝛿 ) is approximated by

𝐴 ∗ ( 𝛿 ) ≜ { ( 𝑥 , 𝑡 ) | 𝑡

1 2 ( 1 − 2 𝛿 𝑥 ( 1 − 𝑥 ) ) , 0 < 𝑥 < 2 𝑡 < 1 } .

Note that

min ⁡ { 𝑡 | ( 𝑥 , 𝑡 ) ∈ 𝐴 ∗ ( 𝛿 ) } ≥ min 0 < 𝑥 < 1 ⁡ 1 2 ( 1 − 2 𝛿 𝑥 ( 1 − 𝑥 ) )

1 2 − 𝛿 / 8 ,

Hence, the condition 𝛿 / 8 ≤ 𝛼 or, equivalently, 𝛿 ≤ 8 𝛼 2 is approximately sufficient for (46). This result is stated as a corollary.

Corollary 2.

Let 𝛿

𝐷 [ 𝑔 ( 𝑥 ; 𝜃 1 ) | 𝑔 ( 𝑥 ; 𝜃 2 ) ] . If

min ⁡ { 𝑡 | ( 𝑥 , 𝑡 ) ∈ 𝐴 ( 𝛿 ) } ≥ 1 / 2 − 𝛼 ,

(47)

then

𝐸 𝑟 [ 𝑔 ( 𝑥 ; 𝜃 1 ) | 𝑔 ( 𝑥 ; 𝜃 2 ) ] ≥ 1 / 2 − 𝛼 .

(48)

Condition (47) is approximately equivalent to

𝛿 ≤ 8 𝛼 2 .

(49)

Consequently, the 𝐶 in (37) or (43) corresponding to the error rate 1 / 2 − 𝛼 is given by the solution of 𝛿 (say, 𝐶 𝛼 ) for the equation

min ⁡ { 𝑡 | ( 𝑥 , 𝑡 ) ∈ 𝐴 ( 𝛿 ) }

1 / 2 − 𝛼 .

(50)

More simply, 𝐶 𝛼 is given by

𝐶 𝛼

8 𝛼 2 .

(51)

Thus, if 𝛼

0.01 ( 0.05 ) , then 𝐶 𝛼

1 / 1250 ( 1 / 50 ) .

– Example 3 (continued) – For the quadratic exponential model of (29), the right-hand side of (43) is given by

1 2 𝑛 tr ( 𝑄 − 1 Σ ^ ) .

Taking an empirical approach and letting 𝑄 be the sample covariance matrix Σ , the r.h.s. of (43) equals 𝑝 / 2 𝑛 . With 𝐶 𝛼 in (51), the 𝑝 − 𝑛 criterion is approximately equivalent to

𝑝 𝑛 ≤ 16 𝛼 2 .

(52)

Note that this criterion does not guarantee that 𝑛 is sufficiently large to allow 𝑂 ( 𝑛 − 3 ) in (24) to be neglected. This is a different concern, but we do not address it here.

– Example 4 (continued) –

For a multinomial distribution and the first-order approximation in (31), the 𝑝 − 𝑛 criterion equals (52). The second-order approximation gives the following 𝑝 − 𝑛 criterion:

96 𝑛 2 𝛼 2 − 6 𝑛 𝑝 − ( 𝑀 ^ − 1 )

0 ,

(53)

where

𝑀 ^

∑ 𝑖

0 𝑝 𝑚 ^ 𝑖 − 1

and 𝑚 ^ 𝑖 is the MLE, the sample relative frequency, for each 𝑖 . Applying the criterion for 𝑛 determination gives the formula

𝑛 ≥ 3 𝑝 + 9 𝑝 2 + 96 𝛼 2 ( 𝑀 ^ − 1 ) 96 𝛼 2 .

(54)

In contrast, if the criterion is used for category determination, i.e., the “bin number” or “bin width” problem with regard to the histogram, the formula is given by

6 𝑛 𝑝 + 𝑀 ^ < 96 𝑛 2 𝛼 2 + 1 .

(55)

Use of these criteria for practical examples is discussed in Section 3.3.

3.2Algorithm for 𝑝 − 𝑛 Criterion of Exponential Family

This section describes calculation of the right-hand side of (43). If we can calculate the function Ψ ( 𝜃 ) analytically, the algorithm is simply the following.

Step 1

Calculate 𝜂 ^ 𝑖

𝜉 ¯ 𝑖 , 𝑖

1 , … , 𝑝 from the sample.

Step 2

Solve the simultaneous equations w.r.t. 𝜃 in (21) to give 𝜃 ^

( 𝜃 ^ 1 , … , 𝜃 ^ 𝑝 ) :

𝜂 ^ 𝑖

𝜂 𝑖 ( 𝜃 ^ )

∂ Ψ ∂ 𝜃 𝑖 ( 𝜃 ^ ) , 𝑖

1 , … , 𝑝 .

(56) Step 3

Calculate (40), (41), and (42) from Ψ ( 𝜃 ^ ) .

Step 4

Calculate (38) and (39) from the sample.

Step 5

Calculate the right-hand side of (43) and compare it with 𝐶 𝛼 .

Often, Ψ ( 𝜃 ) is not explicitly given, especially for a complex model. Then, 𝜃 ^ can be iteratively calculated using the Newton–Raphson method with the Jacobian matrix (40). Because Ψ ¨ ( 𝜃 ) is the variance-covariance matrix of the 𝜉 𝑖 terms under the 𝑔 ( 𝑥 ; 𝜃 ) distribution, its value can be approximated from the generated sample. The alternative methods are as follows.

Step 2’

Iteratively search for 𝜃 ^ with

𝜃 ( 𝑛 + 1 )

𝜃 ( 𝑛 ) − ( 𝜂 ( 𝜃 ( 𝑛 ) ) − 𝜂 ^ ) ( Ψ ¨ ( 𝜃 ( 𝑛 ) ) ) − 1 ,

where 𝜂 ( 𝜃 ( 𝑛 ) ) and Ψ ¨ ( 𝜃 ( 𝑛 ) ) are approximated by the sample mean and the sample covariance matrix of the 𝜉 𝑖 terms from the 𝑔 ( 𝑥 ; 𝜃 ( 𝑛 ) ) distribution.

Further, (40), (41), and (42) can also be approximated using the generated sample.

Step 3’

Approximate (40), (41), and (42) using the sample moments and cumulants, where the sample is generated from 𝑔 ( 𝑥 ; 𝜃 ^ ) .

The point here is that Ψ ( 𝜃 ) is not required for sample generation in Steps 2’ and 3’ if methods such as MCMC (requiring no normalizing constant) are used. Although Steps 2’ and 3’ are computationally heavy tasks, they enable construction of a complex model without calculation of Ψ .

3.3Example – 𝑝 − 𝑛 criterion application –

This section demonstrates use of the 𝑝 − 𝑛 criterion for a particular problem through two practical examples under the exponential family model. Systematic selection of the 𝜉 𝑖 terms is important for model construction, but it is not studied here. For further discussion of this issue, see the reference given in the Introduction.

– Example 5: Red Wine – The first example is a well-known dataset on wine quality, taken from the U.C.I. Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/wine+quality).

Only red wine data are used. The sample size is 1599, and the variables consist of 11 chemical substances (continuous variables) and “quality” indexes (integers from 3 to 8). The vector of the chemical substances and the “quality” variable are denoted by 𝑥 ( 1 )

( 𝑥 1 ( 1 ) , … , 𝑥 11 ( 1 ) ) and 𝑥 ( 2 ) , respectively. We divided the sample into two halves randomly, one of which (“data_base”) was used for the model formulation and the other (“data_est”) was used for the estimation of the parameter.

As the model formulation, we determined the following: normalization method of the original data, the reference (probability) measure 𝑑 𝜇 ( 𝑥 ) and 𝜉 elements. Using “data_base”, we proceed as; 1. Each variable 𝑥 𝑖 ( 1 ) ( 𝑖

1 , … , 11 ) is divided by twice of its maximum such that its range is [ 0 , 1 ) . Further, 2 is subtracted from each “quality” index to give a range of { 1 , 2 , … , 6 } . 2. As 𝑑 𝜇 ( 𝑥 ) , 11 independent Beta distributions are applied to 𝑥 ( 1 ) so that their means and variances are equal to those of the “data_base”. The multinomial distribution of 𝑥 ( 2 ) is adopted, using each category’s sample relative frequency as the category probability parameter (say, 𝑚 𝑖 , 𝑖

1 , … , 6 ). In addition, 𝑥 ( 1 ) and 𝑥 ( 2 ) are taken to be independent.

Consequently, 𝑑 𝜇 is selected as

𝑥

( 𝑥 ( 1 ) , 𝑥 ( 2 ) ) , 𝑑 𝜇 ( 𝑥 )

∏ 𝑖

1 11 𝑥 𝑖 ( 1 ) ( 𝛽 1 𝑖 − 1 ) ( 1 − 𝑥 𝑖 ( 1 ) ) ( 𝛽 2 𝑖 − 1 ) 𝑑 ( 𝑥 ( 1 ) ) × ∏ 𝑖

1 6 𝑚 𝑖 𝐼 ( 𝑥 ( 2 )

𝑖 ) 𝑑 ∗ ( 𝑥 ( 2 ) ) ,

where 𝑑 ( 𝑥 ( 1 ) ) is the Lebesgue measure on [ 0 , 1 ] 11 , 𝑑 ∗ ( 𝑥 ( 2 ) ) is the counting measure on { 1 , 2 , … , 6 } , and 𝐼 ( ⋅ ) is the indicator function. Further, 𝛽 1 𝑖 , 𝛽 2 𝑖 , and 𝑚 𝑖 satisfy the relations

𝛽 1 𝑖 𝛽 1 𝑖 + 𝛽 2 𝑖

Sample mean of 𝑥 𝑖 ( 1 ) , 𝑖

1 , … , 11

𝛽 1 𝑖 𝛽 2 𝑖 ( 𝛽 1 𝑖 + 𝛽 2 𝑖 ) 2 ( 𝛽 1 𝑖 + 𝛽 2 𝑖 + 1 )

Sample variance of 𝑥 𝑖 ( 1 ) , 𝑖

1 , … , 11

𝑚 𝑖

Relative frequency of 𝑖 in 𝑥 ( 2 )

The candidate for the 𝜉 𝑖 terms are as follows:

𝜉 1 ( 𝑥 )

𝑥 1 ( 1 ) 𝑥 2 ( 1 ) , 𝜉 2 ( 𝑥 )

𝑥 1 ( 1 ) 𝑥 3 ( 1 ) , … 𝜉 10 ( 𝑥 )

𝑥 1 ( 1 ) 𝑥 11 ( 1 )

𝜉 11 ( 𝑥 )

𝑥 2 ( 1 ) 𝑥 3 ( 1 ) , … 𝜉 19 ( 𝑥 )

𝑥 2 ( 1 ) 𝑥 11 ( 1 )

⋯

𝜉 55 ( 𝑥 )

𝑥 10 ( 1 ) 𝑥 11 ( 1 )

and

𝜉 56 ( 𝑥 )

𝑥 1 ( 1 ) 𝑥 ( 2 ) , … 𝜉 66 ( 𝑥 )

𝑥 11 ( 1 ) 𝑥 ( 2 ) .

Since some of these terms are highly correlated, we eliminate one of the pair with the correlation higher than 0.95. Actually the following 20 𝜉 𝑖 terms were removed from the full model:

𝜉 𝑖 , 𝑖

8 , 17 , 19 , 24 , 25 , 27 , 32 , 34 , 38 , 40 , 43 , 45 , 46 , 47 , 49 , 53 , 58 , 62 , 64 .

Consequently, an exponential family model with 𝑝

47 is formulated. As the probability distribution 𝑔 ( 𝑥 ; 𝜃 ) 𝑑 𝜇 equals 𝑑 𝜇 when the 𝜃 terms all equal zero, it is denoted by 𝑔 ( 𝑥 ; 0 ) . Note that the 𝑔 ( 𝑥 ; 𝜃 ∗ ) of this model is the closest to 𝑔 ( 𝑥 ; 0 ) in the sense that

𝐷 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 0 ) ]

min ℎ ∈ ℋ ⁡ 𝐷 [ ℎ ( 𝑥 ) | 𝑔 ( 𝑥 ; 0 ) ] ,

where ℋ is the p.d.f. set of ℎ ( 𝑥 ) (w.r.t. 𝑑 𝜇 ) that satisfies

𝐸 ℎ [ 𝜉 𝑖 ( 𝑋 ) ] ≜ ∫ ℎ ( 𝑥 ) 𝜉 𝑖 ( 𝑥 ) 𝑑 𝜇 ( 𝑥 )

𝐸 [ 𝜉 𝑖 ( 𝑋 ) ] ,

for each 𝜉 𝑖 in the model. This is the consequence of so-called “minimum relative entropy characterization” of an exponential family” (see [10]).

Under the formulated exponential family model, the algorithm in the previous section was implemented and the right-hand side of (43) was calculated using the “data_est”, the size of which ( 𝑛 ) equals 799. Because of the model complexity, the explicit form of Ψ ( 𝜃 ) could not be obtained; hence, Alternative Steps 2’ and 3’ were used. The R and RStan program codes are presented in GitHub (https://github.com/YSheena/P-N_Criteria_Program.git). The first-and second-order terms and the estimation risk in the total of (43) were as follows: First-order term: 2.95e-02, Second-order term: -1.30e-04, Estimation Risk: 2.93e-02

Note that the second-order term contributes little to the estimation risk; thus, the first-order approximation seems sufficient for this model and data. Using (49) as the equation with 𝛿

2.93e-02, we have 𝛼 ≑ 0.06 . Hence the Bayes error rate between 𝑔 ( 𝑥 ; 𝜃 ^ ) and 𝑔 ( 𝑥 ; 𝜃 ∗ ) is higher than 0.44. If we set the threshold as 𝛼

0.05 , then we must trim the model further. For example, if we eliminate one of the 𝜉 elements from the pair with correlation higher than 0.9, then 𝑝 becomes as small as 37. For this model, the estimation risk is lower than the target value 0.02 as follows: First-order term: 1.60e-02, Second-order term: 2.04e-04, Estimation Risk: 1.62e-02

As mentioned in the Introduction, the distribution 𝑔 ( 𝑥 ; 𝜃 ∗ ) as the best approximation of 𝑔 ( 𝑥 ) can be used for many purposes. As an example, classification of each wine into a “quality” class ( 𝑥 ( 2 ) ) based on its “chemical substances” ( 𝑥 ( 1 ) ) is briefly shown. The algorithm is quite simple: wine with 𝑥 ( 1 ) is classified into the class where the function of 𝑥 ( 2 ) , 𝑔 ( ( 𝑥 ( 1 ) , 𝑥 ( 2 ) ) ; 𝜃 ^ ) , attains the maximum. This approach is called a “generative” classifier in the field of machine learning, in contrast with a “discriminative” classifier like a decision tree. (Note that, in the ordinary “generative” model approach, after the distributions of 𝑥 ( 1 ) are learned for each 𝑥 ( 2 ) class, the conditional distribution of 𝑥 ( 2 ) given 𝑥 ( 1 ) is calculated using Bayes’ theorem. In the above example, the simultaneous distribution of ( 𝑥 ( 1 ) , 𝑥 ( 2 ) ) is directly obtained. This approach is useful when the sample size in some classes is so small that it is difficult to estimate the distribution of the explanatory variables within the class.)

Cross-validation was performed ten times, with 10% of the sample ( 𝑛

160 ) being randomly chosen for the test. For each training set, the model formulation and the estimation are made as above using the whole training set in common. The accuracy of the overall test data ( 𝑛

1600 ) was 58%. For comparison, similar cross-validation for the naive decision tree (no bagging, no boosting) classifier was also performed, having 63% accuracy. (The “C50” R package was used with the default setting.) Although the generative model had inferior accuracy (i.e., higher uncertainty in the “uncertain knowledge” of Rao’s equation), it could provide more reliable “knowledge of the amount of uncertainty” in the prediction. The accuracy difference between the training and test data illustrates this point. The model accuracy for the overall training data ( 𝑛

14390 ) was 56%; hence, the accuracy difference between the training and test data was 2% on average, whereas the accuracy of the decision tree model for the overall training data was 91%, such that the average difference was 28%. The generative model obtained here had no “overfitting” (rather “underfitting”); hence, we could present the accuracy safely before applying it to new data, or even before cross-validation. – Example 6: Abalone Data –

The next example also features a well-known dataset, in this case, for the physical measurement of abalones (U.C.I. Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Abalone). This data comprise eight properties (sex, length, diameter, etc.) of 4177 abalones. Here, only two discrete variables were considered: “sex” and “ring,” where “sex” had three values “Female,” “Infant,” and “Male”; and “rings” had integer values from 1 to 29. The frequency of each classified group by “sex” and “rings” is given in Table 1. The original frequencies were aggregated at both ends. In the table, if a cell with a star mark is located to the immediate left or right, the number in the cell is aggregated. For example, of the female abalones, cells with 24 or more rings were aggregated to frequency 4. The total number of cells was 63.

Table 1:Abalones by sex & rings 1 2 3 4 5 6 7 8 9 10 11 12 13 F * * * * 4 16 44 122 238 248 200 128 88 I 1 1 12 51 100 216 267 274 173 92 62 21 24 M * * 3 6 11 27 80 172 278 294 225 118 91 14 15 16 17 18 19 20 21 22 23 24 25 ≤

F 56 41 30 26 19 15 12 7 3 6 4 * I 14 10 7 7 5 2 2 1 * * * * M 56 52 30 25 18 15 12 6 3 3 3 *

A multinomial distribution over 63 cells was considered; hence, 𝑝

62 . From the sample relative frequency of each cell 𝑚 ^ 𝑖 , where 𝑖

0 , … , 62 ,

𝑀 ^

∑ 𝑖

0 62 𝑚 ^ 𝑖 − 1

36128.33 ,

The first-order risk and estimated second-order risk in (31) were respectively 0.0074 and 1.73e-04. Consequently, for 𝛼

0.05 , the estimation risk in total satisfied

𝐶 𝛼

0.02

0.0074 + 1.73 / 10 4 ≑ 0.0076 ;

hence, the model satisfied the 𝑝 − 𝑛 criterion. Use of the 𝑛 formula (54) yielded

𝑛 ≥ 1642 .

However, setting 𝛼

0.01 increased the required 𝑛 to 38847, with the actual 𝑛 being far below this value.

4Total Risk Decomposition and Model Comparison

This section treats the third problem mentioned in the Introduction, i.e., the distance between 𝑔 ( 𝑥 ) and 𝑔 ( 𝑥 ; 𝜃 ∗ ) : 𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ∗ ) ] .

For an exponential family model, the following “generalized Pythagorean theorem” holds:

𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ∗ ) ] + 𝐷 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

(57)

(see, e.g., Lemma 3 of [6] and Theorem 3.8 of [4]). The convergence of the total distance 𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ] as both 𝑛 and 𝑝 goes to infinity has been studied in [17], [6], and [22], for the exponential family model.

In general, estimation of 𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ∗ ) ] is difficult compared to estimation of 𝐷 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ] (or its expectation as treated in the previous sections), because it very subtly depends on 𝑔 ( 𝑥 ) . For example, Theorem 1 of [6] indicates that the convergence rate in probability as 𝑝 goes to infinity depends on the square integrability of the higher-order derivative of log ⁡ 𝑔 ( 𝑥 ) for an exponential family on [ 0 , 1 ] with basis functions such as polynomial, spline, and trigonometric functions.

Taking the expectation of both sides of (57) with respect to 𝑔 ( 𝑥 ) ,

𝑅 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ∗ ) ] + 𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ] ,

(58)

where the 𝑅 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ] (say “total risk”) is defined by

𝑅 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ] ≜ 𝐸 [ 𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ] ] .

(59)

Taking 𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ∗ ) ] as the “approximation risk” (although it is a constant and need not to be averaged over 𝑔 ( 𝑥 ) ), (58) states that the “total risk” is the sum of the “approximation risk” and the “estimation risk.” If a model satisfies the 𝑝 − 𝑛 criterion, the estimation risk is relatively small; hence, the total risk is mostly determined by the approximation risk.

The approximation risk is decomposed as

𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ∗ ) ]

∫ 𝑔 ( 𝑥 ) log ⁡ ( 𝑔 ( 𝑥 ) / 𝑔 ( 𝑥 ; 𝜃 ∗ ) ) 𝑑 𝜇 ,

∫ 𝑔 ( 𝑥 ) log ⁡ 𝑔 ( 𝑥 ) 𝑑 𝜇 − ∫ 𝑔 ( 𝑥 ) log ⁡ 𝑔 ( 𝑥 ; 𝜃 ∗ ) 𝑑 𝜇 .

Because the latter part (including minus) is the cross entropy between 𝑔 ( 𝑥 ) and 𝑔 ( 𝑥 ; 𝜃 ) , and it is determined by the model ℳ , let it be denoted by 𝐶 𝑒 ( 𝑀 ) . Naturally, the following estimation is performed:

𝐶 𝑒 ( 𝑀 ) ^ ≜ − 1 𝑛 ∑ 𝑡

1 𝑛 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ^ ( 𝑋 ) ) ,

based on the sample 𝑋

( 𝑋 1 , … , 𝑋 𝑛 ) from 𝑔 ( 𝑥 ) . The bias of this estimator up to 𝑛 − 1 order is evaluated as

𝐸 [ 𝐶 𝑒 ( 𝑀 ) ^ ] − 𝐶 𝑒 ( 𝑀 )

− 1 2 𝑛 tr ( 𝐺 ~ − 1 𝐺 ) + 𝑜 ( 𝑛 − 1 ) .

(60)

(For the proof, see Section 5.5 of the Appendix).

Using the bias-corrected estimator, the approximation risk is evaluated as

𝐷 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ∗ ) ] ≑ ∫ 𝑔 ( 𝑥 ) log ⁡ 𝑔 ( 𝑥 ) 𝑑 𝜇 − 1 𝑛 ∑ 𝑡

1 𝑛 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ^ ( 𝑋 ) ) + 1 2 𝑛 tr ( 𝐺 ~ − 1 𝐺 ) .

Combining this with the estimation risk (24), the estimated total risk to 𝑛 − 1 -order is

∫ 𝑔 ( 𝑥 ) log ⁡ 𝑔 ( 𝑥 ) 𝑑 𝜇 − 1 𝑛 ∑ 𝑡

1 𝑛 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ^ ( 𝑋 ) ) + 1 𝑛 tr ( 𝐺 ~ − 1 𝐺 ) .

As the first term on the right-hand side is common between the models, the second and third terms, or equivalently those times 2 𝑛 ,

− 2 ∑ 𝑡

1 𝑛 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ^ ( 𝑋 ) ) + 2 t r ( 𝐺 ~ − 1 𝐺 ) ,

can be used as the criteria for the total risk comparison between the two exponential family models. If 𝐺 is exchanged with its consistent estimator (e.g., Σ ^ , as in (38)), the TIC is obtained (see [24] and [16]). Needless to say, this criterion is equivalent to the AIC for the case when the model includes 𝑔 ( 𝑥 ) ; hence, 𝐺 ~

𝐺 . Recall that, to calculate the first term in the TIC, Ψ ( 𝜃 ^ ) must be calculated. Because it is difficult to obtain Ψ ( 𝜃 ) analytically in a complicated model, numerical alternatives are required.

Finally, consider a comparison between the two models. Suppose that, between the two models,

ℳ 1 ≜ { 𝑔 1 ( 𝑥 ; 𝜃 ) | 𝜃 ∈ Θ } ,

ℳ 2 ≜ { 𝑔 2 ( 𝑥 ; 𝜏 ) | 𝜏 ∈ 𝑇 } ,

ℳ 1 is preferable in terms of the information criteria. This indicates only that 𝑔 1 ( 𝑥 ; 𝜃 ^ ) is likely to be closer to 𝑔 ( 𝑥 ) than 𝑔 2 ( 𝑥 ; 𝜏 ^ ) . However, 𝑔 1 ( 𝑥 ; 𝜃 ∗ ) is not guaranteed preferable to 𝑔 2 ( 𝑥 ; 𝜏 ∗ ) ; hence, ℳ 1 is not confirmed to be a better model. It is possible that 𝑔 2 ( 𝑥 ; 𝜏 ^ ) is far from 𝑔 2 ( 𝑥 ; 𝜏 ∗ ) , but the approximation risk of ℳ 2 is smaller than that of ℳ 1 .

To compare the two models, ℳ 1 and ℳ 2 , it is better to first determine whether the present 𝑛 is sufficiently large to satisfy the 𝑝 − 𝑛 criterion for each model. If both models satisfy this criterion, their approximation risks can be compared based on 𝐶 𝑒 ( 𝑀 ) ^ or the bias-corrected term. (As observed above, the bias equates to the first-order term of the estimation risk; hence, 𝑝 − 𝑛 criterion satisfaction indicates that the bias is somewhat negligible.) When ℳ 1 includes ℳ 2 , the approximation risk of ℳ 1 is obviously smaller than that of ℳ 2 .

References [1] ↑ M. Arbel and A. Gretton. Kernel conditional exponential family. Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR 84: 1337–1346, 2018. [2] ↑ S. Amari. Differential geometry of curved exponential families–curvature and information loss. Annals of Statistics, 10(2): 357–385, 1982. [3] ↑ S. Amari. Information Geometry and Its Applications. Springer, 2016. [4] ↑ S. Amari and H. Nagaoka. Methods of Information Geometry. Translations of Mathematical Monographs 191. American Mathematical Society, 2000. [5] ↑ O. E. Barndorff-Nielsen. Information and Exponential Families in Statistical Theory. Wiley, 2014. [6] ↑ A. R. Barron and C. Sheu. Approximation of density functions by sequences of exponential families. Annals of Statistics, 19(3): 1347–1369, 1991. [7] ↑ L. D. Brown. Fundamentals of Statistical Exponentials Families. IMS, 1986. [8] ↑ S. Canu and A. Smola. Kernel methods and the exponential family. Neurocomputing, 69(7-9): 714–720, 2006. [9] ↑ N. N. Cencov. Statistical Decision Rules and Optimal Inference. American Mathematical Society Translations, 1982. [10] ↑ I. Csiszár. 𝐼 -divergence geometry of probability distributions and minimization problems. Annals of Probability, 3: 146–158, 1975. [11] ↑ I. Csiszár. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Annals of Statistics, 19(4): 2032–2066, 1991. [12] ↑ B. Efron. Defining the curvature of a statistical problem (with applications to second-order efficiency). Annals of Statistics, 3(6): 1189–1242, 1975. [13] ↑ B. Efron and R. Tibshirani. Using specially designed exponential families for density estimation. Annals of Statistics, 24(6): 2431–2461, 1996. [14] ↑ K. Fukumizu. Exponential manifold by reproducing kernel Hilbert spaces. In Algebraic and Geometric Methods in Statistics, ed. by Paolo Gibilisco et al.. Cambridge University Press, 2010. [15] ↑ P. Hall. On Kullback–Leibler loss and density estimation. Annals of Statisics, 15: 1491–1519, 1987. [16] ↑ S. Konishi and G. Kitagawa. Generalized information criteria in model selection. Biometrika, 83(4), 875-890, 1996. [17] ↑ S. Portnoy. Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Annals of Statistics, 16(1): 356-366, 1988. [18] ↑ C. R. Rao. Statistics And Truth: Putting Chance To Work. World Scientific, 1997. [19] ↑ Y. Sheena. Asymptotic expansion of the risk of maximum likelihood estimator with respect to 𝛼 -divergence as a measure of the difficulty of specifying a parametric model. Communications in Statistics – Theory and Methods, 47(16): 4059–4087, 2018. [20] ↑ Y. Sheena. The convergence speed of MLE to the information projection of an exponential family –a criteria for the model dimension and the sample size– with complete proof. arXiv, XXX, 2021. [21] ↑ B. Sriperumbudur, K. Fukumizu, A. Gretton, A. Hyva̋rinen, R. Kumar. Density estimation in infinite dimensional exponential families Journal of Machine Learning Research, 18: 1–59, 2017. [22] ↑ C. J. Stone Large-sample inference for log-spline models. Annals of Statistics, 18(29): 717–741, 1990. [23] ↑ R. Sundberg. Statistical Modeling for Exponential Families. Cambridge University Press, 2019. [24] ↑ K. Takeuchi. Distribution of information statistics and criteria for adequacy of models. Mathematical Science, 153: 12–18, 1976 (in Japanese). [25] ↑ I. Vajda. Theory of Statistical Inference and Information, Kluwer Academic Publishers, 1989. [26] ↑ A. W. Van Der Vaart. Asymptotic Statistics, Cambridge University Press, 1998. [27] ↑ M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference, now Publishers, 2008. [28] ↑ W. H. Wong and T. A. Severini. On a maximum likelihood estimation in infinite dimensional parameter spaces. Annals of Statistics, 19(2): 603–632, 1991. 5Appendix 5.1The proof of (14)

We denote an i.i.d. sample from 𝑔 ( 𝑥 ) by 𝑿

( 𝑋 1 , … , 𝑋 𝑛 ) .

For 1 ≤ 𝑖 ≤ 𝑝 , let

𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ) ≜ 1 𝑛 ∑ 𝑎

1 𝑛 ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 𝑎 ; 𝜃 ) , 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ) ≜ ∑ 𝑗

1 𝑝 𝑔 ~ 𝑖 𝑗 ( 𝜃 ) 𝑒 ¯ 𝑗 ( 𝑿 ; 𝜃 ) ,

where

𝑔 ~ ( 𝜃 ) 𝑖 𝑗 ≜ ( 𝐺 ~ ( 𝜃 ) − 1 ) 𝑖 𝑗 .

Since MLE 𝜃 ^ maximizes log -likelihood ∑ 𝑎

1 𝑛 log ⁡ 𝑔 ( 𝑥 𝑎 ; 𝜃 )

𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ^ )

0 , 𝑖

1 , … , 𝑝

(61)

Taylor exqansion of 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ^ ) around 𝜃 ∗ is given by

𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ^ )

𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ∗ ) + ∑ 𝑗 ∂ 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ) ∂ 𝜃 𝑗 | 𝜃

𝜃 ∗ ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 )

+ 1 2 ∑ 𝑗 , 𝑘 ∂ 2 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 | 𝜃

𝜃 ∗ ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑘 )

+ 1 3 ! ∑ 𝑗 , 𝑘 , 𝑙 ∂ 3 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 | 𝜃

𝜃 ∗ ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ( 𝜃 ^ 𝑘 − 𝜃 ∗ 𝑘 ) ( 𝜃 ^ 𝑙 − 𝜃 ∗ 𝑙 )

+ 1 4 ! ∑ 𝑗 , 𝑘 , 𝑙 , 𝑚 ∂ 4 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 | 𝜃

𝜃 ~ 𝑖 𝑗 𝑘 𝑙 𝑚 ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ( 𝜃 ^ 𝑘 − 𝜃 ∗ 𝑘 ) ( 𝜃 ^ 𝑙 − 𝜃 ∗ 𝑙 ) ( 𝜃 ^ 𝑚 − 𝜃 ∗ 𝑚 ) ,

where 𝜃 ~ 𝑖 𝑗 𝑘 𝑙 𝑚 is on the segment from 𝜃 ∗ to 𝜃 ^ . If we add ∑ 𝑗 𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) on the both sides of the above expansion and use (61), then we have

∑ 𝑗 𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 )

𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ∗ ) + ∑ 𝑗 ( ∂ 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ) ∂ 𝜃 𝑗 | 𝜃

𝜃 ∗ + 𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ ) ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 )

+ 1 2 ∑ 𝑗 , 𝑘 ∂ 2 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 | 𝜃

𝜃 ∗ ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑘 )

+ 1 3 ! ∑ 𝑗 , 𝑘 , 𝑙 ∂ 3 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 | 𝜃

𝜃 ∗ ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ( 𝜃 ^ 𝑘 − 𝜃 ∗ 𝑘 ) ( 𝜃 ^ 𝑙 − 𝜃 ∗ 𝑙 )

+ 1 4 ! ∑ 𝑗 , 𝑘 , 𝑙 , 𝑚 ∂ 4 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 | 𝜃

𝜃 ~ 𝑖 𝑗 𝑘 𝑙 𝑚 ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ( 𝜃 ^ 𝑘 − 𝜃 ∗ 𝑘 ) ( 𝜃 ^ 𝑙 − 𝜃 ∗ 𝑙 ) ( 𝜃 ^ 𝑚 − 𝜃 ∗ 𝑚 ) .

Furthermore if we multiply the both sides with 𝑔 ~ 𝑖 𝑠 ( 𝜃 ∗ ) and sum them up over 𝑖 from 1 to 𝑝 , then we have

𝜃 ¯ 𝑠

𝑒 ¯ 𝑠 + ∑ 𝑗 𝐴 𝑗 𝑠 𝜃 ¯ 𝑗 + ∑ 𝑗 , 𝑘 𝐵 𝑗 𝑘 𝑠 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 + ∑ 𝑗 , 𝑘 𝐵 ¯ 𝑗 𝑘 𝑠 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 + ∑ 𝑗 , 𝑘 , 𝑙 𝐶 𝑗 𝑘 𝑙 𝑠 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙 + ∑ 𝑗 𝑘 𝑙 𝐶 ¯ 𝑗 𝑘 𝑙 𝑠 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙

∑ 𝑗 , 𝑘 , 𝑙 , 𝑚 𝐷 𝑗 𝑘 𝑙 𝑚 𝑠 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙 𝜃 ¯ 𝑚 ,

(62)

where we used the following notations: For 1 ≤ 𝑗 , 𝑘 , 𝑙 , 𝑚 , 𝑠 ≤ 𝑝 ,

𝜃 ¯ 𝑠
≜ 𝜃 ^ 𝑠 − 𝜃 ∗ 𝑠

𝐴 𝑗 𝑠
≜ ∑ 𝑖

1 𝑝 𝑔 ~ 𝑖 𝑠 ( 𝜃 ∗ ) ( ∂ 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ∗ ) ∂ 𝜃 𝑗 + 𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ ) )

𝐵 𝑗 𝑘 𝑠
≜ 1 2 ∑ 𝑖

1 𝑝 𝑔 ~ 𝑖 𝑠 ( 𝜃 ∗ ) ( ∂ 2 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ∗ ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 − 𝐸 [ ∂ 2 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ∗ ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ] )

𝐵 ¯ 𝑗 𝑘 𝑠
≜ 1 2 ∑ 𝑖

1 𝑝 𝑔 ~ 𝑖 𝑠 ( 𝜃 ∗ ) 𝐸 [ ∂ 2 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ∗ ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ]

1 2 ∑ 𝑖

1 𝑝 𝑔 ~ 𝑖 𝑠 ( 𝜃 ∗ ) 𝐿 ( 𝑖 𝑗 𝑘 ) ,

𝐶 𝑗 𝑘 𝑙 𝑠
≜ 1 3 ! ∑ 𝑖 𝑔 ~ 𝑖 𝑠 ( 𝜃 ∗ ) ( ∂ 3 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ∗ ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 − 𝐸 [ ∂ 3 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ∗ ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 ] ) ,

𝐶 ¯ 𝑗 𝑘 𝑙 𝑠
≜ 1 3 ! ∑ 𝑖

1 𝑝 𝑔 ~ 𝑖 𝑠 ( 𝜃 ∗ ) 𝐸 [ ∂ 3 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ∗ ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 ]

1 3 ! ∑ 𝑖

1 𝑝 𝑔 ~ 𝑖 𝑠 ( 𝜃 ∗ ) 𝐿 ( 𝑖 𝑗 𝑘 𝑙 ) ,

𝐷 𝑗 𝑘 𝑙 𝑚 𝑠

≜ 1 4 ! ∑ 𝑖 𝑔 ~ 𝑖 𝑠 ( 𝜃 ∗ ) ∂ 4 𝑒 ¯ 𝑖 ( 𝑿 ; 𝜃 ~ 𝑖 𝑗 𝑘 𝑙 𝑚 ) ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 .

If we insert the right-hand side of (62) into 𝜃 ¯ 𝑗 , 𝜃 ¯ 𝑘 , 𝜃 ¯ 𝑙 , 𝜃 ¯ 𝑚 of itself, we have the following equation.

𝜃 ¯ 𝑠

𝑒 ¯ 𝑠 + ∑ 𝑗 𝐴 𝑗 𝑠 ( 𝑒 ¯ 𝑗 + ∑ 𝑖 𝐴 𝑖 𝑗 𝜃 ¯ 𝑖 + ∑ 𝑖 , 𝑘 𝐵 𝑖 𝑘 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 + ∑ 𝑖 , 𝑘 𝐵 ¯ 𝑖 𝑘 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 + ∑ 𝑖 , 𝑘 , 𝑙 𝐶 𝑖 𝑘 𝑙 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙

∑ 𝑖 𝑘 𝑙 𝐶 ¯ 𝑖 𝑘 𝑙 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙
∑ 𝑖 , 𝑘 , 𝑙 , 𝑚 𝐷 𝑖 𝑘 𝑙 𝑚 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙 𝜃 ¯ 𝑚 )
∑ 𝑗 , 𝑘 ( 𝐵 𝑗 𝑘 𝑠
𝐵 ¯ 𝑗 𝑘 𝑠 )

× ( 𝑒 ¯ 𝑗 + ∑ 𝑖 𝐴 𝑖 𝑗 𝜃 ¯ 𝑖 + ∑ 𝑖 , 𝑘 𝐵 𝑖 𝑘 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 + ∑ 𝑖 , 𝑘 𝐵 ¯ 𝑖 𝑘 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 + ∑ 𝑖 , 𝑘 , 𝑙 𝐶 𝑖 𝑘 𝑙 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙

∑ 𝑖 𝑘 𝑙 𝐶 ¯ 𝑖 𝑘 𝑙 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙
∑ 𝑖 , 𝑘 , 𝑙 , 𝑚 𝐷 𝑖 𝑘 𝑙 𝑚 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙 𝜃 ¯ 𝑚 )

× ( 𝑒 ¯ 𝑘 + ∑ 𝑖 𝐴 𝑖 𝑘 𝜃 ¯ 𝑖 + ∑ 𝑖 , 𝑗 𝐵 𝑖 𝑗 𝑘 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 + ∑ 𝑖 , 𝑗 𝐵 ¯ 𝑖 𝑗 𝑘 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 + ∑ 𝑖 , 𝑗 , 𝑙 𝐶 𝑖 𝑗 𝑙 𝑘 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑙

∑ 𝑖 𝑗 𝑙 𝐶 ¯ 𝑖 𝑗 𝑙 𝑘 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑙
∑ 𝑖 , 𝑗 , 𝑙 , 𝑚 𝐷 𝑖 𝑗 𝑙 𝑚 𝑘 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑙 𝜃 ¯ 𝑚 )
∑ 𝑗 , 𝑘 , 𝑙 ( 𝐶 𝑗 𝑘 𝑙 𝑠
𝐶 ¯ 𝑗 𝑘 𝑙 𝑠 )

∑ 𝑖 𝑘 𝑙 𝐶 ¯ 𝑖 𝑘 𝑙 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙
∑ 𝑖 , 𝑘 , 𝑙 , 𝑚 𝐷 𝑖 𝑘 𝑙 𝑚 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙 𝜃 ¯ 𝑚 )

∑ 𝑖 , 𝑗 , 𝑙 𝐶 ¯ 𝑖 𝑗 𝑙 𝑘 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑙
∑ 𝑖 , 𝑗 , 𝑙 , 𝑚 𝐷 𝑖 𝑗 𝑙 𝑚 𝑘 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑙 𝜃 ¯ 𝑚 )

× ( 𝑒 ¯ 𝑙 + ∑ 𝑖 𝐴 𝑖 𝑙 𝜃 ¯ 𝑖 + ∑ 𝑖 , 𝑗 𝐵 𝑖 𝑗 𝑙 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 + ∑ 𝑖 , 𝑗 𝐵 ¯ 𝑖 𝑗 𝑙 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 + ∑ 𝑖 , 𝑗 , 𝑘 𝐶 𝑖 𝑗 𝑘 𝑙 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘

∑ 𝑖 , 𝑗 , 𝑘 𝐶 ¯ 𝑖 𝑗 𝑘 𝑙 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘
∑ 𝑖 , 𝑗 , 𝑘 , 𝑚 𝐷 𝑖 𝑗 𝑘 𝑚 𝑙 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 𝜃 ¯ 𝑚 )
∑ 𝑗 , 𝑘 , 𝑙 , 𝑚 𝐷 𝑗 𝑘 𝑙 𝑚 𝑠

∑ 𝑖 𝑘 𝑙 𝐶 ¯ 𝑖 𝑘 𝑙 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙
∑ 𝑖 , 𝑘 , 𝑙 , 𝑚 𝐷 𝑖 𝑘 𝑙 𝑚 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙 𝜃 ¯ 𝑚 )

∑ 𝑖 , 𝑗 , 𝑙 𝐶 ¯ 𝑖 𝑗 𝑙 𝑘 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑙
∑ 𝑖 , 𝑗 , 𝑙 , 𝑚 𝐷 𝑖 𝑗 𝑙 𝑚 𝑘 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑙 𝜃 ¯ 𝑚 )

∑ 𝑖 , 𝑗 , 𝑘 𝐶 ¯ 𝑖 𝑗 𝑘 𝑙 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘
∑ 𝑖 , 𝑗 , 𝑘 , 𝑚 𝐷 𝑖 𝑗 𝑘 𝑚 𝑙 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 𝜃 ¯ 𝑚 )

× ( 𝑒 ¯ 𝑚 + ∑ 𝑖 𝐴 𝑖 𝑚 𝜃 ¯ 𝑖 + ∑ 𝑖 , 𝑗 𝐵 𝑖 𝑗 𝑚 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 + ∑ 𝑖 , 𝑗 𝐵 ¯ 𝑖 𝑗 𝑚 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 + ∑ 𝑖 , 𝑗 , 𝑘 𝐶 𝑖 𝑗 𝑘 𝑚 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘

∑ 𝑖 , 𝑗 , 𝑘 𝐶 ¯ 𝑖 𝑗 𝑘 𝑚 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘
∑ 𝑖 , 𝑗 , 𝑘 , 𝑙 𝐷 𝑖 𝑗 𝑘 𝑙 𝑚 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙 )

(63)

Expanding the equation, counting the order of each term, we can rewrite (63) as

𝜃 ¯ 𝑠

𝑒 ¯ 𝑠 + ∑ 𝑗 𝐴 𝑗 𝑠 ( 𝑒 ¯ 𝑗 + ∑ 𝑖 𝐴 𝑖 𝑗 𝜃 ¯ 𝑖 + ∑ 𝑖 , 𝑘 𝐵 ¯ 𝑖 𝑘 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 )

∑ 𝑗 , 𝑘 ( 𝐵 𝑗 𝑘 𝑠
𝐵 ¯ 𝑗 𝑘 𝑠 )

× ( 𝑒 ¯ 𝑗 + ∑ 𝑖 𝐴 𝑖 𝑗 𝜃 ¯ 𝑖 + ∑ 𝑖 , 𝑘 𝐵 ¯ 𝑖 𝑘 𝑗 𝜃 ¯ 𝑖 𝜃 ¯ 𝑘 )

× ( 𝑒 ¯ 𝑘 + ∑ 𝑖 𝐴 𝑖 𝑘 𝜃 ¯ 𝑖 + ∑ 𝑖 , 𝑗 𝐵 ¯ 𝑖 𝑗 𝑘 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 )

∑ 𝑗 , 𝑘 , 𝑙 ( 𝐶 𝑗 𝑘 𝑙 𝑠
𝐶 ¯ 𝑗 𝑘 𝑙 𝑠 ) 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘 𝑒 ¯ 𝑙
𝑅 𝑒 1 ,

where 𝑅 𝑒 1 is the polynomial with respect to the variables 𝜃 ¯ 𝑠 , 𝑒 ¯ 𝑠 , 𝐴 𝑗 𝑠 , 𝐵 𝑗 𝑘 𝑠 , 𝐶 𝑗 𝑘 𝑙 𝑠 , 𝐷 𝑗 𝑘 𝑙 𝑚 𝑠

( 1 ≤ 𝑗 , 𝑘 , 𝑙 , 𝑚 , 𝑠 ≤ 𝑝 ) , and each term is of at least fourth order with respect to 𝜃 ¯ 𝑠 , 𝑒 ¯ 𝑠 , 𝐴 𝑗 𝑠 , 𝐵 𝑗 𝑘 𝑠 , 𝐶 𝑗 𝑘 𝑙 𝑠

( 1 ≤ 𝑗 , 𝑘 , 𝑙 , 𝑠 ≤ 𝑝 ) . If we insert this result into the right-hand side of itself, then we yield the result.

𝜃 ¯ 𝑠

𝑒 ¯ 𝑠 + ∑ 𝑗 𝐴 𝑗 𝑠 𝑒 ¯ 𝑗 + ∑ 𝑖 , 𝑗 𝐴 𝑗 𝑠 𝐴 𝑖 𝑗 𝑒 ¯ 𝑖 + ∑ 𝑖 , 𝑗 , 𝑘 𝐴 𝑗 𝑠 𝐵 ¯ 𝑖 𝑘 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑘

+ ∑ 𝑗 , 𝑘 ( 𝐵 𝑗 𝑘 𝑠 + 𝐵 ¯ 𝑗 𝑘 𝑠 )

× ( 𝑒 ¯ 𝑗 + ∑ 𝑖 𝐴 𝑖 𝑗 𝑒 ¯ 𝑖 + ∑ 𝑖 , 𝑙 𝐵 ¯ 𝑖 𝑙 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 )

× ( 𝑒 ¯ 𝑘 + ∑ 𝑖 𝐴 𝑖 𝑘 𝑒 ¯ 𝑖 + ∑ 𝑖 , 𝑙 𝐵 ¯ 𝑖 𝑙 𝑘 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 )

+ ∑ 𝑗 , 𝑘 , 𝑙 ( 𝐶 𝑗 𝑘 𝑙 𝑠 + 𝐶 ¯ 𝑗 𝑘 𝑙 𝑠 ) 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘 𝑒 ¯ 𝑙 + 𝑅 𝑒 2

𝑒 ¯ 𝑠 + ∑ 𝑗 𝐴 𝑗 𝑠 𝑒 ¯ 𝑗 + ∑ 𝑗 , 𝑘 𝐵 ¯ 𝑗 𝑘 𝑠 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘 + ∑ 𝑖 , 𝑗 𝐴 𝑗 𝑠 𝐴 𝑖 𝑗 𝑒 ¯ 𝑖 + ∑ 𝑖 , 𝑗 , 𝑘 𝐴 𝑗 𝑠 𝐵 ¯ 𝑖 𝑘 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑘

∑ 𝑗 , 𝑘 𝐵 𝑗 𝑘 𝑠 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘
2 ∑ 𝑖 , 𝑗 , 𝑘 𝐵 ¯ 𝑗 𝑘 𝑠 𝐴 𝑖 𝑘 𝑒 ¯ 𝑖 𝑒 ¯ 𝑗
2 ∑ 𝑖 , 𝑗 , 𝑘 , 𝑙 𝐵 ¯ 𝑗 𝑘 𝑠 𝐵 ¯ 𝑖 𝑙 𝑘 𝑒 ¯ 𝑖 𝑒 ¯ 𝑗 𝑒 ¯ 𝑙
∑ 𝑗 , 𝑘 , 𝑙 𝐶 ¯ 𝑗 𝑘 𝑙 𝑠 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘 𝑒 ¯ 𝑙
𝑅 𝑒 3 ,

(64)

where 𝑅 𝑒 2 and 𝑅 𝑒 3 have the same property as 𝑅 𝑒 1 .

Here we impose the moment conditions as follows. The suitably higher-order joint moments composed of the following variables are bounded with respect to 𝑛 ;

𝑛 𝜃 ¯ 𝑠 , 𝑛 𝑒 ¯ 𝑠 , 𝑛 𝐴 𝑗 𝑠 , 𝑛 𝐵 𝑗 𝑘 𝑠 , 𝑛 𝐶 𝑗 𝑘 𝑙 𝑠 , 𝐷 𝑗 𝑘 𝑙 𝑚 𝑠 ,

(65)

where 1 ≤ 𝑗 , 𝑘 , 𝑙 , 𝑚 , 𝑠 ≤ 𝑝 . Then the following result on the expectations hold. In the process of the calculation, we use Einstein summation notation for brevity. First we consider 𝐸 [ 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 ] . From (64), we have

𝜃 ¯ 𝑖 𝜃 ¯ 𝑗

= ( 𝑒 ¯ 𝑖 + 𝐴 𝑙 𝑖 𝑒 ¯ 𝑙 + 𝐵 ¯ 𝑙 𝑚 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 + 𝐴 𝑙 𝑖 𝐴 𝑚 𝑙 𝑒 ¯ 𝑚 + 𝐴 𝑙 𝑖 𝐵 ¯ 𝑚 𝑠 𝑙 𝑒 ¯ 𝑚 𝑒 ¯ 𝑠 + 𝐵 𝑙 𝑚 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 + 2 𝐵 ¯ 𝑙 𝑚 𝑖 𝐴 𝑠 𝑚 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠

2 𝐵 ¯ 𝑙 𝑚 𝑖 𝐵 ¯ 𝑠 𝑡 𝑚 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡
𝐶 ¯ 𝑙 𝑚 𝑡 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 𝑒 ¯ 𝑡
𝑅 𝑒 3 )

× ( 𝑒 ¯ 𝑗 + 𝐴 𝑙 𝑗 𝑒 ¯ 𝑙 + 𝐵 ¯ 𝑙 𝑚 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 + 𝐴 𝑙 𝑗 𝐴 𝑚 𝑙 𝑒 ¯ 𝑚 + 𝐴 𝑙 𝑗 𝐵 ¯ 𝑚 𝑠 𝑙 𝑒 ¯ 𝑚 𝑒 ¯ 𝑠 + 𝐵 𝑙 𝑚 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 + 2 𝐵 ¯ 𝑙 𝑚 𝑗 𝐴 𝑠 𝑚 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠

2 𝐵 ¯ 𝑙 𝑚 𝑗 𝐵 ¯ 𝑠 𝑡 𝑚 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡
𝐶 ¯ 𝑙 𝑚 𝑡 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 𝑒 ¯ 𝑡
𝑅 𝑒 3 )

= 𝑒 ¯ 𝑖 𝑒 ¯ 𝑗 + 𝐴 𝑙 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 + 𝐴 𝑙 𝑖 𝑒 ¯ 𝑗 𝑒 ¯ 𝑙 + 𝐵 ¯ 𝑙 𝑚 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 + 𝐵 ¯ 𝑙 𝑚 𝑖 𝑒 ¯ 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚

𝐴 𝑙 𝑗 𝐴 𝑚 𝑙 𝑒 ¯ 𝑖 𝑒 ¯ 𝑚
𝐴 𝑙 𝑗 𝐵 ¯ 𝑚 𝑠 𝑙 𝑒 ¯ 𝑖 𝑒 ¯ 𝑚 𝑒 ¯ 𝑠
𝐵 𝑙 𝑚 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚
2 𝐵 ¯ 𝑙 𝑚 𝑗 𝐴 𝑠 𝑚 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠
2 𝐵 ¯ 𝑙 𝑚 𝑗 𝐵 ¯ 𝑠 𝑡 𝑚 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡
𝐶 ¯ 𝑙 𝑚 𝑡 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 𝑒 ¯ 𝑡
𝐴 𝑙 𝑖 𝐴 𝑚 𝑙 𝑒 ¯ 𝑗 𝑒 ¯ 𝑚
𝐴 𝑙 𝑖 𝐵 ¯ 𝑚 𝑠 𝑙 𝑒 ¯ 𝑗 𝑒 ¯ 𝑚 𝑒 ¯ 𝑠
𝐵 𝑙 𝑚 𝑖 𝑒 ¯ 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚
2 𝐵 ¯ 𝑙 𝑚 𝑖 𝐴 𝑠 𝑚 𝑒 ¯ 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠
2 𝐵 ¯ 𝑙 𝑚 𝑖 𝐵 ¯ 𝑠 𝑡 𝑚 𝑒 ¯ 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡
𝐶 ¯ 𝑙 𝑚 𝑡 𝑖 𝑒 ¯ 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 𝑒 ¯ 𝑡
𝐴 𝑙 𝑖 𝐴 𝑚 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚
𝐴 𝑙 𝑖 𝐵 ¯ 𝑠 𝑡 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡
𝐴 𝑙 𝑗 𝐵 ¯ 𝑠 𝑡 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡
𝐵 ¯ 𝑙 𝑚 𝑖 𝐵 ¯ 𝑠 𝑡 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡
𝑅 𝑒 4 ,

(66)

where 𝑅 𝑒 4 is a polynomial with respect to the variables 𝜃 ¯ 𝑠 , 𝑒 ¯ 𝑠 , 𝐴 𝑗 𝑠 , 𝐵 𝑗 𝑘 𝑠 , 𝐶 𝑗 𝑘 𝑙 𝑠 , 𝐷 𝑗 𝑘 𝑙 𝑚 𝑠

( 1 ≤ 𝑗 , 𝑘 , 𝑙 , 𝑚 , 𝑠 ≤ 𝑝 ) , and each term is of at least fifth order with respect to 𝜃 ¯ 𝑠 , 𝑒 ¯ 𝑠 , 𝐴 𝑗 𝑠 , 𝐵 𝑗 𝑘 𝑠 , 𝐶 𝑗 𝑘 𝑙 𝑠

( 1 ≤ 𝑗 , 𝑘 , 𝑙 , 𝑠 ≤ 𝑝 ) . We calculate the expectation of each term on the right-hand side of this equation. 𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ ) , 𝑔 𝑖 𝑗 ( 𝜃 ∗ ) ( 1 ≤ 𝑖 , 𝑗 ≤ 𝑝 ) are abbreviated as 𝑔 ~ 𝑖 𝑗 , 𝑔 𝑖 𝑗 . Note that

𝐸 [ ∂ ∂ 𝜃 𝑖 log ⁡ 𝑓 ( 𝑋 𝑡 ; 𝜃 ∗ ) ]

0 , 𝑖

1 , … , 𝑝 , 𝑡

1 , … , 𝑛 .
(67)
𝐸 [ 𝑒 ¯ 𝑖 𝑒 ¯ 𝑗 ]

𝑛 − 2 𝐸 [ ( ∑ 𝑎

1 𝑛 𝑔 ~ 𝑖 𝑙 ∂ ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∑ 𝑏

1 𝑛 𝑔 ~ 𝑗 𝑚 ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

𝑛 − 1 𝐸 [ 𝑔 ~ 𝑖 𝑙 𝑔 ~ 𝑗 𝑚 ∂ ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ]

+ 𝑛 − 2 ∑ 𝑎 ≠ 𝑏 𝑔 ~ 𝑖 𝑙 𝐸 [ ∂ ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ] 𝑔 ~ 𝑗 𝑚 𝐸 [ ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ]

𝑛 − 1 𝑔 ~ 𝑖 𝑙 𝑔 ~ 𝑗 𝑚 𝑔 𝑙 𝑚 .

(68)

𝐸 [ 𝐴 𝑙 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 ]

= 𝐸 [ 𝑛 − 1 𝑔 ~ 𝑠 𝑗 ∑ 𝑐

1 𝑛 ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) + 𝑔 ~ 𝑙 𝑠 )

× 𝑛 − 2 𝑔 ~ 𝑖 𝑡 ( ∑ 𝑎

1 𝑛 ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) 𝑔 ~ 𝑙 𝑚 ( ∑ 𝑏

1 𝑛 ∂ ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

= 𝑛 − 3 𝑔 ~ 𝑠 𝑗 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑚

× { 𝑛 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

𝑛 𝑔 ~ 𝑙 𝑠 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]
∑ 𝑎 ≠ 𝑐 𝐸 [ ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ )
𝑔 ~ 𝑙 𝑠 ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

∑ 𝑎 ≠ 𝑏 𝐸 [ ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑙 𝑠 ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

∑ 𝑏 ≠ 𝑎 𝐸 [ ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) + 𝑔 ~ 𝑙 𝑠 ) ( ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

∑ 𝑎 ≠ 𝑏 , 𝑎 ≠ 𝑐 , 𝑏 ≠ 𝑐 𝐸 [ ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ )
𝑔 ~ 𝑙 𝑠 ]

× 𝐸 [ ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ] 𝐸 [ ∂ ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ] }

= 𝑛 − 2 𝑔 ~ 𝑠 𝑗 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑚

× { 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] + 𝑔 ~ 𝑙 𝑠 𝑔 𝑡 𝑚 }

= 𝑛 − 2 𝑔 ~ 𝑠 𝑗 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑚 ( 𝐿 ( 𝑠 𝑙 ) 𝑡 𝑚 + 𝑔 ~ 𝑙 𝑠 𝑔 𝑡 𝑚 ) .

(69)

𝐸 [ 𝐵 ¯ 𝑙 𝑚 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 ]

= 𝐵 ¯ 𝑙 𝑚 𝑗 𝐸 [ 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 ]

= 𝑛 − 3 𝐵 ¯ 𝑙 𝑚 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝐸 [ ( ∑ 𝑎

1 𝑛 ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∑ 𝑏

1 𝑛 ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∑ 𝑐

1 𝑛 ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ]

= 𝑛 − 3 𝐵 ¯ 𝑙 𝑚 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡

× { 𝑛 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

∑ 𝑎 ≠ 𝑐 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ] 𝐸 [ ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ]
∑ 𝑎 ≠ 𝑐 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ] 𝐸 [ ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ]
∑ 𝑎 ≠ 𝑐 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ] 𝐸 [ ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ]
∑ 𝑎 ≠ 𝑏 , 𝑎 ≠ 𝑐 , 𝑏 ≠ 𝑐 𝐸 [ ∂ ∂ 𝜃 𝑘 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ] 𝐸 [ ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ] 𝐸 [ ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ] }

= 𝑛 − 2 𝐵 ¯ 𝑙 𝑚 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡

× 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

= 𝑛 − 2 𝐵 ¯ 𝑙 𝑚 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝐿 𝑘 𝑠 𝑡 .
(70)
𝐸 [ 𝐴 𝑙 𝑗 𝐴 𝑚 𝑙 𝑒 ¯ 𝑖 𝑒 ¯ 𝑚 ]

𝑛 − 4 𝐸 [ { 𝑔 ~ 𝑗 𝑘 ∑ 𝑎

1 𝑛 ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) }

× { 𝑔 ~ 𝑙 𝑢 ∑ 𝑏

1 𝑛 ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) }

× { 𝑔 ~ 𝑖 𝑠 ∑ 𝑐

1 𝑛 ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) } { 𝑔 ~ 𝑚 𝑡 ∑ 𝑑

1 𝑛 ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) } ]

𝑛 − 4 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑚 𝑡

× { ∑ 𝑎

1 𝑛 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 )

× ( ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑏 𝐸 [ ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑏 𝐸 [ ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑐 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 )

× ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ] 𝐸 [ ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ]

+ ∑ 𝑎 ≠ 𝑐 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 )

× ( ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ] 𝐸 [ ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ]

+ ∑ 𝑎 ≠ 𝑐 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑐 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑐 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑏 , 𝑎 ≠ 𝑐 , 𝑏 ≠ 𝑐 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ] 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑏 , 𝑎 ≠ 𝑐 , 𝑏 ≠ 𝑐 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑏 𝜃 ) + 𝑔 ~ 𝑢 𝑚 ) ] 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑏 , 𝑎 ≠ 𝑐 , 𝑏 ≠ 𝑐 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑏 𝜃 ) + 𝑔 ~ 𝑢 𝑚 ) ] 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑏 , 𝑎 ≠ 𝑐 , 𝑏 ≠ 𝑐 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑏 𝜃 ) + 𝑔 ~ 𝑘 𝑙 ) ] 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑏 , 𝑎 ≠ 𝑐 , 𝑏 ≠ 𝑐 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑏 𝜃 ) + 𝑔 ~ 𝑘 𝑙 ) ] 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑏 , 𝑎 ≠ 𝑐 , 𝑏 ≠ 𝑐 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ] 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ]

+ ∑ 𝑎 ≠ 𝑏 , 𝑎 ≠ 𝑐 , 𝑎 ≠ 𝑑 , 𝑏 ≠ 𝑐 , 𝑏 ≠ 𝑑 , 𝑐 ≠ 𝑑 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ] 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ] 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ] }

𝑛 − 2 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑚 𝑡

× { 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑢 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑢 𝑚 ) ( ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 ) .

𝑛 − 2 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑚 𝑡

× ( 𝐿 ( 𝑘 𝑙 ) ( 𝑢 𝑚 ) 𝑔 𝑠 𝑡 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑢 𝑚 𝑔 𝑠 𝑡 + 𝐿 ( 𝑘 𝑙 ) 𝑠 𝐿 ( 𝑢 𝑚 ) 𝑡 + 𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑢 𝑚 ) 𝑠 ) + 𝑂 ( 𝑛 − 3 ) .

(71)

𝐸 [ 𝐴 𝑙 𝑗 𝐵 ¯ 𝑚 𝑠 𝑙 𝑒 ¯ 𝑖 𝑒 ¯ 𝑚 𝑒 ¯ 𝑠 ]

= 𝑛 − 4 𝐵 ¯ 𝑚 𝑠 𝑙 𝐸 [ 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 ( ∑ 𝑎

1 𝑛 ( ∂ 2 ∂ 𝜃 𝑙 ∂ 𝜃 𝑘 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ) ( ∑ 𝑏

1 𝑛 ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) )

× ( ∑ 𝑐

1 𝑛 ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∑ 𝑑

1 𝑛 ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ]

= 𝑛 − 4 𝐵 ¯ 𝑚 𝑠 𝑙 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣

× { ∑ 𝑎 ≠ 𝑏 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑙 ∂ 𝜃 𝑘 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

∑ 𝑎 ≠ 𝑏 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑙 ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ )
𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

∑ 𝑎 ≠ 𝑏 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑙 ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ )
𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

= 𝑛 − 2 𝐵 ¯ 𝑚 𝑠 𝑙 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣

× { 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑙 ∂ 𝜃 𝑘 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] }

𝐸 [ ( ∂ 2 ∂ 𝜃 𝑙 ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ )
𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

𝐸 [ ( ∂ 2 ∂ 𝜃 𝑙 ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ )
𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

= 𝑛 − 2 𝐵 ¯ 𝑚 𝑠 𝑙 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 ( 𝐿 ( 𝑙 𝑘 ) 𝑡 𝑔 𝑢 𝑣 + 𝐿 ( 𝑙 𝑘 ) 𝑢 𝑔 𝑡 𝑣 + 𝐿 ( 𝑙 𝑘 ) 𝑣 𝑔 𝑡 𝑢 + 𝑂 ( 𝑛 − 3 ) .
(72)
𝐸 [ 𝐵 𝑙 𝑚 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 ]

𝑛 − 4 2 − 1 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑚 𝑣

× 𝐸 [ ∑ 𝑎

1 𝑛 ( ∂ 3 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) − 𝐸 [ ∂ 3 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ] )

× ( ∑ 𝑏

1 𝑛 ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∑ 𝑐

1 𝑛 ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∑ 𝑑

1 𝑛 ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ]

𝑛 − 2 2 − 1 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑚 𝑣

× { 𝐸 [ ( ∂ 3 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) − 𝐸 [ ∂ 3 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ] ) ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 3 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) − 𝐸 [ ∂ 3 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ] ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 3 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) − 𝐸 [ ∂ 3 ∂ 𝜃 𝑠 ∂ 𝜃 𝑙 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ] ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 2 − 1 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑚 𝑣 ( 𝐿 ( 𝑠 𝑙 𝑚 ) 𝑡 𝑔 𝑢 𝑣 + 𝐿 ( 𝑠 𝑙 𝑚 ) 𝑢 𝑔 𝑡 𝑣 + 𝐿 ( 𝑠 𝑙 𝑚 ) 𝑣 𝑔 𝑡 𝑢 + 𝑂 ( 𝑛 − 3 ) .
(73)
𝐸 [ 𝐵 ¯ 𝑙 𝑚 𝑗 𝐴 𝑠 𝑚 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠 ]

𝑛 − 4 𝐵 ¯ 𝑙 𝑚 𝑗 𝑔 ~ 𝑚 𝑡 𝑔 ~ 𝑖 𝑢 𝑔 ~ 𝑙 𝑣 𝑔 ~ 𝑠 𝑤

× 𝐸 [ ( ∑ 𝑎

1 𝑝 ( ∂ 2 ∂ 𝜃 𝑡 ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑡 𝑠 ) ) ( ∑ 𝑏

1 𝑛 ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) )

× ( ∑ 𝑐

1 𝑛 ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∑ 𝑑

1 𝑛 ∂ ∂ 𝜃 𝑤 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ]

𝑛 − 4 𝐵 ¯ 𝑙 𝑚 𝑗 𝑔 ~ 𝑚 𝑡 𝑔 ~ 𝑖 𝑢 𝑔 ~ 𝑙 𝑣 𝑔 ~ 𝑠 𝑤

× { ∑ 𝑎 ≠ 𝑏 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑡 ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑡 𝑠 ) ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑏 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑡 ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑡 𝑠 ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

+ ∑ 𝑎 ≠ 𝑏 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑡 ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑡 𝑠 ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝐵 ¯ 𝑙 𝑚 𝑗 𝑔 ~ 𝑚 𝑡 𝑔 ~ 𝑖 𝑢 𝑔 ~ 𝑙 𝑣 𝑔 ~ 𝑠 𝑤

× { 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑡 ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑡 𝑠 ) ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑡 ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑡 𝑠 ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑡 ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑡 𝑠 ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝐵 ¯ 𝑙 𝑚 𝑗 𝑔 ~ 𝑚 𝑡 𝑔 ~ 𝑖 𝑢 𝑔 ~ 𝑙 𝑣 𝑔 ~ 𝑠 𝑤 ( 𝐿 ( 𝑡 𝑠 ) 𝑢 𝑔 𝑣 𝑤 + 𝐿 ( 𝑡 𝑠 ) 𝑣 𝑔 𝑢 𝑤 + 𝐿 ( 𝑡 𝑠 ) 𝑤 𝑔 𝑢 𝑣 ) + 𝑂 ( 𝑛 − 3 ) .
(74)
𝐸 [ 𝐵 ¯ 𝑙 𝑚 𝑗 𝐵 ¯ 𝑠 𝑡 𝑚 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡 ]

𝐵 ¯ 𝑙 𝑚 𝑗 𝐵 ¯ 𝑠 𝑡 𝑚 𝐸 [ 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡 ]

𝑛 − 4 𝐵 ¯ 𝑙 𝑚 𝑗 𝐵 ¯ 𝑠 𝑡 𝑚 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤

× 𝐸 [ ( ∑ 𝑎

1 𝑛 ∂ ∂ 𝜃 𝑘 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∑ 𝑏

1 𝑛 ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) )

× ( ∑ 𝑐

1 𝑛 ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∑ 𝑑

1 𝑛 ∂ ∂ 𝜃 𝑤 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ]

𝑛 − 2 𝐵 ¯ 𝑙 𝑚 𝑗 𝐵 ¯ 𝑠 𝑡 𝑚 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤

× 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝐵 ¯ 𝑙 𝑚 𝑗 𝐵 ¯ 𝑠 𝑡 𝑚 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤 + 𝑔 𝑘 𝑣 𝑔 𝑢 𝑤 + 𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 ) + 𝑂 ( 𝑛 − 3 ) .
(75)
𝐸 [ 𝐶 ¯ 𝑙 𝑚 𝑡 𝑗 𝑒 ¯ 𝑖 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 𝑒 ¯ 𝑡 ]

𝑛 − 4 𝐶 ¯ 𝑙 𝑚 𝑡 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣

× 𝐸 [ ( ∑ 𝑎

1 𝑛 ∂ ∂ 𝜃 𝑘 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∑ 𝑏

1 𝑛 ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) )

× ( ∑ 𝑐

1 𝑛 ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∑ 𝑑

1 𝑛 ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ]

𝑛 − 2 𝐶 ¯ 𝑙 𝑚 𝑡 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣

× 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝐶 ¯ 𝑙 𝑚 𝑡 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣 ( 𝑔 𝑘 𝑠 𝑔 𝑢 𝑣 + 𝑔 𝑘 𝑢 𝑔 𝑠 𝑣 + 𝑔 𝑘 𝑣 𝑔 𝑠 𝑢 ) + 𝑂 ( 𝑛 − 3 ) .
(76)
𝐸 [ 𝐴 𝑙 𝑖 𝐴 𝑚 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 ]

𝑛 − 4 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑙 𝑡 𝑔 ~ 𝑚 𝑢

× 𝐸 [ ( ∑ 𝑎

1 𝑛 ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ) ( ∑ 𝑏

1 𝑛 ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) + 𝑔 ~ 𝑠 𝑚 ) )

× ( ∑ 𝑐

1 𝑛 ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∑ 𝑑

1 𝑛 ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ]

𝑛 − 2 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑙 𝑡 𝑔 ~ 𝑚 𝑢

× { 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑠 𝑚 ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑠 𝑚 ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑠 𝑚 ) ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑙 𝑡 𝑔 ~ 𝑚 𝑢 ( 𝐿 ( 𝑘 𝑙 ) ( 𝑠 𝑚 ) 𝑔 𝑡 𝑢 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑠 𝑚 𝑔 𝑡 𝑢 + 𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑠 𝑚 ) 𝑢 + 𝐿 ( 𝑘 𝑙 ) 𝑢 𝐿 ( 𝑠 𝑚 ) 𝑡 ) + 𝑂 ( 𝑛 − 3 ) .
(77)
𝐸 [ 𝐴 𝑙 𝑖 𝐵 ¯ 𝑠 𝑡 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡 ]

𝑛 − 4 𝐵 ¯ 𝑠 𝑡 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤

× 𝐸 [ ( ∑ 𝑎

1 𝑛 ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ) ( ∑ 𝑏

1 𝑛 ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) )

× ( ∑ 𝑐

1 𝑛 ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∑ 𝑑

1 𝑛 ∂ ∂ 𝜃 𝑤 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ]

𝑛 − 2 𝐵 ¯ 𝑠 𝑡 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤

× { 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) + 𝑔 ~ 𝑘 𝑙 ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝐵 ¯ 𝑠 𝑡 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 ( 𝐿 ( 𝑘 𝑙 ) 𝑢 𝑔 𝑣 𝑤 + 𝐿 ( 𝑘 𝑙 ) 𝑣 𝑔 𝑢 𝑤 + 𝐿 ( 𝑘 𝑙 ) 𝑤 𝑔 𝑢 𝑣 ) + 𝑂 ( 𝑛 − 3 ) .
(78)
𝐸 [ 𝐵 ¯ 𝑙 𝑚 𝑖 𝐵 ¯ 𝑠 𝑡 𝑗 𝑒 ¯ 𝑙 𝑒 ¯ 𝑚 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡 ]

𝑛 − 4 𝐵 ¯ 𝑙 𝑚 𝑖 𝐵 ¯ 𝑠 𝑡 𝑗 𝑔 ~ 𝑙 𝑘 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤

× 𝐸 [ ( ∑ 𝑎

1 𝑛 ∂ ∂ 𝜃 𝑘 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∑ 𝑏

1 𝑛 ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) )

× ( ∑ 𝑐

1 𝑛 ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∑ 𝑑

1 𝑛 ∂ ∂ 𝜃 𝑤 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ]

𝑛 − 2 𝐵 ¯ 𝑙 𝑚 𝑖 𝐵 ¯ 𝑠 𝑡 𝑗 𝑔 ~ 𝑙 𝑘 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤

× { 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ ∂ 𝜃 𝑘 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝐵 ¯ 𝑙 𝑚 𝑖 𝐵 ¯ 𝑠 𝑡 𝑗 𝑔 ~ 𝑙 𝑘 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤 + 𝑔 𝑘 𝑣 𝑔 𝑢 𝑤 + 𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 ) + 𝑂 ( 𝑛 − 3 ) .

(79)

Consequently the following result holds.

𝐸 [ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ]

= 𝑛 − 1 𝑔 ~ 𝑖 𝑙 𝑔 ~ 𝑗 𝑚 𝑔 𝑙 𝑚 + 𝑛 − 2

× ( 𝑔 ~ 𝑠 𝑗 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑚 ( 𝐿 ( 𝑠 𝑙 ) 𝑡 𝑚 + 𝑔 ~ 𝑙 𝑠 𝑔 𝑡 𝑚 ) + 𝑔 ~ 𝑠 𝑖 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑙 𝑚 ( 𝐿 ( 𝑠 𝑙 ) 𝑡 𝑚 + 𝑔 ~ 𝑙 𝑠 𝑔 𝑡 𝑚 )

𝐵 ¯ 𝑙 𝑚 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝐿 𝑘 𝑠 𝑡
𝐵 ¯ 𝑙 𝑚 𝑖 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝐿 𝑘 𝑠 𝑡
𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑚 𝑡 ( 𝐿 ( 𝑘 𝑙 ) ( 𝑢 𝑚 ) 𝑔 𝑠 𝑡 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑢 𝑚 𝑔 𝑠 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑠 𝐿 ( 𝑢 𝑚 ) 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑢 𝑚 ) 𝑠 )
𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑚 𝑡 ( 𝐿 ( 𝑘 𝑙 ) ( 𝑢 𝑚 ) 𝑔 𝑠 𝑡 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑢 𝑚 𝑔 𝑠 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑠 𝐿 ( 𝑢 𝑚 ) 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑢 𝑚 ) 𝑠 )
𝐵 ¯ 𝑚 𝑠 𝑙 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 ( 𝐿 ( 𝑙 𝑘 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑣 𝑔 𝑡 𝑢 )
𝐵 ¯ 𝑚 𝑠 𝑙 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 ( 𝐿 ( 𝑙 𝑘 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑣 𝑔 𝑡 𝑢 )
2 − 1 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑚 𝑣 ( 𝐿 ( 𝑠 𝑙 𝑚 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑣 𝑔 𝑡 𝑢 )
2 − 1 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑚 𝑣 ( 𝐿 ( 𝑠 𝑙 𝑚 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑣 𝑔 𝑡 𝑢 )
2 𝐵 ¯ 𝑙 𝑚 𝑗 𝑔 ~ 𝑚 𝑡 𝑔 ~ 𝑖 𝑢 𝑔 ~ 𝑙 𝑣 𝑔 ~ 𝑠 𝑤 ( 𝐿 ( 𝑡 𝑠 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑤 𝑔 𝑢 𝑣 )
2 𝐵 ¯ 𝑙 𝑚 𝑖 𝑔 ~ 𝑚 𝑡 𝑔 ~ 𝑗 𝑢 𝑔 ~ 𝑙 𝑣 𝑔 ~ 𝑠 𝑤 ( 𝐿 ( 𝑡 𝑠 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑤 𝑔 𝑢 𝑣 )
2 𝐵 ¯ 𝑙 𝑚 𝑗 𝐵 ¯ 𝑠 𝑡 𝑚 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 )
2 𝐵 ¯ 𝑙 𝑚 𝑖 𝐵 ¯ 𝑠 𝑡 𝑚 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 )
𝐶 ¯ 𝑙 𝑚 𝑡 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣 ( 𝑔 𝑘 𝑠 𝑔 𝑢 𝑣
𝑔 𝑘 𝑢 𝑔 𝑠 𝑣
𝑔 𝑘 𝑣 𝑔 𝑠 𝑢 )
𝐶 ¯ 𝑙 𝑚 𝑡 𝑖 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣 ( 𝑔 𝑘 𝑠 𝑔 𝑢 𝑣
𝑔 𝑘 𝑢 𝑔 𝑠 𝑣
𝑔 𝑘 𝑣 𝑔 𝑠 𝑢 )
𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑙 𝑡 𝑔 ~ 𝑚 𝑢 ( 𝐿 ( 𝑘 𝑙 ) ( 𝑠 𝑚 ) 𝑔 𝑡 𝑢 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑠 𝑚 𝑔 𝑡 𝑢
𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑠 𝑚 ) 𝑢
𝐿 ( 𝑘 𝑙 ) 𝑢 𝐿 ( 𝑠 𝑚 ) 𝑡 )
𝐵 ¯ 𝑠 𝑡 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 ( 𝐿 ( 𝑘 𝑙 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑤 𝑔 𝑢 𝑣 )
𝐵 ¯ 𝑠 𝑡 𝑖 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 ( 𝐿 ( 𝑘 𝑙 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑤 𝑔 𝑢 𝑣 )
𝐵 ¯ 𝑙 𝑚 𝑖 𝐵 ¯ 𝑠 𝑡 𝑗 𝑔 ~ 𝑙 𝑘 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 ) )
𝑂 ( 𝑛 − 3 )

(80)

If we substitute 𝐵 ¯ 𝑖 𝑗 𝑘 and 𝐶 ¯ 𝑖 𝑗 𝑘 𝑙 for the expression with 𝑔 ~ 𝑖 𝑗 , 𝐿 ( 𝑖 𝑗 𝑘 ) , 𝐿 ( 𝑖 𝑗 𝑘 𝑙 ) , then

𝐸 [ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ]

= 𝑛 − 1 𝑔 ~ 𝑖 𝑙 𝑔 ~ 𝑗 𝑚 𝑔 𝑙 𝑚 + 𝑛 − 2

2 − 1 𝑔 ~ 𝑢 𝑗 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝐿 𝑘 𝑠 𝑡 𝐿 ( 𝑙 𝑚 𝑢 )
2 − 1 𝑔 ~ 𝑢 𝑖 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝐿 𝑘 𝑠 𝑡 𝐿 ( 𝑙 𝑚 𝑢 )
𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑚 𝑡 ( 𝐿 ( 𝑘 𝑙 ) ( 𝑢 𝑚 ) 𝑔 𝑠 𝑡 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑢 𝑚 𝑔 𝑠 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑠 𝐿 ( 𝑢 𝑚 ) 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑢 𝑚 ) 𝑠 )
𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑚 𝑡 ( 𝐿 ( 𝑘 𝑙 ) ( 𝑢 𝑚 ) 𝑔 𝑠 𝑡 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑢 𝑚 𝑔 𝑠 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑠 𝐿 ( 𝑢 𝑚 ) 𝑡
𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑢 𝑚 ) 𝑠 )
2 − 1 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑤 𝑙 𝐿 ( 𝑚 𝑠 𝑤 ) ( 𝐿 ( 𝑙 𝑘 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑣 𝑔 𝑡 𝑢 )
2 − 1 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑤 𝑙 𝐿 ( 𝑚 𝑠 𝑤 ) ( 𝐿 ( 𝑙 𝑘 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑙 𝑘 ) 𝑣 𝑔 𝑡 𝑢 )
2 − 1 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑚 𝑣 ( 𝐿 ( 𝑠 𝑙 𝑚 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑣 𝑔 𝑡 𝑢 )
2 − 1 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑚 𝑣 ( 𝐿 ( 𝑠 𝑙 𝑚 ) 𝑡 𝑔 𝑢 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑢 𝑔 𝑡 𝑣
𝐿 ( 𝑠 𝑙 𝑚 ) 𝑣 𝑔 𝑡 𝑢 )
𝑔 ~ 𝑚 𝑡 𝑔 ~ 𝑖 𝑢 𝑔 ~ 𝑙 𝑣 𝑔 ~ 𝑠 𝑤 𝑔 ~ 𝑘 𝑗 𝐿 ( 𝑙 𝑚 𝑘 ) ( 𝐿 ( 𝑡 𝑠 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑤 𝑔 𝑢 𝑣 )
𝑔 ~ 𝑚 𝑡 𝑔 ~ 𝑗 𝑢 𝑔 ~ 𝑙 𝑣 𝑔 ~ 𝑠 𝑤 𝑔 ~ 𝑘 𝑖 𝐿 ( 𝑙 𝑚 𝑘 ) ( 𝐿 ( 𝑡 𝑠 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑡 𝑠 ) 𝑤 𝑔 𝑢 𝑣 )
2 − 1 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑜 𝑗 𝑔 ~ ℎ 𝑚 𝐿 ( 𝑙 𝑚 𝑜 ) 𝐿 ( 𝑠 𝑡 ℎ ) ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 )
2 − 1 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑜 𝑖 𝑔 ~ ℎ 𝑚 𝐿 ( 𝑙 𝑚 𝑜 ) 𝐿 ( 𝑠 𝑡 ℎ ) ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 )
6 − 1 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑤 𝑗 𝐿 ( 𝑙 𝑚 𝑡 𝑤 ) ( 𝑔 𝑘 𝑠 𝑔 𝑢 𝑣
𝑔 𝑘 𝑢 𝑔 𝑠 𝑣
𝑔 𝑘 𝑣 𝑔 𝑠 𝑢 )
6 − 1 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑤 𝑖 𝐿 ( 𝑙 𝑚 𝑡 𝑤 ) ( 𝑔 𝑘 𝑠 𝑔 𝑢 𝑣
𝑔 𝑘 𝑢 𝑔 𝑠 𝑣
𝑔 𝑘 𝑣 𝑔 𝑠 𝑢 )
𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑗 𝑠 𝑔 ~ 𝑙 𝑡 𝑔 ~ 𝑚 𝑢 ( 𝐿 ( 𝑘 𝑙 ) ( 𝑠 𝑚 ) 𝑔 𝑡 𝑢 − 𝑔 ~ 𝑘 𝑙 𝑔 ~ 𝑠 𝑚 𝑔 𝑡 𝑢
𝐿 ( 𝑘 𝑙 ) 𝑡 𝐿 ( 𝑠 𝑚 ) 𝑢
𝐿 ( 𝑘 𝑙 ) 𝑢 𝐿 ( 𝑠 𝑚 ) 𝑡 )
2 − 1 𝑔 ~ 𝑖 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑗 𝑚 𝐿 ( 𝑠 𝑡 𝑚 ) ( 𝐿 ( 𝑘 𝑙 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑤 𝑔 𝑢 𝑣 )
2 − 1 𝑔 ~ 𝑗 𝑘 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑖 𝑚 𝐿 ( 𝑠 𝑡 𝑚 ) ( 𝐿 ( 𝑘 𝑙 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑘 𝑙 ) 𝑤 𝑔 𝑢 𝑣 )
4 − 1 𝑔 ~ 𝑙 𝑘 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑖 𝑜 𝑔 ~ 𝑗 ℎ 𝐿 ( 𝑙 𝑚 𝑜 ) 𝐿 ( 𝑠 𝑡 ℎ ) ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 ) )
𝑂 ( 𝑛 − 3 )

(81)

Now we consider 𝐸 [ 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 ] . From (64), we have

𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘

𝑒 ¯ 𝑖 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘 + 𝐴 𝑠 𝑖 𝑒 ¯ 𝑠 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘 + 𝐴 𝑠 𝑗 𝑒 ¯ 𝑠 𝑒 ¯ 𝑖 𝑒 ¯ 𝑘 + 𝐴 𝑠 𝑘 𝑒 ¯ 𝑠 𝑒 ¯ 𝑖 𝑒 ¯ 𝑗

𝐵 ¯ 𝑠 𝑡 𝑖 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘
𝐵 ¯ 𝑠 𝑡 𝑗 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡 𝑒 ¯ 𝑖 𝑒 ¯ 𝑘
𝐵 ¯ 𝑠 𝑡 𝑘 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡 𝑒 ¯ 𝑖 𝑒 ¯ 𝑗
𝑅 𝑒 4 ,

(82)

where 𝑅 𝑒 4 is similarly defined as before. We evaluate the expectation of each term.

𝐸 [ 𝑒 ¯ 𝑖 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘 ]

𝑛 − 3 𝐸 [ ( 𝑔 ~ 𝑖 𝑠 ∑ 𝑎

1 𝑛 ( ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ) ( 𝑔 ~ 𝑗 𝑡 ∑ 𝑏

1 𝑛 ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) )

× ( 𝑔 ~ 𝑘 𝑢 ∑ 𝑐

1 𝑛 ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ) ]

𝑛 − 3 ∑ 𝑎

1 𝑛 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

+ 𝑛 − 3 ∑ 𝑎 ≠ 𝑏 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢

× { 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ] 𝐸 [ ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ]

+ 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ] 𝐸 [ ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ]

+ 𝐸 [ ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ] 𝐸 [ ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ] }

+ 𝑛 − 3 ∑ 𝑎 ≠ 𝑏 , 𝑎 ≠ 𝑐 , 𝑏 ≠ 𝑐 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢

× 𝐸 [ ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ] 𝐸 [ ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ] 𝐸 [ ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ]

𝑛 − 2 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝐸 [ ( ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝐿 𝑠 𝑡 𝑢 + 𝑂 ( 𝑛 − 3 ) .
(83)
𝐸 [ 𝐴 𝑠 𝑖 𝑒 ¯ 𝑠 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘 ]

𝑛 − 4 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑗 𝑣 𝑔 ~ 𝑘 𝑤 ∑ 1 ≤ 𝑎 , 𝑏 , 𝑐 , 𝑑 ≤ 𝑛 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑠 𝑡 )

× ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ]

𝑛 − 4 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑗 𝑣 𝑔 ~ 𝑘 𝑤

× ∑ 𝑎 ≠ 𝑏 { 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑠 𝑡 ) ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑠 𝑡 ) ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ 2 ∂ 𝜃 𝑠 ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) + 𝑔 ~ 𝑠 𝑡 ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑗 𝑣 𝑔 ~ 𝑘 𝑤 ( 𝐿 ( 𝑠 𝑡 ) 𝑢 𝑔 𝑣 𝑤 + 𝐿 ( 𝑠 𝑡 ) 𝑣 𝑔 𝑢 𝑤 + 𝐿 ( 𝑠 𝑡 ) 𝑤 𝑔 𝑢 𝑣 ) + 𝑂 ( 𝑛 − 3 ) .
(84)
𝐸 [ 𝐵 ¯ 𝑠 𝑡 𝑖 𝑒 ¯ 𝑠 𝑒 ¯ 𝑡 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘 ]

𝑛 − 4 𝐵 ¯ 𝑠 𝑡 𝑖 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑗 𝑤 𝑔 ~ 𝑘 𝑚

× ∑ 1 ≤ 𝑎 , 𝑏 , 𝑐 , 𝑑 ≤ 𝑛 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) )

× ( ∂ ∂ 𝜃 𝑤 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑚 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ]

𝑛 − 2 𝐵 ¯ 𝑠 𝑡 𝑖 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑗 𝑤 𝑔 ~ 𝑘 𝑚

× { 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

+ 𝐸 [ ( ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑚 log ⁡ 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ]

× 𝐸 [ ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ( ∂ ∂ 𝜃 𝑤 log 𝑓 ( 𝑋 ; 𝜃 ∗ ) ) ] } + 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝐵 ¯ 𝑠 𝑡 𝑖 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑗 𝑤 𝑔 ~ 𝑘 𝑚 ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚 + 𝑔 𝑢 𝑤 𝑔 𝑣 𝑚 + 𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 ) + 𝑂 ( 𝑛 − 3 ) .

(85)

Therefore

𝐸 [ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ( 𝜃 ^ 𝑘 − 𝜃 ∗ 𝑘 ) ]

= 𝑛 − 2 ( 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝐿 𝑠 𝑡 𝑢

𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑗 𝑣 𝑔 ~ 𝑘 𝑤 ( 𝐿 ( 𝑠 𝑡 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑤 𝑔 𝑢 𝑣 )
𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑖 𝑣 𝑔 ~ 𝑘 𝑤 ( 𝐿 ( 𝑠 𝑡 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑤 𝑔 𝑢 𝑣 )
𝑔 ~ 𝑘 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑖 𝑣 𝑔 ~ 𝑗 𝑤 ( 𝐿 ( 𝑠 𝑡 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑤 𝑔 𝑢 𝑣 )
𝐵 ¯ 𝑠 𝑡 𝑖 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑗 𝑤 𝑔 ~ 𝑘 𝑚 ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚
𝑔 𝑢 𝑤 𝑔 𝑣 𝑚
𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 )
𝐵 ¯ 𝑠 𝑡 𝑗 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑖 𝑤 𝑔 ~ 𝑘 𝑚 ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚
𝑔 𝑢 𝑤 𝑔 𝑣 𝑚
𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 )
𝐵 ¯ 𝑠 𝑡 𝑘 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑖 𝑤 𝑔 ~ 𝑗 𝑚 ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚
𝑔 𝑢 𝑤 𝑔 𝑣 𝑚
𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 ) )
𝑂 ( 𝑛 − 3 )

= 𝑛 − 2 ( 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝐿 𝑠 𝑡 𝑢

𝑔 ~ 𝑖 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑗 𝑣 𝑔 ~ 𝑘 𝑤 ( 𝐿 ( 𝑠 𝑡 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑤 𝑔 𝑢 𝑣 )
𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑖 𝑣 𝑔 ~ 𝑘 𝑤 ( 𝐿 ( 𝑠 𝑡 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑤 𝑔 𝑢 𝑣 )
𝑔 ~ 𝑘 𝑡 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑖 𝑣 𝑔 ~ 𝑗 𝑤 ( 𝐿 ( 𝑠 𝑡 ) 𝑢 𝑔 𝑣 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑣 𝑔 𝑢 𝑤
𝐿 ( 𝑠 𝑡 ) 𝑤 𝑔 𝑢 𝑣 )
2 − 1 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑗 𝑤 𝑔 ~ 𝑘 𝑚 𝑔 ~ 𝑖 𝑙 𝐿 ( 𝑠 𝑡 𝑙 ) ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚
𝑔 𝑢 𝑤 𝑔 𝑣 𝑚
𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 )
2 − 1 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑖 𝑤 𝑔 ~ 𝑘 𝑚 𝑔 ~ 𝑗 𝑙 𝐿 ( 𝑠 𝑡 𝑙 ) ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚
𝑔 𝑢 𝑤 𝑔 𝑣 𝑚
𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 )
2 − 1 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑖 𝑤 𝑔 ~ 𝑗 𝑚 𝑔 ~ 𝑘 𝑙 𝐿 ( 𝑠 𝑡 𝑙 ) ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚
𝑔 𝑢 𝑤 𝑔 𝑣 𝑚
𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 ) )
𝑂 ( 𝑛 − 3 ) .

(86)

Finally we calculate 𝐸 [ 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙 ] . Notice

𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙

𝑒 ¯ 𝑖 𝑒 ¯ 𝑗 𝑒 ¯ 𝑘 𝑒 ¯ 𝑙 + 𝑅 𝑒 4 ,

(87)

where 𝑅 𝑒 4 is defined as before. Therefore

𝐸 [ 𝜃 ¯ 𝑖 𝜃 ¯ 𝑗 𝜃 ¯ 𝑘 𝜃 ¯ 𝑙 ]

(88)

= 𝑛 − 4 𝐸 [ ( 𝑔 ~ 𝑖 𝑠 ∑ 𝑎

1 𝑛 ( ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ) ) ( 𝑔 ~ 𝑗 𝑡 ∑ 𝑏

1 𝑛 ( ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ) )

× ( 𝑔 ~ 𝑘 𝑢 ∑ 𝑐

1 𝑛 ( ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑐 ; 𝜃 ∗ ) ) ) ( 𝑔 ~ 𝑙 𝑣 ∑ 𝑑

1 𝑛 ( ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑑 ; 𝜃 ∗ ) ) ) ]

𝑛 − 4 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝑔 ~ 𝑙 𝑣

× ∑ 𝑎 ≠ 𝑏 { 𝐸 [ ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ] 𝐸 [ ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ]

+ 𝐸 [ ∂ ∂ 𝜃 𝑠 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑢 log ⁡ 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ] 𝐸 [ ∂ ∂ 𝜃 𝑡 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑣 log ⁡ 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ]

+ 𝐸 [ ∂ ∂ 𝜃 𝑠 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑣 log 𝑓 ( 𝑋 𝑎 ; 𝜃 ∗ ) ] 𝐸 [ ∂ ∂ 𝜃 𝑡 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑢 log 𝑓 ( 𝑋 𝑏 ; 𝜃 ∗ ) ] }

+ 𝑂 ( 𝑛 − 3 )

𝑛 − 2 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝑔 ~ 𝑙 𝑣 ( 𝑔 𝑠 𝑡 𝑔 𝑢 𝑣 + 𝑔 𝑠 𝑢 𝑔 𝑡 𝑣 + 𝑔 𝑠 𝑣 𝑔 𝑡 𝑢 ) + 𝑂 ( 𝑛 − 3 ) .

(89)

Therefore we have

𝐸 [ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ( 𝜃 ^ 𝑘 − 𝜃 ∗ 𝑘 ) ( 𝜃 ^ 𝑙 − 𝜃 ∗ 𝑙 ) ]

(90) 5.2Proof of (24)

Note that, for 1 ≤ 𝑖 , 𝑗 , 𝑘 , 𝑙 ≤ 𝑝 ,

𝑔 𝑖 𝑗 ∗ ( 𝜃 ∗ )

− ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ( ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) 𝑑 𝜇

∂ 2 Ψ ( 𝜃 ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 | 𝜃

𝜃 ∗

𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ ) ,

and

𝐿 ( 𝑖 𝑗 ) 𝑘 𝑙

− 𝑔 ~ 𝑖 𝑗 𝑔 𝑘 𝑙 , 𝐿 ( 𝑖 𝑗 ) ( 𝑘 𝑙 )

𝑔 ~ 𝑖 𝑗 𝑔 ~ 𝑘 𝑙 , 𝐿 ( 𝑖 𝑗 ) 𝑘

0 , 𝐿 ( 𝑖 𝑗 𝑘 ) 𝑙

0 .

Combining these relations with (14) gives the following:

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

= 1 2 𝑛 tr ( 𝐺 ~ − 1 𝐺 )

1 24 𝑛 2 [ ( 12 𝑔 ~ 𝑢 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝐿 𝑘 𝑠 𝑡 𝐿 ( 𝑙 𝑚 𝑢 )
12 𝑔 ~ 𝑘 𝑜 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ ℎ 𝑚 𝐿 ( 𝑙 𝑚 𝑜 ) 𝐿 ( 𝑠 𝑡 ℎ ) ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 )
4 𝑔 ~ 𝑘 𝑤 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣 𝐿 ( 𝑙 𝑚 𝑡 𝑤 ) ( 𝑔 𝑘 𝑠 𝑔 𝑢 𝑣
𝑔 𝑘 𝑢 𝑔 𝑠 𝑣
𝑔 𝑘 𝑣 𝑔 𝑠 𝑢 )
3 𝑔 ~ 𝑙 𝑘 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ 𝑜 ℎ 𝐿 ( 𝑙 𝑚 𝑜 ) 𝐿 ( 𝑠 𝑡 ℎ ) ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 ) )

− 𝜏 𝑖 𝑗 𝑘 ( 4 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝐿 𝑠 𝑡 𝑢

6 𝑔 ~ 𝑠 𝑢 𝑔 ~ 𝑡 𝑣 𝑔 ~ 𝑗 𝑤 𝑔 ~ 𝑘 𝑚 𝑔 ~ 𝑖 𝑙 𝐿 ( 𝑠 𝑡 𝑙 ) ( 𝑔 𝑢 𝑣 𝑔 𝑤 𝑚
𝑔 𝑢 𝑤 𝑔 𝑣 𝑚
𝑔 𝑢 𝑚 𝑔 𝑣 𝑤 ) )

− 𝜏 𝑖 𝑗 𝑘 𝑙 𝑔 ~ 𝑖 𝑠 𝑔 ~ 𝑗 𝑡 𝑔 ~ 𝑘 𝑢 𝑔 ~ 𝑙 𝑣 ( 𝑔 𝑠 𝑡 𝑔 𝑢 𝑣 + 𝑔 𝑠 𝑢 𝑔 𝑡 𝑣 + 𝑔 𝑠 𝑣 𝑔 𝑡 𝑢 ) ]

𝑂 ( 𝑛 − 3 ) .

(91)

𝜏 𝑖 𝑗 𝑘

− ∂ 3 Ψ ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘

𝐿 ( 𝑖 𝑗 𝑘 ) , 𝜏 𝑖 𝑗 𝑘 𝑙

− ∂ 4 Ψ ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙

𝐿 ( 𝑖 𝑗 𝑘 𝑙 ) ,

for 1 ≤ 𝑖 , 𝑗 , 𝑘 , 𝑙 ≤ 𝑝 ,

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

= 1 2 𝑛 tr ( 𝐺 ~ − 1 𝐺 )

1 24 𝑛 2 [ 8 𝑔 ~ 𝑢 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝐿 𝑘 𝑠 𝑡 𝐿 ( 𝑙 𝑚 𝑢 )
9 𝑔 ~ 𝑘 𝑜 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ ℎ 𝑚 𝐿 ( 𝑙 𝑚 𝑜 ) 𝐿 ( 𝑠 𝑡 ℎ ) ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 )
3 𝑔 ~ 𝑘 𝑤 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑢 𝑔 ~ 𝑡 𝑣 𝐿 ( 𝑙 𝑚 𝑡 𝑤 ) ( 𝑔 𝑘 𝑠 𝑔 𝑢 𝑣
𝑔 𝑘 𝑢 𝑔 𝑠 𝑣
𝑔 𝑘 𝑣 𝑔 𝑠 𝑢 ) ]
𝑂 ( 𝑛 − 3 ) .

(92)

Substituting the cumulant expression (23) gives

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

= 1 2 𝑛 tr ( 𝐺 ~ − 1 𝐺 )

1 24 𝑛 2 [ − 8 𝑔 ~ 𝑢 𝑘 𝑔 ~ 𝑙 𝑠 𝑔 ~ 𝑚 𝑡 𝜅 𝑘 𝑠 𝑡 𝜅 𝑙 𝑚 𝑢 ∗
9 𝑔 ~ 𝑘 𝑜 𝑔 ~ 𝑙 𝑢 𝑔 ~ 𝑠 𝑣 𝑔 ~ 𝑡 𝑤 𝑔 ~ ℎ 𝑚 𝜅 𝑙 𝑚 𝑜 ∗ 𝜅 𝑠 𝑡 ℎ ∗ ( 𝑔 𝑘 𝑢 𝑔 𝑣 𝑤
𝑔 𝑘 𝑣 𝑔 𝑢 𝑤
𝑔 𝑘 𝑤 𝑔 𝑢 𝑣 )

𝑂 ( 𝑛 − 3 ) .

(93) 5.3Derivation of (15), (16), (17)

Information projection is given by the solution

𝐸 [ ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑌 , 𝑋 ; 𝜃 ) ]

{ − 2 − 1 𝐸 [ 𝜖 2 ] + 2 − 1 ( 𝜃 0 ) − 1

0 , if 𝑖

0 ,

𝜃 0 𝐸 [ 𝜖 𝑋 𝑖 ]

0 , if 𝑖

1 , … , 𝑝 .

In other words, 𝑔 ( 𝑦 , 𝑥 ; 𝜃 ∗ ) is given by 𝜃 ∗

( 𝜃 ∗ 0 , … , 𝜃 ∗ 𝑝 ) , which satisfies

( 𝜃 ∗ 0 ) − 1

𝐸 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] , 𝐸 [ 𝑋 𝑖 𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ]

0 , 𝑖

1 , … 𝑝 .

Note that

𝑔 ~ 00 ( 𝜃 ∗ )

1 2 ( 𝜃 ∗ 0 ) 2 ,

𝑔 ~ 0 𝑖 ( 𝜃 ∗ )

− 𝐸 [ 𝑋 𝑖 𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ]

0 , 𝑖

1 , … , 𝑝

𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ )

𝜃 ∗ 0 𝑠 𝑖 𝑗 , 𝑠 𝑖 𝑗 ≜ 𝐸 [ 𝑋 𝑖 𝑋 𝑗 ] , 𝑖 , 𝑗

1 , … , 𝑝

𝑔 00 ∗ ( 𝜃 ∗ )

1 2 ( 𝜃 ∗ 0 ) 2

𝑔 0 𝑖 ∗ ( 𝜃 ∗ )

− 𝐸 𝜃 ∗ [ 𝑋 𝑖 𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ]

0 , 𝑖

1 , … , 𝑝

𝑔 𝑖 𝑗 ∗ ( 𝜃 ∗ )

𝜃 ∗ 0 𝐸 𝜃 ∗ [ 𝑋 𝑖 𝑋 𝑗 ]

𝜃 ∗ 0 𝐸 [ 𝑋 𝑖 𝑋 𝑗 ]

𝜃 ∗ 0 𝑠 𝑖 𝑗 , 𝑖 , 𝑗

1 , … , 𝑝 .

Hence, 𝐺 ~

𝐺 ∗ and

𝐺 ~ − 1

( 2 ( 𝜃 ∗ 0 ) 2

0
( 𝜃 ∗ 0 ) − 1 𝑆 − 1 ) , 𝑆

( 𝑠 𝑖 𝑗 ) .

Further, we have

𝑔 00 ( 𝜃 ∗ )

1 4 𝐸 [ ( 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) − ( 𝜃 ∗ 0 ) − 1 ) 2 ]

1 4 𝐸 [ 𝜖 4 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] − 1 2 ( 𝜃 ∗ 0 ) − 1 𝐸 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] + 1 4 ( 𝜃 ∗ 0 ) − 2

1 4 ( 𝐸 [ 𝜖 4 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] − 𝐸 2 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] )

𝑔 𝑖 𝑗 ( 𝜃 ∗ )

( 𝜃 ∗ 0 ) 2 𝑡 𝑖 𝑗 , 𝑡 𝑖 𝑗 ≜ 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] , 𝑖 , 𝑗

1 , … , 𝑝 .

Consequently, for Case1,

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

1 2 𝑛 tr ( 𝐺 ~ − 1 𝐺 𝐺 ~ − 1 𝐺 ∗ ) + 𝑜 ( 𝑛 − 1 )

1 2 𝑛 tr ( 𝐺 ~ − 1 𝐺 ) + 𝑜 ( 𝑛 − 1 )

1 2 𝑛 ( 𝜃 ∗ 0 tr ( 𝑆 − 1 𝑇 ) + ( 𝜃 ∗ 0 ) 2 2 ( 𝐸 [ 𝜖 4 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] − 𝐸 2 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] ) ) + 𝑜 ( 𝑛 − 1 )

1 2 𝑛 ( tr ( 𝑆 − 1 𝑇 ) / 𝐸 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] + 1 2 ( 𝐸 [ 𝜖 4 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] / 𝐸 2 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] − 1 ) ) + 𝑜 ( 𝑛 − 1 ) ,

where ( 𝑇 ) 𝑖 𝑗

𝑡 𝑖 𝑗 .

For Case 2, we observe that

𝑡 𝑖 𝑗

𝐸 [ 𝑋 𝑖 𝑋 𝑗 ] 𝐸 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] , 𝑖 , 𝑗

1 , … , 𝑝 ;

hence,

𝑇

𝐸 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] 𝑆 ,

and

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

1 2 𝑛 ( 𝑝 + 1 2 ( 𝐸 [ 𝜖 4 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] / 𝐸 2 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] − 1 ) ) + 𝑜 ( 𝑛 − 1 ) .

For Case 3, as

𝐸 [ 𝜖 4 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] / 𝐸 2 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ]

3 ,

we have

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

𝑝 + 1 2 𝑛 + 𝑜 ( 𝑛 − 1 ) .

5.4Proof of (44)

A suitably fine partition 𝑆 𝑖 , 𝑖

1 , … , 𝑚 of the domain of 𝑑 𝜇 and the associated step functions 𝑔 ~ 𝑗 ( 𝑥 )

∑ 𝑖

1 𝑚 𝑐 𝑗 𝑖 𝐼 ( 𝑥 ∈ 𝑆 𝑖 ) , 𝑗

1 , 2 are taken such that the two integrations

𝐸 𝑟 [ 𝑔 1 ( 𝑥 ) | 𝑔 2 ( 𝑥 ) ]

1 2 ∫ min ⁡ ( 𝑔 1 ( 𝑥 ) , 𝑔 2 ( 𝑥 ) ) 𝑑 𝜇

1 2 ∫ 𝑔 1 ( 𝑥 ) min ⁡ ( 1 , 𝑔 2 ( 𝑥 ) / 𝑔 1 ( 𝑥 ) ) 𝑑 𝜇 ,

(94)

𝐷 [ 𝑔 1 ( 𝑥 ) | 𝑔 2 ( 𝑥 ) ]

∫ 𝑔 1 ( 𝑥 ) log ⁡ ( 𝑔 1 ( 𝑥 ) / 𝑔 2 ( 𝑥 ) ) 𝑑 𝜇 ,

(95)

are sufficiently well approximated by

1 2 ∫ 𝑔 ~ 1 ( 𝑥 ) min ⁡ ( 1 , 𝑔 ~ 2 ( 𝑥 ) / 𝑔 ~ 1 ( 𝑥 ) ) 𝑑 𝜇

1 2 ∑ 𝑖

1 𝑚 min ⁡ ( 1 , 𝑐 2 𝑖 / 𝑐 1 𝑖 ) ∫ 𝑆 𝑖 𝑐 1 𝑖 𝑑 𝜇 ,

(96)

∫ 𝑔 ~ 1 ( 𝑥 ) log ⁡ ( 𝑔 ~ 1 ( 𝑥 ) / 𝑔 ~ 2 ( 𝑥 ) ) 𝑑 𝜇

∑ 𝑖

1 𝑚 log ⁡ ( 𝑐 1 𝑖 / 𝑐 2 𝑖 ) ∫ 𝑆 𝑖 𝑐 1 𝑖 𝑑 𝜇 ,

(97)

respectively. Furthermore, we can choose the partition such that

∫ 𝑆 𝑖 𝑐 1 𝑖 𝑑 𝜇

1 / 𝑚 , 𝑖

1 , … , 𝑚 .

Then, (96) and (97) equal

1 2 𝑚 ∑ 𝑖

1 𝑚 min ⁡ ( 1 , Δ 𝑖 ) ( ≜ 𝑡 ( Δ 1 , … , Δ 𝑚 ) )

1 𝑚 ∑ 𝑖

1 𝑚 − log ⁡ Δ 𝑖 ,

where Δ 𝑖 ≜ 𝑐 2 𝑖 / 𝑐 1 𝑖 , 𝑖

1 , … , 𝑚 . Suppose that 𝐷 [ 𝑔 ( 𝑋 ; 𝜃 1 ) | 𝑔 ( 𝑥 ; 𝜃 2 ) ] < 𝛿 . Then, we can suppose

𝑓 ( Δ 1 , … , Δ 𝑚 ) ≜ 1 𝑚 ∑ 𝑖

1 𝑚 log ⁡ Δ 𝑖 ≥ − 𝛿 .

(98)

The lower bound of 𝑡 ( Δ ) is searched for, under the condition of (98). Let

𝑚 ~ ≜ ∑ 𝑖

1 𝑚 Δ 𝑖 , 1 ~ ≜ 𝑚 ~ 𝑚 .

(99)

Note that, as the partition 𝑆 𝑖 , 𝑖

1 , … , 𝑚 becomes finer,

∑ 𝑖

1 𝑚 ∫ 𝑆 𝑖 𝑐 2 𝑖 𝑑 𝜇

∑ 𝑖

1 𝑚 Δ 𝑖 / 𝑚

1 ~ → ∫ 𝑔 2 ( 𝑥 ) 𝑑 𝜇

1 .

Without loss of generality, the following can be assumed:

Δ 1 ≥ ⋯ ≥ Δ 𝑠

Δ 𝑠 + 1 ≥ ⋯ ≥ Δ 𝑚

0 , ∃ 𝑠 ( ≥ 1 ) .

Let 𝑢

𝑚 − 𝑠 and

Δ + ≜ 1 𝑠 ∑ 𝑖

1 𝑠 Δ 𝑖 , Δ − ≜ 1 𝑢 ∑ 𝑖

𝑠 + 1 𝑚 Δ 𝑖 .

Note that

𝑡 ( Δ + , ⋯ , Δ + ⏟ 𝑠 , Δ − , ⋯ , Δ − ⏟ 𝑢 )

𝑡 ( Δ 1 , … , Δ 𝑚 )

and, because of the concavity of log ⁡ ( Δ 𝑖 ) ,

𝑓 ( Δ + , ⋯ , Δ + ⏟ 𝑠 , Δ − , ⋯ , Δ − ⏟ 𝑢 ) ≥ 𝑓 ( Δ 1 , … , Δ 𝑚 ) ≥ − 𝛿 .

Therefore, in search of the lower bound of 𝑡 ( Δ ) , we must only consider the case where

Δ 1

Δ 2

⋯

Δ 𝑠

Δ +

1 ,

0 < Δ 𝑠 + 1

Δ 𝑠 + 2

⋯

Δ 𝑚

Δ − < 1 ,

(100)

Under condition (100), the relations (98) and (99) are

1 𝑚 ( 𝑠 log ⁡ Δ + + 𝑢 log ⁡ Δ − ) ≥ − 𝛿 ,

𝑠 Δ + + 𝑢 Δ −

𝑚 ~ ,

respectively, or equivalently,

𝑥 log ⁡ Δ + + ( 1 − 𝑥 ) log ⁡ Δ − ≥ − 𝛿 ,

(101)

𝑥 Δ + + ( 1 − 𝑥 ) Δ −

1 ~ ,

(102)

where

0 < 𝑥

𝑠 / 𝑚 < 1 .

(103)

Substituting the relation from (102), i.e.,

Δ −

1 ~ − 𝑥 Δ + 1 − 𝑥

into Δ −

0 and (101) gives

1 < Δ + < 1 ~ 𝑥

(104)

ℎ ( 𝑥 ; Δ + ) ≜ 𝑥 log ⁡ Δ + + ( 1 − 𝑥 ) log ⁡ ( 1 ~ − 𝑥 Δ + 1 − 𝑥 ) ≥ − 𝛿 .

(105)

Furthermore, under condition (100),

1 2 𝑚 ∑ 𝑖

1 𝑚 min ⁡ ( 1 , Δ 𝑖 )

𝑡 ( Δ 1 , … , Δ 𝑚 )

1 2 𝑚 ( 𝑠 + 𝑢 Δ − )

1 2 ( 𝑥 + ( 1 − 𝑥 ) Δ − )

1 2 ( 1 ~ + 𝑥 ( 1 − Δ + ) ) ( ≜ 𝑡 ( 𝑥 ; Δ + ) )

Consider the minimization of 𝑡 ( 𝑥 ; Δ + ) under conditions (103), (104), and (105). As

𝑑 𝑑 𝑥 ℎ ( 𝑥 ; Δ + )

ℎ ′ ( 𝑥 ; Δ + )

log ⁡ Δ + − log ⁡ ( 1 ~ − 𝑥 Δ + 1 − 𝑥 ) + ( 1 − 𝑥 ) { − Δ + 1 ~ − 𝑥 Δ + + 1 1 − 𝑥 }

log ⁡ ( Δ + ( 1 − 𝑥 ) 1 ~ − 𝑥 Δ + ) + 1 ~ − Δ + 1 ~ − 𝑥 Δ +

≤ Δ + − 1 ~ 1 ~ − 𝑥 Δ + + 1 ~ − Δ + 1 ~ − 𝑥 Δ +

0 ( ∵ log ⁡ ( 1 + 𝑥 ) ≤ 𝑥 ) ,

the minimum value of 𝑡 ( 𝑥 ; Δ + ) (say, 𝑡 ∗ ) is attained when (105) holds with the equation. Let 𝑥 ∗ denote the point that attains 𝑡 ∗ ; then,

Δ +

( 1 ~ − 2 𝑡 ∗ ) / 𝑥 ∗ + 1 .

(106)

Inserting (106) into the left-hand side of (105) and equating it with − 𝛿 gives

𝑥 ∗ log ⁡ ( 1 ~ − 2 𝑡 ∗ 𝑥 ∗ + 1 ) + ( 1 − 𝑥 ∗ ) log ⁡ ( 2 𝑡 ∗ − 1 1 − 𝑥 ∗ + 1 )

− 𝛿 ,

while, from (103), (104), and (106),

0 < 𝑥 ∗ < 2 𝑡 ∗ < 1 ~ .

Let us define the region 𝐴 ~ ( 𝛿 ) by

𝐴 ~ ( 𝛿 ) ≜ { ( 𝑥 ∗ , 𝑡 ∗ ) | 𝑥 ∗ log ( 1 ~ − 2 𝑡 ∗ 𝑥 ∗ + 1 ) + ( 1 − 𝑥 ∗ ) log ( 2 𝑡 ∗ − 1 1 − 𝑥 ∗ + 1 )

− 𝛿 , 0 < 𝑥 ∗ < 2 𝑡 ∗ < 1 ~ . }

Then,

1 2 𝑚 ∑ 𝑖

1 𝑚 min ⁡ ( 1 , Δ 𝑖 )

𝑡 ( 𝑥 ; Δ + ) ≥ min ⁡ { 𝑡 ∗ | ( 𝑥 ∗ , 𝑡 ∗ ) ∈ 𝐴 ~ ( 𝛿 ) } .

Taking the limit operation for both sides as the partition becomes finer gives the result.

5.5Proof of (60)

For simplicity, the notation ≑ is used when the terms 𝑜 ( 𝑛 − 1 ) or 𝑜 𝑝 ( 𝑛 − 1 ) are ignored. The expansion of log ⁡ 𝑔 ( 𝑥 𝑡 ; 𝜃 ^ ) around 𝜃 ∗ is given by

log ⁡ 𝑔 ( 𝑥 𝑡 ; 𝜃 ^ )
≑ log ⁡ 𝑔 ( 𝑥 𝑡 ; 𝜃 ∗ ) + ∑ 𝑖

1 𝑝 ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑥 𝑡 ; 𝜃 ) | 𝜃

𝜃 ∗ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 )

1 2 ∑ 1 ≤ 𝑖 , 𝑗 ≤ 𝑝 ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 𝑡 ; 𝜃 ) | 𝜃
𝜃 ∗ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 )

= log ⁡ 𝑔 ( 𝑥 𝑡 ; 𝜃 ∗ ) + ∑ 𝑖

1 𝑝 ( 𝜉 𝑖 ( 𝑥 𝑡 ) − 𝜂 𝑖 ∗ ) ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 )

− 1 2 ∑ 1 ≤ 𝑖 , 𝑗 ≤ 𝑝 ( Ψ ¨ ) 𝑖 𝑗 ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) .

(107)

Meanwhile, 𝜃 ^ 𝑖

𝜃 𝑖 ( 𝜂 ^ ) can be expanded around 𝜂 ∗ (

𝜂 ( 𝜃 ∗ ) ) , as

𝜃 𝑖 ( 𝜂 ^ ) ≑ 𝜃 𝑖 ( 𝜂 ∗ ) + ∑ 𝑗

1 𝑝 ∂ 𝜃 𝑖 ∂ 𝜂 𝑗 ( 𝜂 ^ 𝑗 − 𝜂 𝑗 ∗ )

𝜃 𝑖 ( 𝜂 ∗ ) + ∑ 𝑗

1 𝑝 ( Φ ¨ ) 𝑖 𝑗 ( 𝜂 ^ 𝑗 − 𝜂 𝑗 ∗ ) ,

(108)

where Φ ( 𝜂 ) is the conjugate convex function of Ψ , which satisfies the following relations:

∂ Φ ∂ 𝜂 𝑖 ( 𝜂 )

𝜃 𝑖 , 𝑖

1 , … , 𝑝

(109)

Φ ¨ ≜ ( ∂ 2 Φ ∂ 𝜂 𝑖 ∂ 𝜂 𝑗 )

Ψ ¨ − 1 .

(110)

Inserting (108) and (110) into (107) gives

log ⁡ 𝑔 ( 𝑥 𝑡 ; 𝜃 ^ )

≑ log ⁡ 𝑔 ( 𝑥 𝑡 ; 𝜃 ∗ ) + ∑ 1 ≤ 𝑖 , 𝑗 ≤ 𝑝 ( Ψ ¨ − 1 ) 𝑖 𝑗 ( 𝜉 𝑖 ( 𝑥 𝑡 ) − 𝜂 𝑖 ∗ ) ( 𝜂 ^ 𝑗 − 𝜂 𝑗 ∗ )

− 1 2 ∑ 1 ≤ 𝑖 , 𝑗 ≤ 𝑝 ( Ψ ¨ ) 𝑖 𝑗 ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) .

(111)

Taking the expectation for both sides gives

𝐸 [ log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ^ ) ]

≑ 𝐸 log 𝑔 ( 𝑋 𝑡 ; 𝜃 ∗ ) ] + ∑ 1 ≤ 𝑖 , 𝑗 ≤ 𝑝 ( Ψ ¨ − 1 ) 𝑖 𝑗 𝐸 [ ( 𝜉 𝑖 ( 𝑋 𝑡 ) − 𝜂 𝑖 ∗ ) ( 𝜂 ^ 𝑗 − 𝜂 𝑗 ∗ ) ]

− 1 2 ∑ 1 ≤ 𝑖 , 𝑗 ≤ 𝑝 ( Ψ ¨ ) 𝑖 𝑗 𝐸 [ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ]

(112)

Note that

𝐸 [ ( 𝜉 𝑖 ( 𝑋 𝑡 ) − 𝜂 𝑖 ∗ ) ( 𝜂 ^ 𝑗 − 𝜂 𝑗 ∗ ) ]

𝑛 − 1 ∑ 𝑠

1 𝑛 𝐸 [ ( 𝜉 𝑖 ( 𝑋 𝑡 ) − 𝜂 𝑖 ∗ ) ( 𝜉 𝑗 ( 𝑋 𝑠 ) − 𝜂 𝑗 ∗ ) ]

= 𝑛 − 1 𝐸 [ ( 𝜉 𝑖 ( 𝑋 𝑡 ) − 𝜂 𝑖 ∗ ) ( 𝜉 𝑗 ( 𝑋 𝑡 ) − 𝜂 𝑗 ∗ ) ]

= 𝑛 − 1 ( 𝐺 ) 𝑖 𝑗

(113)

as, for 𝑠 ≠ 𝑡 ,

𝐸 [ ( 𝜉 𝑖 ( 𝑋 𝑡 ) − 𝜂 𝑖 ∗ ) ( 𝜉 𝑗 ( 𝑋 𝑠 ) − 𝜂 𝑗 ∗ )

𝐸 [ 𝜉 𝑖 ( 𝑋 𝑡 ) − 𝜂 𝑖 ∗ ] 𝐸 [ 𝜉 𝑗 ( 𝑋 𝑠 ) − 𝜂 𝑗 ∗ ]

0 .

From (80) in Section 5.1 of the Appendix,

𝐸 [ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ]

≑ 𝑛 − 1 ∑ 1 ≤ 𝑙 , 𝑚 ≤ 𝑝 𝑔 ~ 𝑖 𝑙 𝑔 ~ 𝑗 𝑚 𝑔 𝑙 𝑚

= 𝑛 − 1 ( 𝐺 ~ − 1 𝐺 𝐺 ~ − 1 ) 𝑖 𝑗 .

(114)

Inserting (113) and (114) into (112) and using the fact that 𝐺 ~

Ψ ¨ gives

𝐸 [ log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ^ ) ]
≑ 𝐸 [ log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ∗ ) ] + 1 𝑛 tr ( 𝐺 ~ − 1 𝐺 ) . − 1 2 𝑛 tr ( 𝐺 ~ 𝐺 ~ − 1 𝐺 𝐺 ~ − 1 )

𝐸 [ log ⁡ 𝑔 ( 𝑋 ; 𝜃 ∗ ) ] + 1 2 𝑛 tr ( 𝐺 ~ − 1 𝐺 ) ,

and

𝐸 [ 𝐶 𝑒 ( 𝑀 ) ^ ] − 𝐶 𝑒 ( 𝑀 )

− 1 𝑛 ∑ 𝑡

1 𝑛 𝐸 [ log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ^ ) ] + 𝐸 [ log ⁡ 𝑔 ( 𝑋 ; 𝜃 ∗ ) ]

− 1 𝑛 ∑ 𝑡

1 𝑛 { 𝐸 [ log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 ^ ) ] − 𝐸 [ log ⁡ 𝑔 ( 𝑋 ; 𝜃 ∗ ) ] }

− 1 2 𝑛 tr ( 𝐺 ~ − 1 𝐺 ) .

Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 186 kB
Xet hash:: add4fdcb5c5b256c0a9f61afbe1ab17293712e5f8fc91413e51e60d52755fd12

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

ℳ

{ 𝑔 ​ ( 𝑥 ; 𝜃 ) | 𝜃

𝐷 𝛼 [ 𝑔 ( 𝑥 ) : 𝑔 ( 𝑥 ; 𝜃 ) ]

∫ 𝑔 ​ ( 𝑥 ; 𝜃 ) ​ log ⁡ ( 𝑔 ​ ( 𝑥 ; 𝜃 ) / 𝑔 ​ ( 𝑥 ) ) ​ 𝑑 𝜇 , if 𝛼

∫ 𝑔 ​ ( 𝑥 ) ​ log ⁡ ( 𝑔 ​ ( 𝑥 ) / 𝑔 ​ ( 𝑥 ; 𝜃 ) ) ​ 𝑑 𝜇 , if 𝛼

Note that 𝛼 -divergence is the only “decomposable,” “flat,” and “invariant” divergence when it is extended on the positive measure space (see Theorem 4.2 of [3]). Further, 𝛼 -divergence contains frequently used divergences such as the K-L divergence ( 𝛼

− 1 ), Hellinger distance ( 𝛼

0 ), and 𝜒 2 -divergence ( 𝛼

𝜃 ∗

∂ ∂ 𝜃 𝑖 ​ 𝐷 𝛼 ​ [ 𝑔 ​ ( 𝑥 ) | 𝑔 ​ ( 𝑥 ; 𝜃 ) ]

0 , 𝑖

⟺ { ∫ ( 𝑔 ​ ( 𝑥 ; 𝜃 ) ) 𝛼 − 1 2 ​ ( 𝑔 ​ ( 𝑥 ) ) 1 − 𝛼 2 ​ ∂ 𝑔 ​ ( 𝑥 ; 𝜃 ) ∂ 𝜃 𝑖 ​ 𝑑 𝜇

0 , 𝑖

∫ ∂ 𝑔 ​ ( 𝑥 ; 𝜃 ) ∂ 𝜃 𝑖 ​ log ⁡ ( 𝑔 ​ ( 𝑥 ; 𝜃 ) / 𝑔 ​ ( 𝑥 ) ) ​ 𝑑 𝜇

0 , 𝑖

Note that, if 𝛼

𝐸 ​ [ ∂ ∂ 𝜃 𝑖 ​ log ⁡ 𝑔 ​ ( 𝑋 ; 𝜃 ) ]

0 , 𝑖

∑ 𝑡

1 𝑛 ∂ ∂ 𝜃 𝑖 ​ log ⁡ 𝑔 ​ ( 𝑋 𝑡 ; 𝜃 )

0 , 𝑖

where 𝑋 𝑡 , 𝑡

𝐸 ​ [ ℎ ​ ( 𝑋 , 𝜃 ) ]

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

𝐷 ​ [ 𝑔 ​ ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ​ ( 𝑥 ; 𝜃 ^ ) ]

𝐷 ​ [ 𝑔 ​ ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ​ ( 𝑥 ; 𝜃 ^ ) ]

− ∑ 𝑖 ∫ ∂ ∂ 𝜃 𝑖 ​ 𝑔 ​ ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ​ 𝑑 ​ 𝜇 ​ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) + 1 2 ​ ∑ 𝑖 , 𝑗 ∫ 𝑔 ​ ( 𝑥 ; 𝜃 ∗ ) ​ ( ∂ ∂ 𝜃 𝑖 ​ log ⁡ 𝑔 ​ ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ​ ( ∂ ∂ 𝜃 𝑗 ​ log ⁡ 𝑔 ​ ( 𝑥 ; 𝜃 ) | 𝜃

− 1 2 ​ ∑ 𝑖 , 𝑗 ∫ ∂ 2 ∂ 𝜃 𝑖 ​ ∂ 𝜃 𝑗 ​ 𝑔 ​ ( 𝑥 , 𝜃 ) | 𝜃

𝜃 ∗ ​ 𝑑 ​ 𝜇 ​ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ​ ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) − ∑ 𝑡

3 ∞ 1 𝑡 ! ​ ∑ 𝑖 1 , … , 𝑖 𝑡 ∫ 𝑔 ​ ( 𝑥 ; 𝜃 ∗ ) ​ ∂ ∂ 𝜃 𝑖 1 ​ ⋯ ​ ∂ 𝜃 𝑖 𝑡 ​ log ⁡ 𝑔 ​ ( 𝑥 , 𝜃 ) | 𝜃

∫ 𝑔 ​ ( 𝑥 , 𝜃 ) ​ 𝑑 𝜇

∫ ∂ ∂ 𝜃 𝑖 ​ 𝑔 ​ ( 𝑥 ; 𝜃 ) ​ 𝑑 𝜇

0 , ∫ ∂ 2 ∂ 𝜃 𝑖 ​ ∂ 𝜃 𝑗 ​ 𝑔 ​ ( 𝑥 , 𝜃 ) ​ 𝑑 𝜇

𝑅 ​ [ 𝑔 ​ ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ​ ( 𝑥 ; 𝜃 ^ ) ]

1 2 ​ ∑ 𝑖 , 𝑗 𝑔 𝑖 ​ 𝑗 ∗ ​ ( 𝜃 ∗ ) ​ 𝐸 ​ [ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ​ ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ] − ∑ 𝑡

𝜏 𝑖 1 , … , 𝑖 𝑡 ≜ ∫ 𝑔 ​ ( 𝑥 ; 𝜃 ∗ ) ​ ∂ ∂ 𝜃 𝑖 1 ​ ⋯ ​ ∂ 𝜃 𝑖 𝑡 ​ log ⁡ 𝑔 ​ ( 𝑥 , 𝜃 ) | 𝜃

𝑔 𝑖 ​ 𝑗 ∗ ​ ( 𝜃 ∗ ) ≜ ( 𝐺 ∗ ​ ( 𝜃 ∗ ) ) 𝑖 ​ 𝑗 ≜ 𝐸 𝜃 ∗ ​ [ ( ∂ ∂ 𝜃 𝑖 ​ log ⁡ 𝑔 ​ ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ​ ( ∂ ∂ 𝜃 𝑗 ​ log ⁡ 𝑔 ​ ( 𝑥 ; 𝜃 ) | 𝜃

= ∫ 𝑔 ​ ( 𝑥 ; 𝜃 ∗ ) ​ ( ∂ ∂ 𝜃 𝑖 ​ log ⁡ 𝑔 ​ ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ​ ( ∂ ∂ 𝜃 𝑗 ​ log ⁡ 𝑔 ​ ( 𝑥 ; 𝜃 ) | 𝜃

= − ∫ 𝑔 ​ ( 𝑥 ; 𝜃 ∗ ) ​ ∂ 2 ∂ 𝜃 𝑖 ​ ∂ 𝜃 𝑗 ​ log ⁡ 𝑔 ​ ( 𝑥 ; 𝜃 ) | 𝜃

𝑔 𝑖 ​ 𝑗 ​ ( 𝜃 ∗ ) ≜ ( 𝐺 ​ ( 𝜃 ∗ ) ) 𝑖 ​ 𝑗 ≜ 𝐸 ​ [ ( ∂ ∂ 𝜃 𝑖 ​ log ⁡ 𝑔 ​ ( 𝑋 ; 𝜃 ) | 𝜃

𝜃 ∗ ​ ∂ ∂ 𝜃 𝑗 ​ log ⁡ 𝑔 ​ ( 𝑋 ; 𝜃 ) | 𝜃

𝑔 ~ 𝑖 ​ 𝑗 ​ ( 𝜃 ∗ ) ≜ ( 𝐺 ~ ​ ( 𝜃 ∗ ) ) 𝑖 ​ 𝑗 ≜ − 𝐸 ​ [ ∂ 2 ∂ 𝜃 𝑗 ​ ∂ 𝜃 𝑖 ​ log ⁡ 𝑔 ​ ( 𝑋 ; 𝜃 ) | 𝜃

𝐺 ~

𝑌

∑ 𝑖

ℳ

{ 𝑔 ​ ( 𝑦 , 𝑥 ; 𝜃 ) | 𝜃 0 > 0 , − ∞ < 𝜃 𝑖 < ∞ , 𝑖

𝑔 ​ ( 𝑦 , 𝑥 ; 𝜃 )

exp ⁡ ( − 𝜃 0 2 ​ ( 𝑦 − ∑ 𝑖

and 𝑑 ​ 𝜇

𝜖 ​ ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ≜ 𝑌 − ∑ 𝑖

𝐸 ​ [ 𝜖 4 ​ ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] 𝐸 2 ​ [ 𝜖 2 ​ ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ]

where ( 𝑆 ) 𝑖 ​ 𝑗 ≜ 𝐸 ​ [ 𝑋 𝑖 ​ 𝑋 𝑗 ] , ( 𝑇 ) 𝑖 ​ 𝑗 ≜ 𝐸 ​ [ 𝑋 𝑖 ​ 𝑋 𝑗 ​ 𝜖 2 ​ ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] , 𝑖 , 𝑗

𝑅 ​ [ 𝑔 ​ ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ​ ( 𝑥 ; 𝜃 ^ ) ]

𝑅 ​ [ 𝑔 ​ ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ​ ( 𝑥 ; 𝜃 ^ ) ]

Let ℎ ​ ( 𝑥 ) be the true p.d.f. of the explanatory variables 𝑋 ≜ ( 𝑋 1 , … , 𝑋 𝑝 ) with respect to the Lebesgue measure. Suppose that, when 𝑋 is given as 𝑥

𝜆 ​ ( 𝑥 ; 𝜃 )

exp ⁡ ( ∑ 𝑖

𝑔 ​ ( 𝑦 , 𝑥 | 𝜃 )

exp ⁡ ( ∑ 𝑖

1 𝑝 𝜃 𝑖 ​ 𝑥 𝑖 ​ 𝑦 − 𝜆 )

where the reference measure 𝑑 ​ 𝜇 is the product measure between the discrete measure 1 / 𝑦 ! on { 𝑦 | 𝑦

As for the true distribution, we postulate that the conditional distribution of 𝑌 under 𝑋

𝑥 is the Poisson distribution with mean 𝜆 0 ​ ( 𝑥 )

∂ ∂ 𝜃 𝑖 ​ log ⁡ 𝑔 ​ ( 𝑦 , 𝑥 | 𝜃 )

𝑥 𝑖 ​ 𝑦 − 𝜆 ​ ( 𝑥 ; 𝜃 ) ​ 𝑥 𝑖 , ∂ 2 ∂ 𝜃 𝑖 ​ ∂ 𝜃 𝑗 ​ log ⁡ 𝑔 ​ ( 𝑦 , 𝑥 | 𝜃 )

for 𝑖 , 𝑗

𝑔 ~ 𝑖 ​ 𝑗 ​ ( 𝜃 ∗ )

𝐸 ​ [ 𝑋 𝑖 ​ 𝑋 𝑗 ​ 𝜆 ​ ( 𝑋 ; 𝜃 ∗ ) ]

𝑔 𝑖 ​ 𝑗 ∗ ​ ( 𝜃 ∗ ) , 𝑔 𝑖 ​ 𝑗 ​ ( 𝜃 ∗ )

𝐸 ​ [ 𝑋 𝑖 ​ 𝑋 𝑗 ​ ( 𝑌 − 𝜆 ​ ( 𝑋 ; 𝜃 ∗ ) ) 2 ]

𝐸 ​ [ 𝑋 𝑖 ​ 𝑋 𝑗 ​ 𝑌 2 ] − 2 ​ 𝐸 ​ [ 𝑋 𝑖 ​ 𝑋 𝑗 ​ 𝑌 ​ 𝜆 ​ ( 𝑋 ; 𝜃 ∗ ) ] + 𝐸 ​ [ 𝑋 𝑖 ​ 𝑋 𝑗 ​ 𝜆 2 ​ ( 𝑋 ; 𝜃 ∗ ) ]

𝐸 ​ [ 𝑋 𝑖 ​ 𝑋 𝑗 ​ ( 𝜆 0 ​ ( 𝑋 ) + 𝜆 0 2 ​ ( 𝑋 ) ) ] − 2 ​ 𝐸 ​ [ 𝑋 𝑖 ​ 𝑋 𝑗 ​ 𝜆 0 ​ ( 𝑋 ) ​ 𝜆 ​ ( 𝑋 ; 𝜃 ∗ ) ] + 𝐸 ​ [ 𝑋 𝑖 ​ 𝑋 𝑗 ​ 𝜆 2 ​ ( 𝑋 ; 𝜃 ∗ ) ]

Consequently, 𝐺 ~ ​ ( 𝜃 ∗ )

tr ​ ( 𝐺 ~ ​ ( 𝜃 ∗ ) − 1 ​ 𝐺 ​ ( 𝜃 ∗ ) ​ 𝐺 ~ ​ ( 𝜃 ∗ ) − 1 ​ 𝐺 ∗ ​ ( 𝜃 ∗ ) )

tr ​ ( 𝐺 ~ ​ ( 𝜃 ∗ ) − 1 ​ 𝐺 ​ ( 𝜃 ∗ ) )

{ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

∫ 𝑔 ( 𝑥 ; 𝜃 ) log ⁡ ( 𝑔 ( 𝑥 ; 𝜃 ) / 𝑔 ( 𝑥 ) ) 𝑑 𝜇 ,
if 𝛼

∫ 𝑔 ( 𝑥 ) log ⁡ ( 𝑔 ( 𝑥 ) / 𝑔 ( 𝑥 ; 𝜃 ) ) 𝑑 𝜇 ,
if 𝛼

∂ ∂ 𝜃 𝑖 𝐷 𝛼 [ 𝑔 ( 𝑥 ) | 𝑔 ( 𝑥 ; 𝜃 ) ]

⟺ { ∫ ( 𝑔 ( 𝑥 ; 𝜃 ) ) 𝛼 − 1 2 ( 𝑔 ( 𝑥 ) ) 1 − 𝛼 2 ∂ 𝑔 ( 𝑥 ; 𝜃 ) ∂ 𝜃 𝑖 𝑑 𝜇

∫ ∂ 𝑔 ( 𝑥 ; 𝜃 ) ∂ 𝜃 𝑖 log ⁡ ( 𝑔 ( 𝑥 ; 𝜃 ) / 𝑔 ( 𝑥 ) ) 𝑑 𝜇

𝐸 [ ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 ; 𝜃 ) ]

1 𝑛 ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 𝑡 ; 𝜃 )

𝐸 [ ℎ ( 𝑋 , 𝜃 ) ]

𝐷 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

𝐷 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

− ∑ 𝑖 ∫ ∂ ∂ 𝜃 𝑖 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ 𝑑 𝜇 ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 )

+ 1 2 ∑ 𝑖 , 𝑗 ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ( ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ( ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

− 1 2 ∑ 𝑖 , 𝑗 ∫ ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 𝑔 ( 𝑥 , 𝜃 ) | 𝜃

𝜃 ∗ 𝑑 𝜇 ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 )

− ∑ 𝑡

3 ∞ 1 𝑡 ! ∑ 𝑖 1 , … , 𝑖 𝑡 ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑖 1 ⋯ ∂ 𝜃 𝑖 𝑡 log ⁡ 𝑔 ( 𝑥 , 𝜃 ) | 𝜃

∫ 𝑔 ( 𝑥 , 𝜃 ) 𝑑 𝜇

∫ ∂ ∂ 𝜃 𝑖 𝑔 ( 𝑥 ; 𝜃 ) 𝑑 𝜇

0 , ∫ ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 𝑔 ( 𝑥 , 𝜃 ) 𝑑 𝜇

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

1 2 ∑ 𝑖 , 𝑗 𝑔 𝑖 𝑗 ∗ ( 𝜃 ∗ ) 𝐸 [ ( 𝜃 ^ 𝑖 − 𝜃 ∗ 𝑖 ) ( 𝜃 ^ 𝑗 − 𝜃 ∗ 𝑗 ) ]

− ∑ 𝑡

𝜏 𝑖 1 , … , 𝑖 𝑡 ≜ ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ ∂ 𝜃 𝑖 1 ⋯ ∂ 𝜃 𝑖 𝑡 log ⁡ 𝑔 ( 𝑥 , 𝜃 ) | 𝜃

𝑔 𝑖 𝑗 ∗ ( 𝜃 ∗ ) ≜ ( 𝐺 ∗ ( 𝜃 ∗ ) ) 𝑖 𝑗
≜ 𝐸 𝜃 ∗ [ ( ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ( ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

= ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ( ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝜃 ∗ ) ( ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

= − ∫ 𝑔 ( 𝑥 ; 𝜃 ∗ ) ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑥 ; 𝜃 ) | 𝜃

𝑔 𝑖 𝑗 ( 𝜃 ∗ ) ≜ ( 𝐺 ( 𝜃 ∗ ) ) 𝑖 𝑗
≜ 𝐸 [ ( ∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 ; 𝜃 ) | 𝜃

𝜃 ∗ ∂ ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑋 ; 𝜃 ) | 𝜃

𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ ) ≜ ( 𝐺 ~ ( 𝜃 ∗ ) ) 𝑖 𝑗
≜ − 𝐸 [ ∂ 2 ∂ 𝜃 𝑗 ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑋 ; 𝜃 ) | 𝜃

{ 𝑔 ( 𝑦 , 𝑥 ; 𝜃 ) | 𝜃 0 > 0 , − ∞ < 𝜃 𝑖 < ∞ , 𝑖

𝑔 ( 𝑦 , 𝑥 ; 𝜃 )

exp ⁡ ( − 𝜃 0 2 ( 𝑦 − ∑ 𝑖

and 𝑑 𝜇

𝜖 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ≜ 𝑌 − ∑ 𝑖

𝐸 [ 𝜖 4 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] 𝐸 2 [ 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ]

where ( 𝑆 ) 𝑖 𝑗 ≜ 𝐸 [ 𝑋 𝑖 𝑋 𝑗 ] , ( 𝑇 ) 𝑖 𝑗 ≜ 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜖 2 ( 𝑌 , 𝑋 ; 𝜃 ∗ ) ] , 𝑖 , 𝑗

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

Let ℎ ( 𝑥 ) be the true p.d.f. of the explanatory variables 𝑋 ≜ ( 𝑋 1 , … , 𝑋 𝑝 ) with respect to the Lebesgue measure. Suppose that, when 𝑋 is given as 𝑥

𝜆 ( 𝑥 ; 𝜃 )

𝑔 ( 𝑦 , 𝑥 | 𝜃 )

1 𝑝 𝜃 𝑖 𝑥 𝑖 𝑦 − 𝜆 )

where the reference measure 𝑑 𝜇 is the product measure between the discrete measure 1 / 𝑦 ! on { 𝑦 | 𝑦

𝑥 is the Poisson distribution with mean 𝜆 0 ( 𝑥 )

∂ ∂ 𝜃 𝑖 log ⁡ 𝑔 ( 𝑦 , 𝑥 | 𝜃 )

𝑥 𝑖 𝑦 − 𝜆 ( 𝑥 ; 𝜃 ) 𝑥 𝑖 , ∂ 2 ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 log ⁡ 𝑔 ( 𝑦 , 𝑥 | 𝜃 )

𝑔 ~ 𝑖 𝑗 ( 𝜃 ∗ )

𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜆 ( 𝑋 ; 𝜃 ∗ ) ]

𝑔 𝑖 𝑗 ∗ ( 𝜃 ∗ ) ,

𝑔 𝑖 𝑗 ( 𝜃 ∗ )

𝐸 [ 𝑋 𝑖 𝑋 𝑗 ( 𝑌 − 𝜆 ( 𝑋 ; 𝜃 ∗ ) ) 2 ]

𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝑌 2 ] − 2 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝑌 𝜆 ( 𝑋 ; 𝜃 ∗ ) ] + 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜆 2 ( 𝑋 ; 𝜃 ∗ ) ]

𝐸 [ 𝑋 𝑖 𝑋 𝑗 ( 𝜆 0 ( 𝑋 ) + 𝜆 0 2 ( 𝑋 ) ) ] − 2 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜆 0 ( 𝑋 ) 𝜆 ( 𝑋 ; 𝜃 ∗ ) ] + 𝐸 [ 𝑋 𝑖 𝑋 𝑗 𝜆 2 ( 𝑋 ; 𝜃 ∗ ) ]

Consequently, 𝐺 ~ ( 𝜃 ∗ )

tr ( 𝐺 ~ ( 𝜃 ∗ ) − 1 𝐺 ( 𝜃 ∗ ) 𝐺 ~ ( 𝜃 ∗ ) − 1 𝐺 ∗ ( 𝜃 ∗ ) )

tr ( 𝐺 ~ ( 𝜃 ∗ ) − 1 𝐺 ( 𝜃 ∗ ) )

( 𝐺 ( 𝜃 ∗ ) − 𝐺 ~ ( 𝜃 ∗ ) ) 𝑖 𝑗

{ 𝑔 ( 𝑥 ; 𝜃 )

Ψ ( 𝜃 )

log ∫ exp ⁡ ( ∑ 𝑖

𝜂 𝑖 ( 𝜃 ) ≜ ∂ Ψ ( 𝜃 ) ∂ 𝜃 𝑖

𝐸 𝜃 [ 𝜉 𝑖 ] , 𝑖

𝜂 𝑖 ∗ ≜ 𝜂 𝑖 ( 𝜃 ∗ )

𝐸 𝜃 ∗ [ 𝜉 𝑖 ]

𝐸 [ 𝜉 𝑖 ] , 𝑖

The last equation requires the means of 𝜉 𝑖 to coincide under 𝑔 ( 𝑥 ) and 𝑔 ( 𝑥 ; 𝜃 ∗ ) . It is known that 𝑔 ( 𝑥 ; 𝜃 ∗ ) maximizes the Shannon entropy among all probability distributions for a given 𝐸 [ 𝜉 𝑖 ] , 𝑖

∂ Ψ ∂ 𝜃 𝑖 ( 𝜃 ^ )

𝜉 ¯ 𝑖 ( ≜ 𝑛 − 1 ∑ 𝑡

( Ψ ¨ ( 𝜃 ) ) 𝑖 𝑗 ≜ ∂ 2 Ψ ( 𝜃 ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗

𝑔 𝑖 𝑗 ∗ ( 𝜃 )

𝜅 𝑖 𝑗 𝑘 ≜ 𝐸 [ ( 𝜉 𝑖 − 𝜂 𝑖 ∗ ) ( 𝜉 𝑗 − 𝜂 𝑗 ∗ ) ( 𝜉 𝑖 − 𝜂 𝑘 ∗ ) ]

𝜅 𝑖 𝑗 𝑘 ∗ ≜ 𝐸 𝜃 ∗ [ ( 𝜉 𝑖 − 𝜂 𝑖 ∗ ) ( 𝜉 𝑗 − 𝜂 𝑗 ∗ ) ( 𝜉 𝑖 − 𝜂 𝑘 ∗ ) ]

∂ 3 Ψ ( 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘

= ∂ 4 Ψ ( 𝜃 ∗ ) ∂ 𝜃 𝑖 ∂ 𝜃 𝑗 ∂ 𝜃 𝑘 ∂ 𝜃 𝑙

𝑅 [ 𝑔 ( 𝑥 ; 𝜃 ∗ ) | 𝑔 ( 𝑥 ; 𝜃 ^ ) ]

( 𝑆 ) 𝑖 𝑗 ≜ 𝐸 [ 𝜉 𝑖 𝜉 𝑗 ] , ( 𝑆 ∗ ) 𝑖 𝑗 ≜ 𝐸 𝜃 ∗ [ 𝜉 𝑖 𝜉 𝑗 ] , 𝑖 , 𝑗

Further, (27) can be proven as follows. As 𝐸 [ 𝜉 𝑖 ]

𝐸 𝜃 ∗ [ 𝜉 𝑖 ]

( 𝐺 ( 𝜃 ∗ ) ) 𝑖 𝑗